This Jupyter notebook demonstrates how to create and use a vector database for searching English text content using Persian queries. The system uses LanceDB for vector storage and includes a translation pipeline to convert Persian queries to English before searching.
- Vector database creation with LanceDB
- Persian to English query translation
- Full-text search capabilities
- Sample dataset of 30 English sentences across various topics
The following packages are required:
pip install lancedb
pip install tantivy
pip install transformers- Uses LanceDB to store vector embeddings and text data
- Creates a table called 'news' with vector embeddings and text content
- Implements full-text search indexing on the text field
- Utilizes the Hugging Face transformers library
- Uses the "barghavani/Farsi-to-English" translation model
- Translates Persian queries to English before searching
The database includes 30 sample sentences covering various topics:
- Technology and Computing
- Business and Work
- Education and Learning
- Health and Wellness
- Environment
- Entertainment and Culture
- Run the installation cells to set up required packages
- Execute the database creation function:
tbl = create_database()- Search using Persian queries:
results = search('مهارت', tbl) # Searches for "skill" in EnglishThe search results are returned as a pandas DataFrame containing:
- vector: The embedding vector
- text: The matched English text
- _score: The relevance score of the match
- The vector values in the sample data are simplified (1-30) for demonstration purposes
- In a production environment, these should be replaced with actual embeddings
- The full-text search index is created automatically on the 'text' field
# Create database
tbl = create_database()
# Search in Persian
result = search('فرهنگ', tbl)
# Returns matches related to "skill" in the English dataset