MongoDB Document Embedding and Search using Vectors

This repository contains two Python scripts that demonstrate how to generate embeddings using the DistilBERT model and process documents in a MongoDB database. The goal is to embed text data and perform vector searches to retrieve relevant documents.

Overview

main.py:
- Connects to MongoDB asynchronously.
- Uses the DistilBERT model to generate embeddings for documents and stores them back in MongoDB.
- Processes documents in batches and updates the embeddings field in the database.
search.py:
- Retrieves documents from MongoDB by performing vector search based on embeddings generated from the DistilBERT model.
- Utilizes MongoDB's $vectorSearch aggregation to find similar documents based on the query text.

Dataset

We have used Wikidata, an open-source dataset, as the source of the documents for generating embeddings and performing vector search. You can download the dataset from the following link:

Wikidata Download

Wikidata is a large dataset with rich textual descriptions, making it ideal for embedding and searching tasks. Make sure to preprocess the dataset as per your requirements before loading it into MongoDB.

Performance

When generating embeddings using a NVIDIA RTX 3090 GPU, this process can handle approximately 10,000 documents every 12 seconds. This performance benchmark provides a general guideline for estimating the time needed to process larger datasets depending on your hardware capabilities.

Installation

Clone the repository:

git clone https://github.com/sethigoldy/text_embedding_processor_mongodb_vector_python.git

Navigate to the project directory:

cd text_embedding_processor_mongodb_vector_python

Install the required Python packages:
```
pip install -r requirements.txt
```

Usage

Running `main.py`

This script connects to MongoDB, processes documents, and generates embeddings using DistilBERT. It stores the embeddings back in the MongoDB collection.

Run the script:

python main.py

Make sure to replace the MongoDB connection URI, database, and collection names as needed.

Running `search.py`

This script performs a vector search based on a query string using the generated embeddings and retrieves similar documents from the MongoDB collection.

Run the script:

python search.py

Again, replace the MongoDB connection URI, database, and collection names as needed.

MongoDB Configuration

Ensure you have a MongoDB collection that stores your documents with the following structure:

{
    "_id": ObjectId,
    "description": { "en": { "value": "document description" } },
    "labels": { "en": { "value": "document labels" } },
    "type": "document type",
    "embedding": [ ... ]  // Generated embeddings will be stored here
}

You should also create a vector index on the embedding field for efficient vector search.

Requirements

Python 3.8 or higher
MongoDB
CUDA-enabled GPU (for faster processing, but not required)

Key Dependencies

torch: PyTorch framework used for handling the DistilBERT model and GPU acceleration.
transformers: Library for loading the pre-trained DistilBERT model and tokenizer.
motor: Asynchronous MongoDB client for main.py.
pymongo: MongoDB client used in search.py.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
reset.py		reset.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MongoDB Document Embedding and Search using Vectors

Overview

Dataset

Performance

Installation

Usage

Running `main.py`

Running `search.py`

MongoDB Configuration

Requirements

Key Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sethigoldy/text_embedding_processor_mongodb_vector_python

Folders and files

Latest commit

History

Repository files navigation

MongoDB Document Embedding and Search using Vectors

Overview

Dataset

Performance

Installation

Usage

Running main.py

Running search.py

MongoDB Configuration

Requirements

Key Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Running `main.py`

Running `search.py`

Packages