# Populating Embedding Vectors in MongoDB Atlas

In this Python notebook, we'll be using the embedding models we've downloaded to our local device to create embedding attributes for our movies dataset. Once we've done that, we'll be using `pymongo` to add these new embedding attributes to our dataset in MongoDB.

The cool thing about what we're doing here is that we're generating all the embeddings locally (i.e. no external API calls needed); so if you have your own custom embedding models, you can make use of that for other projects too! 

## Step 1: Load Settings
We start by loading up some environment settings. These should all have been configured in Quest 1, so head to Quest 1 if you're missing out anything.

In [1]:
# Load settings from .env file
import sys
from dotenv import find_dotenv, dotenv_values

# Change system path to root direcotry
sys.path.insert(0, '../')

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# For debugging purposes
# print (config)

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")
else:
    print("ATLAS_URI Connection string found:", ATLAS_URI)

ATLAS_URI Connection string found: mongodb+srv://stackies:nIR5ez8mrtgoBeaH@cluster0.w34yhfu.mongodb.net/


In [2]:
# Our variables
DB_NAME = 'sample_mflix'
COLLECTION_NAME = 'embedded_movies'

## Step 2: Initialize Mongo Atlas Client
Then, we intialize a connection to Mongo Atlas Client by using our unique ATLAS_URI value. Once we've succesfully connected, we'll be able to interact with our MongoDB database. For example, in the second code cell, we're getting our embedded_movies dataset and printing out its document count.

In [3]:
from AtlasClient import AtlasClient

atlas_client = AtlasClient (ATLAS_URI, DB_NAME)
print("Connected to the Mongo Atlas database!")

Connected to the Mongo Atlas database!


In [4]:
collection = atlas_client.get_collection(COLLECTION_NAME)
document_count = collection.count_documents({})

print (f"Document count = {document_count:,} movies")

Document count = 3,483 movies


## Step 3: Generate Embeddings
Now for the fun part - we're going to generate all embeddings locally on our computer, using open source models. No API calls or API KEYS needed! 😄

As mentioned, we'll be using the following models:

| model name                              | overall score | model params | model size | embedding length | url                                                            |
|-----------------------------------------|---------------|--------------|------------|------------------|----------------------------------------------------------------|
| BAAI/bge-small-en-v1.5                  | 62.x          | 33.5 M       | 133 MB     | 384              | https://huggingface.co/BAAI/bge-small-en-v1.5                  |
| sentence-transformers/all-mpnet-base-v2 | 57.8          |              | 438 MB     | 768              | https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| sentence-transformers/all-MiniLM-L6-v2  | 56.x          |              | 91 MB      | 384              | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2  |

In [8]:
import os
# LlamaIndex will download embeddings models as needed
# Set llamaindex cache dir to ../cache dir here (Default is system tmp)
# This way, we can easily see downloaded artifacts
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath('../'), 'cache')

In [13]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
#from llama_index.embeddings import HuggingFaceEmbedding
import time

# Helper function to calculate embeddings, given a model
def create_embeddings (movies, embedding_model, embedding_attr):
    embed_model = HuggingFaceEmbedding(model_name=embedding_model)

    t2a = time.perf_counter()
    for movie in movies:
        movie[embedding_attr] = embed_model.get_text_embedding(movie['plot'])

    t2b = time.perf_counter()
    # print (f'Embeddings generated for {len(movies):,} movies  in {(t2b-t2a)*1000:,.0f} ms')

In [14]:
# Fetch all movies
t1a = time.perf_counter()
movies = [m for m in atlas_client.find (collection_name=COLLECTION_NAME, filter={'plot':{"$exists": True}}, limit=0)]
t1b = time.perf_counter()

print (f'Fetched {len(movies):,} from Atlas in {(t1b-t1a)*1000:,.0f} ms')

Fetched 3,403 from Atlas in 24,365 ms


In [15]:
# Embedding models we want to use
model_mappings = {
    'BAAI/bge-small-en-v1.5' : {'embedding_attr' : 'plot_embedding_bge_small', 'index_name' : 'idx_plot_embedding_bge_small'},
    'sentence-transformers/all-mpnet-base-v2' : {'embedding_attr' : 'plot_embedding_mpnet_base_v2', 'index_name' : 'idx_plot_embedding_mpnet_base_v2'},
    'sentence-transformers/all-MiniLM-L6-v2' : {'embedding_attr' : 'plot_embedding_minilm_l6_v2', 'index_name' : 'idx_plot_embedding_minilm_l6_v2'},
}

In [16]:
# For selected embedding models above, we are going to create vectors for the plot field
# Each embedding model will have its own 'plot_embedding' attribute (i.e. we don't want to mix them up)
# It might take up to a few minutes to generate the embeddings

for key in model_mappings.keys():
    embedding_model = key
    embedding_attr = model_mappings[key]['embedding_attr']

    print (f'\n------- Embedding Model = {embedding_model} ---------')
    t1a = time.perf_counter()
    create_embeddings(movies=movies, embedding_model=embedding_model, embedding_attr=embedding_attr)
    t1b = time.perf_counter()
    avg_time_per_movie = (t1b-t1a)*1000 / len(movies)
    print (f'model={embedding_model}, created embeddings for {len(movies):,} movies in {(t1b-t1a)*1000:,.0f} ms, avg_time_per_movie={avg_time_per_movie:,.0f} ms')


------- Embedding Model = BAAI/bge-small-en-v1.5 ---------
model=BAAI/bge-small-en-v1.5, created embeddings for 3,403 movies in 1,219,040 ms, avg_time_per_movie=358 ms

------- Embedding Model = sentence-transformers/all-mpnet-base-v2 ---------
model=sentence-transformers/all-mpnet-base-v2, created embeddings for 3,403 movies in 262,977 ms, avg_time_per_movie=77 ms

------- Embedding Model = sentence-transformers/all-MiniLM-L6-v2 ---------
model=sentence-transformers/all-MiniLM-L6-v2, created embeddings for 3,403 movies in 689,534 ms, avg_time_per_movie=203 ms


## Step 4: Inspect Generated Embeddings
Great job! At this point, we should have succesfully generated 3 sets of embeddings, 1 set for each embedding model we used. Let's try running the code cell below a few times to see the embeddings generated by the different embedding models for a given random movie.

In [17]:
import random

movie = random.choice(movies)
print ('Randomly selected movie: ', movie['title'])
print ('Movie plot: ', movie['plot'], '\n')
print (f'plot_embeddings (existing openAI generated), len={len(movie["plot_embedding"])} , {movie["plot_embedding"][:5]}...\n')
print (f'plot_embedding_bge_small , len={len(movie["plot_embedding_bge_small"])} , {movie["plot_embedding_bge_small"][:5]}...\n')
print (f'plot_embedding_mpnet_base_v2 , len={len(movie["plot_embedding_mpnet_base_v2"])} , {movie["plot_embedding_mpnet_base_v2"][:5]}...\n')
print (f'plot_embedding_minilm_l6_v2 , len={len(movie["plot_embedding_minilm_l6_v2"])} , {movie["plot_embedding_minilm_l6_v2"][:5]}...')

Randomly selected movie:  Mighty Morphin Power Rangers: The Movie
Movie plot:  A giant egg is unearthed at a construction site and soon opened, releasing the terrible Ivan Ooze, who wreaks vengeance on Zordon for imprisoning him millennia ago. With Zordon dying and their powers lost, the Rangers head to a distant planet to find the mystic warrior Dulcea. 

plot_embeddings (existing openAI generated), len=1536 , [0.008537359, -0.0419179, -0.016995177, -0.0045039873, -0.030145891]...

plot_embedding_bge_small , len=384 , [-0.04320862516760826, -0.014960034750401974, 0.005376668181270361, 0.04989439994096756, 0.0363854356110096]...

plot_embedding_mpnet_base_v2 , len=768 , [0.07135888189077377, 0.0057554468512535095, 0.022949937731027603, 0.01702651008963585, 0.02376585453748703]...

plot_embedding_minilm_l6_v2 , len=384 , [-0.08905382454395294, 0.0890040472149849, 0.0572551004588604, 0.009834816679358482, 0.04117322340607643]...


## Step 5: Add Embeddings to MongoDB Atlas
Now that we've generated all our embeddings, let's go ahead and add them into our dataset in MongoDB Atlas! We'll be using a bulk update approach to save some time in uploading our embeddings.

In [18]:
from pymongo import ReplaceOne

collection = atlas_client.get_collection(COLLECTION_NAME)
replacements = [ReplaceOne ({"_id" : movie["_id"]}, movie) for movie in movies]

# Perform bulk replacement
print (f'About to update {len(replacements)} movies in Atlas...')
t1a = time.perf_counter()
result = collection.bulk_write(replacements)
t1b = time.perf_counter()

# Print result
print(f"Update matched count: {result.matched_count}")
print(f"Update modified count: {result.modified_count}")
print (f'Updated {len(movies):,} in Atlas in {(t1b-t1a)*1000:,.0f} ms')

About to update 3403 movies in Atlas...
Update matched count: 3403
Update modified count: 3403
Updated 3,403 in Atlas in 31,703 ms


And we're done with this notebook! Please **head back to the Quest page on StackUp now** where we'll be proceeding with the next step - creating a search index for each of our newly created custom embeddings. 