# Custom Embeddings

In previous lab, we queried `embedded_movies` collection that already had `plot_embedding` column.  This column was created using OpenAI embedding model.

In this lab, we will setup a custom embedding column using a a local model (so no API calls to OpenAI).  We will try a few embedding models and gauge their performance for semantic search

## Load Settings

In [1]:
import os, sys

## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")

## Find My Public IP

This IP address should be added to Atlas's 'access list' for the connection to work

In [2]:
# import requests
# ip = requests.get('https://api.ipify.org').text()

from urllib.request import urlopen
ip = urlopen('https://api.ipify.org').read()
print (f"My public IP is '{ip}.   Make sure this IP is allowed to connect to cloud Atlas")

My public IP is 'b'67.160.193.201'.   Make sure this IP is allowed to connect to cloud Atlas


## Connect to Atlas

In [3]:
DB_NAME = 'sample_mflix'
COLLECTION_NAME = 'embedded_movies'
# INDEX_NAME = 'idx_plot_embedding'

In [4]:
from AtlasClient import AtlasClient

atlas_client = AtlasClient (ATLAS_URI, DB_NAME)
print("Connected to the Mongo Atlas database!")

Connected to the Mongo Atlas database!


## Get Started With Local Embeddings

We are going to use HF models to create embeddings locally.

Please refer to this notebook for details : [local-embeddings-test.ipynb](local-embeddings-test.ipynb)

We can choose from many available options.

In [5]:
# EMBEDDING_MODEL =  'BAAI/bge-large-en-v1.5'  # embedding size = 1024
# EMBEDDING_FIELD  = 'plot_embedding_bge_large'


EMBEDDING_MODEL =  'BAAI/bge-small-en-v1.5'  # embedding size = 384

from llama_index.embeddings import HuggingFaceEmbedding

def get_embeddings (text: str) -> list[float]:
    embed_model = HuggingFaceEmbedding(model_name=EMBEDDING_MODEL)
    embeddings = embed_model.get_text_embedding(text)
    return embeddings

In [6]:
embeddings = get_embeddings ("Hello Atlas!")
print (f'embeddings length : {len(embeddings)}')
print (embeddings[:5])

  from .autonotebook import tqdm as notebook_tqdm


embeddings length : 384
[-0.027998557314276695, -0.04962858557701111, 0.05050567910075188, -0.019165894016623497, 0.018842166289687157]


In [7]:
collection_movies = atlas_client.get_collection(COLLECTION_NAME)

for movie in collection_movies.find({'plot':{"$exists": True}}).limit(5):
    print (f'   id: {movie["_id"]}')
    print (f'   title: {movie["title"]}')
    print (f'   plot [{len(movie["plot"])}]: {movie["plot"]}')
    embeddings = get_embeddings (movie['plot'])
    print ('   embeddings : ', embeddings[:5])
    print()

   id: 573a1391f29313caabcd68d0
   title: From Hand to Mouth
   plot [99]: A penniless young man tries to save an heiress from kidnappers and help her secure her inheritance.
   embeddings :  [-0.07233031094074249, 0.027128444984555244, -0.03854302316904068, -0.04830155521631241, 0.027173439040780067]

   id: 573a1391f29313caabcd8268
   title: The Black Pirate
   plot [96]: Seeking revenge, an athletic young man joins the pirate band responsible for his father's death.
   embeddings :  [-0.039247386157512665, 0.025636354461312294, 0.029713360592722893, -0.06483838707208633, 0.056841544806957245]

   id: 573a1391f29313caabcd93a3
   title: Men Without Women
   plot [57]: Navy divers clear the torpedo tube of a sunken submarine.
   embeddings :  [-0.002184102777391672, -0.019694756716489792, -0.01831626333296299, -0.0525548979640007, 0.025484716519713402]

   id: 573a1391f29313caabcd820b
   title: Beau Geste
   plot [192]: Michael "Beau" Geste leaves England in disgrace and joins the infa