# Using OpenAI Latest Embeddings In A RAG System With MongoDB

OpenAI recently released new embeddings and moderation models. This article explores the step-by-step implementation process of utilizing one of the new embedding models: text-embedding-3-small within a Retrieval Augmented Generation(RAG) System powered by MongoDB Atlas Vector Database.


## Step 1: Libraries Installation


Below are brief explanations of the tools and libraries utilised within the implementation code:
* **datasets**: This library is part of the Hugging Face ecosystem. By installing 'datasets', we gain access to a number of pre-processed and ready-to-use datasets, which are essential for training and fine-tuning machine learning models or benchmarking their performance.

* **pandas**: A data science library that provides robust data structures and methods for data manipulation, processing and analysis.

* **openai**: This is the official Python client library for accessing OpenAI's suite of AI models and tools, including GPT and embedding models.italicised text

* **pymongo**: PyMongo is a Python toolkit for MongoDB. It enables interactions with a MongoDB database.

In [1]:
!pip install datasets pandas openai pymongo



## Step 2: Data Loading


Load the dataset titled ["AIatMongoDB/embedded_movies"](https://huggingface.co/datasets/AIatMongoDB/embedded_movies). This dataset is a collection of movie-related details that include attributes such as the title, release year, cast, plot and more. A unique feature of this dataset is the plot_embedding field for each movie. These embeddings are generated using OpenAI's text-embedding-ada-002 model.


In [2]:
# 1. Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies
dataset = load_dataset("AIatMongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset['train'])

dataset_df.head(5)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,cast,genres,poster,writers,runtime,imdb,title,languages,plot,countries,num_mflix_comments,metacritic,directors,awards,type,plot_embedding,fullplot,rated
0,"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",[Action],https://m.media-amazon.com/images/M/MV5BMzgxOD...,"[Charles W. Goddard (screenplay), Basil Dickey...",199.0,"{'id': 4465, 'rating': 7.6, 'votes': 744}",The Perils of Pauline,[English],Young Pauline is left a lot of money when her ...,[USA],0,,"[Louis J. Gasnier, Donald MacKenzie]","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,"[0.00072939653, -0.026834568, 0.013515796, -0....",Young Pauline is left a lot of money when her ...,
1,"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...","[Comedy, Short, Action]",https://m.media-amazon.com/images/M/MV5BNzE1OW...,[H.M. Walker (titles)],22.0,"{'id': 10146, 'rating': 7.0, 'votes': 639}",From Hand to Mouth,[English],A penniless young man tries to save an heiress...,[USA],0,,"[Alfred J. Goulding, Hal Roach]","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,"[-0.022837115, -0.022941574, 0.014937485, -0.0...",As a penniless man worries about how he will m...,TV-G
2,"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...","[Action, Adventure, Drama]",,"[Herbert Brenon (adaptation), John Russell (ad...",101.0,"{'id': 16634, 'rating': 6.9, 'votes': 222}",Beau Geste,[English],"Michael ""Beau"" Geste leaves England in disgrac...",[USA],0,,[Herbert Brenon],"{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,"[0.00023330493, -0.028511643, 0.014653289, -0....","Michael ""Beau"" Geste leaves England in disgrac...",
3,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...","[Adventure, Action]",https://m.media-amazon.com/images/M/MV5BMzU0ND...,"[Douglas Fairbanks (story), Jack Cunningham (a...",88.0,"{'id': 16654, 'rating': 7.2, 'votes': 1146}",The Black Pirate,,"Seeking revenge, an athletic young man joins t...",[USA],1,,[Albert Parker],"{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,"[-0.005927917, -0.033394486, 0.0015323418, -0....",A nobleman vows to avenge the death of his fat...,
4,"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...","[Action, Comedy, Romance]",https://m.media-amazon.com/images/M/MV5BMTcxMT...,"[Ted Wilde (story), John Grey (story), Clyde B...",58.0,"{'id': 16895, 'rating': 7.6, 'votes': 918}",For Heaven's Sake,[English],An irresponsible young millionaire changes his...,[USA],0,,[Sam Taylor],"{'nominations': 1, 'text': '1 nomination.', 'w...",movie,"[-0.0059373598, -0.026604708, -0.0070914757, -...","The Uptown Boy, J. Harold Manners (Lloyd) is a...",PASSED


## Step 3: Data Cleaning and Preparation

The next step cleans the data and prepares it for the next stage, which creates a new embedding data point using the new OpenAI embedding model.


In [3]:
print("Columns:", dataset_df.columns)
print("\nNumber of rows and columns:", dataset_df.shape)
print("\nBasic Statistics for numerical data:")
print(dataset_df.describe())
print("\nNumber of missing values in each column:")
print(dataset_df.isnull().sum())

Columns: Index(['cast', 'genres', 'poster', 'writers', 'runtime', 'imdb', 'title',
       'languages', 'plot', 'countries', 'num_mflix_comments', 'metacritic',
       'directors', 'awards', 'type', 'plot_embedding', 'fullplot', 'rated'],
      dtype='object')

Number of rows and columns: (1500, 18)

Basic Statistics for numerical data:
           runtime  num_mflix_comments  metacritic
count  1485.000000         1500.000000  572.000000
mean    111.977104            6.071333   51.646853
std      42.090386           27.378982   16.861996
min       6.000000            0.000000    9.000000
25%      96.000000            0.000000   40.000000
50%     106.000000            0.000000   51.000000
75%     121.000000            1.000000   63.000000
max    1256.000000          158.000000   97.000000

Number of missing values in each column:
cast                    1
genres                  0
poster                 89
writers                13
runtime                15
imdb                    0
title

In [4]:
# Remove data point where plot coloumn is missing
dataset_df = dataset_df.dropna(subset=['plot'])
print("\nNumber of missing values in each column after removal:")
print(dataset_df.isnull().sum())

# Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with the new OpenAI emebedding Model "text-embedding-3-small"
dataset_df = dataset_df.drop(columns=['plot_embedding'])
dataset_df.head(5)


Number of missing values in each column after removal:
cast                    1
genres                  0
poster                 78
writers                13
runtime                14
imdb                    0
title                   0
languages               1
plot                    0
countries               0
num_mflix_comments      0
metacritic            903
directors              13
awards                  0
type                    0
plot_embedding          1
fullplot               21
rated                 284
dtype: int64


Unnamed: 0,cast,genres,poster,writers,runtime,imdb,title,languages,plot,countries,num_mflix_comments,metacritic,directors,awards,type,fullplot,rated
0,"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",[Action],https://m.media-amazon.com/images/M/MV5BMzgxOD...,"[Charles W. Goddard (screenplay), Basil Dickey...",199.0,"{'id': 4465, 'rating': 7.6, 'votes': 744}",The Perils of Pauline,[English],Young Pauline is left a lot of money when her ...,[USA],0,,"[Louis J. Gasnier, Donald MacKenzie]","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,Young Pauline is left a lot of money when her ...,
1,"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...","[Comedy, Short, Action]",https://m.media-amazon.com/images/M/MV5BNzE1OW...,[H.M. Walker (titles)],22.0,"{'id': 10146, 'rating': 7.0, 'votes': 639}",From Hand to Mouth,[English],A penniless young man tries to save an heiress...,[USA],0,,"[Alfred J. Goulding, Hal Roach]","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,As a penniless man worries about how he will m...,TV-G
2,"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...","[Action, Adventure, Drama]",,"[Herbert Brenon (adaptation), John Russell (ad...",101.0,"{'id': 16634, 'rating': 6.9, 'votes': 222}",Beau Geste,[English],"Michael ""Beau"" Geste leaves England in disgrac...",[USA],0,,[Herbert Brenon],"{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,"Michael ""Beau"" Geste leaves England in disgrac...",
3,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...","[Adventure, Action]",https://m.media-amazon.com/images/M/MV5BMzU0ND...,"[Douglas Fairbanks (story), Jack Cunningham (a...",88.0,"{'id': 16654, 'rating': 7.2, 'votes': 1146}",The Black Pirate,,"Seeking revenge, an athletic young man joins t...",[USA],1,,[Albert Parker],"{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,A nobleman vows to avenge the death of his fat...,
4,"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...","[Action, Comedy, Romance]",https://m.media-amazon.com/images/M/MV5BMTcxMT...,"[Ted Wilde (story), John Grey (story), Clyde B...",58.0,"{'id': 16895, 'rating': 7.6, 'votes': 918}",For Heaven's Sake,[English],An irresponsible young millionaire changes his...,[USA],0,,[Sam Taylor],"{'nominations': 1, 'text': '1 nomination.', 'w...",movie,"The Uptown Boy, J. Harold Manners (Lloyd) is a...",PASSED


## Step 4: Create embeddings with OpenAI

This stage focuses on generating new embeddings using OpenAI's advanced model.
This demonstration utilises a Google Colab Notebook, where environment variables are configured explicitly within the notebook's Secret section and accessed using the user data module. In a production environment, the environment variables that store secret keys are usually stored in a '.env' file or equivalent.


In [5]:
import openai
from google.colab import userdata

openai.api_key = userdata.get("open_ai")

EMBEDDING_MODEL = "text-embedding-3-small"

def get_embedding(text):
    """Generate an embedding for the given text using OpenAI's API."""

    # Check for valid input
    if not text or not isinstance(text, str):
        return None

    try:
        # Call OpenAI API to get the embedding
        embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

dataset_df["plot_embedding_optimised"] = dataset_df['plot'].apply(get_embedding)

dataset_df.head()

Unnamed: 0,cast,genres,poster,writers,runtime,imdb,title,languages,plot,countries,num_mflix_comments,metacritic,directors,awards,type,fullplot,rated,plot_embedding_optimised
0,"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",[Action],https://m.media-amazon.com/images/M/MV5BMzgxOD...,"[Charles W. Goddard (screenplay), Basil Dickey...",199.0,"{'id': 4465, 'rating': 7.6, 'votes': 744}",The Perils of Pauline,[English],Young Pauline is left a lot of money when her ...,[USA],0,,"[Louis J. Gasnier, Donald MacKenzie]","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,Young Pauline is left a lot of money when her ...,,"[0.015450738370418549, -0.0037871389649808407,..."
1,"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...","[Comedy, Short, Action]",https://m.media-amazon.com/images/M/MV5BNzE1OW...,[H.M. Walker (titles)],22.0,"{'id': 10146, 'rating': 7.0, 'votes': 639}",From Hand to Mouth,[English],A penniless young man tries to save an heiress...,[USA],0,,"[Alfred J. Goulding, Hal Roach]","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,As a penniless man worries about how he will m...,TV-G,"[-0.024403288960456848, 0.009791935794055462, ..."
2,"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...","[Action, Adventure, Drama]",,"[Herbert Brenon (adaptation), John Russell (ad...",101.0,"{'id': 16634, 'rating': 6.9, 'votes': 222}",Beau Geste,[English],"Michael ""Beau"" Geste leaves England in disgrac...",[USA],0,,[Herbert Brenon],"{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,"Michael ""Beau"" Geste leaves England in disgrac...",,"[-0.03142419457435608, 0.07591885328292847, 0...."
3,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...","[Adventure, Action]",https://m.media-amazon.com/images/M/MV5BMzU0ND...,"[Douglas Fairbanks (story), Jack Cunningham (a...",88.0,"{'id': 16654, 'rating': 7.2, 'votes': 1146}",The Black Pirate,,"Seeking revenge, an athletic young man joins t...",[USA],1,,[Albert Parker],"{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,A nobleman vows to avenge the death of his fat...,,"[0.021738531067967415, 0.06848915666341782, 0...."
4,"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...","[Action, Comedy, Romance]",https://m.media-amazon.com/images/M/MV5BMTcxMT...,"[Ted Wilde (story), John Grey (story), Clyde B...",58.0,"{'id': 16895, 'rating': 7.6, 'votes': 918}",For Heaven's Sake,[English],An irresponsible young millionaire changes his...,[USA],0,,[Sam Taylor],"{'nominations': 1, 'text': '1 nomination.', 'w...",movie,"The Uptown Boy, J. Harold Manners (Lloyd) is a...",PASSED,"[0.008178990334272385, -0.019387762993574142, ..."


## Step 5: Vector Database Setup and Data Ingestion

MongoDB acts as both an operational and a vector database. It offers a database solution that efficiently stores, queries and retrieves vector embeddings—the advantages of this lie in the simplicity of database maintenance, management and cost.

**To create a new MongoDB database, set up a database cluster:**

1. Head over to MongoDB official site and register for a [free MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register), or for existing users, [sign into MongoDB Atlas](https://account.mongodb.com/account/login?nds=true).

2. Select the 'Database' option on the left-hand pane, which will navigate to the Database Deployment page, where there is a deployment specification of any existing cluster. Create a new database cluster by clicking on the "+Create" button.

3.   Select all the applicable configurations for the database cluster. Once all the configuration options are selected, click the “Create Cluster” button to deploy the newly created cluster. MongoDB also enables the creation of free clusters on the “Shared Tab”.

 *Note: Don’t forget to whitelist the IP for the Python host or 0.0.0.0/0 for any IP when creating proof of concepts.*

4. After successfully creating and deploying the cluster, the cluster becomes accessible on the ‘Database Deployment’ page.

5. Click on the “Connect” button of the cluster to view the option to set up a connection to the cluster via various language drivers.

6. This tutorial only requires the cluster's URI(unique resource identifier). Grab the URI and copy it into the Google Colabs Secrets environment in a variable named `MONGO_URI` or place it in a .env file or equivalent.




In [6]:
import pymongo
from google.colab import userdata

def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = userdata.get('MONGO')
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

# Ingest data into MongoDB
db = mongo_client['movies']
collection = db['movie_collection']

Connection to MongoDB successful


In [7]:
# Delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 1473, 'electionId': ObjectId('7fffffff000000000000000e'), 'opTime': {'ts': Timestamp(1709882073, 1474), 't': 14}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1709882073, 1476), 'signature': {'hash': b'\xcf\xa6e8\r\x9c\xda;\x8c2\x01_\xb4R\xff%\xe9\x0c\xa6\xa8', 'keyId': 7278884602146455558}}, 'operationTime': Timestamp(1709882073, 1474)}, acknowledged=True)

In [8]:
documents = dataset_df.to_dict('records')
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


## Step 6: Create a Vector Search Index

At this point make sure that your vector index is created via MongoDB Atlas.
Follow instructions here:

This next step is mandatory for conducting efficient and accurate vector-based searches based on the vector embeddings stored within the documents in the ‘movie_collection’ collection. Creating a Vector Search Index enables the ability to traverse the documents efficiently to retrieve documents with embeddings that match the query embedding based on vector similarity. Go here to read more about [MongoDB Vector Search Index](https://www.mongodb.com/docs/atlas/atlas-search/field-types/knn-vector/).


## Step 7: Perform Vector Search on User Queries

This step combines all the activities in the previous step to provide the functionality of conducting vector search on stored records based on embedded user queries.

This step implements a function that returns a vector search result by generating a query embedding and defining a MongoDB aggregation pipeline. The pipeline, consisting of the `$vectorSearch` and `$project` stages, queries using the generated vector and formats the results to include only required information like plot, title, and genres while incorporating a search score for each result.

This selective projection enhances query performance by reducing data transfer and optimizes the use of network and memory resources, which is especially critical when handling large datasets. For AI Engineers and Developers considering data security at an early stage, the chances of sensitive data leaked to the client side can be minimized by carefully excluding fields irrelevant to the user's query.


In [12]:

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "plot_embedding_optimised",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "plot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1, # Include the genres field
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = list(collection.aggregate(pipeline))
    return results

## Step 8: Handling User Query and Result

The final step in the implementation phase focuses on the practical application of our vector search functionality and AI integration to handle user queries effectively.

The handle_user_query function performs a vector search on the MongoDB collection based on the user's query and utilizes OpenAI's GPT-3.5 model to generate context-aware responses.


In [13]:
def handle_user_query(query, collection):

  get_knowledge = vector_search(query, collection)

  search_result = ''
  for result in get_knowledge:
      search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('plot', 'N/A')}\n"

  completion = openai.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
          {"role": "system", "content": "You are a movie recommendation system."},
          {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
      ]
  )

  return (completion.choices[0].message.content), search_result

In [14]:
# 6. Conduct query with retrival of sources
query = "What is the best romantic movie to watch?"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
print(f"Source Information: \n{source_information}")


Response: If you are looking for a romantic movie to watch, I recommend "The Notebook." This classic film, based on the novel by Nicholas Sparks, tells a heartwarming love story that will tug at your heartstrings. With its captivating performances and beautiful cinematography, "The Notebook" is a beloved romantic movie that is sure to leave you feeling all the feels.
Source Information: 

