[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/mongodb-atlas-vector-search/blob/main/quickstarts/quickstart-2-vector-search-atlas-openai.ipynb)

# Lab: Vector Search on Mongo Atlas Using OpenAI Embeddings


This is a companion notebook for this [TODO - Quick start guide](#)
It will demonstrate the following:

- 👉 Creating a vector index on Atlas
- 👉 Performing vector search using OpenAI embeddings


### What you need to run this notebook

- a (free) MongoDB Atlas Account
- An Atlas instance running in the cloud with sample data loaded
- and connection credentials
- OpenAI API key (optional, see below)

Follow this [TODO quick start guide](#) to set this up before proceeding.

### How to run

This notebook can be run on Google Colab and stand alone python development environments.  Click here to run on colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/mongodb-atlas-vector-search/blob/main/quickstarts/quickstart-2-vector-search-atlas-openai.ipynb)


References

- https://cookbook.openai.com/examples/vector_databases/mongodb_atlas/semantic_search_using_mongodb_atlas_vector_search

## Big Picture


![image missing](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/vector-search-1.png)

## Step-1: Setup Atlas

We will need to have Atlas setup.

Follow [instructions here](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-1-atlas-setup/setup-atlas.md)

Also the [TODO quick start guide](#) has more information.

## Step-2: Create an Alas Index

Refer to the [TODO quickstart guide](#) for more details

Index name: `idx_plot_embedding`

Index definition

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "plot_embedding",
      "numDimensions": 1536,
      "similarity": "dotProduct"
    }
  ]
}
```



## Step-3: Configuration

We need the following settings
- Atlas connection credentials
- openAI API key

**Note: we are keeping this very simple for the purpose of this quick start.  For production systems, consider using libraries like [python dotenv](https://pypi.org/project/python-dotenv/) to get configuration settings**

In [3]:
# We will keep all global variables in an object to not pollute the global namespace.
class MyConfig(object):
    pass

MY_CONFIG = MyConfig()

MY_CONFIG.ATLAS_URI = "Enter your ATLAS URI string"  ## TODO
MY_CONFIG.OPENAI_API_KEY = "Enter your OpenAI API key"  ## TODO


## Step-4: Install dependencies

We will install required libraries.

In [2]:
!pip install \
                openai==1.13.3 \
                pymongo==4.6.2

Collecting openai==1.13.3
  Downloading openai-1.13.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo==4.6.2
  Downloading pymongo-4.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.2/677.2 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai==1.13.3)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo==4.6.2)
  Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai==1.13.3)
  Downloading httpcore

## Step-5: AtlasClient and OpenAIClient

Here are couple of handy classes.

For full implementation see here:

- [AtlasClient.py](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/AtlasClient.py) - a handy class to interact with Atlas
- [OpenAIClient.py](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/OpenAIClient.py) - a handy class to intereact with openAI

In [4]:
from pymongo import MongoClient

class AtlasClient ():

    def __init__ (self, altas_uri, dbname):
        self.mongodb_client = MongoClient(altas_uri)
        self.database = self.mongodb_client[dbname]

    ## A quick way to test if we can connect to Atlas instance
    def ping (self):
        self.mongodb_client.admin.command('ping')

    def get_collection (self, collection_name):
        collection = self.database[collection_name]
        return collection

    def find (self, collection_name, filter = {}, limit=10):
        collection = self.database[collection_name]
        items = list(collection.find(filter=filter, limit=limit))
        return items

    # https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/
    def vector_search(self, collection_name, index_name, attr_name, embedding_vector, limit=5):
        collection = self.database[collection_name]
        results = collection.aggregate([
            {
                '$vectorSearch': {
                    "index": index_name,
                    "path": attr_name,
                    "queryVector": embedding_vector,
                    "numCandidates": 50,
                    "limit": limit,
                }
            },
            ## We are extracting 'vectorSearchScore' here
            ## columns with 1 are included, columns with 0 are excluded
            {
                "$project": {
                    '_id' : 1,
                    'title' : 1,
                    'plot' : 1,
                    'year' : 1,
                    "search_score": { "$meta": "vectorSearchScore" }
            }
            }
            ])
        return list(results)

    def close_connection(self):
        self.mongodb_client.close()


In [5]:
from openai import OpenAI

class OpenAIClient():
    def __init__(self, api_key) -> None:
        self.client = OpenAI(
            api_key= api_key,  # defaults to os.environ.get("OPENAI_API_KEY")
        )
        # print ("OpenAI Client initialized!")


    def chat (self, messages, model="gpt-3.5-turbo"):
        chat_completion = self.client.chat.completions.create(
                        messages=messages, model=model,)
        return chat_completion

    def get_embedding(self, text: str,  model="text-embedding-ada-002") -> list[float]:
        text = text.replace("\n", " ")
        resp = self.client.embeddings.create (
            input=[text],
            model=model  )

        return resp.data[0].embedding

## Step-6: Connect to Atals

See if we can connect to our Atlas cloud instance.

If this step fails, make sure 'connect from anywhere' is enabled on your Atlas network configuration (See [TODO quickstart guide](#))


In [6]:
MY_CONFIG.DB_NAME = 'sample_mflix'
MY_CONFIG.COLLECTION_NAME = 'embedded_movies'
MY_CONFIG.INDEX_NAME = 'idx_plot_embedding'

In [7]:
atlas_client = AtlasClient (MY_CONFIG.ATLAS_URI, MY_CONFIG.DB_NAME)
atlas_client.ping()
print ('Connected to Atlas instance! We are good to go!')

Connected to Atlas instance! We are good to go!


## Step-7: Initialize OpenAI Client

In [8]:
openAI_client = OpenAIClient (api_key=MY_CONFIG.OPENAI_API_KEY)
print ("OpenAI client initialized")

OpenAI client initialized


## Step-8: Do a Vector Search

Now that we have every thing setup, this is the fun part!

We are going to query movies, not just on plot keywords but 'meaning'.

See the examples below.  And try your own!

The process is as follows:

- convert query into embeddings (using OpenAI API)
- send the embeddings to Atlas and get results

### Note the Score

IN addition to movie attributes (title, year, plot ..etc) We are also dislaying `search_score`.  This is a meta attribute - not really part of movies collection, but generated as a result of vector search.

This is a number between 0 and 1.  Closer to 1 values represent 'better match'.  And the results are sorted from best match down (closer to 1 first)

[You can read more about search score here](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/#atlas-vector-search-score)


### Troubleshooting

#### No search results?

Make sure the vector search index is defined and active! (Step-2)

In [9]:
import time

# Handy function
def do_vector_search (query:str) -> None:
    # cleanup query
    query = query.lower().strip()
    print ('query: ', query)

    # use openAI API to get embeddings for query text
    t1a = time.perf_counter()
    embedding = openAI_client.get_embedding(query)
    t1b = time.perf_counter()
    print (f"Getting embeddings from OpenAI took {(t1b-t1a)*1000:,.0f} ms")

    # And use the returned embeddings to perform vector search in Atlas
    t2a = time.perf_counter()
    movies = atlas_client.vector_search(collection_name=MY_CONFIG.COLLECTION_NAME, index_name=MY_CONFIG.INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 )
    t2b = time.perf_counter()

    # print out the results
    print (f"Altas query returned {len (movies)} movies in {(t2b-t2a)*1000:,.0f} ms")
    print()

    for idx, movie in enumerate (movies):
        print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}' +
            f'\nsearch_score(meta):{movie["search_score"]}\nplot: {movie["plot"]}\n')

In [10]:
query="humans fighting aliens"

do_vector_search (query=query)

query:  humans fighting aliens
Getting embeddings from OpenAI took 430 ms
Altas query returned 10 movies in 474 ms

1
id: 573a1398f29313caabce8f83
title: V: The Final Battle,
year: 1984
search_score(meta):0.9573380947113037
plot: A small group of human resistance fighters fight a desperate guerilla war against the genocidal extra-terrestrials who dominate Earth.

2
id: 573a13c7f29313caabd75324
title: Falling Skies,
year: 2011è
search_score(meta):0.955032467842102
plot: Survivors of an alien attack on earth gather together to fight for their lives and fight back.

3
id: 573a139af29313caabcf0cff
title: Starship Troopers,
year: 1997
search_score(meta):0.952342689037323
plot: Humans in a fascistic, militaristic future do battle with giant alien bugs in a fight for survival.

4
id: 573a139ff29313caabd000f6
title: Battlefield Earth,
year: 2000
search_score(meta):0.9512579441070557
plot: After enslavement & near extermination by an alien race in the year 3000, humanity begins to fight back.



In [11]:
query="relationship drama between two good friends"

do_vector_search (query=query)

query:  relationship drama between two good friends
Getting embeddings from OpenAI took 432 ms
Altas query returned 10 movies in 197 ms

1
id: 573a13a3f29313caabd0dfe2
title: Dark Blue World,
year: 2001
search_score(meta):0.9380691051483154
plot: The friendship of two men becomes tested when they both fall for the same woman.

2
id: 573a13a3f29313caabd0e14b
title: Dark Blue World,
year: 2001
search_score(meta):0.9380691051483154
plot: The friendship of two men becomes tested when they both fall for the same woman.

3
id: 573a1399f29313caabcec488
title: Once a Thief,
year: 1991
search_score(meta):0.9260262250900269
plot: A romantic and action packed story of three best friends, a group of high end art thieves, who come into trouble when a love-triangle forms between them.

4
id: 573a13b3f29313caabd3b197
title: Hulchul,
year: 2004
search_score(meta):0.9249671697616577
plot: A man and woman from feuding families each pretend to fall in love, as part of a revenge plot. Chaos ensues when th

### Try your own searches!

Update the query string to what ever you like, and run it.

Remember, if you want to try different queries, than what we cached, you will need your OPENAI_API_KEY

In [12]:
## TODO: enter your query here
# query="technology gone wrong"

# do_vector_search (query=query)


In [13]:
## Close connection

# atlas_client.close_connection()