# Vector Search on Mongo Atlas

This Python notebook will show you how to connect to MongoDB Atlas programatically as well as how to perform Atlas Vector Search.

In [1]:
import os, sys
import pprint
import json
import time

# Change system path to root direcotry
sys.path.insert(0, '../')

## Load Settings

In [2]:
# Load settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# For debugging purposes
# print (config)

ATLAS_URI = config.get('ATLAS_URI')
OPENAI_API_KEY = config.get("OPENAI_API_KEY")

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")
else:
    print("ATLAS_URI Connection string found:", ATLAS_URI)

if not OPENAI_API_KEY:
    raise Exception ("'OPENAI_API_KEY' is not set.  Please set it above to continue...")
else:
    print("OPENAI_API_KEY Connection string found:", OPENAI_API_KEY)

ATLAS_URI Connection string found: mongodb+srv://yongtaufoo:mucjOuDXLysFfEGA@cluster0.ds8hjdi.mongodb.net/?retryWrites=true&w=majority
OPENAI_API_KEY Connection string found: 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


In [3]:
# Our variables
DB_NAME = 'sample_mflix'
COLLECTION_NAME = 'embedded_movies'
INDEX_NAME = 'idx_plot_embedding'

## Find My IP Address

This IP address should be added to Atlas' 'access list' for the connection to work. If you completed Quest 1, this should be configured correctly already.

In [None]:
from urllib.request import urlopen
ip = urlopen('https://api.ipify.org').read()
decoded_ip = ip.decode('utf-8')

print (f"My IP address is {decoded_ip} \nMake sure that this IP address is allowed to connect to cloud Atlas")

## Initialize Mongo Atlas Client

We start by intializing a connection to the Mongo Atlas Client.

In [4]:
from AtlasClient import AtlasClient

atlas_client = AtlasClient (ATLAS_URI, DB_NAME)
print("Connected to the Mongo Atlas database!")

Connected to the Mongo Atlas database!


## Initialize OpenAI Client
Recall that we'll be using OpenAI as our embedding model. Although we already have embeddings in our embedded_movies dataset, we'll still need an embedding model that is able to help us generate embeddings for the input queries so that its able to be compared against the vectors stored in the database (i.e. compare vectors against vectors instead of text against vectors).

In [5]:
from OpenAIClient import OpenAIClient

openAI_client = OpenAIClient(api_key=OPENAI_API_KEY)
print ("OpenAI client initialized!")

OpenAI client initialized!


With our OpenAI client initialized, let's do a **quick vectorization test** as a sanity check! Essentially, what we're doing here is using the vectorizer provided by OpenAI to get the vector representation (i.e. numerical representation) of the string "a futuristic Sci-fi movie".

In [6]:
text = 'a futuristic Sci-fi movie'

embedding = openAI_client.get_embedding(text)
print (f"Text: '{text}'\nEmbeddding_length: {len(embedding)}\nFirst 10 numbers of embedding:", embedding [:10] )

Text: 'a futuristic Sci-fi movie'
Embeddding_length: 1536
First 10 numbers of embedding: [-0.008840689435601234, -0.034085843712091446, -0.011264224536716938, -0.030672045424580574, 0.009133859537541866, 0.015166636556386948, -0.021590305492281914, -0.027388548478484154, -0.006205421406775713, -0.016391433775424957]


## Atlas Vector Search 
Now for the fun part! We are going to do an embedding search on our embedded_movies dataset based on movie plots. What this means is that we're searching for movies based on the **meaning** of their plots.

We're **not** searching for keywords within plots, but we're searching movies that have plots that have the closest semantic meaning to our input query.

Check out the examples below!

In [7]:
query = "imaginary characters from outerspace at war with earthlings"

embedding = openAI_client.get_embedding(query)
movies = atlas_client.vector_search(collection_name=COLLECTION_NAME, index_name=INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 )
print (f"Found {len (movies)} movies")
for idx, movie in enumerate (movies):
    print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}\nplot: {movie["plot"]}\n')

Found 10 movies
1
id: 573a1398f29313caabce8f83
title: V: The Final Battle,
year: 1984
plot: A small group of human resistance fighters fight a desperate guerilla war against the genocidal extra-terrestrials who dominate Earth.

2
id: 573a13d7f29313caabda215e
title: Pixels,
year: 2015
plot: When aliens misinterpret video feeds of classic arcade games as a declaration of war, they attack the Earth in the form of the video games.

3
id: 573a139ff29313caabd000f6
title: Battlefield Earth,
year: 2000
plot: After enslavement & near extermination by an alien race in the year 3000, humanity begins to fight back.

4
id: 573a13c7f29313caabd75324
title: Falling Skies,
year: 2011è
plot: Survivors of an alien attack on earth gather together to fight for their lives and fight back.

5
id: 573a13a9f29313caabd1e90b
title: Battlestar Galactica,
year: 2003
plot: A re-imagining of the original series in which a "rag-tag fugitive fleet" of the last remnants of mankind flees pursuing robots while simultaneo

In [8]:
query = "superheroes saving earth"

embedding = openAI_client.get_embedding(query)
movies = atlas_client.vector_search(collection_name=COLLECTION_NAME, index_name=INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 )
print (f"Found {len (movies)} movies")
for idx, movie in enumerate (movies):
    print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}\nplot: {movie["plot"]}\n')

Found 10 movies
1
id: 573a139bf29313caabcf3a3f
title: Mystery Men,
year: 1999
plot: A group of inept amateur superheroes must try to save the day when a supervillian threatens to destroy a major superhero and the city.

2
id: 573a13a9f29313caabd1f35b
title: The Incredibles,
year: 2004
plot: A family of undercover superheroes, while trying to live the quiet suburban life, are forced into action to save the world.

3
id: 573a13bbf29313caabd536b1
title: Justice League: The New Frontier,
year: 2008
plot: In the 1950s, a new generation of superheroes must join forces with the community's active veterans and a hostile US government to fight a menace to Earth.

4
id: 573a13baf29313caabd506e4
title: The Avengers,
year: 2012
plot: Earth's mightiest heroes must come together and learn to fight as a team if they are to stop the mischievous Loki and his alien army from enslaving humanity.

5
id: 573a13d1f29313caabd8db32
title: Superheroes,
year: 2011
plot: A journey inside the world of real life c

## Try Your Own Searches!

As you can see from the sample searches above, the results retrieved from our query are ranked based on how close the semantic meaning of the values `plot` field matches with our queries. This is the power of Atlas Vector Search - we're searching via comparing semantic meaning (i.e. comparing vectors), as opposed to merely 1:1 value matching.

Now, try to search for your own query! **Replace the placeholder value in the string below and enter your own custom query**. 

In [None]:
# TODO: enter your query here
query = "REPLACE-WITH-YOUR-QUERY"

embedding = openAI_client.get_embedding(query)
movies = atlas_client.vector_search(collection_name=COLLECTION_NAME, index_name=INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 )
print (f"Found {len (movies)} movies")
for idx, movie in enumerate (movies):
    print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}\nplot: {movie["plot"]}\n')

Good job following till the end! Now let's **head back to StackUp** to complete our submission.