# Load Movie Data and Query with Embedding Model

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/llm-workshop/blob/main/embeddings/4_embeddings_similarty_search.ipynb)


[Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) on hugging face.  We can sort by overall or RAG

[How to Choose the Right Embedding Model for Your LLM Application](https://www.mongodb.com/developer/products/atlas/choose-embedding-model-rag/)


## Colab Setup

In [None]:
# are we running in Colab?
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT running in Colab")
   RUNNING_IN_COLAB = False

if RUNNING_IN_COLAB:
   ! pip install  --default-timeout=100 sentence_transformers   datasets

NOT running in Colab


In [1]:
from datasets import load_dataset

dataset = load_dataset("MongoDB/embedded_movies")

movies = dataset['train']

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import pandas as pd

movies_df = pd.DataFrame(movies)

# some cleanup
movies_df = movies_df[movies_df["plot"].notna()]
movies_df = movies_df.drop ('plot_embedding', axis=1) # remove existing column

print (movies_df.info())
movies_df

<class 'pandas.core.frame.DataFrame'>
Index: 1473 entries, 0 to 1499
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   plot                1473 non-null   object 
 1   runtime             1459 non-null   float64
 2   genres              1473 non-null   object 
 3   fullplot            1452 non-null   object 
 4   directors           1460 non-null   object 
 5   writers             1460 non-null   object 
 6   countries           1473 non-null   object 
 7   poster              1395 non-null   object 
 8   languages           1472 non-null   object 
 9   cast                1472 non-null   object 
 10  title               1473 non-null   object 
 11  num_mflix_comments  1473 non-null   int64  
 12  rated               1189 non-null   object 
 13  imdb                1473 non-null   object 
 14  awards              1473 non-null   object 
 15  type                1473 non-null   object 
 16  metacritic 

Unnamed: 0,plot,runtime,genres,fullplot,directors,writers,countries,poster,languages,cast,title,num_mflix_comments,rated,imdb,awards,type,metacritic
0,Young Pauline is left a lot of money when her ...,199.0,[Action],Young Pauline is left a lot of money when her ...,"[Louis J. Gasnier, Donald MacKenzie]","[Charles W. Goddard (screenplay), Basil Dickey...",[USA],https://m.media-amazon.com/images/M/MV5BMzgxOD...,[English],"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",The Perils of Pauline,0,,"{'id': 4465, 'rating': 7.6, 'votes': 744}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,
1,A penniless young man tries to save an heiress...,22.0,"[Comedy, Short, Action]",As a penniless man worries about how he will m...,"[Alfred J. Goulding, Hal Roach]",[H.M. Walker (titles)],[USA],https://m.media-amazon.com/images/M/MV5BNzE1OW...,[English],"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",From Hand to Mouth,0,TV-G,"{'id': 10146, 'rating': 7.0, 'votes': 639}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,
2,"Michael ""Beau"" Geste leaves England in disgrac...",101.0,"[Action, Adventure, Drama]","Michael ""Beau"" Geste leaves England in disgrac...",[Herbert Brenon],"[Herbert Brenon (adaptation), John Russell (ad...",[USA],,[English],"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",Beau Geste,0,,"{'id': 16634, 'rating': 6.9, 'votes': 222}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,
3,"Seeking revenge, an athletic young man joins t...",88.0,"[Adventure, Action]",A nobleman vows to avenge the death of his fat...,[Albert Parker],"[Douglas Fairbanks (story), Jack Cunningham (a...",[USA],https://m.media-amazon.com/images/M/MV5BMzU0ND...,,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",The Black Pirate,1,,"{'id': 16654, 'rating': 7.2, 'votes': 1146}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,
4,An irresponsible young millionaire changes his...,58.0,"[Action, Comedy, Romance]","The Uptown Boy, J. Harold Manners (Lloyd) is a...",[Sam Taylor],"[Ted Wilde (story), John Grey (story), Clyde B...",[USA],https://m.media-amazon.com/images/M/MV5BMTcxMT...,[English],"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",For Heaven's Sake,0,PASSED,"{'id': 16895, 'rating': 7.6, 'votes': 918}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,"In the ironically named city of Paradise, a re...",100.0,"[Action, Comedy, Thriller]",The story begins with a regular Joe who tries ...,[Uwe Boll],"[Uwe Boll, Bryan C. Knight]","[USA, Canada, Germany]",https://m.media-amazon.com/images/M/MV5BMTIzMD...,[English],"[Zack Ward, Dave Foley, Chris Coppola, Jackie ...",Postal,0,R,"{'id': 486640, 'rating': 4.4, 'votes': 19641}","{'nominations': 3, 'text': '1 win & 3 nominati...",movie,22.0
1496,A group of suburban biker wannabes looking for...,100.0,"[Action, Adventure, Comedy]",Four middle-aged men decide to take a road tri...,[Walt Becker],[Brad Copeland],[USA],https://m.media-amazon.com/images/M/MV5BZWZlMz...,[English],"[Tim Allen, John Travolta, Martin Lawrence, Wi...",Wild Hogs,0,PG-13,"{'id': 486946, 'rating': 5.9, 'votes': 94699}","{'nominations': 3, 'text': '3 nominations.', '...",movie,27.0
1497,"Shakespeare's masterpiece ""Othello"" set in mod...",155.0,"[Action, Crime, Drama]",Advocate Raghunath Mishra has arranged the mar...,[Vishal Bhardwaj],"[Vishal Bhardwaj (screenplay), Robin Bhatt (sc...",[India],https://m.media-amazon.com/images/M/MV5BY2NmNj...,[Hindi],"[Ajay Devgn, Kareena Kapoor, Saif Ali Khan, Ko...",Omkara,1,,"{'id': 488414, 'rating': 8.2, 'votes': 9800}","{'nominations': 13, 'text': '14 wins & 13 nomi...",movie,
1498,When a small Colorado town is overrun by the f...,86.0,"[Action, Horror]","In Leadville, Colorado, Captain Rhodes and his...",[Steve Miner],"[Jeffrey Reddick (screenplay), George A. Romer...",[USA],https://m.media-amazon.com/images/M/MV5BNzg1Mj...,[English],"[Mena Suvari, Nick Cannon, Michael Welch, Anna...",Day of the Dead,1,R,"{'id': 489018, 'rating': 4.5, 'votes': 17177}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,


## Calculate Embeddings for Data Frame

In [3]:
from typing import List
from sentence_transformers import SentenceTransformer

model_name = "BAAI/bge-small-en-v1.5"
model = SentenceTransformer(model_name, trust_remote_code=True)

def get_embeddings2(string:str, model) -> List[float]:
    # cleanup text
    string = string.replace('\n' , ' ')  # new lines can affect results
    embeddings = model.encode(string)
    return embeddings




In [4]:

embeddings = get_embeddings2 ('hello world', model)
print (len(embeddings))
print (embeddings[:10])

384
[ 0.01519612 -0.02257068  0.00854709 -0.07417059  0.00383641  0.00271351
 -0.03126793  0.04463398  0.04405522 -0.00787113]


In [5]:
def df_calculate_embeddings (row, model) -> List[float]:
    return get_embeddings2 (row['plot'], model)

In [6]:
%%time

movies_df ['plot_embeddings']  = movies_df.apply(lambda row: df_calculate_embeddings(row, model), axis=1)

CPU times: user 10.8 s, sys: 24.5 ms, total: 10.8 s
Wall time: 10.8 s


In [7]:
movies_df.head(5)
# we should see 'plot_embeddings' column now

Unnamed: 0,plot,runtime,genres,fullplot,directors,writers,countries,poster,languages,cast,title,num_mflix_comments,rated,imdb,awards,type,metacritic,plot_embeddings
0,Young Pauline is left a lot of money when her ...,199.0,[Action],Young Pauline is left a lot of money when her ...,"[Louis J. Gasnier, Donald MacKenzie]","[Charles W. Goddard (screenplay), Basil Dickey...",[USA],https://m.media-amazon.com/images/M/MV5BMzgxOD...,[English],"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",The Perils of Pauline,0,,"{'id': 4465, 'rating': 7.6, 'votes': 744}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[-0.030162292, 0.021669663, 0.021604462, -0.04..."
1,A penniless young man tries to save an heiress...,22.0,"[Comedy, Short, Action]",As a penniless man worries about how he will m...,"[Alfred J. Goulding, Hal Roach]",[H.M. Walker (titles)],[USA],https://m.media-amazon.com/images/M/MV5BNzE1OW...,[English],"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",From Hand to Mouth,0,TV-G,"{'id': 10146, 'rating': 7.0, 'votes': 639}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.072330326, 0.027128397, -0.038543023, -0.0..."
2,"Michael ""Beau"" Geste leaves England in disgrac...",101.0,"[Action, Adventure, Drama]","Michael ""Beau"" Geste leaves England in disgrac...",[Herbert Brenon],"[Herbert Brenon (adaptation), John Russell (ad...",[USA],,[English],"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",Beau Geste,0,,"{'id': 16634, 'rating': 6.9, 'votes': 222}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[-0.021819035, 0.07047935, -0.007299974, 0.022..."
3,"Seeking revenge, an athletic young man joins t...",88.0,"[Adventure, Action]",A nobleman vows to avenge the death of his fat...,[Albert Parker],"[Douglas Fairbanks (story), Jack Cunningham (a...",[USA],https://m.media-amazon.com/images/M/MV5BMzU0ND...,,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",The Black Pirate,1,,"{'id': 16654, 'rating': 7.2, 'votes': 1146}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[-0.03924741, 0.02563638, 0.029713377, -0.0648..."
4,An irresponsible young millionaire changes his...,58.0,"[Action, Comedy, Romance]","The Uptown Boy, J. Harold Manners (Lloyd) is a...",[Sam Taylor],"[Ted Wilde (story), John Grey (story), Clyde B...",[USA],https://m.media-amazon.com/images/M/MV5BMTcxMT...,[English],"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",For Heaven's Sake,0,PASSED,"{'id': 16895, 'rating': 7.6, 'votes': 918}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.049515534, 0.047105137, 0.006521498, -0.00..."


## Let's do vector search

In [8]:
from sentence_transformers.util import cos_sim

def df_calculate_cosine (row, embedding_col_name: str, query_embedding : List[float]) -> float:
    cos_distance = cos_sim (row [embedding_col_name], query_embedding) # this is a 2 dim tensor
    return cos_distance.tolist()[0][0]

## Vector Search

In [9]:
# Let's create a convenient function

def do_vector_search(query:str, model):
    query_embedding =  get_embeddings2 (string=query, model=model)
    movies_df ['similarity']  = movies_df.apply(lambda row: df_calculate_cosine(row, embedding_col_name='plot_embeddings', query_embedding=query_embedding), axis=1)
    pd.set_option('display.max_colwidth', None) # Display full column width
    movies_df_sorted =  movies_df.sort_values(by='similarity', ascending=False)
    return movies_df_sorted [['title', 'plot', 'similarity']].head(5)


In [11]:
do_vector_search(query='where humans fight aliens', model=model)

Unnamed: 0,title,plot,similarity
691,Independence Day,"The aliens are coming and their goal is to invade and destroy Earth. Fighting superior technology, mankind's best weapon is the will to survive.",0.823629
759,Starship Troopers,"Humans in a fascistic, militaristic future do battle with giant alien bugs in a fight for survival.",0.787265
257,V: The Final Battle,A small group of human resistance fighters fight a desperate guerilla war against the genocidal extra-terrestrials who dominate Earth.,0.745812
286,Enemy Mine,"A soldier from Earth crash-lands on an alien world after sustaining battle damage. Eventually he encounters another survivor, but from the enemy species he was fighting; they band together ...",0.735677
904,Battlefield Earth,"After enslavement & near extermination by an alien race in the year 3000, humanity begins to fight back.",0.728381


In [10]:
do_vector_search(query='relationship drama between two friends', model=model)

Unnamed: 0,title,plot,similarity
1390,Varalaaru,Relationships become entangled in an emotional web.,0.73714
987,Dark Blue World,The friendship of two men becomes tested when they both fall for the same woman.,0.725312
988,Dark Blue World,The friendship of two men becomes tested when they both fall for the same woman.,0.725312
1375,Harsh Times,A tough-minded drama about two friends in South Central Los Angeles and the violence that comes between them.,0.704631
471,Once a Thief,"A romantic and action packed story of three best friends, a group of high end art thieves, who come into trouble when a love-triangle forms between them.",0.69296


In [12]:
do_vector_search(query='futuristic christmas', model=model)

Unnamed: 0,title,plot,similarity
1297,The Girl from Monday,A comic drama about a time in the near future when citizens are happy to be property traded on the stock exchange.,0.606868
1263,Naechureol siti,"In the year 2080, the world is connected by a massive computer network. Combiners have developed a process that allows them to merge the souls of human and machine/cyborg, wreaking havoc in...",0.603312
1116,Immortal (Ad Vitam),"In the distant future, Earth is occupied by ancient gods and genetically altered humans. When a god is sentenced to death he seeks a new human host and a woman to bear his child.",0.598992
1018,Megiddo: The Omega Code 2,Megiddo is a supernatural ride into a world teetering on the edge of the Apocalypse. It follows the rise of a Machiavellian leader bent on amassing the armies of the world for the battle of...,0.592574
1457,9,A rag doll that awakens in a postapocalyptic future holds the key to humanity's salvation.,0.589821


In [13]:
do_vector_search(query='fatalistic sci-fi movies', model=model)

Unnamed: 0,title,plot,similarity
151,Logan's Run,An idyllic sci-fi future has one major drawback: life must end at 30.,0.635371
1071,Forklift Driver Klaus: The First Day on the Job,Short film depicting a fictional educational film about fork lift truck operational safety. The dangers of unsafe operation are presented in gory details.,0.615401
1102,Forklift Driver Klaus: The First Day on the Job,Short film depicting a fictional educational film about fork lift truck operational safety. The dangers of unsafe operation are presented in gory details.,0.615401
322,Dead End Drive-In,"In the near future, drive-in theatres are turned into concentration camps for the undesirable and unemployed. The prisoners don't really care to escape because they are fed and they have a ...",0.607888
208,Assassination Attempt,Alain Delon and Claude Jade stars in this Soviet movie: Documents reveal in 1980 that the Germans planned to kill the Big Three in Teheran in 1943.,0.602526
