# Vector Search of Movie Plots with Milvus

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/llm-workshop/blob/main/vector-db/milvus_2_movie_search.ipynb)


This notebook will demonstrate how do 'semantic search' of movie plots.  For example we can query for movies like:

- "Where humans fight aliens"
- "Relationship drama between two good friends"

We will:

- 👉 Use this [movie data](https://huggingface.co/datasets/MongoDB/embedded_movies)
- 👉 index the plot text using embedding models
- 👉 Load the indexed data into [Milvus](https://milvus.io/) -  a popular vector database.  
- 👉 And run queries

References
- [Milvus quick start](https://milvus.io/docs/quickstart.md)

**This notebook is deisnged to run on local python environment and Google Colab environment 😄**

## Embedding models

See hugging face embedding models (sentence transformers) here : https://huggingface.co/models?library=sentence-transformers&sort=trending

Here are a select models for comparison.  Taken from leaderboard : https://huggingface.co/spaces/mteb/leaderboard

| model name                              | overall score | model params | model size | embedding length | url                                                            |
|-----------------------------------------|---------------|--------------|------------|------------------|----------------------------------------------------------------|
| intfloat/e5-mistral-7b-instruct         | 66.x          | 7.11 B       | 15 GB      | 4096             | https://huggingface.co/intfloat/e5-mistral-7b-instruct         |
| BAAI/bge-large-en-v1.5                  | 64.x          | 335 M        | 1.34 GB    | 1024             | https://huggingface.co/BAAI/bge-large-en-v1.5                  |
| BAAI/bge-small-en-v1.5                  | 62.x          | 33.5 M       | 133 MB     | 384              | https://huggingface.co/BAAI/bge-small-en-v1.5                  |
| sentence-transformers/all-mpnet-base-v2 | 57.8          |              | 438 MB     | 768              | https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| sentence-transformers/all-MiniLM-L12-v2 | 56.x          |              | 134 MB     | 384              | https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 |
| sentence-transformers/all-MiniLM-L6-v2  | 56.x          |              | 91 MB      | 384              | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2  |


## Colab Setup

<span style="color:red;">**A note for Google Colab Users**</span>

<span style="color:red;">After installing the dependenceis, if you get errors loading libraries, **restart runtime** and **run the notebook** again</span>

In [1]:
# are we running in Colab?
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT running in Colab")
   RUNNING_IN_COLAB = False
   
if RUNNING_IN_COLAB:
  !pip install  --default-timeout=100  pymilvus  pymilvus[model]  datasets  sentence-transformers

NOT running in Colab


## Config

In [3]:
class MyConfig:
    pass

MY_CONFIG = MyConfig()

## different embedding models to try

# MY_CONFIG.MODEL_NAME = "bge-large-en-v1.5"
# MY_CONFIG.EMBEDDING_LENGTH = 1024

MY_CONFIG.MODEL_NAME = "BAAI/bge-small-en-v1.5"
MY_CONFIG.EMBEDDING_LENGTH = 384

# MY_CONFIG.MODEL_NAME = "all-mpnet-base-v2 "
# MY_CONFIG.EMBEDDING_LENGTH = 768

# MY_CONFIG.MODEL_NAME = "all-MiniLM-L6-v2"
# MY_CONFIG.EMBEDDING_LENGTH = 384



## Load Data

We will load movies data.  The movie data has the following fields.  

- plot: A brief summary of the movie's plot.
- title: The title of the movie.
- and many more ...

See the [dataset description](https://huggingface.co/datasets/MongoDB/embedded_movies) for full description


In [4]:
from datasets import load_dataset

dataset = load_dataset("MongoDB/embedded_movies")['train']

# convert the dataset to an array of dicts, we only wants movies with plots
movies = [row for row in dataset if row['plot']]
print (f'Loaded {len(movies)} movies')

# select a few attributes
movies = [{k : v for k, v in m.items() if k in ['title', 'plot']} for m in movies ]

  from .autonotebook import tqdm as notebook_tqdm


Loaded 1473 movies


In [5]:
import pandas as pd
import pprint
import random

pprint.pprint (random.sample(movies, 5))
movies_df = pd.DataFrame(movies)
movies_df


[{'plot': 'Hercules and Deianeira go in search of fire to save the world from '
          "cold. All the world's fire are fast going out. Hercules' father, "
          'Zeus, is on hand to help (and sometime hinder) the two.',
  'title': 'Hercules: The Legendary Journeys - Hercules and the Circle of '
           'Fire'},
 {'plot': 'Taking place towards the end of WWII, 500 American Soldiers have '
          'been entrapped in a camp for 3 years. Beginning to give up hope '
          'they will ever be rescued, a group of Rangers goes on a dangerous '
          'mission to try and save them.',
  'title': 'The Great Raid'},
 {'plot': 'In the old west, a man becomes a sheriff just for the pay, figuring '
          'he can decamp if things get tough. In the end, he uses ingenuity '
          'instead.',
  'title': 'Support Your Local Sheriff!'},
 {'plot': 'A teacher is assigned to be the principal of a violence and crime '
          'ridden high school.',
  'title': 'The Principal'},
 {'pl

Unnamed: 0,plot,title
0,Young Pauline is left a lot of money when her ...,The Perils of Pauline
1,A penniless young man tries to save an heiress...,From Hand to Mouth
2,"Michael ""Beau"" Geste leaves England in disgrac...",Beau Geste
3,"Seeking revenge, an athletic young man joins t...",The Black Pirate
4,An irresponsible young millionaire changes his...,For Heaven's Sake
...,...,...
1468,"In the ironically named city of Paradise, a re...",Postal
1469,A group of suburban biker wannabes looking for...,Wild Hogs
1470,"Shakespeare's masterpiece ""Othello"" set in mod...",Omkara
1471,When a small Colorado town is overrun by the f...,Day of the Dead


## Setup Embedded Database

Milvus can be embedded and easy to use.

After we execute this code, you will see `milvus_demo.db` and `.milvus_demo.db.lock` file in the folder

In [6]:
from pymilvus import MilvusClient

client = MilvusClient("milvus_demo2.db")

# Create A Collection



In [7]:
# if we already have a collection, clear it first
if client.has_collection(collection_name="movies"):
    client.drop_collection(collection_name="movies")

client.create_collection(
    collection_name="movies",
    dimension=MY_CONFIG.EMBEDDING_LENGTH
)


## Calculate Embeddings for Plots

In [8]:
import torch

# Set the default device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print ('Using device : ', device)

Using device :  cuda


In [9]:
from pymilvus import model
import random

# If connection to https://huggingface.co/ failed, uncomment the following path
# import os
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# embedding_fn = model.DefaultEmbeddingFunction()

## initialize the SentenceTransformerEmbeddingFunction
embedding_fn = model.dense.SentenceTransformerEmbeddingFunction(
    model_name=MY_CONFIG.MODEL_NAME,
    device=device
)

# calculate embeddings for plots
for i, m in enumerate (movies):
    m['id'] = i
    m['vector'] = embedding_fn ([m['plot']][0])
    # m['vector'] = embedding_fn.encode_documents ([m['plot']][0])

# print a sample
for m in random.sample (movies, 3):
    print ('id:', m['id'] )
    print ('title: ', m['title'])
    print ('plot : ', m['plot'])
    print ('vector dim :',  len(m["vector"]))
    print ('vector[:10] :', m["vector"][:10])
    print()

id: 645
title:  Search and Destroy
plot :  A satire about desperate hustling, pop philosophy and big money.
vector dim : 384
vector[:10] : [-0.018120939, 0.039345175, -0.030752013, 0.013920638, 0.07516184, 0.035670273, 0.008708524, 0.01150346, -0.0020232312, -0.023062076]

id: 434
title:  The Adventures of Ford Fairlane
plot :  A vulgar private detective is hired to find a missing groupie and is drawn into a mystery involving a series of murders tied to the music industry.
vector dim : 384
vector[:10] : [0.018674074, -0.010004115, -0.017652493, -0.070779115, 0.061660517, -0.0072658765, 0.06216077, -0.06653017, 0.030017212, -0.011053714]

id: 88
title:  On Her Majesty's Secret Service
plot :  James Bond woos a mob boss's daughter and goes undercover to uncover the true reason for Blofeld's allergy research in the Swiss Alps that involves beautiful women from around the world.
vector dim : 384
vector[:10] : [0.015607916, 0.077515565, -0.030561024, 0.003635491, 0.049616575, -0.044629183, 

## Insert data

In [10]:
res = client.insert(collection_name="movies", data=movies)

print('inserted # rows', res['insert_count'])
print('cost', res['cost'])

inserted # rows 1473
cost 0


## Perform Vector Search (the FUN part!)

Let's do a semantic search on plot lines

In [11]:
from pprint import pprint

## helper function to perform vector search
def  do_vector_search (query):
    # query_vectors = embedding_fn.encode_queries([query])
    query_vectors = embedding_fn([query])

    results = client.search(
        collection_name="movies",  # target collection
        data=query_vectors,  # query vectors
        limit=5,  # number of returned entities
        output_fields=["title", "plot"],  # specifies fields to be returned
    )
    return results
## ----


def  print_search_results (results):
    # pprint (results)
    print ('num results : ', len(results[0]))

    for i, r in enumerate (results[0]):
        #pprint(r, indent=4)
        print (i+1)
        print ('search score:', r['distance'])
        print ('tile:', r['entity']['title'])
        print ('plot:', r['entity']['plot'])
        print()

In [12]:
query = "Where humans fight aliens"

results = do_vector_search (query)
print_search_results (results)

num results :  5
1
search score: 0.8236287832260132
tile: Independence Day
plot: The aliens are coming and their goal is to invade and destroy Earth. Fighting superior technology, mankind's best weapon is the will to survive.

2
search score: 0.7872653007507324
tile: Starship Troopers
plot: Humans in a fascistic, militaristic future do battle with giant alien bugs in a fight for survival.

3
search score: 0.7458124160766602
tile: V: The Final Battle
plot: A small group of human resistance fighters fight a desperate guerilla war against the genocidal extra-terrestrials who dominate Earth.

4
search score: 0.735677182674408
tile: Enemy Mine
plot: A soldier from Earth crash-lands on an alien world after sustaining battle damage. Eventually he encounters another survivor, but from the enemy species he was fighting; they band together ...

5
search score: 0.7283808588981628
tile: Battlefield Earth
plot: After enslavement & near extermination by an alien race in the year 3000, humanity begin

In [13]:
query = "Relationship drama between friends"

results = do_vector_search (query)
print_search_results (results)

num results :  5
1
search score: 0.7513976693153381
tile: Varalaaru
plot: Relationships become entangled in an emotional web.

2
search score: 0.6959848403930664
tile: Once a Thief
plot: A romantic and action packed story of three best friends, a group of high end art thieves, who come into trouble when a love-triangle forms between them.

3
search score: 0.6907370090484619
tile: Dark Blue World
plot: The friendship of two men becomes tested when they both fall for the same woman.

4
search score: 0.6907370090484619
tile: Dark Blue World
plot: The friendship of two men becomes tested when they both fall for the same woman.

5
search score: 0.6906610131263733
tile: Harsh Times
plot: A tough-minded drama about two friends in South Central Los Angeles and the violence that comes between them.

