# Setup

Let's start by installing Pandas and importing our demo data via Huggingface Hub

In [6]:
from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "drossi/EDA_on_IMDB_Movies_Dataset"
FILENAME = "imdb_top_1000.csv"

df = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
)

df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


imdb_top_1000.csv:   0%|          | 0.00/437k [00:00<?, ?B/s]

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


# Data Preparation

We don't need any images or posters, so let's drop the first column.

In [7]:
df.drop('Poster_Link', axis='columns', inplace=True)

In [8]:
df.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


Let's check if there are any null values and get rid of them. They might make problems in the embedding process and we don't need all rows for our purpose here.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   1000 non-null   object 
 1   Released_Year  1000 non-null   object 
 2   Certificate    899 non-null    object 
 3   Runtime        1000 non-null   object 
 4   Genre          1000 non-null   object 
 5   IMDB_Rating    1000 non-null   float64
 6   Overview       1000 non-null   object 
 7   Meta_score     843 non-null    float64
 8   Director       1000 non-null   object 
 9   Star1          1000 non-null   object 
 10  Star2          1000 non-null   object 
 11  Star3          1000 non-null   object 
 12  Star4          1000 non-null   object 
 13  No_of_Votes    1000 non-null   int64  
 14  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(12)
memory usage: 117.3+ KB


In [10]:
df.dropna(how='any',axis=0, inplace=True)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 997
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   714 non-null    object 
 1   Released_Year  714 non-null    object 
 2   Certificate    714 non-null    object 
 3   Runtime        714 non-null    object 
 4   Genre          714 non-null    object 
 5   IMDB_Rating    714 non-null    float64
 6   Overview       714 non-null    object 
 7   Meta_score     714 non-null    float64
 8   Director       714 non-null    object 
 9   Star1          714 non-null    object 
 10  Star2          714 non-null    object 
 11  Star3          714 non-null    object 
 12  Star4          714 non-null    object 
 13  No_of_Votes    714 non-null    int64  
 14  Gross          714 non-null    object 
dtypes: float64(2), int64(1), object(12)
memory usage: 89.2+ KB


Next let's prepare the data for the embedding

In [12]:
data = df.to_dict('records')

# Embeddings

In [13]:
%pip install qdrant-client
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

Collecting qdrant-client
  Downloading qdrant_client-1.12.1-py3-none-any.whl.metadata (10 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client)
  Downloading grpcio_tools-1.68.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client)
  Downloading protobuf-5.29.1-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant-client)
  Downloading h2-4.1.0-py3-none-any.whl.metadata (3.6 kB)
Collecting hyperframe<7,>=6.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client)
  Downloading hyperframe-6.0.1-py3-none-any.whl.metadata (2.7 kB)
Collecting hpack<5,>=4.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client)
  Downloading hpack-4.0.0-py3-none-any.whl.metadata (2.5 kB)
Downloading qdrant_client-1.12.1-py3-none-

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [14]:
encoder = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
qdrant = QdrantClient(":memory:")

In [16]:
qdrant.recreate_collection(
    collection_name="movies",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),
        distance=models.Distance.COSINE
    )
)

  qdrant.recreate_collection(


True

In [17]:
qdrant.upload_points(
    collection_name="movies",
    points=[
        models.PointStruct(
            id=idx,
            vector=encoder.encode(doc["Overview"]).tolist(),
            payload=doc
        ) for idx, doc in enumerate(data)
    ]
)

# Search

In [18]:
hits = qdrant.search(
    collection_name="movies",
    query_vector=encoder.encode("I am looking for a highly rated action movie").tolist(),
    limit=3
)
for hit in hits:
  print(hit.payload, "score:", hit.score)

{'Series_Title': 'Sin City', 'Released_Year': '2005', 'Certificate': 'A', 'Runtime': '124 min', 'Genre': 'Crime, Thriller', 'IMDB_Rating': 8.0, 'Overview': 'A movie that explores the dark and miserable town, Basin City, tells the story of three different people, all caught up in violent corruption.', 'Meta_score': 74.0, 'Director': 'Frank Miller', 'Star1': 'Quentin Tarantino', 'Star2': 'Robert Rodriguez', 'Star3': 'Mickey Rourke', 'Star4': 'Clive Owen', 'No_of_Votes': 738512, 'Gross': '74,103,820'} score: 0.3836764229774456
{'Series_Title': 'Wonder', 'Released_Year': '2017', 'Certificate': 'U', 'Runtime': '113 min', 'Genre': 'Drama, Family', 'IMDB_Rating': 8.0, 'Overview': 'Based on the New York Times bestseller, this movie tells the incredibly inspiring and heartwarming story of August Pullman, a boy with facial differences who enters the fifth grade, attending a mainstream elementary school for the first time.', 'Meta_score': 66.0, 'Director': 'Stephen Chbosky', 'Star1': 'Jacob Tremb

# Let's get to the fun part with OpenAI

In [19]:
search_results = [hit.payload for hit in hits]

In [20]:
search_results

[{'Series_Title': 'Sin City',
  'Released_Year': '2005',
  'Certificate': 'A',
  'Runtime': '124 min',
  'Genre': 'Crime, Thriller',
  'IMDB_Rating': 8.0,
  'Overview': 'A movie that explores the dark and miserable town, Basin City, tells the story of three different people, all caught up in violent corruption.',
  'Meta_score': 74.0,
  'Director': 'Frank Miller',
  'Star1': 'Quentin Tarantino',
  'Star2': 'Robert Rodriguez',
  'Star3': 'Mickey Rourke',
  'Star4': 'Clive Owen',
  'No_of_Votes': 738512,
  'Gross': '74,103,820'},
 {'Series_Title': 'Wonder',
  'Released_Year': '2017',
  'Certificate': 'U',
  'Runtime': '113 min',
  'Genre': 'Drama, Family',
  'IMDB_Rating': 8.0,
  'Overview': 'Based on the New York Times bestseller, this movie tells the incredibly inspiring and heartwarming story of August Pullman, a boy with facial differences who enters the fifth grade, attending a mainstream elementary school for the first time.',
  'Meta_score': 66.0,
  'Director': 'Stephen Chbosky',


In [21]:
# we have to use specific versions here to prevent an error from happening
!pip install openai==1.55.3 httpx==0.27.2



In [None]:
from openai import OpenAI

api_key = ""

client = OpenAI(api_key=api_key)

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are chatbot, a movie specialist. Your top priority is to help guide users into selecting movies they like and guide them with their requests."},
        {"role": "user", "content": "Suggest me an amazing action movie with a female superstar"},
        {"role": "assistant", "content": str(search_results)}
    ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content='If you\'re looking for an action-packed movie with a female superstar, I recommend "Mad Max: Fury Road" (2015). Starring Charlize Theron as the formidable Furiosa, this film is a high-octane ride through a post-apocalyptic wasteland. It\'s directed by George Miller and offers breathtaking visuals and intense action sequences. Theron\'s performance is widely praised, and the film is noted for its strong emphasis on female empowerment.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)
