# Movie Recommendation System Using Embeddings

In this project, we will design and build a Movie Recommendation System using the power of OpenAI embeddings. In this module, we'll tap into the advanced natural language understanding capabilities of OpenAI's machine learning models to analyze and match movie titles and descriptions, providing personalized recommendations to users.

## Project Objectives

In this project, we will build a recommendation system that:

1. **Processes Movie Data**: Converts movie titles and descriptions into embeddings that capture the essence of the content.
2. **Calculates Similarities**: Uses the embeddings to find similarities between movies based on user queries or past user interactions.
3. **Generates Recommendations**: Offers a list of movie recommendations tailored to the user's tastes and viewing history.
4. **Pinecone**: Working with Pinecone Vector
 Database


# 2. Libraries import

Import the openai and python-dotenv libraries to use openai apis and read in the openai api key respectively.

In [1]:
!pip install openai
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [2]:
import os
import openai
import numpy as np
import pandas as pd

from openai import OpenAI
from dotenv import load_dotenv

In [3]:
pwd()

'/content'

# 3. Sending a first request to OpenAI API


### 3.1 Setting up API Key

In [4]:
# os.environ["OPENAI_API_KEY"] = "sk-XXXXXXXXXXXXX"
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")

client = OpenAI()

API key looks good so far


### 3.2 Vectors and their similarity


### Embeddings:

An embedding is a way of turning words, sentences, or things like movies into a list of numbers (we call this list a "vector") that represents different features, just like the list you made for the fruits. For example, for movies, the numbers might represent how action-packed they are, whether they are romantic, if they are funny, and so on. These numbers aren't random; they are calculated so that movies with similar numbers have similar features.

### Vector Similarity:

Vector similarity is calculate by using something called "cosine similarity" to check how close the numbers are in both vectors or lists. If the numbers are really close across both lists, it means the movies are similar. If the numbers are far apart, then the movies are quite different.


In [5]:
experiment_sentence = "The Terminator is a movie about AI going after human"

In [6]:
res = client.embeddings.create(
    model="text-embedding-ada-002",
    input=experiment_sentence
)

In [7]:
len(res.data[0].embedding)

1536

In [8]:
res.data[0].embedding[:10]

[-0.015121644362807274,
 -0.05992080271244049,
 -0.02566564828157425,
 -0.021782368421554565,
 0.018477723002433777,
 0.010974764823913574,
 -0.03266069293022156,
 -0.0015197187894955277,
 -0.019982172176241875,
 -0.007207212503999472]

## Similarity

In [9]:
toy_dataset = [
    "The Terminator is a movie that has AI-based robots inside of them",
    "Harry Potter is all amobut wizards and magic",
    "In the movie Matrix, AI already has become the most powerfull 'being'"
]

In [10]:
toy_embeddings = client.embeddings.create(
    model="text-embedding-ada-002",
    input=toy_dataset
)

In [11]:
# toy_embeddings.data[0].embedding[:10]
clean_embeds = [i.embedding for i in toy_embeddings.data]

In [12]:
clean_embeds[0][:10]

[-0.013006119057536125,
 -0.0584043487906456,
 -0.027775779366493225,
 -0.012967217713594437,
 0.01618308760225773,
 0.010522897355258465,
 -0.031302861869335175,
 -0.008253633975982666,
 -0.015534726902842522,
 -0.01243556197732687]

In [13]:
user_input = input("Enter movie description: ")

user_vector = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_input
)

Enter movie description: A movie about toys that come to life


In [14]:
user_vector = user_vector.data[0].embedding

In [15]:
from scipy.spatial.distance import cosine, cdist

In [16]:
cosine(user_vector, clean_embeds[0])

0.16800135509912806

In [17]:
np.array(clean_embeds).shape

(3, 1536)

In [18]:
np.array(user_vector).shape

(1536,)

In [19]:
# Calculating distances
cdist(np.array(user_vector).reshape(1,-1), np.array(clean_embeds), metric='cosine')

array([[0.16800136, 0.18896994, 0.1991382 ]])

In [20]:
#Calculating similarities
similarities = 1 - cdist(np.array(user_vector).reshape(1,-1), np.array(clean_embeds), metric='cosine')

## Recommending most similar vector

In [21]:
np.argsort(-similarities)

array([[0, 1, 2]])

In [22]:
p_movies = [toy_dataset[id] for id in np.argsort(-similarities)[0]] # Remove extra brackets to iterate over the 1D array

In [23]:
p_movies

['The Terminator is a movie that has AI-based robots inside of them',
 'Harry Potter is all amobut wizards and magic',
 "In the movie Matrix, AI already has become the most powerfull 'being'"]

# 4. Scaling to the big dataset

The dataset can be downloaded from here: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv

In [24]:
data = pd.read_csv("movies_metadata.csv")

  data = pd.read_csv("movies_metadata.csv")


In [25]:
data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [26]:
subset = data[['original_title', 'overview']]

In [27]:
subset.head()

Unnamed: 0,original_title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


In [28]:
# Using only 100 movies for recommendation system to peresven money for API :)
small_dataset = subset.iloc[:100]
small_dataset.shape

(100, 2)

In [29]:
# Drop missing values
small_dataset.dropna(inplace=True)
small_dataset.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  small_dataset.dropna(inplace=True)


(99, 2)

In [30]:
small_dataset['overview'].values.tolist()

["Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
 "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",
 "A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she 

In [31]:
movie_embeddings = client.embeddings.create(
    model="text-embedding-ada-002",
    input=small_dataset['overview'].values.tolist()
)

In [32]:
clean_movie_embeddings = [i.embedding for i in movie_embeddings.data]
clean_movie_embeddings[0][:10]

[-0.013767948374152184,
 -0.045178502798080444,
 -0.005214642733335495,
 -0.02619268372654915,
 -0.019567018374800682,
 -0.00416202750056982,
 0.013897104188799858,
 0.0019906051456928253,
 -0.010706969536840916,
 -0.007323102094233036]

In [33]:
user_input = input("Enter movie description: ")

user_vector = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_input
)

Enter movie description: action movie


In [34]:
user_vector = user_vector.data[0].embedding

In [35]:
from scipy.spatial.distance import cosine, cdist

In [36]:
cosine(user_vector, clean_movie_embeddings[0])

0.22936274223415798

In [37]:
np.array(clean_movie_embeddings).shape

(99, 1536)

In [38]:
np.array(user_vector).shape

(1536,)

In [39]:
# Calculating distances
cdist(np.array(user_vector).reshape(1,-1), np.array(clean_movie_embeddings), metric='cosine')

array([[0.22936274, 0.22480472, 0.21777495, 0.22739508, 0.25170424,
        0.21763775, 0.2236573 , 0.24620933, 0.1773976 , 0.22042161,
        0.25700483, 0.23728083, 0.21439638, 0.20887014, 0.24270086,
        0.23261925, 0.2574889 , 0.2407431 , 0.23864176, 0.2021232 ,
        0.23397959, 0.20925138, 0.21628225, 0.22467316, 0.21643221,
        0.23457715, 0.23605123, 0.23884589, 0.21878834, 0.21216349,
        0.2329473 , 0.22914391, 0.25982709, 0.27824418, 0.21746117,
        0.23263096, 0.22733017, 0.25667338, 0.20848717, 0.2122478 ,
        0.23375272, 0.26978787, 0.19526978, 0.19638146, 0.26965846,
        0.2111109 , 0.23419657, 0.23415499, 0.2280566 , 0.22325341,
        0.23369876, 0.23277425, 0.23502268, 0.25316228, 0.21852856,
        0.2353415 , 0.23235441, 0.26271758, 0.21832603, 0.22774811,
        0.23851211, 0.2227576 , 0.22708657, 0.22895865, 0.23438457,
        0.2025026 , 0.23423791, 0.22132647, 0.21673677, 0.23312581,
        0.23402775, 0.23099417, 0.22851986, 0.21

In [40]:
#Calculating similarities
similarities = 1 - cdist(np.array(user_vector).reshape(1,-1), np.array(clean_movie_embeddings), metric='cosine')

In [41]:
np.argsort(-similarities)

array([[ 8, 79, 42, 43, 81, 19, 65, 95, 38, 91, 13, 21, 45, 29, 39, 73,
        12, 22, 24, 68, 85, 98, 34,  5,  2, 58, 54, 28, 90,  9, 67, 87,
        92, 84, 61, 49,  6, 23,  1, 93, 62, 77, 36,  3, 59, 48, 74, 97,
        94, 72, 63, 31,  0, 71, 56, 15, 35, 51, 30, 69, 50, 40, 20, 70,
        47, 46, 66, 64, 25, 52, 86, 55, 26, 82, 11, 60, 18, 27, 17, 88,
        89, 14, 80, 76, 83,  7,  4, 53, 37, 10, 16, 32, 57, 75, 78, 44,
        41, 96, 33]])

In [42]:
p_movies = [small_dataset.iloc[id] ['original_title'] for id in np.argsort(-similarities)[0]]

In [43]:
p_movies[:5]

['Sudden Death',
 "Things to Do in Denver When You're Dead",
 'Mortal Kombat',
 'To Die For',
 'Once Upon a Time... When We Were Colored']

### 5. Building movie recommender with Pinecone


Pinecone website: https://www.pinecone.io/

Install the pinecone client.

In [44]:
pip install pinecone-client

Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.1.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone-client
Successfully installed pinecone-client-5.0.

Before inserting the embeddings, an empty database called movies must be created at the Pinecone website.

In [46]:
# import pinecone
from pinecone import Pinecone

load_dotenv(override=True)
pc_api_key = os.getenv('PINECONE_API_KEY')

pc = Pinecone(api_key=pc_api_key)
index = pc.Index("movies")

In [47]:
for i in range(len(small_dataset)):
  upload_stats = index.upsert(
      vectors = [
          (
              str(i),  # Vector ID as string
              np.array(clean_movie_embeddings[i]).tolist(),  # Vector data for current movie only
              {'title': small_dataset.iloc[i]['original_title']}  # Metadata
          )
      ]
  )

## Searching the most similar movie

In [48]:
user_input = input("Enter movie description: ")

user_vector = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_input
)

user_vector = user_vector.data[0].embedding

Enter movie description: movie about toys coming to life


Retrieve top 10 matches using vector similarity search

In [49]:
matches = index.query(
    vector=user_vector,
    top_k=10,
    include_metadata=True
)

In [50]:
matches

{'matches': [{'id': '0',
              'metadata': {'title': 'Toy Story'},
              'score': 0.853124917,
              'values': []},
             {'id': '58',
              'metadata': {'title': 'The Indian in the Cupboard'},
              'score': 0.843501508,
              'values': []},
             {'id': '1',
              'metadata': {'title': 'Jumanji'},
              'score': 0.838532925,
              'values': []},
             {'id': '28',
              'metadata': {'title': 'La Cité des Enfants Perdus'},
              'score': 0.807533,
              'values': []},
             {'id': '73',
              'metadata': {'title': 'Big Bully'},
              'score': 0.794205189,
              'values': []},
             {'id': '26',
              'metadata': {'title': 'Now and Then'},
              'score': 0.792412639,
              'values': []},
             {'id': '42',
              'metadata': {'title': 'Mortal Kombat'},
              'score': 0.791257203,
        