# **Movie recommendation system**

**Dataset**

Movie Recommendation System is using TMDB 5000 Movie Dataset

> *  **Content Based Filtering**- suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations.

In [1]:
import pandas as pd
import numpy as np

df1=pd.read_csv('tmdb_5000_credits.csv')
df2=pd.read_csv('tmdb_5000_movies.csv')

In [18]:
df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1,on='id')

# **Content Based Filtering**
In this recommendation system, the movie's content, including its overview, cast, crew, keywords, tagline, etc., is analyzed to identify similarities with other movies. Subsequently, the system suggests movies that are highly likely to share similarities.Like plot description based recommendation.

In [19]:
df2['overview'].head(5)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

In [20]:
# For text processing, TF-IDF vectors for overview
# Text vectorization and cleaning

from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF Vectorizer Object. Remove stop words & replace Nan
tfidf = TfidfVectorizer(stop_words='english')
df2['overview'] = df2['overview'].fillna('')

tfidf_matrix = tfidf.fit_transform(df2['overview'])

#final shape of tfidf_matrix
tfidf_matrix.shape

(4803, 20978)

**Next step**

1. Compute similarity score: Pick cosine similarity to quantify similarity b/w 2 given movies.
2. Experiment with euclidean/ Pearson

In [21]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[1][1]

1.0

Now, expected Input: 1 movie, Output 5-10 similar movies based on cosine_sim score
1. reverse map movie title

In [22]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()
indices[1:2]

title
Pirates of the Caribbean: At World's End    1
dtype: int64

Next Steps :-
1. Get index of movie given title
2. Get list of  cosine_sim score x_movie with all movies
3. Sort the generated list of tuples by similarity
4. Get top 10 elements of this list, ignoring first element as refers to self & return corresponding titles.

In [23]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies by similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of top 10 similar movies
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]

    return df2['title'].iloc[movie_indices]

In [24]:
get_recommendations('The Dark Knight Rises')

65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
119                               Batman Begins
2507                                  Slow Burn
9            Batman v Superman: Dawn of Justice
1181                                        JFK
210                              Batman & Robin
Name: title, dtype: object

## **Generating metadata soup**
Extract key variables: actor, director(s), keywords

Step: data is in "stringified" lists format, let's convert to usable structure

In [25]:
list(df2.columns)

['budget',
 'genres',
 'homepage',
 'id',
 'keywords',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'vote_average',
 'vote_count',
 'tittle',
 'cast',
 'crew']

In [26]:
# using ast package to transform string to usable py obj
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

df2.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.avatarmovie.com/,19995,"[{'id': 1463, 'name': 'culture clash'}, {'id':...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de..."
1,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",http://disney.go.com/disneypictures/pirates/,285,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de..."


In [27]:
# Func to extract list then director

# Step1: Return top 3 elements or entire list
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# Step2: Get the director's name from crew
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [28]:
## Generate new variable director and finalise the variables

In [29]:
# Define new director, cast, genres and keywords features that are in a suitable form.
df2['director'] = df2['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)

In [30]:
# Print the new features of the first 3 films
df2[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


**Data pre-procesing**

In [31]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return ''
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [33]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df2[feature] = df2[feature].apply(clean_data)

**CREATE our "metadata soup", which is a string that contains all the metadata**

In [34]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

df2['soup'] = df2.apply(create_soup, axis=1)

Step 2:  **CountVectorizer()** not TF-IDF because we do not want to down-weight the presence of an cast frequency

In [35]:
df2.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'tittle', 'cast', 'crew', 'director', 'soup'],
      dtype='object')

In [36]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])

In [41]:
#  Repeat the steps from initial recommendation function

In [42]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [43]:
# Reset index of DF and reverse mapping as before
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

In [44]:
df_f = df2[['title', 'soup']]
df_f.head(5)

Unnamed: 0,title,soup
0,Avatar,cultureclash future spacewar samworthington zo...
1,Pirates of the Caribbean: At World's End,ocean drugabuse exoticisland johnnydepp orland...
2,Spectre,spy basedonnovel secretagent danielcraig chris...
3,The Dark Knight Rises,dccomics crimefighter terrorist christianbale ...
4,John Carter,basedonnovel mars medallion taylorkitsch lynnc...


In [45]:
df_f.to_csv('final_mv_dataset.csv', index=False)

In [46]:
get_recommendations('The Big Short', cosine_sim2)

925                  Crazy, Stupid, Love.
4247                Me You and Five Bucks
906     Anchorman 2: The Legend Continues
1571                         Hope Springs
1700              Florence Foster Jenkins
2485                          The Cookout
3238                 Little Miss Sunshine
3577                     The Way Way Back
4087                    American Splendor
1386                     Saving Mr. Banks
Name: title, dtype: object

#**T5 for recommendation**

You should consider upgrading via the '/Users/thapasyamurali/Desktop/capstone/cap/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/thapasyamurali/Desktop/capstone/cap/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/thapasyamurali/Desktop/capstone/cap/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/thapasyamurali/Desktop/capstone/cap/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
## installations

# !pip install transformers
!pip install huggingface-hub
!pip install rouge_score==0.1.2
!pip install sentencepiece

import sys
!{sys.executable} -m pip install transformers torch torchvision


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m


In [14]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader, TensorDataset

# Load pre-trained T5 model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [15]:
import pandas as pd

# Load data and preprocess it
training_data = pd.read_csv('final_mv_dataset.csv')

# Convert the "soup" column to a list of strings
soup_list = training_data["soup"].astype(str).tolist()
title_list = training_data["title"].astype(str).tolist()

In [17]:

inputs = tokenizer(soup_list, return_tensors="pt", padding=True, truncation=True)
labels = tokenizer(title_list, return_tensors="pt", padding=True, truncation=True)["input_ids"]

# Create dataset
dataset = TensorDataset(inputs["input_ids"], inputs["attention_mask"], labels)


# model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Define optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss()

num_epochs = 20
batch_size = 16 #32
# Train the model
for epoch in range(num_epochs):
    for batch in DataLoader(dataset, batch_size=batch_size, shuffle=True):
        optimizer.zero_grad()
        outputs = model(input_ids=batch[0], attention_mask=batch[1], labels=batch[2])
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Save the trained model
model.save_pretrained("kaggle/working/")

KeyboardInterrupt: 

In [None]:
training_data["soup"]

In [None]:
# test the sample dialogue-based t5-small code below

In [None]:
def current_context(dialog, instruction):
    dialog = ' EOS '.join(dialog)
    context = f"{instruction} [CONTEXT] {dialog} "
    return context

def generate(context):
    input_ids = tokenizer(f"{context}", return_tensors="pt").input_ids.to('cuda')
    outputs = model.generate(input_ids, max_length=128, min_length=8, top_p=0.9, do_sample=True)
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return output

instruction = f'Instruction: given a dialog about movie recommendation, you need to respond based on human preferences.'
dialog = [
    'Do you have any recommendation about a movie?',
    'Yes! Any particular genre that you\'d like to see mentioned?',
]
print('User: ' + dialog[0])
print('Bot: ' + dialog[1])
while(1):
    print('User: ', end = '')
    query = input()
    if query == 'end':
        break
    dialog.append(query)
    response = generate(current_context(dialog, instruction))
    print('Bot: ' + response)
    dialog.append(response)
