## Baseline Movie Recomender System

### Building a similarity based recomender system

In [102]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel
import torch

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [103]:
df = pd.read_csv("Data/IMDB_top_1000.csv")

In [104]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [105]:
df.shape


(1000, 16)

In [106]:
plots = df['Overview'].values.tolist()
for i in range(len(plots)):
    plots[i] = f"{plots[i]}  {df.loc[i,'Genre']}  {df.loc[i,'Director']} {df.loc[i,'Star1']} {df.loc[i,'Star2']} {df.loc[i,'Star3']} {df.loc[i,'Star4']}"

## TF IDF recomender

In [107]:
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to the documents and transform them into TF-IDF vectors
tfidf = tfidf_vectorizer.fit_transform(plots)

In [108]:
def recomend_me(query,top_k = 5):
    tfidf_query = tfidf_vectorizer.transform([query])

    cos_similarities = cosine_similarity(tfidf_query,tfidf)

    sorted_idx = np.argsort(cos_similarities.squeeze())
    recomended_movies = []
    plot = []
    for idx in reversed(sorted_idx[-top_k:]):
        print(df.loc[idx]['Series_Title'])
        print(plots[idx])

In [109]:
recomend_me('Crime with Leonardo DiCaprio')

The Departed
An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.  Crime, Drama, Thriller  Martin Scorsese Leonardo DiCaprio Matt Damon Jack Nicholson Mark Wahlberg
The Wolf of Wall Street
Based on the true story of Jordan Belfort, from his rise to a wealthy stock-broker living the high life to his fall involving crime, corruption and the federal government.  Biography, Crime, Drama  Martin Scorsese Leonardo DiCaprio Jonah Hill Margot Robbie Matthew McConaughey
Titanic
A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic.  Drama, Romance  James Cameron Leonardo DiCaprio Kate Winslet Billy Zane Kathy Bates
Django Unchained
With the help of a German bounty hunter, a freed slave sets out to rescue his wife from a brutal Mississippi plantation owner.  Drama, Western  Quentin Tarantino Jamie Foxx Christoph Waltz Leonardo DiCaprio Kerry Washington
Shutter

## BERT Recomender

In [110]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [111]:
tokens = tokenizer(plots, padding=True, truncation=True, return_tensors='pt')

In [112]:
with torch.no_grad():
    outputs = model(**tokens)

embeddings = outputs.pooler_output

In [113]:
def recomend_me_bis(query,top_k = 5):
    query_token = tokenizer([query], padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        outputs = model(**query_token)

    embedding_query = outputs.pooler_output
    # Calculate cosine similarity between query and all descriptions
    cos_similarities = cosine_similarity(embedding_query,embeddings)
    sorted_idx = np.argsort(cos_similarities.squeeze())
    recomended_movies = []
    plot = []
    for idx in reversed(sorted_idx[-top_k:]):
        print(df.loc[idx]['Series_Title'])
        print(plots[idx])

In [114]:
recomend_me_bis('Crime with Leonardo DiCaprio')

Do lok tin si
This Hong Kong-set crime drama follows the lives of a hitman, hoping to get out of the business, and his elusive female partner.  Comedy, Crime, Drama  Kar-Wai Wong Leon Lai Michelle Reis Takeshi Kaneshiro Charlie Yeung
Du rififi chez les hommes
Four men plan a technically perfect crime, but the human element intervenes...  Crime, Drama, Thriller  Jules Dassin Jean Servais Carl Möhner Robert Manuel Janine Darcey
There Will Be Blood
A story of family, religion, hatred, oil and madness, focusing on a turn-of-the-century prospector in the early days of the business.  Drama  Paul Thomas Anderson Daniel Day-Lewis Paul Dano Ciarán Hinds Martin Stringer
Le notti di Cabiria
A waifish prostitute wanders the streets of Rome looking for true love but finds only heartbreak.  Drama  Federico Fellini Giulietta Masina François Périer Franca Marzi Dorian Gray
Darbareye Elly
The mysterious disappearance of a kindergarten teacher during a picnic in the north of Iran is followed by a series