## Baseline Movie Recomender System

### Building a similarity based recomender system

In [10]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel
import torch

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [2]:
df = pd.read_csv("Data/IMDB_top_1000.csv")

In [3]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [4]:
df.shape


(1000, 16)

In [5]:
plots = df['Overview'].values.tolist()
for i in range(len(plots)):
    plots[i] = f"{plots[i]}  {df.loc[i,'Genre']}  {df.loc[i,'Director']} {df.loc[i,'Star1']} {df.loc[i,'Star2']} {df.loc[i,'Star3']} {df.loc[i,'Star4']}"

## TF IDF recomender

In [23]:
nltk.download('punkt')
nltk.download('stopwords')

UsageError: Cell magic `%%` not found.


In [11]:


# Initialize Porter Stemmer and stopwords
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gillesdeknache/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gillesdeknache/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
def preprocess_text(text):
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove punctuation and stopwords, and apply stemming
    processed_tokens = [stemmer.stem(word.lower()) for word in tokens if word.isalnum() and word.lower() not in stop_words]
    
    # Join the tokens back into a single string
    return ' '.join(processed_tokens)

In [13]:
preprocessed_plots = [preprocess_text(plot) for plot in plots]

In [17]:
preprocessed_plots[0]

'two imprison men bond number year find solac eventu redempt act common decenc drama frank darabont tim robbin morgan freeman bob gunton william sadler'

In [18]:
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to the documents and transform them into TF-IDF vectors
tfidf = tfidf_vectorizer.fit_transform(preprocessed_plots)

In [47]:
def recomend_me(query,top_k = 5):
    preprocessed_query = preprocess_text(query)

    tfidf_query = tfidf_vectorizer.transform([preprocessed_query])

    cos_similarities = cosine_similarity(tfidf_query,tfidf)

    sorted_idx = np.argsort(cos_similarities.squeeze())
    recomended_movies = []
    plot = []
    for idx in reversed(sorted_idx[-top_k:]):
        print(df.loc[idx]['Series_Title'])
        print(plots[idx])

In [20]:
recomend_me('Crime movie with Leonardo DiCaprio')

The Wolf of Wall Street
Based on the true story of Jordan Belfort, from his rise to a wealthy stock-broker living the high life to his fall involving crime, corruption and the federal government.  Biography, Crime, Drama  Martin Scorsese Leonardo DiCaprio Jonah Hill Margot Robbie Matthew McConaughey
The Departed
An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.  Crime, Drama, Thriller  Martin Scorsese Leonardo DiCaprio Matt Damon Jack Nicholson Mark Wahlberg
Shutter Island
In 1954, a U.S. Marshal investigates the disappearance of a murderer who escaped from a hospital for the criminally insane.  Mystery, Thriller  Martin Scorsese Leonardo DiCaprio Emily Mortimer Mark Ruffalo Ben Kingsley
Titanic
A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic.  Drama, Romance  James Cameron Leonardo DiCaprio Kate Winslet Billy Zane Kathy Bates
Inception
A th

In [22]:
recomend_me("Adventure movie with Christopher Nolan")

Batman Begins
After training with his mentor, Batman begins his fight to free crime-ridden Gotham City from corruption.  Action, Adventure  Christopher Nolan Christian Bale Michael Caine Ken Watanabe Liam Neeson
Interstellar
A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.  Adventure, Drama, Sci-Fi  Christopher Nolan Matthew McConaughey Anne Hathaway Jessica Chastain Mackenzie Foy
Memento
A man with short-term memory loss attempts to track down his wife's murderer.  Mystery, Thriller  Christopher Nolan Guy Pearce Carrie-Anne Moss Joe Pantoliano Mark Boone Junior
Inception
A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.  Action, Adventure, Sci-Fi  Christopher Nolan Leonardo DiCaprio Joseph Gordon-Levitt Elliot Page Ken Watanabe
The Dark Knight Rises
Eight years after the Joker's reign of anarchy, Batman, with the help of the enigmati

## BERT Recomender

In [123]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [124]:
tokens = tokenizer(plots, padding=True, truncation=True, return_tensors='pt')

In [125]:
with torch.no_grad():
    outputs = model(**tokens)

embeddings = outputs.pooler_output

In [126]:
def recomend_me_bis(query,top_k = 5):
    query_token = tokenizer([query], padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        outputs = model(**query_token)

    embedding_query = outputs.pooler_output
    # Calculate cosine similarity between query and all descriptions
    cos_similarities = cosine_similarity(embedding_query,embeddings)
    sorted_idx = np.argsort(cos_similarities.squeeze())
    recomended_movies = []
    plot = []
    for idx in reversed(sorted_idx[-top_k:]):
        print(df.loc[idx]['Series_Title'])
        print(plots[idx])

In [128]:
recomend_me_bis('Crime with Leonardo DiCaprio')

Do lok tin si
This Hong Kong-set crime drama follows the lives of a hitman, hoping to get out of the business, and his elusive female partner.  Comedy, Crime, Drama  Kar-Wai Wong Leon Lai Michelle Reis Takeshi Kaneshiro Charlie Yeung
Du rififi chez les hommes
Four men plan a technically perfect crime, but the human element intervenes...  Crime, Drama, Thriller  Jules Dassin Jean Servais Carl Möhner Robert Manuel Janine Darcey
There Will Be Blood
A story of family, religion, hatred, oil and madness, focusing on a turn-of-the-century prospector in the early days of the business.  Drama  Paul Thomas Anderson Daniel Day-Lewis Paul Dano Ciarán Hinds Martin Stringer
Le notti di Cabiria
A waifish prostitute wanders the streets of Rome looking for true love but finds only heartbreak.  Drama  Federico Fellini Giulietta Masina François Périer Franca Marzi Dorian Gray
Darbareye Elly
The mysterious disappearance of a kindergarten teacher during a picnic in the north of Iran is followed by a series

In [129]:
recomend_me_bis('Fun and exciting with adventure')

There Will Be Blood
A story of family, religion, hatred, oil and madness, focusing on a turn-of-the-century prospector in the early days of the business.  Drama  Paul Thomas Anderson Daniel Day-Lewis Paul Dano Ciarán Hinds Martin Stringer
Darbareye Elly
The mysterious disappearance of a kindergarten teacher during a picnic in the north of Iran is followed by a series of misadventures for her fellow travelers.  Drama, Mystery  Asghar Farhadi Golshifteh Farahani Shahab Hosseini Taraneh Alidoosti Merila Zare'i
Swades: We, the People
A successful Indian scientist returns to an Indian village to take his nanny to America with him and in the process rediscovers his roots.  Drama  Ashutosh Gowariker Shah Rukh Khan Gayatri Joshi Kishori Ballal Smit Sheth
M.S. Dhoni: The Untold Story
The untold story of Mahendra Singh Dhoni's journey from ticket collector to trophy collector - the world-cup-winning captain of the Indian Cricket Team.  Biography, Drama, Sport  Neeraj Pandey Sushant Singh Rajput 

## Sbert

In [24]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

Downloading modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [28]:
# Sentences are encoded by calling model.encode()
embedding_plots = model.encode(plots)

In [40]:
def recomend_me_sbert(query,top_k = 1):
    movie_name = []
    movie_plot = []
    encoded_query = model.encode(query)
    cos_similarities = cosine_similarity(encoded_query.reshape(1, -1),embedding_plots)

    sorted_idx = np.argsort(cos_similarities.squeeze())

    for idx in reversed(sorted_idx[-top_k:]):
        movie_name.append(df.loc[idx]['Series_Title'])
        movie_plot.append(plots[idx])

    return movie_name,movie_plot

In [43]:
recomend_me_sbert("Adventure and Comedy")

(384,)
(1000, 384)


(['Shrek'],
 ['A mean lord exiles fairytale creatures to the swamp of a grumpy ogre, who must go on a quest and rescue a princess for the lord in order to get his land back.  Animation, Adventure, Comedy  Andrew Adamson Vicky Jenson Mike Myers Eddie Murphy Cameron Diaz'])

In [44]:
recomend_me_sbert("Something that makes me cry")

(384,)
(1000, 384)


(['Mystic River'],
 ['The lives of three men who were childhood friends are shattered when one of them has a family tragedy.  Crime, Drama, Mystery  Clint Eastwood Sean Penn Tim Robbins Kevin Bacon Emmy Rossum'])

In [48]:
recomend_me("Something that makes me cry")

Stagecoach
A group of people traveling on a stagecoach find their journey complicated by the threat of Geronimo and learn something about each other in the process.  Adventure, Drama, Western  John Ford John Wayne Claire Trevor Andy Devine John Carradine
Fight Club
An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.  Drama  David Fincher Brad Pitt Edward Norton Meat Loaf Zach Grenier
Watchmen
In 1985 where former superheroes exist, the murder of a colleague sends active vigilante Rorschach into his own sprawling investigation, uncovering something that could completely change the course of history as we know it.  Action, Drama, Mystery  Zack Snyder Jackie Earle Haley Patrick Wilson Carla Gugino Malin Akerman
The Peanut Butter Falcon
Zak runs away from his care home to make his dream of becoming a wrestler come true.  Adventure, Comedy, Drama  Tyler Nilson Michael Schwartz Zack Gottsagen Ann Owens Dakota J