<a href="https://colab.research.google.com/github/terryliu1993/-Recommendation-System-Movie-Recommendation-Engine/blob/main/Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd#
import numpy as np#
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
import nltk #
import re #
nltk.download('punkt') #
from nltk.stem.snowball import SnowballStemmer#
from sklearn.feature_extraction.text import TfidfVectorizer #

from sklearn.metrics.pairwise import linear_kernel #


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
# use this only if you use google colab
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving wiki_movie_plots_deduped.csv to wiki_movie_plots_deduped.csv
User uploaded file "wiki_movie_plots_deduped.csv" with length 81193310 bytes


In [7]:
metadata=pd.read_csv('wiki_movie_plots_deduped.csv',header=0)
metadata.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


# EDA

First, check an overview on the size of data, table header, check missing values

In [None]:
print('Shape of data: ',metadata.shape)
print(metadata.columns)

print('========================')
print(metadata.isnull().sum())

print('========================')
print(metadata.dtypes)

Shape of data:  (34886, 8)
Index(['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast',
       'Genre', 'Wiki Page', 'Plot'],
      dtype='object')
Release Year           0
Title                  0
Origin/Ethnicity       0
Director               0
Cast                1422
Genre                  0
Wiki Page              0
Plot                   0
dtype: int64
Release Year         int64
Title               object
Origin/Ethnicity    object
Director            object
Cast                object
Genre               object
Wiki Page           object
Plot                object
dtype: object


There are more than 10% of records with missing Cast. When we build model, it may become a noise. thus for simplicity, in this project, we delete Cast column

URL address is not our interested features on NLP model in this project, thus it's dropped.

Since last column Unnamed:8 is empty, we delete it, too.

In [8]:
metadata=metadata.drop(['Wiki Page'],axis=1)
print('New data size is ',metadata.shape)


###or drop Cast
# metadata=metadata.drop(['Cast','Wiki Page','Unnamed: 8'],axis=1)
# print('New data size is ',metadata.shape)

New data size is  (34886, 7)


Find duplicate record.
Since some movies may be remade many years later, thus we introduce a Temporary feature combining Release Year+Director+Title, to determine if this record is real duplicate.

In [9]:
# Find the true duplicates by Release Year+Director+Title
metadata['Temporary']=metadata['Release Year'].astype(str)+metadata['Director'].astype(str)+metadata['Title'].astype(str)
metadata['Temporary'].nunique()

34886

Since the unique number is equal to the row number of metadata, we conclude there is no duplicate data.

In [None]:
metadata.Genre.value_counts()

unknown                               6083
drama                                 5964
comedy                                4379
horror                                1167
action                                1098
                                      ... 
drama / western / crime                  1
reincarnation drama                      1
romance/teen                             1
action-adventure, animated, family       1
crime drama, superhero                   1
Name: Genre, Length: 2265, dtype: int64

In [None]:
# find the length of each record
metadata['Origin/Ethnicity'].apply(lambda x:len(str(x)))
# remove those 2 rows with wrong information (with string length>20['Origin/Ethnicity']
metadata=metadata[metadata['Origin/Ethnicity'].apply(lambda x:len(str(x)))<20]
print('Data size becomes: ',metadata.shape)

Data size becomes:  (34886, 8)


# The Recommendation Engine

In [10]:
# define a function to tokenize sentence, tokenize word, and stem words
stemmer = SnowballStemmer("english")
def token_stem(doc):
  # tokenize document into sentence, then tokenize sentence into words
  tokens=[word for sent in nltk.sent_tokenize(doc) for word in nltk.word_tokenize(sent)]
  # remove non-letter text
  letter_only_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
  # stem the word token
  stemmed_tokens=[stemmer.stem(word) for word in letter_only_tokens]
  return stemmed_tokens

In [11]:
# TfidfVectorizer
tfidf=TfidfVectorizer(stop_words='english',tokenizer=token_stem,ngram_range=(1,3),decode_error='ignore')
# generate tfidf_matrix by fit and transform on plot
tfidf_matrix=tfidf.fit_transform([plot for plot in metadata['Plot'].astype(str)])
print('tfidf_matrix size is: ',tfidf_matrix.shape)



  'stop_words.' % sorted(inconsistent))


tfidf_matrix size is:  (34886, 10107563)


# Cosine Similarity

cosine_similarity= AB  /  ||A||||B||

--numerator AB are dot product(element-wise product),retern a scaler 

--denominator ||A|| is magnitude :  sqrt(a1^2 + a2^2 + ... +an^2) 

--since denominator is the same for all pairwise cos_similarity across document, it's computational cheap to use linear_kernal to ONLY calculate numerator(dot product)

In [None]:
# calculate pairwise cosine_similarity on tfidf_matrix
cos_sim=linear_kernel(tfidf_matrix,tfidf_matrix)

In [2]:
# def get_recommendation(input_title,cos_sim=cos_sim):
input_title='Forgotten'
# find the movie in the matrix, and retrieve the entire row
# each element in the row is pairwise cosine similarity with the input movie
row_number=metadata[metadata['Title'] == input_title].index
# give the entire list an index
rankings=list(enumerate(cos_sim[row_number][0]))
# sort the ranking in descending order
sorted_rankings=sorted(rankings,key=lambda x:x[1],reverse=True)
# pick the top 10 most similar movie(with largest cosine_similarity)
# exclude the first movie itself with cosine_similarity=1
top_10=sorted_rankings[1:11]
movie_id=[element[0] for element in top_10]
movie_names=[metadata.iloc[number,1] for number in movie_id]
movie_names

NameError: ignored

# Manually explore some similar movies:

In [None]:
metadata['Origin/Ethnicity'].value_counts()

Tamil           1662
Telugu          1311
Japanese        1188
South_Korean     522
Russian          232
Turkish           70
Malaysian         70
Maldivian          2
Name: Origin/Ethnicity, dtype: int64

In [None]:
metadata[metadata['Origin/Ethnicity']=='South_Korean']

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Genre,Wiki Page,Plot,Temporary,Title_and_plot
4468,1947,Arirang,South_Korean,Na Woon-gyu,drama,https://en.wikipedia.org/wiki/Arirang_(1926_film),Yeong-jin is a student who has become mentally...,1947Na Woon-gyuArirang,Arirang Yeong-jin is a student who has become ...
4469,1947,Nongjungjo,South_Korean,Lee Gyu-seol,melodrama,https://en.wikipedia.org/wiki/Nongjungjo,The story is a melodrama concerning two lovers...,1947Lee Gyu-seolNongjungjo,Nongjungjo The story is a melodrama concerning...
4470,1947,Soldier of Fortune,South_Korean,Na Woon-gyu,melodrama,https://en.wikipedia.org/wiki/Punguna,"In Punguna, Na Woon-gyu plays the role of Nico...",1947Na Woon-gyuSoldier of Fortune,"Soldier of Fortune In Punguna, Na Woon-gyu pla..."
4471,1947,Deuljwi,South_Korean,Na Woon-gyu,melodrama,https://en.wikipedia.org/wiki/Deuljwi,The plot concerns a young couple who have made...,1947Na Woon-gyuDeuljwi,Deuljwi The plot concerns a young couple who h...
4472,1947,Farewell,South_Korean,Na Woon-gyu,drama,https://en.wikipedia.org/wiki/Jalitgeola,This film is a melodrama telling a story of gr...,1947Na Woon-gyuFarewell,Farewell This film is a melodrama telling a st...
...,...,...,...,...,...,...,...,...,...
4985,2017,The Swindlers,South_Korean,Jang Chang-won,unknown,https://en.wikipedia.org/wiki/The_Swindlers_(2...,A con artist who had been reported dead after ...,2017Jang Chang-wonThe Swindlers,The Swindlers A con artist who had been report...
4986,2017,Forgotten,South_Korean,Jang Hang-jun,unknown,https://en.wikipedia.org/wiki/Forgotten_(2017_...,A man loses his memory after being kidnapped f...,2017Jang Hang-junForgotten,Forgotten A man loses his memory after being k...
4987,2017,Steel Rain,South_Korean,Yang Woo-suk,unknown,https://en.wikipedia.org/wiki/Steel_Rain,A former agent from the North Korean intellige...,2017Yang Woo-sukSteel Rain,Steel Rain A former agent from the North Korea...
4988,2017,Along With the Gods: The Two Worlds,South_Korean,Kim Yong-hwa,unknown,https://en.wikipedia.org/wiki/Along_With_the_G...,Story of the death of an ordinary fireman name...,2017Kim Yong-hwaAlong With the Gods: The Two W...,Along With the Gods: The Two Worlds Story of t...


In [None]:
metadata[metadata['Title']=='Santa Barbara']

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,Unnamed: 8
4839,2014,Santa Barbara,South_Korean,David Cho,"Lee Sang-yoon, Yoon Jin-seo",unknown,https://en.wikipedia.org/wiki/Santa_Barbara_(f...,Jung-woo is a naïve music director of film and...,


Conclusion
To imporve accuracy, we may
1. Using web scraping to collect those missing Cast and Genre.
2. Add release year, origin, director, cast, genre into the feature matrix, then find the similarity ranking