## Content-based recommender system for movies

The recommender we will build in this practical uses data from the IMDB (Internet Movie Database) website. IMDB is a Web site that provides information about millions of films and television programs.

The dataset that we will use in this practical contains the top 250 English movies and can be downloaded at the following link: https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7. The metadata used includes movie director, main actors and plot. 

This python script is adapted from https://towardsdatascience.com/how-to-build-from-scratch-a-content-based-movie-recommender-with-natural-language-processing-25ad400eb243

<b>Note:</b>

Make sure `Rake` (Rapid Automatic Keyword Extraction algorithm) library is installed. <br />Refer to https://pypi.org/project/rake-nltk/ for more information.

In [1]:
# run this statement only once to install Rake
# nltk : natural language tool kit

# In this jupyter notebook: 
# ------------------------
!pip install rake_nltk



In [2]:
import numpy as np
import pandas as pd
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to C:\Users\Wei
[nltk_data]     Ping\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Wei
[nltk_data]     Ping\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Step 1: Read in and analyse the data

In [4]:
# if you placed the csv file in your current directory:
df = pd.read_csv('IMDB_Top250Engmovies2_OMDB_Detailed.csv')

# if you want access the csv file from the Internet with its URL:
# url = 'https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7'
# df = pd.read_csv(url)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,...,tomatoConsensus,tomatoUserMeter,tomatoUserRating,tomatoUserReviews,tomatoURL,DVD,BoxOffice,Production,Website,Response
0,1,The Shawshank Redemption,1994,R,14-Oct-94,142 min,"Crime, Drama",Frank Darabont,"Stephen King (short story ""Rita Hayworth and S...","Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",...,,,,,http://www.rottentomatoes.com/m/shawshank_rede...,27-Jan-98,,Columbia Pictures,,True
1,2,The Godfather,1972,R,24-Mar-72,175 min,"Crime, Drama",Francis Ford Coppola,"Mario Puzo (screenplay), Francis Ford Coppola ...","Marlon Brando, Al Pacino, James Caan, Richard ...",...,,,,,http://www.rottentomatoes.com/m/godfather/,9-Oct-01,,Paramount Pictures,http://www.thegodfather.com,True
2,3,The Godfather: Part II,1974,R,20-Dec-74,202 min,"Crime, Drama",Francis Ford Coppola,"Francis Ford Coppola (screenplay), Mario Puzo ...","Al Pacino, Robert Duvall, Diane Keaton, Robert...",...,,,,,http://www.rottentomatoes.com/m/godfather_part...,24-May-05,,Paramount Pictures,http://www.thegodfather.com/,True
3,4,The Dark Knight,2008,PG-13,18-Jul-08,152 min,"Action, Crime, Drama",Christopher Nolan,"Jonathan Nolan (screenplay), Christopher Nolan...","Christian Bale, Heath Ledger, Aaron Eckhart, M...",...,,,,,http://www.rottentomatoes.com/m/the_dark_knight/,9-Dec-08,"$533,316,061",Warner Bros. Pictures/Legendary,http://thedarkknight.warnerbros.com/,True
4,5,12 Angry Men,1957,APPROVED,1-Apr-57,96 min,"Crime, Drama",Sidney Lumet,"Reginald Rose (story), Reginald Rose (screenplay)","Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",...,,,,,http://www.rottentomatoes.com/m/1000013-12_ang...,6-Mar-01,,Criterion Collection,http://www.criterion.com/films/27871-12-angry-men,True


In [6]:
# num rows and num cols
df.shape

(250, 38)

<img align="left" src='https://drive.google.com/uc?export=view&id=0B08uY8vosNfoeUJ4NUxtMlVNNnM' style="width: 60px; height: 60px;"><br />
If you want to do recommendations, do you need all the features?<br />
Which features do you think should be used?

Use the following input features to base the recommendations.

In [7]:
# create a subset of the initial dataframe (based on only 5 columns)
df = df[['Title', 'Genre', 'Director','Actors','Plot']]
df.head()

Unnamed: 0,Title,Genre,Director,Actors,Plot
0,The Shawshank Redemption,"Crime, Drama",Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...
1,The Godfather,"Crime, Drama",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,"Crime, Drama",Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",The early life and career of Vito Corleone in ...
3,The Dark Knight,"Action, Crime, Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...
4,12 Angry Men,"Crime, Drama",Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",A jury holdout attempts to prevent a miscarria...


In [8]:
df.shape

(250, 5)

### Step 2a: Data pre-processing
Transforming the full names of <b>actors</b>, <b>genres</b> and <b>directors</b> in single words so they are considered as <b>unique values</b>.

#### Transforming `Actors`

In [9]:
df['Actors'].head()

0    Tim Robbins, Morgan Freeman, Bob Gunton, Willi...
1    Marlon Brando, Al Pacino, James Caan, Richard ...
2    Al Pacino, Robert Duvall, Diane Keaton, Robert...
3    Christian Bale, Heath Ledger, Aaron Eckhart, M...
4    Martin Balsam, John Fiedler, Lee J. Cobb, E.G....
Name: Actors, dtype: object

In [10]:
# We will be getting only the first three names,
# discarding the commas between the actors' full names and 
# putting the actors in a list of words

# so that will have list of actors, top 3 after splitting string
df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])

In [11]:
df['Actors'].head()

0        [Tim Robbins,  Morgan Freeman,  Bob Gunton]
1           [Marlon Brando,  Al Pacino,  James Caan]
2         [Al Pacino,  Robert Duvall,  Diane Keaton]
3    [Christian Bale,  Heath Ledger,  Aaron Eckhart]
4       [Martin Balsam,  John Fiedler,  Lee J. Cobb]
Name: Actors, dtype: object

In [12]:
# merging first and last name for each actor into one word 
# to ensure no mix up between people sharing a first name

# remove space, algorithm might think "tan wei ping" and "tan blah" is the same as we make them stick as 1 token
for index, row in df.iterrows():
    row['Actors'] = [x.lower().replace(' ','') for x in row['Actors']]

In [13]:
df['Actors'].head()

0        [timrobbins, morganfreeman, bobgunton]
1           [marlonbrando, alpacino, jamescaan]
2         [alpacino, robertduvall, dianekeaton]
3    [christianbale, heathledger, aaroneckhart]
4        [martinbalsam, johnfiedler, leej.cobb]
Name: Actors, dtype: object

#### Transforming `Genre`

In [14]:
# good tip: shift the cell and click the cell which is a few cell down
df['Genre'].head()

0            Crime, Drama
1            Crime, Drama
2            Crime, Drama
3    Action, Crime, Drama
4            Crime, Drama
Name: Genre, dtype: object

In [15]:
# We will be getting only the first three names,
# discarding the commas between the genre and 
# putting the genre in a list of words

# so that will have list of genre, top 3 after splitting string
df['Genre'] = df['Genre'].map(lambda x: x.split(',')[:3])

In [16]:
df['Genre'].head()

0             [Crime,  Drama]
1             [Crime,  Drama]
2             [Crime,  Drama]
3    [Action,  Crime,  Drama]
4             [Crime,  Drama]
Name: Genre, dtype: object

In [17]:
# merging first and last name for each actor into one word 
# to ensure no mix up between people sharing a first name

# remove space, algorithm might think "tan wei ping" and "tan blah" is the same as we make them stick as 1 token
for index, row in df.iterrows():
    row['Genre'] = [x.lower().replace(' ','') for x in row['Genre']]

In [18]:
df['Genre'].head()

0            [crime, drama]
1            [crime, drama]
2            [crime, drama]
3    [action, crime, drama]
4            [crime, drama]
Name: Genre, dtype: object

#### Transforming `Director`

In [19]:
df['Director'].head()

0          Frank Darabont
1    Francis Ford Coppola
2    Francis Ford Coppola
3       Christopher Nolan
4            Sidney Lumet
Name: Director, dtype: object

In [20]:
# putting the directors in a list of words
df['Director'] = df['Director'].map(lambda x: x.split(' '))

In [21]:
df['Director'].head()

0           [Frank, Darabont]
1    [Francis, Ford, Coppola]
2    [Francis, Ford, Coppola]
3        [Christopher, Nolan]
4             [Sidney, Lumet]
Name: Director, dtype: object

In [22]:
# merging first and last name for each director into one word

# for each director, split according to space, first or family name
for index, row in df.iterrows():
    row['Director'] = ''.join(row['Director']).lower()

In [23]:
df['Director'].head()

0         frankdarabont
1    francisfordcoppola
2    francisfordcoppola
3      christophernolan
4           sidneylumet
Name: Director, dtype: object

In [24]:
# Finally
df.head()

Unnamed: 0,Title,Genre,Director,Actors,Plot
0,The Shawshank Redemption,"[crime, drama]",frankdarabont,"[timrobbins, morganfreeman, bobgunton]",Two imprisoned men bond over a number of years...
1,The Godfather,"[crime, drama]",francisfordcoppola,"[marlonbrando, alpacino, jamescaan]",The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,"[crime, drama]",francisfordcoppola,"[alpacino, robertduvall, dianekeaton]",The early life and career of Vito Corleone in ...
3,The Dark Knight,"[action, crime, drama]",christophernolan,"[christianbale, heathledger, aaroneckhart]",When the menace known as the Joker emerges fro...
4,12 Angry Men,"[crime, drama]",sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]",A jury holdout attempts to prevent a miscarria...


### Step 2b: Data pre-processing on plot

Extracting the key words from the plot description.

In [25]:
# creating and initializing the new column to empty string for all rows
df['Key_words'] = ""

for index, row in df.iterrows():
    # keep most important keywords
    plot = row['Plot']
    
    # instantiating a Rake object
    # by default it uses english stopwords from NLTK (natural language tool kit)
    # and discards all puntuation characters
    r = Rake()

    # extracting the keywords from the text by passing plot 
    r.extract_keywords_from_text(plot)

    # getting the dictionary with key words and their scores
    key_words_dict_scores = r.get_word_degrees()
    
    # assigning the key words to the new column
    row['Key_words'] = list(key_words_dict_scores.keys())

In [26]:
df['Key_words'].head()

0    [two, imprisoned, men, bond, number, years, fi...
1    [aging, patriarch, organized, crime, dynasty, ...
2    [early, life, career, vito, corleone, 1920s, n...
3    [menace, known, joker, emerges, mysterious, pa...
4    [jury, holdout, attempts, prevent, miscarriage...
Name: Key_words, dtype: object

In [27]:
# dropping the Plot column
df.drop(columns = ['Plot'], inplace = True)
# if have error - use df.drop('Plot', axis=1, inplace=True)

In [28]:
# check all the columns now
df.head()

Unnamed: 0,Title,Genre,Director,Actors,Key_words
0,The Shawshank Redemption,"[crime, drama]",frankdarabont,"[timrobbins, morganfreeman, bobgunton]","[two, imprisoned, men, bond, number, years, fi..."
1,The Godfather,"[crime, drama]",francisfordcoppola,"[marlonbrando, alpacino, jamescaan]","[aging, patriarch, organized, crime, dynasty, ..."
2,The Godfather: Part II,"[crime, drama]",francisfordcoppola,"[alpacino, robertduvall, dianekeaton]","[early, life, career, vito, corleone, 1920s, n..."
3,The Dark Knight,"[action, crime, drama]",christophernolan,"[christianbale, heathledger, aaroneckhart]","[menace, known, joker, emerges, mysterious, pa..."
4,12 Angry Men,"[crime, drama]",sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]","[jury, holdout, attempts, prevent, miscarriage..."


### Step 3: Create word representation via a bag of words 

Using the values from the df columns

In [29]:
# Title should be omitted from bag of words creation
columns = df.columns[1:]

df['bag_of_words'] = ''

for index, row in df.iterrows():
    words = ''
    for col in columns:
        if col != 'Director':
            # to convert the list into a string of words separated by a space
            words = words + ' '.join(row[col])+ ' '
        else:
            words = words + row[col]+ ' '
    row['bag_of_words'] = words

# let's keep only the title and the bag of words in the dataframe
df = df[['Title', 'bag_of_words']]

In [30]:
df.head()

Unnamed: 0,Title,bag_of_words
0,The Shawshank Redemption,crime drama frankdarabont timrobbins morganfre...
1,The Godfather,crime drama francisfordcoppola marlonbrando al...
2,The Godfather: Part II,crime drama francisfordcoppola alpacino robert...
3,The Dark Knight,action crime drama christophernolan christianb...
4,12 Angry Men,crime drama sidneylumet martinbalsam johnfiedl...


### Step 4: Create cosine similarity matrix 

In [31]:
# instantiating and generating the count matrix using CountVectorizer
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])

In [32]:
# generating the cosine similarity matrix

# movie 2 and movie 1 have some similarity
# there are higher values and more similarity and genre 
cosine_sim = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim)

[[1.         0.15789474 0.13764944 ... 0.05263158 0.05263158 0.05564149]
 [0.15789474 1.         0.36706517 ... 0.05263158 0.05263158 0.05564149]
 [0.13764944 0.36706517 1.         ... 0.04588315 0.04588315 0.04850713]
 ...
 [0.05263158 0.05263158 0.04588315 ... 1.         0.05263158 0.05564149]
 [0.05263158 0.05263158 0.04588315 ... 0.05263158 1.         0.05564149]
 [0.05564149 0.05564149 0.04850713 ... 0.05564149 0.05564149 1.        ]]


### Step 5: Create and Run the model (recommender)

In [33]:
# creating a Series for the movie titles and corresponding indexes
indices = pd.Series(df['Title'])
indices[:5]

0    The Shawshank Redemption
1               The Godfather
2      The Godfather: Part II
3             The Dark Knight
4                12 Angry Men
Name: Title, dtype: object

In [34]:
# function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
    
    recommended_movies = []
    
    # getting the index of the movie with the given title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    # 1:11, 0 is the movie itself, 11 as excluded
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df['Title'])[i])
        
    return recommended_movies

In [35]:
recommendations('The Dark Knight')

['The Dark Knight Rises',
 'Batman Begins',
 'The Prestige',
 'The Green Mile',
 'Witness for the Prosecution',
 'Out of the Past',
 'Rush',
 'The Godfather',
 'V for Vendetta',
 'Reservoir Dogs']

<img align="left" src='https://drive.google.com/uc?export=view&id=1gvji0A564aEIZRmSzm4J2H5euNV3kY4Q' style="width: 200px; height: 150px;">

### You have built your first recommender!!!