<h3> Content Based Recommender System

As you navigate through this file, you will see that two types of content-based movie recommendation systems have been built.
1. Using the summary of the movie, ie, Plot based recommendation engine.
2. Using the metadata for the movie, such as crew and keywords of the movie, ie, MetaData based recommendation engine.

In [4]:
import pandas as pd
import numpy as np

#Import data from the clean file 
df = pd.read_csv(r'C:\Users\sweth\OneDrive\Desktop\2nd_Semester\Machine Learning 1\Project\Dataset\the-movies-dataset\clean_data.csv')

#Print the head of the cleaned DataFrame
df.head()

Unnamed: 0,title,id,overview,popularity,genres,runtime,vote_average,vote_count,year
0,Toy Story,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995
1,Jumanji,8844,When siblings Judy and Peter discover an encha...,17.015539,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,15602,A family wedding reignites the ancient feud be...,11.7129,"['romance', 'comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,31357,"Cheated on, mistreated and stepped on, the wom...",3.859495,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,11862,Just when George Banks has recovered from his ...,8.387519,['comedy'],106.0,5.7,173.0,1995


The overview of the dataset has been cleaned using the scikit library. Basically, the Summary of the movie is cleaned by removing all the stopwords and replacing the NaN values with an empty string. We see that the TF-IDF vectorizer has created a 48,083 dimensional vector for the overview of every movie.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vector = TfidfVectorizer(stop_words='english')

df['overview'] = df['overview'].fillna('')

tfidf_matrix = vector.fit_transform(df['overview'])

tfidf_matrix.shape
#tfidf_matrix

(21748, 48083)

Now, we compute the pair-wise cosine similarity of every movie. A 21,748 by 21,748 matrix will be created, where the cell of ith row and jth column represents the similarity scores between movies i and j. All the diagonals of this matrix are one, since it is a similarity score of the movie by itself.

In [6]:
from sklearn.metrics.pairwise import linear_kernel

cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

Created a pandas series with index as the movie title and the value as the corresponding title index.

In [7]:
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
# indices

The following steps will be perfomed to create the content_recommender. 
1.	We declare the title of the movie as an argument, and we obtain the index of the movie that matches this title.
2.	Get the pairwise cosine similarity of all movies with that movie. 
3.	Sort this list f tuples based on the decreasing order of the cosine similarity.
4.	Get the top 10 elements of this list. Here, we intentionally drop the first movie as the movie most similar to any movie is     the movie itself.
5.  Return the titles corresponding to the indices of these movies.

In [8]:
# Function that takes in movie title as input and gives recommendations 
def content_recommender(title, cosine_similarity=cosine_similarity, df=df, indices=indices):
    idx = indices[title]

    similarity_scores = list(enumerate(cosine_similarity[idx]))

    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    similarity_scores = similarity_scores[1:11]

    movie_indices = [i[0] for i in similarity_scores]

    return df['title'].iloc[movie_indices]

As we can see, all the movies recommended by the summary based recommender have similar plots where the protagonists play games and end up with severe consequences. 

In [9]:
content_recommender("Jumanji")

13912        Table No. 21
21707                Quiz
6723              Quintet
4789            Brainscan
17401        Turkey Shoot
20086           Beta Test
11754              DeVour
4703     Poolhall Junkies
17115              Pixels
16658             Standby
Name: title, dtype: object

<h3> Metadata Based Recommender

The difference between the content-based and this recommender system is the type of data that is used to produce the recommendations. To build this models, the following meta-data will be used:
1. Genres
2. Directors
3. Stars

In [10]:
# Load the keywords and credits files
credits = pd.read_csv(r'C:\Users\sweth\OneDrive\Desktop\2nd_Semester\Machine Learning 1\Project\Dataset\the-movies-dataset\credits.csv')
keywords = pd.read_csv(r'C:\Users\sweth\OneDrive\Desktop\2nd_Semester\Machine Learning 1\Project\Dataset\the-movies-dataset\keywords.csv')

In [11]:
#Print the head of the credit dataframe
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [12]:
#Print the head of the keywords dataframe
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [13]:
df['id'] = df['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')

df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')

df.head()

Unnamed: 0,title,id,overview,popularity,genres,runtime,vote_average,vote_count,year,cast,crew,keywords
0,Toy Story,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,8844,When siblings Judy and Peter discover an encha...,17.015539,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,15602,A family wedding reignites the ancient feud be...,11.7129,"['romance', 'comedy']",101.0,6.5,92.0,1995,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,31357,"Cheated on, mistreated and stepped on, the wom...",3.859495,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,11862,Just when George Banks has recovered from his ...,8.387519,['comedy'],106.0,5.7,173.0,1995,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [14]:
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

In [15]:
df.iloc[0]['crew'][0]

{'credit_id': '52fe4284c3a36847f8024f49',
 'department': 'Directing',
 'gender': 2,
 'id': 7879,
 'job': 'Director',
 'name': 'John Lasseter',
 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}

Extracting the director's name. The idea of repeating the name of director twice is to give more weight to the director.


In [16]:
def director(x):
    for crew_member in x:
        if crew_member['job'] == 'Director':
            return (crew_member['name'] + ' ' + crew_member['name'])

Defining a new feature called director and extracting the names of the first 5 directors. 

In [17]:
df['director'] = df['crew'].apply(director)
df['director'].head()

0        John Lasseter John Lasseter
1          Joe Johnston Joe Johnston
2        Howard Deutch Howard Deutch
3    Forest Whitaker Forest Whitaker
4        Charles Shyer Charles Shyer
Name: director, dtype: object

The generate_list function returns a list of the top 3 elements

In [18]:
def generate_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

In [19]:
df['cast'] = df['cast'].apply(generate_list)
df['keywords'] = df['keywords'].apply(generate_list)

In [20]:
df['genres'] = df['genres'].apply(lambda x: x[:3])
df[['title', 'cast', 'director', 'keywords', 'genres']].head()

The purpose of this function is to remove the empty space so that we can differentiate Tom in Tom hanks vs that in Tom Cruise 

In [22]:
def clean(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [23]:
#Apply the generate_list function to cast, keywords, director and genres
for feature in ['cast', 'director', 'genres', 'keywords']:
    df[feature] = df[feature].apply(clean)

Since previously we only used overview to compute the similarity, similary, in this case, we create a new feature that consists of all the relevant features and compute the similarity on that feature.

In [31]:
#Function that creates a soup out of the desired metadata
def whole_data(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [32]:
# Create the new soup feature called total
df['total'] = df.apply(whole_data, axis=1)

In [33]:
#Display the soup of the first movie
df.iloc[0]['total']

'jealousy toy boy tomhanks timallen donrickles johnlasseterjohnlasseter animation comedy family'

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['total'])

In [35]:

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity_2 = cosine_similarity(count_matrix, count_matrix)
df = df.reset_index()
indices2 = pd.Series(df.index, index=df['title'])

In [36]:
df = df.reset_index()
indices2 = pd.Series(df.index, index=df['title'])

In [37]:
content_recommender("Jumanji", cosine_similarity_2, df, indices2)

10563                       Where the Wild Things Are
450                                    The Pagemaster
15348               Tinker Bell and the Lost Treasure
17233    Mostly Ghostly: Have You Met My Ghoulfriend?
9832                                    City of Ember
17240                 Zenon: Girl of the 21st Century
9335                                  The Water Horse
15980                                      Snow Queen
19782                           The Shamer's Daughter
1511                                     Return to Oz
Name: title, dtype: object