# Content-Based Movie Recommendation Engine

What movie should I watch next? Will I really have to scour through Reddit's Top 250 and IMDB and various other lists just to stumble across the name of the next fated film? 

No, I won't! Because today we will build a simple movie recommendation engine that uses content-based filtering to provide a list of movies similar to the one we select.

In [None]:
! pip install rake_nltk

import pandas as pd
import numpy as np
from rake_nltk import Rake

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

from ast import literal_eval

## Data Preprocessing
### Loading Data

In [None]:
movies = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')
credits = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')

# Join datasets
credits.columns = ['id', 'title', 'cast', 'crew']

alldata = movies.merge(credits, on = 'id')
alldata.head()

### Cleaning Features

Our content-based filtering system will not be using all of these columns, so I will cut the dataset down to only include the relevant features. Then we can clean up the feature contents.

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Trim dataset to include relevant features
df = alldata[['id', 'original_title', 'genres', 'keywords', 'overview', 'original_language', 'cast', 'crew']]

# Parse stringed list features into python objects
features = ['keywords', 'genres', 'cast', 'crew']
for i in features:
    df[i] = alldata[i].apply(literal_eval)
    
# Extract list of genres
def list_genres(x):
    l = [d['name'] for d in x]
    return(l)
df['genres'] = df['genres'].apply(list_genres)

# Extract top 3 cast members
def list_cast(x):
    l = [d['name'] for d in x]
    if len(l) > 3:
        l = l[:3]
    return(l)
df['cast'] = df['cast'].apply(list_cast)

# Extract top 5 keywords
def list_keywords(x):
    l = [d['name'] for d in x]
    if len(l) > 5:
        l = l[:5]
    return(l)
df['keywords'] = df['keywords'].apply(list_keywords)

# Extract director
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan
df['director'] = df['crew'].apply(get_director)

# Drop the now unnecessary crew feature
df = df.drop('crew', axis = 1)

# Clean features of spaces and lowercase all to ensure uniques
def clean_feat(x):
    if isinstance(x, list):
        return [i.lower().replace(" ","") for i in x]
    else:
        if isinstance(x, str):
            return x.lower().replace(" ", "")
        else:
            return ''

features = ['keywords', 'genres', 'cast', 'director']
for i in features:
    df[i] = df[i].apply(clean_feat)

In [None]:
df.head()

Now we have several features with lists of keywords that are all lowercase and stripped of spaces, therefore making them unique keywords. 

### Missing Values
Let's check for missing values, since they could be problematic when it comes to creating more keywords for overview.

In [None]:
missing = df.columns[df.isnull().any()]
df[missing].isnull().sum().to_frame()

In [None]:
# Replace NaN from overview with an empty string
df['overview'] = df['overview'].fillna('')

### Creating bag of keywords

We will use genres, keywords, overview, cast, and director to create a bag of words column.

Let's use Rake from the nltk package to extract keywords from the overview feature, which is a summary of the plot. We'll put those keywords into a new column: plotwords.

In [None]:
# Initialize empty column
df['plotwords'] = ''

# function to get keywords from a text
def get_keywords(x):
    plot = x
    
    # initialize Rake using english stopwords from NLTK, and all punctuation characters
    rake = Rake()
    
    # extract keywords from text
    rake.extract_keywords_from_text(plot)
    
    # get dictionary with keywords and scores
    scores = rake.get_word_degrees()
    
    # return new keywords as list, ignoring scores
    return(list(scores.keys()))

# Apply function to generate keywords
df['plotwords'] = df['overview'].apply(get_keywords)

Now that we have our plot keywords, let's combine our our cleaned features with them to create a bag of words. We'll make a new dataframe.

In [None]:
df_keys = pd.DataFrame() 

df_keys['title'] = df['original_title']
df_keys['keywords'] = ''

def bag_words(x):
    return(' '.join(x['genres']) + ' ' + ' '.join(x['keywords']) + ' ' +  ' '.join(x['cast']) + 
           ' ' + ' '.join(x['director']) + ' ' + ' '.join(x['plotwords']))
df_keys['keywords'] = df.apply(bag_words, axis = 1)

df_keys.head()

## Creating Model

We will use CountVectorizer from scikit-learn to convet the keywords into a matrix of token counts, producing the frequency of each word.

In [None]:
# create count matrix
cv = CountVectorizer()
cv_mx = cv.fit_transform(df_keys['keywords'])

Now we apply the cosine_similarity function to find similarity between two movies. A brief overview drawn from [Machine Learning Plus](https://www.machinelearningplus.com/nlp/cosine-similarity/):

> Cosine similarity is a metric used to determine how similar the documents are irrespective of their size. 
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. In this context, the two vectors I am talking about are arrays containing the word counts of two documents. 

>When plotted on a multi-dimensional space, where each dimension corresponds to a word in the document, the cosine similarity captures the orientation (the angle) of the documents and not the magnitude. If you want the magnitude, compute the Euclidean distance instead. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘cricket’ appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

In [None]:
# create cosine similarity matrix
cosine_sim = cosine_similarity(cv_mx, cv_mx)
cosine_sim

In [None]:
# create list of indices for later matching
indices = pd.Series(df_keys.index, index = df_keys['title'])

# Recommendation 
Now we will write the actual recommendation function.

In [None]:
def recommend_movie(title, n = 10, cosine_sim = cosine_sim):
    movies = []
    
    # retrieve matching movie title index
    if title not in indices.index:
        print("Movie not in database.")
        return
    else:
        idx = indices[title]
    
    # cosine similarity scores of movies in descending order
    scores = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    # top n most similar movies indexes
    # use 1:n because 0 is the same movie entered
    top_n_idx = list(scores.iloc[1:n].index)
        
    return df_keys['title'].iloc[top_n_idx]

### Testing our Recommendation Engine!

In [None]:
recommend_movie('Toy Story', n = 5)

In [None]:
recommend_movie('The Avengers')

In [None]:
recommend_movie('The Hobbit: An Unexpected Journey')

In [None]:
recommend_movie('Ocean\'s Eleven', n = 7)

# Remarks

In the future, I might join this project with the larger movies dataset on Kaggle. That way I can use release date as a keyword as well. Along that vein, I could add a function that limits release date to x number of years forward and backward when recommending the movie.

It'd also be good to explore if I can weight the director keyword more than other keywords.

Thank you for reading! I hope you enjoyed.