In [1]:
# %% [markdown]
# # Recommending Movies On MovieLens
# In this notebook we are going to focus on content based filtering with features from the movies such as the overview, the crew and additional keyword features

# %% [markdown]
# ## Importing Libraries

# %% [code] {"execution":{"iopub.status.busy":"2021-06-11T17:39:03.263675Z","iopub.execute_input":"2021-06-11T17:39:03.264411Z","iopub.status.idle":"2021-06-11T17:39:03.288786Z","shell.execute_reply.started":"2021-06-11T17:39:03.264215Z","shell.execute_reply":"2021-06-11T17:39:03.286582Z"}}
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/demo'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# %% [markdown]
# ## Inspecting Datasets
# In this part we have 5 datasets, we are only going to focus on 3 of them, which contain movie metadata (aka features of movies) along with some additional features such as keywords and credits.

# %% [code] {"execution":{"iopub.status.busy":"2021-06-11T17:39:03.293958Z","iopub.execute_input":"2021-06-11T17:39:03.294459Z"}}
ratings = pd.read_csv("ratings.csv")
links = pd.read_csv("links.csv")
movies_meta = pd.read_csv("movies_metadata.csv")
keywords = pd.read_csv("keywords.csv")
credits = pd.read_csv("credits.csv")

# %% [markdown]
# ### Ratings
# Contains ratings that have been given by users for movies in certain timestamps

# %% [code]
ratings.head(5)

# %% [markdown]
# ### Links
# This file contains certain links for movies, their IDs on both IMDB and TMDB

# %% [code]
links.head(5)

# %% [markdown]
# ### Movie Metadata
# The main dataset which has a number of features such as budget, genres, original language, overview and runtime.

# %% [code]
movies_meta.head(5)

# %% [markdown]
# ### Credits
# An additional feature table that contains useful information about the crews of movies, its data can be combined with movies metadata as we will see later.

# %% [code]
credits.head(5)

# %% [markdown]
# ### Keywords
# An additional feature table that contains useful information about what keywords does a movie have, its data can be combined with movies metadata also.

# %% [code]
keywords.head(5)

# %% [markdown]
# ## Data Preprocessing
# In this part we are going to do some data preprocessing to try to gain insight from the data and make sure the results make sense, 

# %% [markdown]
# ### Limiting the Dataset to Popular Movies
# In this part we limit the dataset to the highest 90% popular movies, so that the number of raters doesn't affect the average rating. If for example a movie had an average rating of 7 with 50 votes, it would be certainly better than another movie with the same average rating but with only 3 votes. 

# %% [code]
# Calculate mean of vote average column
C = movies_meta['vote_average'].mean()
print(C)

# %% [code]
# Calculate the minimum number of votes required to be in the chart, m
m = movies_meta['vote_count'].quantile(0.90)
print(m)

# %% [code]
popular_movies = movies_meta.loc[movies_meta['vote_count'] >= m, :] 
popular_movies.shape

# %% [markdown]
# ### Weighted Average Rating for Each Movie
# Also one of the good ways to normalize ratings is to average them by the average rating of the whole dataset, and the number of voters for movies which have more than or equal to 160 voters

# %% [code]
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

# %% [code]
popular_movies['score'] = popular_movies.apply(weighted_rating, axis = 'columns')

# %% [markdown]
# ### Top Rated Movies
# Then we try to gain an important insight about which movies are the most popular and have the highest average rating, we are now looking at a list that pretty makes sense compared to the situation where movies with 5-10 reviews are treated the same as movies with +160 reviews

# %% [code]
popular_movies_srt = popular_movies.sort_values('score', ascending = False)
popular_movies_srt[['title', 'vote_count', 'vote_average', 'score']].head(5)

# %% [markdown]
# ### Overview Based Recommendations
# First we are going to look at how to derive features from a simple attribute, which is the overview of the movie, that describes how the movie will go and probably some teaser words

# %% [code]
metadata = popular_movies.copy().reset_index()

# %% [code]
metadata['overview'].head(5)

# %% [markdown]
# #### TF-IDF Vectorization
# Then we are going to calculate the text extracted from overviews to be able to be processed in a different manner ("of numbers"), so what basically TF-IDF means is that it is Term-Frequency - Inverse Document Frequency, its main function is to represent words with frequencies, with a little tweak (IDF) that doesn't let the model be biased towards the most occurring words in the overview.

# %% [code]
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

# %% [markdown]
# ### Measuring Similarity Between Content
# Then we are going to measure similarity between extracted features for each movie and store the similarity values in a matrix that will be used later, and based on that we will try to recommend movies that are "similar" (based on the overview) with other movies.

# %% [code]
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# %% [code]
cosine_sim.shape

# %% [code]
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

# %% [code]
indices[:10]

# %% [markdown]
# ### Top Similar Movies Given a Movie
# As promised, we are going to get similar movies based on the content of our movie, (in this case, the content is the overview of the movie).

# %% [code]
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

# %% [code]
get_recommendations('The Dark Knight Rises')

# %% [code]
get_recommendations('The Godfather')

# %% [markdown]
# ### Using Additional Features (Crew and Keywords)
# In this section we are going to use the additional tables that we talked about earlier, and try to do the same thing as we did in the overview, but this time we are going to use a Bag of Words for evaluating similarity.

# %% [code]
# Remove rows with bad IDs.
# metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

# %% [code]
metadata.head(5)

# %% [code]
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

# %% [markdown]
# #### Director's Name
# Director's Name is one of the important factors in recommending movies, each director has his own unique style that is repeated with different variations for each and every movie, so it is easy to distinguish Quentin Tarantino's movies to be somehow gorish, some other movies like Alfred Hitchcock's movies to be centered more around horror.

# %% [code]
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# %% [markdown]
# #### Text Extraction from Additional Features
# To decrease the level of sparsity of the dataset, as there are many movies with different numbers of recorded Crew and keywords, we have chosen to take the first 3 words out of each movie's crew and keywords, to insure data consistency. We could have done it using several other ways such as reducing the number of words to n_components using Principal Component Analysis or other dimensionality reduction techniques.

# %% [code]
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# %% [code]
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

# %% [code]
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(5)

# %% [markdown]
# #### Text Cleaning

# %% [code]
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# %% [code]
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

# %% [markdown]
# #### Feature Aggregation
# In this part we are going to join the features in a clear document (same as overview, but larger), to be able to represent the text as a whole.

# %% [code]
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

# %% [code]
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

# %% [code]
metadata[['soup']].head(5)

# %% [markdown]
# #### Count Vectorization

# %% [markdown]
# In this part we are going to use another type of text representation to be able to derive features from keywords, combined with actors and directors. We use count vectorization as sometimes the directors, for example, take much more weight than actors in movie recommendation, that's why we want it to stand out more.

# %% [code]
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

# %% [code]
count_matrix.shape

# %% [markdown]
# ### Measuring Similarity Between Content 
# Then we are going to measure similarity between extracted features for each movie and store the similarity values in a matrix that will be used later, and based on that we will try to recommend movies that are "similar" (based on crew and keywords) with other movies.

# %% [code]
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# %% [code]
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

# %% [code]
get_recommendations('The Dark Knight Rises', cosine_sim2)

ModuleNotFoundError: No module named 'numpy'