**Author**: Salma Elshahawy

**Date**: June, 5, 2020

**Title**: DATA 612, Recommender system, project#1

**Github repo**: [Simple_recommender#2](https://github.com/salma71/MSDS_SU2020/blob/master/Recommender_system612/week_2/project_2_tmdb.ipynb)

## Introduction

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

In this notebook, I will demonestrate two different methods for recommender system.

* **Collaborative Filtering**: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.

* **Content-Based Filtering**: This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.


In [None]:
import pandas as pd 
import numpy as np


In [None]:
# Load Movies credit
df1 = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')

# Print the first three rows
df1.head()

The Movie dataset contains the following features:-
* **movie_id** - A unique identifier for each movie.
* **cast** - The name of lead and supporting actors.
* **crew** - The name of Director, Editor, Composer, Writer etc.

In [None]:
# Load Movies 
df2 = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')

# Print the first three rows
df2.head()

## The tmdb dataset has the following features:-

* **budget** - The budget in which the movie was made.
* **genre** - The genre of the movie, Action, Comedy ,Thriller etc.
* **homepage** - A link to the homepage of the movie.
* **id** - This is infact the movie_id as in the first dataset.
* **keywords** - The keywords or tags related to the movie.
* **original_language** - The language in which the movie was made.
* **original_title** - The title of the movie before translation or adaptation.
* **overview** - A brief description of the movie.
* **popularity** - A numeric quantity specifying the movie popularity.
* **production_companies** - The production house of the movie.
* **production_countries** - The country in which it was produced.
* **release_date** - The date on which it was released.
* **revenue** - The worldwide revenue generated by the movie.
* **runtime** - The running time of the movie in minutes.
* **status** - "Released" or "Rumored".
* **tagline** - Movie's tagline.
* **title** - Title of the movie.
* **vote_average** - average ratings the movie recieved.
* ****vote_count** - the count of votes recieved.

In [None]:
# merge two tables on movie_id
df1.columns = ['id','movie_title','cast','crew']
df2 = df2.merge(df1, on='id')
df2.head()
# dataset.drop('title_y', axis=1).head()

# Explore the Data


I would use a weighted rating that takes into account the average rating and the number of votes it has accumulated. Such a system will make sure that a movie with a 9 rating from 100,000 voters gets a (far) higher score than a movie with the same rating but a mere few hundred voters.

WeightedRating(WR)= ((v/v+m)⋅R)+((m/v+m)⋅C)

* v is the number of votes for the movie;

* m is the minimum votes required to be listed in the chart;

* R is the average rating of the movie;

* C is the mean vote across the whole report.

I already have the values to v ```(vote_count)``` and R```(vote_average)``` for each movie in the dataset. It is also possible to directly calculate C from this data.

In [None]:
# Calculate mean of vote average column
C = df2['vote_average'].mean()
print(C)

In [None]:
# Calculate the minimum number of votes required to be in the chart, m , coverage parameter
m = df2['vote_count'].quantile(0.90)
print(m)

In [None]:
# Filter out all qualified movies into a new DataFrame
q_movies = df2.copy().loc[df2['vote_count'] >= m]
q_movies.shape

In [None]:
q_movies.describe()

In [None]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [None]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head()

In [None]:
import matplotlib.pyplot as plt

pop= df2.sort_values('popularity', ascending=False)
plt.figure(figsize=(12,4))

plt.barh(pop['title'].head(10),pop['popularity'].head(10), align='center',
        color='red')
plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Top 10 Popular Movies")

# Build the model

## Content based filtering

Content-based filtering approaches leverage description or attributes from items the user has interacted to recommend similar items. It depends only on the user previous choices, making this method robust to avoid the cold-start problem. For textual items, like articles, news and books, it is simple to use the raw text to build item profiles and user profiles. Here we are using a very popular technique in information retrieval (search engines) named TF-IDF. This technique converts unstructured text into a vector structure, where each word is represented by a position in the vector, and the value measures how relevant a given word is for an article. As all items will be represented in the same Vector Space Model, it is to compute similarity between movie overview.

In [None]:
#Print overviews of the first 10 movies.
df2['overview'].head(10)

In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df2['overview'] = df2['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df2['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

In [None]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()
indices.head
indices['The Shawshank Redemption']

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df2['title'].iloc[movie_indices]

In [None]:
get_recommendations('Batman Forever')

## Collaborative filtering

Collaborative Filtering (CF) has two main implementation strategies:

1. **Memory-based**: This approach uses the memory of previous users interactions to compute users similarities based on items they've interacted (user-based approach) or compute items similarities based on the users that have interacted with them (item-based approach). A typical example of this approach is User Neighbourhood-based CF, in which the top-N similar users (usually computed using Pearson correlation) for a user are selected and used to recommend items those similar users liked, but the current user have not interacted yet. This approach is very simple to implement, but usually do not scale well for many users. 

2. **Model-based**: This approach, models are developed using different machine learning algorithms to recommend items to users. There are many model-based CF algorithms, like neural networks, bayesian networks, clustering models, and latent factor models such as Singular Value Decomposition (SVD) and, probabilistic latent semantic analysis.

In [None]:
from surprise import Reader, Dataset, SVD
reader = Reader()
ratings = pd.read_csv('../input/the-movies-dataset/ratings_small.csv')
ratings.head()

In [None]:
from surprise.model_selection import cross_validate
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

We got ana average of 0.89 of RMSE, which is good in our case

In [None]:
trainset = data.build_full_trainset()
algo.fit(trainset)

In [None]:
ratings[ratings['userId'] == 1]

Now let's apply the predict method to see the performance of the recommender. the predict method takes three arguments, user_id, movie_id, and the true rating. The output is the prediction with the estimated new one. 

In [None]:
algo.predict(1, 2150, 3.0)