# Chimera Movie Recommendations

## Overview

The new movie streaming service 'Cinemania' is looking for a way to increase movie streams. They have asked Chimera Solutions to provide the means of connecting subscribers to movies that they will enjoy, and then link to them to other movies that they will enjoy. The goal of this notebook is to construct a recommendation system that will accurately link subscribers to movies that fit their unique tastes

## Business Problem

Cinemania is trying to enter the highly saturated streaming service field, and is looking for the means to level the playing field. They need a recommendation system that is accurate enough to give users a reason to use their service over the many other ones.

- __Stakeholder:__ Cinemania


- __Significance of Recommendations:__ Recommendations need to be accurate so that users keep watching more and more movies on the stakeholder's streaming service


- __Deliverable:__ An interpretable Recommendation Model that the stakeholder can easily understand



## Notebook Summary 

## EDA

In [1]:
import pandas as pd
import numpy as np
from surprise import Dataset, Reader
from surprise import SVD, SVDpp, NMF, NormalPredictor, KNNBaseline, KNNBasic,\
KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering, SlopeOne

from surprise import accuracy
from surprise.model_selection import cross_validate, GridSearchCV
from sklearn.model_selection import train_test_split
from surprise.model_selection import train_test_split as sur_tts

In [2]:
movies = pd.read_csv('data/movies.csv')
ratings = pd.read_csv('data/ratings.csv')
links = pd.read_csv('data/links.csv')
tags = pd.read_csv('data/tags.csv')

#### Movies Dataset Analysis 

In [None]:
print(movies.shape)
movies.head()

In [None]:
movies.info()

Great! no missing values in this dataset

In [None]:
print(movies['movieId'].nunique())
print(movies['title'].nunique())

There are 5 duplicated Movie titles, but that is not a big deal for what we are trying to accomplish

#### Ratings Dataset Analysis

In [None]:
print(ratings.shape)
ratings.head()

In [None]:
ratings.info()

No missing values in our ratings datset, duplicates are expected here because one user can rate multiple movies, and a movie can be rated by multiple users.

In [None]:
ratings['rating'].hist()

There seems to be a slightly skewed distribution of ratings, but this is just how users rated their movies, so this will be left as is.

In [None]:
users = ratings['userId'].nunique()
movie = ratings['movieId'].nunique()

In [None]:
print(f'There are {users} rating {movie} movies')

We have way more movies than users which is to be expected with a new streaming service. 

#### Tags Dataset Analysis

In [None]:
print(tags.shape)
tags.head()

In [None]:
tags.info()

No missing values and the tag column is just a user generated phrase describing something that stood out to them in the movie.

This dataset is not really important for our recommendation system

#### Links Dataset Analysis 

In [None]:
print(links.shape)
links.head()

In [None]:
links.info()

After Further research into this dataset, we found that these were Id numbers that provide links to the actual movies on imdb and The Movie Database. 

This data is not needed for our recommendation system, we want users to watch movies on Cinemania's Streaming Service.

##### Data for our recommendation system

In [None]:
ratings.head()

In [None]:
movies.head()

We are going to use the __ratings__ dataset to produce recommendations to users based on what similar users have rated highly, and we are going to use the title and genre columns from the __movies__ dataset to make the output of of our model more interpretable by using the title rather than the movieId. 

#### Data Preprocessing

Given the high number of movies with only a few reviews (<=10), we decided to remove those ratings so our model would provide more accurate and confident recommendations. This results in a reduction of 20,000 ratings.

In [3]:
print(len(ratings.index))
unique_movies = ratings['movieId'].unique()

movie_count = {}

for i in unique_movies:
    movie_count[i]= 0
    
for i in ratings['movieId']:
    movie_count[i] += 1

rare_movies = []
for movie in movie_count.keys():
    if movie_count[movie] <= 10:
        rare_movies.append(movie)
        
to_delete = []
for index, row in ratings.iterrows():
    if row.movieId in rare_movies:
        to_delete.append(index)
        
ratings.drop(to_delete,inplace=True)
print(len(ratings.index))


100836
79636


## Modeling

#### Compare Different Baseline Models

In [4]:
rating_data = ratings.drop('timestamp', axis=1)

In [5]:
reader = Reader(line_format='user item rating', sep=',')
data = Dataset.load_from_df(rating_data, reader=reader)

__Remember to Remove this markdown before submission__ This cell takes a very long time to run, this is just a note for us when going through the notebook. __do not have to run this cell__ just look at output.

In [None]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation to see which algorithms give lowest RMSE
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Create a dataframe with the algorithm as the index
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')   

- SVDpp, BaselineOnly, SVD, and KNNBaseline are our top 4 models with default parameeters. 
- SVDpp is only a point better but is way more computationally expensive than the other 3 seeing that it took a total of 419 seconds(7 minutes) to fit and test.

### Model Tuning

#### Split our data into a Training and Testset and train the best performing models on the training set.

In [6]:
#use sklearn library train test split
y = pd.DataFrame(rating_data['rating'])
X= rating_data.drop('rating',axis=1)

X_train,X_test,y_train,y_test = train_test_split(X,y)

#remerge to convert using reader

train_df = pd.concat([X_train,y_train],axis=1)

test_df = pd.concat([X_test,y_test],axis=1)

#convert to surprise dataframes
trainset= Dataset.load_from_df(train_df, reader=reader)

testset = Dataset.load_from_df(test_df, reader=reader)

#make testset data usable for testing

blank,testset = sur_tts(testset, test_size=.95)

type(testset)

list

In [8]:
# use a gridsearch to find best params for SVD
param_grid = {'n_factors':[50,200,250],'n_epochs':[25,30,40],
              'lr_all':[.025,.05,.075],'reg_all':[.04,.05,.06]}

gs_svd = GridSearchCV(SVD,param_grid,measures=['rmse'],cv=3,n_jobs=-1)
gs_svd.fit(trainset)

params = gs_svd.best_params['rmse']

params

{'n_factors': 250, 'n_epochs': 40, 'lr_all': 0.025, 'reg_all': 0.06}

In [11]:
#test best SVD on testset

algo = SVD(n_factors= 250, n_epochs=40, lr_all= 0.025, reg_all= 0.06)

train_set = trainset.build_full_trainset()
algo.fit(train_set)

preds = algo.test(testset)

accuracy.rmse(preds)


RMSE: 0.8354


0.8353588248946192

In [12]:
accuracy.mae(preds)

MAE:  0.6374


0.6373621355575251