# Recomender Systems

I always look for Kaggle analysis before doing my own analysis in whatever domain. I think get some ideas before starting is very important, I also like to read the comments to get a grasp of some complication that the code could have had or things that might not be well explain in the code itself. I will add something to the Kaggle analysis or reduce certains part to make the code understandable.
The Kaggle that I was reading for this Notebook is this one: https://www.kaggle.com/code/rounakbanik/movie-recommender-systems

In [2]:
# pip install scikit-surprise

In [9]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
import warnings; warnings.simplefilter('ignore')

In [10]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Movie information metadata

This is the information for the movies, it gives genres, description, overall ratings and other important metrics. Here we will create a Collaborative Filtering, so this metrics we will not use it, in this notebook. But We will explore them in the future when we create a hibrid model.

In [11]:
# Read the data

X_full = pd.read_csv('movies_metadata.csv')

X_full.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


In [12]:
# We will use the same cleaning process as the kaggle code provided above :) 
X_full['genres'] = X_full['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
X_full['year'] = pd.to_datetime(X_full['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
X_full.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995


In [14]:
str(X_full['overview'][0])

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [15]:
str(X_full['production_companies'][0])

"[{'name': 'Pixar Animation Studios', 'id': 3}]"

# Analysing Previous Kaggles and Understanding what is possible

(Here I continue to use the notebook of ROUNAK BANIK presented above)
So in the Kaggle presented there are differents way to recommend that I would like to mentioned:

- Recommend just based on populary and "general taste": this is base in recommend movies that are really like for a lot of people, let us say that those are "good movies". The columns that they use are: vote_count,	vote_average,	popularity and a new column created called wr.
here is a extract of the Kaggle:

"""

I use the TMDB Ratings to come up with our **Top Movies Chart.** I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

"""
- The other one is the content based one: The idea here is to see similarities in movies based on description and actors/directors in it. Columns to use: genres,overview,tagline. And I would add: adult and production_companies.

- Collaborative Filtering: Here we use ratings giving by other users to make predictions. This is the model we will build!!

# Collaborative Filtering

In [20]:
# Read the data

X = pd.read_csv('ratings_small.csv')

X.head(1)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144


In [21]:
reader = Reader()

In [22]:
data = Dataset.load_from_df(X[['userId', 'movieId', 'rating']], reader)

In [24]:
# We'll use the famous SVD algorithm.
svd = SVD()

# Run 5-fold cross-validation and print results
cross_validate(svd, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8936  0.8985  0.8952  0.8937  0.9023  0.8966  0.0033  
MAE (testset)     0.6876  0.6948  0.6884  0.6892  0.6912  0.6902  0.0026  
Fit time          1.19    1.21    1.25    1.23    1.20    1.21    0.02    
Test time         0.17    0.16    0.40    0.16    0.16    0.21    0.09    


{'test_rmse': array([0.89356467, 0.89846783, 0.895178  , 0.89370053, 0.90228305]),
 'test_mae': array([0.68757813, 0.69479583, 0.6883755 , 0.68918021, 0.6911662 ]),
 'fit_time': (1.1852765083312988,
  1.2101449966430664,
  1.2456355094909668,
  1.2283062934875488,
  1.2014307975769043),
 'test_time': (0.16829991340637207,
  0.16472816467285156,
  0.3965113162994385,
  0.1622910499572754,
  0.16328787803649902)}

So what we got is that generally we make an error of less than 1 point in the ratings for user in unseen data! 

In [25]:
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=3.0866959622443635, details={'was_impossible': False})

that was way easier than expected.. We are going to continue explore this topics and the other output of this functions in this notebook, stayed tune 