# Recomender Systems

I always look for Kaggle analysis before doing my own analysis in whatever domain. I think get some ideas before starting is very important, I also like to read the comments to get a grasp of some complication that the code could have had or things that might not be well explain in the code itself. I will add something to the Kaggle analysis or reduce certains part to make the code understandable.
The Kaggle that I was reading for this Notebook is this one:

https://www.kaggle.com/code/rounakbanik/movie-recommender-systems

Also I highly recommend these resources: 

https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

https://sifter.org/~simon/journal/20061211.html  

In [1]:
# pip install scikit-surprise

In [19]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
import warnings; warnings.simplefilter('ignore')

from sklearn.preprocessing import normalize


In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Movie information metadata

This is the information for the movies, it gives genres, description, overall ratings and other important metrics. Here we will create a Collaborative Filtering, so this metrics we will not use it, in this notebook. But We will explore them in the future when we create a hibrid model.

In [4]:
# Read the data

X_full = pd.read_csv('movies_metadata.csv')

X_full.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


In [5]:
# We will use the same cleaning process as the kaggle code provided above :) 
X_full['genres'] = X_full['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
X_full['year'] = pd.to_datetime(X_full['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
X_full.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995


In [6]:
len(X_full.id.unique()) # how many movies

45436

In [7]:
str(X_full['overview'][0])

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [8]:
str(X_full['production_companies'][0])

"[{'name': 'Pixar Animation Studios', 'id': 3}]"

# Analysing Previous Kaggles and Understanding what is possible

(Here I continue to use the notebook of ROUNAK BANIK presented above)
So in the Kaggle presented there are differents way to recommend that I would like to mentioned:

- Recommend just based on populary and "general taste": this is base in recommend movies that are really like for a lot of people, let us say that those are "good movies". The columns that they use are: vote_count,	vote_average,	popularity and a new column created called wr.
here is a extract of the Kaggle:

"""

I use the TMDB Ratings to come up with our **Top Movies Chart.** I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

"""
- The other one is the content based one: The idea here is to see similarities in movies based on description and actors/directors in it. Columns to use: genres,overview,tagline. And I would add: adult and production_companies.

- Collaborative Filtering: Here we use ratings giving by other users to make predictions. This is the model we will build!!

# Collaborative Filtering

In [9]:
# Read the data

X = pd.read_csv('ratings_small.csv').sort_values(by=['movieId']).reset_index(drop=True)

X.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,68,1,4.0,1194741818
1,261,1,1.5,1101665532
2,383,1,5.0,852806429


In [10]:
X.shape

(100004, 4)

In [11]:
len(X.movieId.unique())

9066

In [12]:
len(X.userId.unique())

671

In [13]:
reader = Reader()

In [14]:
data = Dataset.load_from_df(X[['userId', 'movieId', 'rating']], reader)

In [15]:
# We'll use the famous SVD algorithm.
svd = SVD(biased= True)

# Run 5-fold cross-validation and print results
cross_validate(svd, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8917  0.9033  0.8931  0.8992  0.8951  0.8965  0.0043  
MAE (testset)     0.6875  0.6925  0.6866  0.6942  0.6891  0.6900  0.0029  
Fit time          0.72    0.76    1.13    1.30    1.29    1.04    0.25    
Test time         0.17    0.08    0.15    0.29    0.16    0.17    0.07    


{'test_rmse': array([0.89170268, 0.90334639, 0.8930939 , 0.8991915 , 0.89509169]),
 'test_mae': array([0.68754049, 0.69248321, 0.68662815, 0.69424712, 0.68907984]),
 'fit_time': (0.7226223945617676,
  0.7557284832000732,
  1.1346337795257568,
  1.297290563583374,
  1.289625883102417),
 'test_time': (0.17198586463928223,
  0.08184170722961426,
  0.15314173698425293,
  0.28528642654418945,
  0.16073179244995117)}

So what we got is that generally we make an error of less than 1 point in the ratings for user in unseen data! 

In [16]:
dont_print = str("""
uid – The (raw) user id. See this note.
iid – The (raw) item id. See this note.
r_ui (float) – The true rating 
est (float) – The estimated rating 
""")
svd.predict(1, 302,2) 

Prediction(uid=1, iid=302, r_ui=2, est=2.8453853213349047, details={'was_impossible': False})

# What movies are similar?

Maybe this question will sound a little weird to answer with the model we created. How would we know which movies are similar if we do a collaborative filter instead of a content base model?

As it turns out the model creates two matrices one is the user matrix and the other one is the movie matrix. Let us explore this a little more..

In [17]:
# Firsly let us use the whole Dataset
data = Dataset.load_from_df(X[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()
svd = SVD(n_factors=100,biased = True)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1d01b1c7e10>

In [20]:
# movie matrix
movie_matrix_temp = svd.qi
movie_matrix = normalize(movie_matrix_temp, axis=0, norm='l1')
len(movie_matrix[8])

100

In [22]:
sum(movie_matrix[0])

0.0015639126558568755

In [23]:
# users matrix
svd.pu.shape

(671, 100)

In [29]:
# finding some popular movie to see what the model "sees" as similar
X_full['vote_count'] = X_full['vote_count'].apply(lambda x: float(x))
X_full[['id','original_title','vote_count']].sort_values(by='vote_count',ascending=False).head(10)

Unnamed: 0,id,original_title,vote_count
15480,27205,Inception,14075.0
12481,155,The Dark Knight,12269.0
14551,19995,Avatar,12114.0
17818,24428,The Avengers,12000.0
26564,293660,Deadpool,11444.0
22879,157336,Interstellar,11187.0
20051,68718,Django Unchained,10297.0
23753,118340,Guardians of the Galaxy,10014.0
2843,550,Fight Club,9678.0
18244,70160,The Hunger Games,9634.0


Let us take the second movie: The Dark Knight

In [30]:
# TFight Club id : 550
index_movie = 550
def get_similar_movies(index_movie,movie_matrix,X_full,X):
    # we have this
    movies_id = list(X.movieId.unique())
    # this are the id of the user ratings, we have to find in which position the id 155 is, is not necesarily position 154 because in can be "jump"
    tuple_index_movies = list(enumerate(movies_id))
    position_in_X = [item for item in tuple_index_movies if item[1] == index_movie][0][0]
    # interting movie to analyse
    movie = movie_matrix[position_in_X].reshape(1,-1)    
    similarities_temp = np.dot(movie,movie_matrix.T)
    similarities = list(similarities_temp.flatten())
    df_similarities_temp = pd.DataFrame({'id':movies_id,'similarities':similarities}).sort_values(by='similarities',ascending=False)
    df_similarities_temp['id']=df_similarities_temp['id'].astype(str)
    X_full['id']=X_full['id'].astype(str)
    df_similarities = df_similarities_temp.merge(X_full[['id','title']] ,on='id',how='left')
    return df_similarities

In [31]:
df_similarities = get_similar_movies(index_movie,movie_matrix,X_full,X)
df_similarities.head(10)

Unnamed: 0,id,similarities,title
0,550,2.257285e-06,Fight Club
1,593,8.505792e-07,Solaris
2,2132,8.284846e-07,Totally Blonde
3,231,8.202836e-07,Syriana
4,3384,7.374309e-07,
5,95067,7.114425e-07,
6,608,7.03843e-07,Men in Black II
7,909,7.033629e-07,Meet Me in St. Louis
8,39,6.946409e-07,
9,2231,6.642197e-07,


Do these movies look similar? Well for the model they are similar even if for us are not. They are similar in a sense that similar users will rate them similarly.

stay tuned in for the new devolpment in this notebook :)