## Collaborative filtering

In [12]:
from pandas.api.types import CategoricalDtype
from datetime import datetime
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize

from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from funk_svd.dataset import fetch_ml_ratings
from funk_svd import SVD
from sklearn.metrics import mean_absolute_error
from surprise import Reader, Dataset, SVD,accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV


#### Loading the data

In [13]:
movies= pd.read_csv("Data/movies.csv")
movies.drop("Unnamed: 0",axis=1,inplace=True)

In [14]:
ratings_sample = pd.read_csv("Data/ratings_sample.csv")

In [15]:
ratings_sample

Unnamed: 0,userId,movieId,rating,liked
0,3,356,4.0,1
1,3,4167,3.5,0
2,3,4306,4.0,1
3,3,4979,4.0,1
4,3,5574,4.0,1
...,...,...,...,...
813465,162534,122892,2.5,0
813466,162534,136016,2.0,0
813467,162534,152081,2.5,0
813468,162534,174055,3.5,0


In [31]:
movies.head(2)

Unnamed: 0,movieId,original_language,original_title,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,7.7,...,0,0,0,0,0,0,0,0,0,0
1,8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,6.9,...,0,0,0,0,0,0,0,0,0,0


Item’s rating is predicted based on how similar items have been rated by that user.

The ratings are predicted using the user’s own ratings on neighbouring (closely related) items.

In [17]:
ratings = ratings_sample.sort_values(by='userId')

In [18]:
ratings .drop("liked",axis=1,inplace=True)
#dropping the liked columns since we want to use the dataframe in FUNK SVD

In [19]:
ratings

Unnamed: 0,userId,movieId,rating
0,3,356,4.0
27,3,148855,4.0
26,3,134853,4.0
25,3,130634,3.0
24,3,117529,3.0
...,...,...,...
813446,162534,1198,4.0
813445,162534,745,4.0
813468,162534,174055,3.5
813456,162534,33004,3.0


Setting the `userId` to order from (0,len) in order to able to use it in Funk SVD.

In [20]:
# Create a new column 'newUserID' with incremental values starting from 0
ratings['UserID'] = range(len(ratings))



In [21]:
# Update the 'userID' column with the values from the 'newUserID' column
ratings['userId'] = ratings['UserID']

# Remove the 'newUserID' column
ratings = ratings.drop('UserID', axis=1)

In [22]:
#sanity check
ratings

Unnamed: 0,userId,movieId,rating
0,0,356,4.0
27,1,148855,4.0
26,2,134853,4.0
25,3,130634,3.0
24,4,117529,3.0
...,...,...,...
813446,813465,1198,4.0
813445,813466,745,4.0
813468,813467,174055,3.5
813456,813468,33004,3.0


### Funk SVD

Funk SVD is one of the Matrix Factorization techniques in Model based recommender engines that are part of the Collaborative filtering group of recommender engines.

FunkFunk SVD is ideal for sparse matrixes.



We need to load our data into a dataset object using a package in `Surprise` library

In [23]:
#using Funk SVD
my_dataset = Dataset.load_from_df(ratings, Reader(rating_scale=(0.5, 5)))
my_train_dataset = my_dataset.build_full_trainset()

In [24]:
my_train_dataset

<surprise.trainset.Trainset at 0x7fcde963e430>

Now all we do is initialize the algorithm, specify the number of latent variables and iterations we'd like to use, and then let the algorithm run.

A huge downside here is that we cannot use 'Funk SVD' for the users that are new.
For this purpose i'm defining a user Id below and also a movie Id and puuting it in my funk SVD function to see the results.

In [25]:
#example
user_1 = 0
movie = 62

In [26]:

my_algorithm = FunkSVD(n_factors=15, 
                n_epochs=50, 
                lr_all=0.001,    # Learning rate for each epoch
                biased=False,  # This forces the algorithm to store all latent information in the matrices
                verbose=0,
                reg_all =  0.08)
my_algorithm.fit(my_train_dataset)




<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fcdc8f024f0>

We call these two matrices users and items latent factors. 

Then, by applying the dot product between both matrices we can reconstruct our rating matrix. 

In [27]:
#user matrix
U = my_algorithm.pu
#​In the U matrix, we take all the random values associated with User 1

In [28]:
#movie matrix
M = my_algorithm.qi.T


In [29]:
inner_user_id = my_train_dataset.to_inner_uid(user_1) # find the inner representation of user 1
user_profile = U[inner_user_id]
inner_movie_id = my_train_dataset.to_inner_iid(movie) # find the inner representation of item 
movie_profile = M[:, inner_movie_id]


`np.dot` will give us an expected rating.

I'm going to say if the expected rating is above 3.5 it means I'm predicting the user will like the movie. 

In [29]:
expected_rating = np.dot(user_profile, movie_profile) #compute the dot product between the row and column found in order to make our prediction.
print(f"------ Result for 10 factors:")
print(f"expected rating for {movies[movies['movieId']==movie]['title']} is {expected_rating}")

------ Result for 10 factors:
expected rating for 332    2001: A Space Odyssey
Name: title, dtype: object is -0.7451322818646982


In [30]:
# The surprise package doesn't allow you to test on the trainset we built
my_train_dataset, my_test_dataset = train_test_split(my_dataset, test_size=0.3)

predictions = my_algorithm.test(my_test_dataset)

In [31]:
if expected_rating > 3.5:
    print("We predict that the user will like this movie! ")

We know the rating user 1 gave movie 2 (it's 3.0), so let's use this to demonstrate how we calculate ratings using these latent factors.

First, we grab the user profile

Our expected rating of this movie by this user is the dot product of these two profiles.

# Evaluation

Building a well-generalized model, Machine learning model cannot have 100 per cent efficiency otherwise the model is known as a biased model.
So we need to evaluat our model to get a good accuracy while avoiding overfitting.

For that we need to optimize our hyperparameters so we can get good results.

Using the gridsearchCV which is a functinon in sklearn library to find the optimize hyperparameters for our models.

It is a cross validation test splits the data into subsets(folds).It will return the set of parameters that minimises the mean score across folds

Below I use 3 folds and RMSE accuracy score.

RMSE: Root mean standard deviation.

The output value you get is in the same unit as the required output variable which makes interpretation of loss easy.

The formula of RMS is square root of (MSE(y_test,_y_pred))


We do this to make sure that we are doing better for each iteration.

In [30]:
#Run for fast pass
param_grid = {'n_factors': [30], 'n_epochs': [35], 'lr_all': [0.001],
              'reg_all': [0.08]}
              #n_factors – The number of factors
              #n_epochs – The number of iteration of the SGD procedure
              #lr_all – The learning rate for all parameters
              #reg_all – The regularization term for all parameters. 

             
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
gs.fit(my_dataset)
algo = gs.best_estimator['rmse']
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

#Assigning values
t = gs.best_params
factors = t['rmse']['n_factors']
epochs = t['rmse']['n_epochs']
lr_value = t['rmse']['lr_all']
reg_value = t['rmse']['reg_all']

0.9615857978579115
{'n_factors': 30, 'n_epochs': 35, 'lr_all': 0.001, 'reg_all': 0.08}


The performance of recommender systems are often evaluated by use.(interaction of the user with items).


Matrix factorization is a collaborative filtering method to find the relationship between items’ and users’ entities. 

Latent features(the association between users and movies matrices) are determined to find similarity and make a prediction based on both item and user entities
The matrix factorization of user and item matrices can be generated when the math cost function RMSE is minimized through matrix factorization.