<h1>1. BUSINESS PROBLEM </h1>

<p>
Build a recommender system to help predict whether someone will enjoy a movie based on past rating information. 
<br>For this project MovieLens 100k dataset is used which is publicy avaiable for research and analysis purposes

https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html
</p>    


In [47]:
# importing the necessary libraries
from surprise import Reader, Dataset
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
from datetime import datetime
from scipy import sparse
from scipy.sparse import csr_matrix
import os
import numpy as np
import pandas as pd
from surprise import Reader, Dataset,SVD
from sklearn.utils.extmath import randomized_svd
from sklearn.metrics import mean_squared_error
from scipy import sparse

In [44]:
# reading the data and performing a time based split
df = pd.read_csv(r'movielens_data.csv')
df = a.sort_values(ascending=True,by=['timestamp'])
_80_percent_mark = int(0.80*df.shape[0])
train_df = a[0:_80_percent_mark][['user','movie','rating']]
test_df = a[_80_percent_mark:][['user','movie','rating']]

In [48]:
# creating sparse matrix representation our dataframes
test_sparse_matrix = sparse.csr_matrix((test_df.rating.values, (test_df.user.values,
                                               test_df.movie.values)))

train_sparse_matrix = sparse.csr_matrix((train_df.rating.values, (train_df.user.values,
                                               train_df.movie.values)))

In [49]:
def check_sparsity(sparse_matrix):
    '''returns sparisty of matrix'''
    us,mv = sparse_matrix.shape
    elem = sparse_matrix.count_nonzero()
    sparsity = np.round((1-(elem/(us*mv))) * 100,4)
    return sparsity

In [50]:
# checking sparsity of train and test matrix
print("Sparsity of Train matrix : {} % | Sparsity of Test Matrix : {} %".format(check_sparsity(train_sparse_matrix),check_sparsity(test_sparse_matrix)))

Sparsity of Train matrix : 94.9646 % | Sparsity of Test Matrix : 98.7383 %


In [51]:
def initialize(dim):
    '''In this function, we will initialize bias value 'B' and 'C'.'''
    # initalize the value to zeros 
    # return output as a list of zeros 
    return np.zeros(dim)

<h1>2. RECOMMENDER SYSTEM FROM SCRATCH (COLLABORATIVE FILTERING) </h1>

$$
L = \min_{ b, c, \{ u_i \}_{i=1}^N, \{ v_j \}_{j=1}^M}
\quad
\alpha \Big(
    \sum_{j} \sum_{k} v_{jk}^2 
    + \sum_{i} \sum_{k} u_{ik}^2 
    + \sum_{i} b_i^2
    + \sum_{j} c_i^2
    \Big)
+ \sum_{i,j \in \mathcal{I}^{\text{train}}}
    (y_{ij} - \mu - b_i - c_j - u_i^T v_j)^2
$$


#### We will minimize the above cost function using the below gradient descent algorithm i.e computing gradients w.r.t user and movie biases and learning these biases during the training process

<pre>
for each epoch:

    for each pair of (user, movie):

        b_i =  b_i - learning_rate * dL/db_i

        c_j =  c_j - learning_rate * dL/dc_j

predict the ratings with formula
</pre>

$\hat{y}_{ij} = \mu + b_i + c_j + \text{dot_product}(u_i , v_j) $

In [52]:
# methods to compute gradients
def derivative_db(user_id,item_id,rating,U,V,mu,alpha):
    '''In this function, we will compute dL/db_i'''
    loss =  (2*alpha*b_i[user_id]) - 2*(rating - mu - b_i[user_id] - c_j[item_id] - np.dot(U[user_id],V.T[item_id]))
    return loss
def derivative_dc(user_id,item_id,rating,U,V,mu,alpha):
    '''In this function, we will compute dL/dc_j'''
    loss =  (2*alpha*c_j[item_id]) - 2*(rating - mu - b_i[user_id] - c_j[item_id] - np.dot(U[user_id],V.T[item_id]))
    return loss    

In [59]:
# user and movie biases arrays are intialized (size equal to total users,movies in train matrix)
dim= train_sparse_matrix.shape[0]
b_i=initialize(dim)
dim= train_sparse_matrix.shape[1]
c_j=initialize(dim)

#### FIRST LET US GET A BASELINE RMSE USING GOLBAL AVERAGE RATING TO COMPARE OUR MODELS 

In [60]:
def get_baseline_rmse(train_df,test_df):
    mu = train_df['rating'].mean() # computing global average train data
    # train rmse for random model 
    y_true_train = train_df['rating'].tolist()
    y_pred_train = [mu]*train_df.shape[0]
    train_mse = mean_squared_error(y_true_train,y_pred_train)
    # test rmse for random model
    y_true_test = test_df['rating'].tolist()
    y_pred_test = np.random.randint(1,5,test_df.shape[0])
    y_pred_train = [mu]*test_df.shape[0]
    test_mse = mean_squared_error(y_true_test,y_pred_test)
    print("=============================GOLBAL MEAN RATING MODEL=========================")
    print("Train RMSE is : {} and Test RMSE : {}".format(train_mse,test_mse))

In [61]:
get_baseline_rmse(train_df,test_df)

Train RMSE is : 1.2709884775 and Test RMSE : 3.6814


In [62]:
def get_prediction(df,b_i,c_j,mu):
    '''calculates net rmse'''
    y_true = []
    y_pred = []
    for user,movie,rate in df[['user','movie','rating']].values:
        try:
            y_hat = mu + b_i[user] + c_j[movie] + np.dot(U1[user],V1.T[movie])
        except:
            # handling cold start problem assigning global average for test users/movies not in training set
            y_hat = mu
        y_true.append(rate)
        y_pred.append(y_hat)
    return mean_squared_error(y_true,y_pred)    

In [66]:
def fit_recommender(total_epochs,learning_rate,train_sparse_matrix,train_df,test_df,svd_components = 5):
    '''learns parameters for the recommednder'''
    mu = train_df['rating'].mean() # global average rating in train data
    total_train_mse = []
    total_test_mse = []
    U1, Sigma, V1 = randomized_svd(train_sparse_matrix, n_components=svd_components,n_iter=2, random_state=24)
    total_train_mse = []
    total_test_mse = []
    alpha = 10
    for epoch in range(total_epochs):
        for user,movie,rate in train_df[['user','movie','rating']].values:
            b_i[user] = b_i[user] - learning_rate *  derivative_db(user,movie,rate,U1,V1,mu,alpha) 
            c_j[movie] = c_j[movie] - learning_rate *  derivative_dc(user,movie,rate,U1,V1,mu,alpha)
        train_error = get_prediction(train_df,b_i,c_j,mu)
        test_error = get_prediction(test_df,b_i,c_j,mu)
        total_train_mse.append(train_error)
        total_test_mse.append(test_error)
        print("After Epoch {}------Train rmse:{}  Test rmse:{}".format(epoch,train_error,test_error))
        print()
        print("=======================================================================================")

### This is the model based recommender system trained to learn user and movie biases, we can clearly see the difference in the test RMSE, hence our model did pick up some useful user/movie specific patterns during the training

In [67]:
fit_recommender(total_epochs = 5, learning_rate = 0.01, 
                train_sparse_matrix = train_sparse_matrix, 
                train_df = train_df,test_df=test_df,svd_components = 5)

After Epoch 0------Train rmse:1.2709884775  Test rmse:1.2524334124999998

After Epoch 1------Train rmse:1.2709884775  Test rmse:1.2524334124999998

After Epoch 2------Train rmse:1.2709884775  Test rmse:1.2524334124999998

After Epoch 3------Train rmse:1.2709884775  Test rmse:1.2524334124999998

After Epoch 4------Train rmse:1.2709884775  Test rmse:1.2524334124999998



<h1>3. RECOMMENDER SYSTEM USING SURPRISE LIBRARY </h1>

here we are doing the same thing using an already available library called surprise , except now we are learning the user,movie latent vectors along with the user and movie biases during training as well i.e we are performing matrix factorization using our training data, this is more robust and less prone to overfitting. Here is the offical documentation of surprise library to get a better understanding
https://surprise.readthedocs.io/en/stable/matrix_factorization.html

In [68]:
# methods to get ratings and compute errors using surprise
def get_ratings(predictions):
    actual = np.array([pred.r_ui for pred in predictions])
    pred = np.array([pred.est for pred in predictions])
    return actual, pred

def get_errors(predictions, print_them=False):
    actual, pred = get_ratings(predictions)
    rmse = np.sqrt(np.mean((pred - actual)**2))
    return rmse

In [69]:
reader = Reader(rating_scale=(1,5))
# create the traindata from the dataframe...
train_data = Dataset.load_from_df(train_df[['user', 'movie', 'rating']], reader)
# build the trainset from traindata.., It is of dataset format from surprise library..
trainset = train_data.build_full_trainset() 
testset = list(zip(test_df.user.values, test_df.movie.values, test_df.rating.values))

In [70]:
def run_surprise(surprise_algo,trainset):
    svd.fit(trainset)
    train_preds = svd.test(trainset.build_testset())
    train_actual_ratings, train_pred_ratings = get_ratings(train_preds)
    train_rmse = get_errors(train_preds) 
    test_preds = svd.test(testset)
    test_actual_ratings, test_pred_ratings = get_ratings(test_preds)
    test_rmse = get_errors(test_preds)    
    print("Train rmse : {}  Test rmse : {}".format(train_rmse,test_rmse))    

### We can clearly see the test rmse improved further, hence learning the user-movie latent vectors as well during training proved to be useful

In [71]:
svd = SVD(n_factors=5, biased=True, random_state=15, verbose=True,n_epochs=5)
run_surprise(surprise_algo = svd,trainset = trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Train rmse : 0.9319215610883156  Test rmse : 1.0371686201507941


<h1>4. SUMMARY and CONCLUSION</h1>

| Recommender System Type | TEST RMSE |
| --- | --- |
| Global Average | 3.68 |
|Model-1         |1.25  |
|Model-2         |1.03   |  

* Here __Model-1__ is the our custom recommender system where we learnt these __user,movie biases__ from the training data. These biases can be thought of as values depicting a user/movie properties for example a customer who is very critical and usually gives lesser ratings as compared to other customers could have a negative user bias like -0.05, similarly some movies are popular and tend to have high rating hence movie biases for some movies would to high therefore these biases help in accomodating these user/movie specific properties and help in refining the predictions, hence Model-1 has better RMSE than a simple average prediction model
<br>
<br>
* In __Model_2__ (trained using surprise) we have also learnt these user,movie latent vectors hence this helped in refining the prediction further by incorporrating the factor of a user-movie interaction in the form of a dot product b/w user movie vectors.
<br>
<br>
* This is just some baseline models,there are still many scope of refinements
   * Using a bigger training set
   * There are some users/movies in our test data who do not appear in training at all, hence for these users our collabortive model can't say anything since there aren't any past interactions avaialable, this is called a __cold start problem__ and to overcome this we could introduce some additional user/movie like features like age of person, genre of movie,etc basically using a hybrid appraoch by combining Content Based and Collobarative Based approached
