In this notebook, you will obtain an introductory acquiantiance to baseline recommendation algorithms and evaluation methods.
The recommendation dataset we will be using is from a collection called MovieLens, which contains users’ movie ratings and is popular for implementing and testing recommender systems. The specific dataset we will be using for this lab is MovieLens 100K Dataset which contains 100,000 movie ratings from 943 users and a selection of 1682 movies. In recommendation research works, usually a larger version of this dataset, MovieLens 20M is used instead.
First, we import the necessary packages.


In [None]:
# import required libraries
!pip install wget
import os
import os.path
import numpy as np
import pandas as pd
from math import sqrt
from heapq import nlargest
from tqdm import trange
from tqdm import tqdm
import scipy
from scipy import stats
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=7b9ed861fd2531f478b8d9a7a8f6b2cd3028272dcf3478dc0a2ba2aecc9a0624
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


Downloading the dataset and have a glance on its statistics

In [None]:
MOVIELENS_DIR = "/ssd003/projects/aieng/public/recsys_datasets/movielens/ml-100k"

In [None]:
!ls {MOVIELENS_DIR}

In [None]:
def getData(folder_path, file_name):
    fields = ['userID', 'itemID', 'rating', 'timestamp']
    data = pd.read_csv(os.path.join(folder_path, file_name), sep='\t', names=fields)
    return data 

In [None]:
rating_df = getData(MOVIELENS_DIR, 'u.data')

In [None]:
rating_df_train = getData(MOVIELENS_DIR, 'u1.base')
rating_df_test = getData(MOVIELENS_DIR, 'u1.test')

In [None]:
rating_df_train.head()

In [None]:
rating_df_test.head()

In [None]:
print("Number of users in rating df:", len(rating_df.userID.unique()))
print("Number of items in rating df:", len(rating_df.itemID.unique()))
print("Number of users in train df:", len(rating_df_train.userID.unique()))
print("Number of items in train df:", len(rating_df_train.itemID.unique()))
print("Number of users in test df:", len(rating_df_test.userID.unique()))
print("Number of items in test df:", len(rating_df_test.itemID.unique()))

Data inrecommendation systems is usually encoded as dataframe with three or more columns: (user, item, rating, additional meta-data if present). Here, we implement the function dataPreprocessor that takes the data frame, total number of users, total number of items and outputs a user-item utility matrix as demonstrated in the tutorial. See the function comments for more guidance. The following experiments will all use dataPreprocessor.


In [None]:
def dataPreprocessor(rating_df, num_users, num_items):
    """
        INPUT: 
            data: pandas DataFrame. columns=['userID', 'itemID', 'rating' ...]
            num_row: int. number of users
            num_col: int. number of items
            
        OUTPUT:
            matrix: 2D numpy array. 
            
        NOTE 1: see where something very similar is done in the lab in function 'buildUserItemMatrix'    
            
        NOTE 2: data can have more columns, but your function should ignore 
              additional columns.
    """
    matrix = np.zeros((num_users, num_items), dtype=np.int8)
    for (index, userID, itemID, rating, timestamp) in rating_df.itertuples():
      matrix[userID-1, itemID-1] = rating
    return matrix

Two baseline recommender models are implemented in this class. The first one, we make predictions by taking the average of ratings of the user. In popularity, we use the popularity of items for making recommendations (recommending the most popular items)

In [None]:
class BaseLineRecSys(object):
    def __init__(self, method, processor=dataPreprocessor):
        """
            method: string. From ['popularity','useraverage']
            processor: function name. dataPreprocessor by default
        """
        self.method_name = method
        self.method = self._getMethod(self.method_name)
        self.processor = processor
        self.pred_column_name = self.method_name
        
    def _getMethod(self, method_name):
        """
            Don't change this
        """
        switcher = {
            'popularity': self.popularity,
            'useraverage': self.useraverage,
        }
        
        return switcher[method_name]
    
    @staticmethod
    def useraverage(train_matrix, num_users, num_items):
        """
            INPUT:
                train_matrix: 2D numpy array.
                num_users: int. Number of Users.
                num_items: int. Number of Items.
            OUTPUT:
                predictionMatrix: 2D numpy array.
                
            NOTE: see where something very similar is done in the lab in function 'predictByUserAverage'    
        """
        
        predictionMatrix = np.zeros((num_users, num_items))

        # Initialize the predicted rating matrix with zeros
        for (user,item), rating in np.ndenumerate(train_matrix):
          if rating==0:
            userVector = train_matrix[user, :]
            ratedItems = userVector[userVector.nonzero()]
            if ratedItems.size == 0:
              itemAvg=0
            else:
              itemAvg= ratedItems.mean()
            predictionMatrix[user, item] = itemAvg
          #if (user % 100 == 0 and item == 1):
          #  print ("calculated %d users" % (user,))
        return predictionMatrix
    
    @staticmethod
    def popularity(train_matrix, num_users, num_items):
        """
            INPUT:
                train_matrix: 2D numpy array.
                num_users: int. Number of Users.
                num_items: int. Number of Items.
            OUTPUT:
                predictionMatrix: 2D numpy array.
                
            NOTE: see where something very similar is done in the lab in function 'predictByPopularity'    
        """
        predictionMatrix = np.zeros((num_users, num_items))

        # Initialize the predicted rating matrix with zeros
        vf = np.vectorize(lambda x: 1 if x >= 4 else 0)
        itemPopularity = np.zeros((num_items))
        for item in range(num_items):
          numOfUsersRated = len(train_matrix[:, item].nonzero()[0])
          numOfUsersLiked = len(vf(train_matrix[:, item]).nonzero()[0])
          if numOfUsersRated == 0:
            itemPopularity[item] = 0
          else:
            itemPopularity[item] = numOfUsersLiked/numOfUsersRated
        for (user,item), rating in np.ndenumerate(train_matrix):
          if rating==0:
            predictionMatrix[user, item] = itemPopularity[item]
         # if (user % 100 == 0 and item == 1):
         #   print ("calculated %d users" % (user,))

        return predictionMatrix    
    
    def predict_all(self, train_df, num_users, num_items):
        
        train_matrix = self.processor(train_df, num_users, num_items)
        self.__model = self.method(train_matrix, num_users, num_items)
        
    def evaluate_test(self, test_df, copy=False):
        
        if copy:
            prediction = test_df.copy()
        else:
            prediction = test_df
            
        prediction[self.pred_column_name] = np.nan
        
        for (index, 
             userID, 
             itemID) in tqdm(prediction[['userID','itemID']].itertuples()):
            prediction.loc[index, self.pred_column_name] = self.__model[userID-1, itemID-1]

        return prediction
        
    def getModel(self):
        """
            return predicted user-item matrix
        """
        return self.__model
    
    def getPredColName(self):
        """
            return prediction column name
        """
        return self.pred_column_name
    
    def reset(self):
        """
            reuse the instance of the class by removing model
        """
        self.__model = None

In [None]:
popularity_recsys = BaseLineRecSys('popularity')


In [None]:
popularity_recsys.predict_all(rating_df_train,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))


In [None]:
x = popularity_recsys.getModel()


In [None]:
np.all(x<=1)


In [None]:
rating_df_test.head()


In [None]:
popularity_recsys.evaluate_test(rating_df_test,copy=True).head()


In [None]:
average_user_rating_recsys = BaseLineRecSys('useraverage')


In [None]:
average_user_rating_recsys.predict_all(rating_df_train, len(rating_df.userID.unique()), len(rating_df.itemID.unique()))


In [None]:
average_user_rating_recsys.getModel()


In [None]:
average_user_rating_recsys.evaluate_test(rating_df_test,copy=True).head()


In class SimBasedRecSys, there are three similarity measurement functions (cosine, eu- clidean, Manhattan Distance). These metrics will be used to measure user-user and item-item similarities in collaborative filtering

In [None]:
class SimBasedRecSys(object):

    def __init__(self, base, method, processor=dataPreprocessor):
        """
            base: string. From ['user', 'item']. User-based Similarity or Item-based
            method: string. From ['cosine', 'euclidean', 'somethingelse']
            processor: function name. dataPreprocessor by default
        """
        self.base = base
        self.method_name = method
        self.method = self._getMethod(self.method_name)
        self.processor = processor
        self.pred_column_name = self.base+'-'+self.method_name
    
    def _getMethod(self, method_name):
        """
            Don't change this
        """
        switcher = {
            'cosine': self.cosine,
            'euclidean': self.euclidean,
            'somethingelse': self.somethingelse,
        }
        
        return switcher[method_name]
    
    @staticmethod
    def cosine(matrix):
        """
            cosine similarity
        """
        similarity_matrix = 1 - pairwise_distances(matrix, metric='cosine')
        return similarity_matrix
    
    @staticmethod
    def euclidean(matrix):
        """
            euclidean similarity
        """
 
        similarity_matrix = 1 / (pairwise_distances(matrix, metric='euclidean')+1)  
        
        return similarity_matrix
    
    @staticmethod
    def somethingelse(matrix):
        """
            manhattan
        """
        similarity_matrix = 1 /(pairwise_distances(matrix, metric='l1')+1)      
        return similarity_matrix
        
    def predict_all(self, train_df, num_users, num_items):
        """
            INPUT: 
                data: pandas DataFrame. columns=['userID', 'itemID', 'rating'...]
                num_row: scalar. number of users
                num_col: scalar. number of items
            OUTPUT:
                no return... this method assigns the result to self.__model
            
            NOTES:
                self.__model should contain predictions for *all* user and items
                (don't worry about predicting for observed (user,item) pairs,
                 since we won't be using these predictions in the evaluation)
                (see code in for an efficient vectorized example)
        """
        train_matrix = self.processor(train_df, num_users, num_items)
        
        if self.base == 'user':
            temp_matrix = np.zeros(train_matrix.shape)
            temp_matrix[train_matrix.nonzero()] = 1
            uu_similarity=self.cosine(train_matrix)
            normalizer = np.matmul(uu_similarity, temp_matrix)
            normalizer[normalizer == 0] = 1e-5
            predictionMatrix = np.matmul(uu_similarity, train_matrix)/normalizer
            useraverage = np.sum(train_matrix, axis=1)/np.sum(temp_matrix, axis=1)
            columns = np.sum(predictionMatrix, axis=0)
            predictionMatrix[:, columns==0] = predictionMatrix[:, columns==0] + np.expand_dims(useraverage, axis=1)

            
        elif self.base == 'item':

            train_matrix_item=np.transpose(train_matrix)
            temp_matrix = np.zeros(train_matrix_item.shape)
            temp_matrix[train_matrix_item.nonzero()] = 1
            ii_similarity=self.cosine(train_matrix_item)
            normalizer = np.matmul(ii_similarity, temp_matrix)
            normalizer[normalizer == 0] = 1e-5
            predictionMatrix = np.matmul(ii_similarity, train_matrix_item)/normalizer
            temp_matrix[temp_matrix==0] = 1e-5
            itemaverage = np.sum(train_matrix_item, axis=1)/np.sum(temp_matrix, axis=1)
            columns = np.sum(predictionMatrix, axis=0)
            predictionMatrix[:, columns==0] = predictionMatrix[:, columns==0] + np.expand_dims(itemaverage, axis=1)
            predictionMatrix=np.transpose(predictionMatrix)
        else:
            print('No other option available')
        self.__model=predictionMatrix
        
    def evaluate_test(self, test_df, copy=False):
        """
            INPUT:
                data: pandas DataFrame. columns=['userID', 'itemID', 'rating'...]
            OUTPUT:
                predictions:  pandas DataFrame. 
                              columns=['userID', 'itemID', 'rating', 'base-method'...]
                              
            NOTE: 1. data can have more columns, but your function should ignore 
                  additional columns.
                  2. 'base-method' depends on your 'base' and 'method'. For example,
                  if base == 'user' and method == 'cosine', 
                  then base-method == 'user-cosine'
                  3. your predictions go to 'base-method' column
        """
        if copy:
            prediction = test_df.copy()
        else:
            prediction = test_df
        prediction[self.pred_column_name] = np.nan
        
        for (index, 
             userID, 
             itemID) in tqdm(prediction[['userID','itemID']].itertuples()):
            prediction.loc[index, self.pred_column_name] = self.__model[userID-1, itemID-1]
    
        return prediction
    
    def getModel(self):
        """
            return predicted user-item matrix
        """
        return self.__model
    
    def getPredColName(self):
        """
            return prediction column name
        """
        return self.pred_column_name
    
    def reset(self):
        """
            reuse the instance of the class by removing model
        """
        self.__model = None

In [None]:
# Examples of how to call similarity functions.
I = np.eye(3)
SimBasedRecSys.cosine(I)

In [None]:
SimBasedRecSys.euclidean(I)


In [None]:
SimBasedRecSys.somethingelse(I)


##Collaborative Filtering
We have implemented vectorized versions of collaborative filtering since loop-based versions will take excessively long to run.

In [None]:
user_cosine_recsys = SimBasedRecSys('user','cosine')


In [None]:
user_cosine_recsys.predict_all(rating_df_train,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))


In [None]:
user_cosine_recsys.getModel()


In [None]:
rating_df_test.head()


In [None]:
user_cosine_recsys.evaluate_test(rating_df_test,copy=True).head()


CrossValidation will be used to report comparative RMSE results (averages and confidence intervals) between user-user and item-item based collaborative filtering for cosine similarity.

In [None]:
class CrossValidation(object):
    def __init__(self, metric, data_path=MOVIELENS_DIR):
        """
            INPUT:
                metric: string. from['RMSE','P@K','R@K']
        """
        self.folds = self._getData(MOVIELENS_DIR)
        self.metric_name = metric
        self.metric = self._getMetric(self.metric_name)
        
    def _getMetric(self, metric_name):
        """
            Don't change this
        """
        switcher = {
            'RMSE': self.rmse,
            'P@K': self.patk,
            'R@K': self.ratk,
            'RPrecision': self.rprecision
        }
        
        return switcher[metric_name]
    
    @staticmethod
    def rmse(data, k, num_users, num_items, pred, true='rating'):
        """
            data: pandas DataFrame. 
            pred: string. Column name that corresponding to the prediction
            true: string. Column name that corresponding to the true rating
        """
        return sqrt(mean_squared_error(data[pred], data[true]))
    
    # Precision at k
    def patk(self, data, k, num_users, num_items, pred, true='rating'):
        """
            data: pandas DataFrame. 
            k: top-k items retrived
            pred: string. Column name that corresponding to the prediction
            true: string. Column name that corresponding to the true rating
        """
        prediction = self.getMatrix(data, num_users, num_items, pred)
        testSet =  self.getMatrix(data, num_users, num_items, true)
    
        # Initialize sum and count vars for average calculation
        sumPrecisions = 0
        countPrecisions = 0

        # Define function for converting 1-5 rating to 0/1 (like / don't like)
        vf = np.vectorize(lambda x: 1 if x >= 4 else 0)

        for userID in range(num_users):
            # Pick top K based on predicted rating
            userVector = prediction[userID,:]
            topK = nlargest(k, range(len(userVector)), userVector.take)

            # Convert test set ratings to like / don't like
            userTestVector = vf(testSet[userID,:]).nonzero()[0]

            # Calculate precision
            precision = float(len([item for item in topK if item in userTestVector]))/len(topK)

            # Update sum and count
            sumPrecisions += precision
            countPrecisions += 1

        # Return average P@k
        return float(sumPrecisions)/countPrecisions
    
    # Recall at k
    def ratk(self, data, k, num_users, num_items, pred, true='rating'):
        """
            data: pandas DataFrame. 
            k: top-k items relevant
            pred: string. Column name that corresponding to the prediction
            true: string. Column name that corresponding to the true rating
        """
        prediction = self.getMatrix(data, num_users, num_items, pred)
        testSet =  self.getMatrix(data, num_users, num_items, true)
        # Initialize sum and count vars for average calculation
        sumRecalls = 0
        countRecalls = 0

        # Define function for converting 1-5 rating to 0/1 (like / don't like)
        vf = np.vectorize(lambda x: 1 if x >= 4 else 0)

        for userID in range(num_users):
            # Pick top K based on predicted rating
            userVector = prediction[userID,:]
            topK = nlargest(k, range(len(userVector)), userVector.take)

            # Convert test set ratings to like / don't like
            userTestVector = vf(testSet[userID,:]).nonzero()[0]

            # Ignore user if has no ratings in the test set
            if (len(userTestVector) == 0):
                continue

            # Calculate recall
            recall = float(len([item for item in topK if item in userTestVector]))/len(userTestVector)

            # Update sum and count
            sumRecalls += recall
            countRecalls += 1

        # Return average R@k
        return float(sumRecalls)/countRecalls

    def rprecision(self, data, k, num_users, num_items, pred, true='rating'):
        """
            data: pandas DataFrame.
            k: top-k items relevant
            pred: string. Column name that corresponding to the prediction
            true: string. Column name that corresponding to the true rating
        """
        prediction = self.getMatrix(data, num_users, num_items, pred)
        testSet = self.getMatrix(data, num_users, num_items, true)
        # Initialize sum and count vars for average calculation
        sumRPs = 0
        countRPs = 0

        # Define function for converting 1-5 rating to 0/1 (like / don't like)
        vf = np.vectorize(lambda x: 1 if x >= 4 else 0)

        for userID in range(num_users):
            # Pick top K based on predicted rating
            userVector = prediction[userID, :]


            # Convert test set ratings to like / don't like
            userTestVector = vf(testSet[userID, :]).nonzero()[0]

            # Ignore user if has no ratings in the test set
            if (len(userTestVector) == 0):
                continue

            topK = nlargest(len(userTestVector), range(len(userVector)), userVector.take)
            # Calculate recall
            rp = float(len([item for item in topK if item in userTestVector])) / len(userTestVector)

            # Update sum and count
            sumRPs += rp
            countRPs += 1

        # Return average R@k
        return float(sumRPs) / countRPs

    @staticmethod
    def getMatrix(rating_df, num_users, num_items, column_name):
        matrix = np.zeros((num_users, num_items))
    
        for (index, userID, itemID, value) in rating_df[['userID','itemID', column_name]].itertuples():
            matrix[userID-1, itemID-1] = value
            
        return matrix
    
    @staticmethod
    def _getData(data_path):
        """
            Don't change this function
        """
        folds = []
        data_types = ['u{0}.base','u{0}.test']
        for i in range(1,6):
            train_set = getData(data_path, data_types[0].format(i))
            test_set = getData(data_path, data_types[1].format(i))
            folds.append([train_set, test_set])
        return folds
    
    def run(self, algorithms, num_users, num_items, k=1):
        """
            5-fold cross-validation
            algorithms: list. a list of algorithms. 
                        eg: [user_cosine_recsys, item_euclidean_recsys]
        """
        
        scores = {}
        for algorithm in algorithms:
            print('Processing algorithm {0}'.format(algorithm.getPredColName()))
            fold_scores = []
            for fold in self.folds:
                algorithm.reset()
                algorithm.predict_all(fold[0], num_users, num_items)
                prediction = algorithm.evaluate_test(fold[1])
                pred_col = algorithm.getPredColName()
                fold_scores.append(self.metric(prediction, k, num_users, num_items, pred_col))
                
            mean = np.mean(fold_scores)
            ci_low, ci_high = stats.t.interval(0.95, len(fold_scores)-1, loc=mean, scale=stats.sem(fold_scores))
            scores[algorithm.getPredColName()] = [fold_scores, mean, ci_low, ci_high]
            
        results = scores    
    
        return results
      

In [None]:
# 1. gather your algorithms in previous steps.
item_cosine_recsys = SimBasedRecSys('item','cosine')
algorithm_instances = [user_cosine_recsys,
                       item_cosine_recsys]

In [None]:
# 2. Instantiate a CrossValidation instance and assign the measurement that you want to use
# RMSE, P@K, RPrecision
# Precision at K in this example
cv_patk = CrossValidation('RMSE')

In [None]:

# 3. Run CV by giving:
#    1> algorithms just gathered
#    2> number of users in the full dataset
#    3> number of items in the full dataset
#    4> precision or recall at K need a K value, so k=5 means precision at 5 in this example
# Results include independent results from 5 folds, their mean, and confidence interval.
cv_patk.run(algorithm_instances,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()),k=5)

In [None]:
all_ratings=rating_df_train.shape[0]
avg_Ratings_peruser=all_ratings/len(rating_df.userID.unique())
avg_Ratings_peritem=all_ratings/len(rating_df.itemID.unique())
print("average ratings per user is:"+str(avg_Ratings_peruser))
print("average ratings per item is:"+str(avg_Ratings_peritem))

Considering the RMSE values, we see that the RMSE for the case that we use user-user cosine similarity is: 1.0173541216605808 and for item-item similarity it is: 1.020082900106248. We know that generally, item-item similarity is expected to provide better results since items attributes vary less that users' tastes. However, here we see that the user-user similarity has provided less RMSE. First of all, the amount of difference between the two is low and maybe by performing more accurate statistical tests, we see that the difference isn't statistically significant. By the way, we can ratiocinate this little difference by considering the average ratings per item and user which is provided above. We see that the average number of ratings per user is about double the average number of ratings per item. Therefore, we have more information about each user than we know about each item. So, measuring the similarity between users can be more informative and results in less RMSE.

##Performance Comparison
Here, using the CrossValidation class, we compare all the recommmenders. 

In [None]:
item_cosine_recsys = SimBasedRecSys('item','cosine')
algorithm_instances = [popularity_recsys, 
                       average_user_rating_recsys, 
                       user_cosine_recsys,
                       item_cosine_recsys]
cv_patk_output=cv_patk.run(algorithm_instances,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()),k=5)

In [None]:
cv_rmse = CrossValidation('RMSE')
cv_rmse_output=cv_rmse.run(algorithm_instances,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()),k=5)
cv_ratk = CrossValidation('R@K')
cv_ratk_output=cv_ratk.run(algorithm_instances,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()),k=5)
cv_rprec = CrossValidation('RPrecision')
cv_rprec_output=cv_rprec.run(algorithm_instances,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()),k=5)

In [None]:
output_dict={'item-cosine':['item-cosine',cv_patk_output['item-cosine'][1],cv_rmse_output['item-cosine'][1],cv_ratk_output['item-cosine'][1],cv_rprec_output['item-cosine'][1]],
             'popularity':['popularity',cv_patk_output['popularity'][1],cv_rmse_output['popularity'][1],cv_ratk_output['popularity'][1],cv_rprec_output['popularity'][1]],
             'user-cosine':['user-cosine',cv_patk_output['user-cosine'][1],cv_rmse_output['user-cosine'][1],cv_ratk_output['user-cosine'][1],cv_rprec_output['user-cosine'][1]],
             'useraverage':['useraverage',cv_patk_output['useraverage'][1],cv_rmse_output['useraverage'][1],cv_ratk_output['useraverage'][1],cv_rprec_output['useraverage'][1]]}

print ("{:<10} {:<10} {:<10} {:<10} {:<10}".format('Method','         P@K', '               RMSE', '          R@K','                 RPREC'))
for key, value in output_dict.items():
    method, patk, rmse, ratk, rprec = value
    print ("{:<10} {:<10} {:<10} {:<10} {:<10}".format(method, patk, rmse, ratk, rprec))

Considering the results provided above, we see that for Precision@k, user-user with cosine similarity is the top performing method. For RMSE, Recall at k and Rprecision also this method is the top performing model, so overall we see that this model is the best considering all metrics. Compared to the popularity and user average, this method is more complicated and leverages information of the user ratings more intelligently by using the information of similar users instead of averaging over all users or taking into account the popularity of the items for making recommendation. The difference between this method and item-cosine is that here we use user similarities and in the previous sections we saw that we have denser information about users than items, so it is expected that the user-cosine performs better than the item-cosine.

Please note tha good performance on RMSE imply good performance on ranking metrics and vice versa. RMSE penalizes high ratings and low ratings equally. In recommender systems, we are mostly interested in finding the top ranked items to provide to the user and the good performance of the system on low ratings is not as important as the high ratings.

##Similarity Evaluation
Let's go through the list of movies and pick three not-so-popular movies that we know well and list the top 5 most similar movie names according to item-item cosine similarity

In [None]:
train_matrix = dataPreprocessor(rating_df_train,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))
np.shape(train_matrix)

In [None]:
train_matrix_item=np.transpose(train_matrix)
ii_similarity=SimBasedRecSys.cosine(train_matrix_item)
#Movie1: ID97: Dances with Wolves
x=np.argsort(ii_similarity[96,:])[-6:]
#np.shape(train_matrix_item)
x

Most similar movies to movie 1: Forrest Gump (1994) (both Dramas), E.T. the Extra-Terrestrial (both adventure), Field of Dreams (1989)(both dramas), Raiders of the Lost Ark(action and adventure), When Harry Met Sally... (1989)

In [None]:
#Movie2: ID178: 12 Angry Men
y=np.argsort(ii_similarity[177,:])[-6:]
y

Most similar movies to movie 2: Breakfast at Tiffany's (1961), Raising Arizona (1987)(both well-known crime related movies), Raiders of the Lost Ark (1981), Amadeus (1984), Hoop Dreams (1994)

In [None]:
#Most similar movies to movie 3(Cinema Paradiso): Like Water For Chocolate, Mediterraneo (1991), Piano, The (1993), Unbearable Lightness of Being, The (1988), Jean de Florette (1986)
z=np.argsort(ii_similarity[169,:])[-6:]
z

Gistogram of the number of ratings per user.

In [None]:

test_matrix = dataPreprocessor(rating_df_test,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))
train_matrix = dataPreprocessor(rating_df_train,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))

In [None]:
temp_matrix = np.zeros(train_matrix.shape)
temp_matrix[train_matrix.nonzero()] = 1
ratesPerUser=np.sum(train_matrix, axis=1)
_ = plt.hist(ratesPerUser, bins='auto')
plt.show()
#print (hist)
#print ("Size of the bins          : ", bin_edges)
#plt.bar(hist)

We pick 250 as a threshold  that divides users with few ratings and those with a moderate to large number of ratings. Evaluate the RMSE of user-user and item-item collaborative filtering on users below and above the threshold

In [None]:
indices_above=np.array([i for i,v in enumerate(ratesPerUser) if v > 250])
indices_below=np.array([i for i,v in enumerate(ratesPerUser) if v < 250])
rating_df_train_above=rating_df_train.copy()
rating_df_train_below=rating_df_train.copy()
for index, row in rating_df_train_above.iterrows():
  if index in indices_below:
    rating_df_train_above.iloc[index]=0

for index,row in rating_df_train_below.iterrows():
  if index in indices_above:
    rating_df_train_below.iloc[index]=0

ratesPerUser_test=np.sum(test_matrix, axis=1)
indices_above_test=np.array([i for i,v in enumerate(ratesPerUser_test) if v > 250])
indices_below_test=np.array([i for i,v in enumerate(ratesPerUser_test) if v < 250])
rating_df_test_above=rating_df_test.copy()
rating_df_test_below=rating_df_test.copy()
for index, row in rating_df_test_above.iterrows():
  if index in indices_below:
    rating_df_test_above.iloc[index]=0

for index,row in rating_df_test_below.iterrows():
  if index in indices_above:
    rating_df_test_below.iloc[index]=0

#Above Threshold
user_cosine_recsys = SimBasedRecSys('user','cosine')
user_cosine_recsys.predict_all(rating_df_train_above,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))
user_cosine_recsys.getModel()
pred_col=user_cosine_recsys.getPredColName()
prediction=user_cosine_recsys.evaluate_test(rating_df_test_above,copy=True)
rmse1=CrossValidation.rmse(prediction,None,None,None,pred_col)
rmse1

In [None]:
item_cosine_recsys = SimBasedRecSys('item','cosine')
item_cosine_recsys.predict_all(rating_df_train_above,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))
item_cosine_recsys.getModel()
pred_col=item_cosine_recsys.getPredColName()
prediction=item_cosine_recsys.evaluate_test(rating_df_test_above,copy=True)
rmse2=CrossValidation.rmse(prediction,None,None,None,pred_col)
rmse2

In [None]:
#Below Threshold
user_cosine_recsys_below = SimBasedRecSys('user','cosine')
user_cosine_recsys_below.predict_all(rating_df_train_below,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))
user_cosine_recsys_below.getModel()
pred_col=user_cosine_recsys_below.getPredColName()
prediction=user_cosine_recsys_below.evaluate_test(rating_df_test_below,copy=True)
rmse3=CrossValidation.rmse(prediction,None,None,None,pred_col)
rmse3

In [None]:
item_cosine_recsys_below = SimBasedRecSys('item','cosine')
item_cosine_recsys_below.predict_all(rating_df_train_below,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))
item_cosine_recsys_below.getModel()
pred_col=item_cosine_recsys_below.getPredColName()
prediction=item_cosine_recsys_below.evaluate_test(rating_df_test_below,copy=True)
rmse4=CrossValidation.rmse(prediction,None,None,None,pred_col)
rmse4

In both user-user and item-item similarities, we see that by considering users below threshold and training and testing on them, the rmse decreases. We can postulate that since the number of these users is much more than the users with more than the threshold ratings, their ratings forms the overall distribution of the ratings and testing on them yields less error. Intuitively, we may assume that users above the threshold provide more credible information and the error on them might be expected to decrease, but here we see that maybe due to the higher number of below threshold users, the RMSE decreases.

##Using SVD as a Factorization-based Model

In [None]:
class CompetitionRecSys(object):
    """
    You can define new methods if you need. Don't use global variables in the class. 
    """
    def __init__(self, processor=dataPreprocessor):
        """
        Initialization of the class
        1. Make sure to fill out self.pred_column_name, the name you give  to your competition method
        
        """
        self.pred_column_name = 'advanced'
        self.processor = processor

    def predict_all(self, train_vec, num_user, num_item):
        """
        INPUT: 
            data: pandas DataFrame. columns=['userID', 'itemID', 'rating'...]
            num_user: scalar. number of users
            num_item: scalar. number of items
        OUTPUT:
            no return... 
        
        NOTES:
            This function is where you train your model
        """
                

        train_matrix = self.processor(train_vec, num_user, num_item)
        U, s, V= np.linalg.svd(train_matrix,full_matrices=False)
        s=np.diag(s)
        k=4
        s=s[0:k,0:k]
        U=U[:,0:k]
        V=V[0:k,:]
        s_root=sqrtm(s)
        Usk=np.dot(U,s_root)
        skV=np.dot(s_root,V)
        UsV = np.dot(Usk, skV)
        x_pred=UsV
        self.__model= x_pred
        

        

        
    def evaluate_test(self, test_df, copy=False):
        """
            INPUT:
                data: pandas DataFrame. columns=['userID', 'itemID', 'rating'...]
            OUTPUT:
                predictions:  pandas DataFrame. 
                              columns=['userID', 'itemID', 'rating', 'base-method'...]

            NOTES:
            This function is where your model makes prediction 
            Please fill out: prediction.loc[index, self.pred_column_name] = None                            
                              
        """
        if copy:
            prediction = pd.DataFrame(test_df.copy(), columns=['userID', 'itemID', 'rating'])
        else:
            prediction = pd.DataFrame(test_df, columns=['userID', 'itemID', 'rating'])
        prediction[self.pred_column_name] = np.nan
        
        for (index, 
             userID, 
             itemID) in tqdm(prediction[['userID','itemID']].itertuples()):
            prediction.loc[index, self.pred_column_name] = self.__model[userID-1, itemID-1]

        return prediction
          
    def getPredColName(self):
        """
            return prediction column name
        """
        return self.pred_column_name
    
    def reset(self):
        """
            reuse the instance of the class by removing model
        """
        self.__model = None

In [None]:
competition = CompetitionRecSys()
algorithm_instances = [competition]
cv_rp = CrossValidation('RPrecision')
rp = cv_rp.run(algorithm_instances,  len(rating_df.userID.unique()), len(rating_df.itemID.unique()))