# Evaluation of Recommender Systems

Based on the same dataset used on previous weeks, let us evaluate the Collaborative Filtering (CF) model implemented last week.

In [1]:
# Load data splits from Week 6, the files are also uploaded in Absalon
import pandas as pd 
train_df = pd.read_pickle("train_dataframe.pkl") 
test_df = pd.read_pickle("test_dataframe.pkl")

## Exercise 1

Based on the user-based neighborhood model that was created last week, let's make a general system that can be used to generate recommendations for all users and items. The system would take into account the mean rating of each user. We can use Scikit-Surprise for this.
https://surprise.readthedocs.io/en/stable/index.html

Use cosine as similarity measure and try to vary the (maximum) number of neighbors to take into account when predicting ratings. Set the random state to $0$ for comparable results. Keep Scikit-Surprise's default settings for all other parameters. 

Is it better to use $1$ or $10$ neighbors? You should determine this based on the Root Mean Square Error (RMSE) over 3-fold cross-validation.

In [2]:
# Uncomment and run the following line if you need to install scikit-surprise
# !pip install scikit-surprise

In [3]:
import random
import pandas as pd
import numpy as np
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise.model_selection import KFold
from sklearn.metrics import mean_squared_error as mse

In [4]:
# 1. Convert train data format
reader = Reader(rating_scale=(1, 5))
training_matrix = Dataset.load_from_df(train_df[['reviewerID', 'asin', 'overall']], reader)

In [5]:
# 2. Fix the random seed
my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)

# 3. Define a cross-validation iterator
kf = KFold(n_splits=3)

rmse_result = dict()

list_neighbour = [1, 10]
for neighbour in list_neighbour:
    algo = KNNWithMeans(k=neighbour,
                        sim_options={"name":"cosine","user_based":True},
                        verbose=False,
                        random_state=0)
    rmse_result[neighbour] = {}
    
    fold = 0
    for trainset, testset in kf.split(training_matrix):

        # train and test algorithm.
        algo.fit(trainset)
        
        predictions_KNN = []
        for uid, iid, r_ui in testset:
            single_prediction = algo.predict(uid=uid, iid=iid, r_ui=r_ui)
            predictions_KNN.append(single_prediction)
        df_pred_KNN = pd.DataFrame(predictions_KNN)

        rmse_result[neighbour][fold] = mse(df_pred_KNN['est'], df_pred_KNN['r_ui'])# Write your code here

        fold+=1

In [6]:
# Write your code here
best_neighbor = min(rmse_result, key=lambda x: np.sqrt(list(rmse_result[x].values())).mean())

print('Number of neighbors with lowest validation RMSE:')
print(best_neighbor)

Number of neighbors with lowest validation RMSE:
10


## Exercise 2

### 2.1
Fit the neigborhood-based model defined in exercise 1 on the full training set with cosine as similarity measure and either $1$ or $10$ neighbors based on what you found to be better in exercise 1. Keep Scikit-Surprise's default settings for all other parameters, but set the random state to $0$ for comparable results.

Use the model to predict the unobserved ratings for the users in the training set. Remove predictions for users that are not in the test set (`test_df`).

How many predictions are there and what is the average of all the predictions (rounded to 2 decimal places)?

In [12]:
# Fit the model on the full training set
sim_options = {'name': 'cosine',
               'user_based': True
               }
algo = KNNWithMeans(k= 10,
                    sim_options=sim_options, 
                    random_state=0, 
                    verbose=False)

train_data = training_matrix.build_full_trainset()
algo.fit(train_data)

# Predict unobserved ratings for users in the training set
unobserved_ratings = []
for uid in test_df['reviewerID'].unique():
    for iid in train_df['asin'].unique():
        if uid not in test_df['reviewerID'].unique():
            continue
        if iid not in train_df.loc[train_df['reviewerID']==uid, 'asin'].unique():
            continue
        unobserved_ratings.append((uid, iid, 0))
pred_KNN = algo.test(unobserved_ratings)

# Write your code here

num_predictions = len(pred_KNN)
avg_prediction = round(np.mean([pred_KNN[i].est for i in range(num_predictions)]), 2)
print("Number of predictions:", num_predictions)
print("Average of all predictions:", avg_prediction)

Number of predictions: 3003
Average of all predictions: 4.86


### 2.2
Report the RMSE of the rating prediction of users and items in `test_df` (rounded to 3 decimal places).

Note that the documentation https://surprise.readthedocs.io/en/stable/predictions_module.html defines `r_ui` as the true rating of user $u$ for item $i$, but in fact, it is the mean rating of all users over all items. It should not be used for any computations.

In [8]:
# Write your code here

## Exercise 3
Define a general method to get the top-k recommendations for each user, based on the rating predictions obtained in Exercise 2.1. Discard those predictions that are below $4.0$.

Print the top-k with $k=\{5, 10, 20\}$ recommendations for the user with ID `ARARUVZ8RUF5T` and its estimated ratings.

In [22]:
from collections import defaultdict
from surprise.prediction_algorithms.predictions import Prediction
from typing import Dict, List
import numpy as np

def get_top_k(predictions: List[Prediction], 
              k: int, 
              threshold: float) -> Dict[str, List]:

    topk = defaultdict(list)

    for pred in predictions:
        if pred.est < threshold:
            continue
        topk[pred.uid].append((pred.iid, pred.est))

    for uid in topk:
        topk[uid] = sorted(topk[uid], key=lambda x: x[1], reverse=True)[:k]

    return topk

def print_top_k(user_id: str, topk: Dict[str, List]) -> None:
    user_ratings = topk[user_id]
    print(f"TOP-{len(user_ratings)} predictions for user {user_id}: {[item for item in user_ratings]}")

In [21]:
# Write your code here
user_id = 'ARARUVZ8RUF5T'
for k in [5,10,20]:
    topk = get_top_k(pred_KNN, k, threshold=4.0)
    print_top_k(user_id, topk)

TOP-1 predictions for user ARARUVZ8RUF5T: [('B01E7UKR38', 4.08)]
TOP-1 predictions for user ARARUVZ8RUF5T: [('B01E7UKR38', 4.08)]
TOP-1 predictions for user ARARUVZ8RUF5T: [('B01E7UKR38', 4.08)]


## Exercise 4
Report Precision@k (P@k), MAP@k and the MRR@k with $k=\{5, 10, 20\}$ averaged across users for the CF model. Round the scores to 3 decimal places.

When computing precision, we consider as relevant items those with an observed rating $\geq 4.0$ (i.e., those items from the test set with a rating $\geq 4.0$). Reflect on the differences obtained. 

In [None]:
import numpy as np
from __future__ import (absolute_import, division, print_function, unicode_literals)
from collections import defaultdict
from surprise import Dataset


def precision_at_k(predictions: List[Prediction], 
                   k: int, 
                   threshold: float) -> Dict[str, float]:
    """Compute precision at k for each user
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        k(int): The number of recommendation to output for each user.
        threshold(float): Threshold for discarding predictions with too low ratings
    Returns:
        A dict where keys are user ids (str)
        and values are the P@k (float) for each of them
    """

    precisions = defaultdict(float)
    
    # First map the predictions to each user.

    # Write your code here

    return precisions



def mean_average_precision(predictions: List[Prediction], 
                           df_test: pd.DataFrame,
                           k: int, 
                           threshold: float) -> float:
    """Compute the mean average precision 
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        df_test: Pandas DataFrame containing user-item ratings in 
            the test split.
        k(int): The number of recommendation to output for each user.
        threshold(float): Threshold for discarding predictions with too low ratings
    Returns:
        The MAP@k (float)
    """

    average_precision_users = []
    
    # Write your code here
    
    mapk = np.mean(average_precision_users)
    return mapk
    

def mean_reciprocal_rank(predictions: List[Prediction], 
                         df_test: pd.DataFrame, 
                         k) -> float:
    """Compute the mean reciprocal rank 
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        df_test: Pandas DataFrame containing user-item ratings in 
            the test split.
        k(int): The number of recommendation to output for each user.
    Returns:
        The MRR@k (float)
    """
    
    reciprocal_rank = []
    
    # Write your code here
    
    mean_rr = np.mean(reciprocal_rank)
    return mean_rr

In [None]:
# -------- NB BASED --------
print("Metrics for Neighborhood based CF:")
# PRECISION
precisions_nb = precision_at_k(# Complete, 
    test_df, k=5, threshold=4.0)
print("Averaged P@5: {:.3f}".format(sum(prec for prec in precisions_nb.values()) / len(precisions_nb)))
# MAP 
map_nb = mean_average_precision(# Complete, 
    test_df, k=5, threshold=4.0)
print("MAP@5: {:.3f}".format(map_nb))
# MRR
mrr_nb = mean_reciprocal_rank(# Complete, 
    test_df, k=5)
print("MRR@5: {:.3f}".format(mrr_nb))

# PRECISION
precisions_nb = precision_at_k(# Complete, 
    test_df, k=10, threshold=4.0)
print("Averaged P@10: {:.3f}".format(sum(prec for prec in precisions_nb.values()) / len(precisions_nb)))
# MAP 
map_nb = mean_average_precision(# Complete, 
    test_df, k=10, threshold=4.0)
print("MAP@10: {:.3f}".format(map_nb))
# MRR
mrr_nb = mean_reciprocal_rank(# Complete, 
    test_df, k=10)
print("MRR@10: {:.3f}".format(mrr_nb))

# PRECISION
precisions_nb = precision_at_k(# Complete, 
    test_df, k=20, threshold=4.0)
print("Averaged P@20: {:.3f}".format(sum(prec for prec in precisions_nb.values()) / len(precisions_nb)))
# MAP 
map_nb = mean_average_precision(# Complete, 
    test_df, k=20, threshold=4.0)
print("MAP@20: {:.3f}".format(map_nb))
# MRR
mrr_nb = mean_reciprocal_rank(# Complete, 
    test_df, k=20)
print("MRR@20: {:.3f}".format(mrr_nb))

## Exercise 5

Based on the top-5, top-10 and top-20 predictions from Exercise 3, compute the system’s hit rate averaged over the total number of users in the test set.

In [None]:
def hit_rate(top_k: Dict[str, List[str]],
             test_df: pd.DataFrame) -> float:
    """Compute the hit rate
    Args:
        top_k: A dictionary where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n (output of get_top_k())
        df_test: Pandas DataFrame containing user-item ratings in 
            the test split.
    Returns:
        The average hit rate
    """
    hits_rate = 0
    
    # Write your code here
    
    return hits_rate

print("Hit Rate for Neighborhood based CF:")
print("Hit Rate (top-5): {:.3f}".format(hit_rate( #Complete )))
print("Hit Rate (top-10): {:.3f}".format(hit_rate( #Complete )))
print("Hit Rate (top-20): {:.3f}".format(hit_rate( #Complete )))