# Evaluation of Recommender Systems

Based on the same dataset used on previous weeks, let us evaluate the Collaborative Filtering (CF) models implemented last week.

## Exercise 1

1. Load the test set and the predictions made with both Collaborative Filtering models in the previous session. 
2. Detect those users which are in the training set but not in the test set. Remove their predictions before evaluating the systems.
3. Report the Root Mean Square Error (RMSE) for both CF models defined in the previous session.

In [2]:
import os
import sys
sys.path.append('../')
import pickle
import pandas as pd
import import_ipynb

# TEST
from Session_1 import test_data

# PREDICTIONS
from Session_2 import predictions_neigbor, predictions_factor

pred_nb_list = predictions_neigbor
pred_lf_list = predictions_factor
# Detect users from training set that are not in test
nb_users = set([pred.uid for pred in pred_nb_list])
lf_users = set([pred.uid for pred in pred_lf_list])
nb_users_in_pred_but_not_in_test = list(nb_users.difference(set(test_data['reviewerID'])))
lf_users_in_pred_but_not_in_test = list(lf_users.difference(set(test_data['reviewerID'])))
assert nb_users_in_pred_but_not_in_test == lf_users_in_pred_but_not_in_test
print(f"There are {len(lf_users_in_pred_but_not_in_test)} users in the training set that are not in the test set.")

# Remove these users' predictions for evaluation
pred_nb_list_removed = [pred for pred in pred_nb_list if pred.uid not in nb_users_in_pred_but_not_in_test]
pred_lf_list_removed = [pred for pred in pred_lf_list if pred.uid not in lf_users_in_pred_but_not_in_test]
assert len(pred_nb_list_removed) == len(pred_lf_list_removed)
print(f"Evaluating the systems with {len(pred_nb_list_removed)} predictions for users in the test split.")

There are 32 users in the training set that are not in the test set.
Evaluating the systems with 52988 predictions for users in the test split.


In [None]:
from surprise import accuracy
print("RMSE for Neighborhood based Collaborative Filtering: {:.3f}".format(accuracy.rmse(pred_nb_list_removed)))
print("RMSE for Latent Factor based Collaborative Filtering: {:.3f}".format(accuracy.rmse(pred_lf_list_removed)))

RMSE: 0.6856
RMSE for Neighborhood based Collaborative Filtering: 0.686
RMSE: 0.5486
RMSE for Latent Factor based Collaborative Filtering: 0.549


## Exercise 2
Define a general method to get the top-k recommendations for each user. Print the top-k with k={5, 10} recommendations for the user with ID 'ARARUVZ8RUF5T' and its estimated ratings.

In [7]:
def transfer_to_user_item_rating(pred_list):
    # First map the predictions to each user.
    user_item_rating = defaultdict(list)
    for uid, iid, _, est, _ in pred_list:
        user_item_rating[uid].append((iid, est))
    # {uid: (iid, est)}
    return user_item_rating

def top_k_recommendations(n, user_item_rating):
    # Then sort the predictions for each user and retrieve the k highest ones.
    top_n = defaultdict(list)
    for uid, user_ratings in user_item_rating.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        # {uid: (iid, est)}
        top_n[uid] = user_ratings[:n]

    return top_n

user_item_rating_nb = transfer_to_user_item_rating(pred_nb_list_removed)
user_item_rating_lf = transfer_to_user_item_rating(pred_lf_list_removed)

print("Neighborhood based Collaborative Filtering:")
top_5_nb = top_k_recommendations(5, user_item_rating_nb)
print(f"TOP-5 predictions for user ARARUVZ8RUF5T:", top_5_nb["ARARUVZ8RUF5T"]) 
top_10_nb = top_k_recommendations(10, user_item_rating_nb)
print(f"TOP-10 predictions for user ARARUVZ8RUF5T:", top_10_nb["ARARUVZ8RUF5T"]) 
top_20_nb = top_k_recommendations(20, user_item_rating_nb)

print("Latent Factor based Collaborative Filtering:")   
top_5_lf = top_k_recommendations(5, user_item_rating_lf)
print(f"TOP-5 predictions for user ARARUVZ8RUF5T:", top_5_lf["ARARUVZ8RUF5T"]) 
top_10_lf = top_k_recommendations(10, user_item_rating_lf)
print(f"TOP-10 predictions for user ARARUVZ8RUF5T:", top_10_lf["ARARUVZ8RUF5T"]) 
top_20_lf = top_k_recommendations(20, user_item_rating_lf)

Neighborhood based Collaborative Filtering:
TOP-5 predictions for user ARARUVZ8RUF5T: [('B000WR2HB6', 5), ('B000FOI48G', 4.675), ('B000VV1YOY', 4.666666666666667), ('B001ET7FZE', 4.6), ('B000PKKAGO', 4.5)]
TOP-10 predictions for user ARARUVZ8RUF5T: [('B000WR2HB6', 5), ('B000FOI48G', 4.675), ('B000VV1YOY', 4.666666666666667), ('B001ET7FZE', 4.6), ('B000PKKAGO', 4.5), ('B00EF1QRMU', 4.470205150915517), ('B016V8YWBC', 4.458333333333333), ('B00W259T7G', 4.42), ('B00CZH3K1C', 4.333333333333334), ('B000GLRREU', 4.233333333333333)]
Latent Factor based Collaborative Filtering:
TOP-5 predictions for user ARARUVZ8RUF5T: [('B006IB5T4W', 5), ('B001F51RAG', 5), ('B00NT0AR7E', 5), ('B006WYJM8Y', 5), ('B00155Z6V2', 5)]
TOP-10 predictions for user ARARUVZ8RUF5T: [('B006IB5T4W', 5), ('B001F51RAG', 5), ('B00NT0AR7E', 5), ('B006WYJM8Y', 5), ('B00155Z6V2', 5), ('B00021DJ32', 4.993923352784167), ('B001QY8QXM', 4.948791090341968), ('B000X7ST9Y', 4.840934395462433), ('B00126LYJM', 4.839149265457805), ('B002R

## Excercise 3
Report Precision@k (P@k), MAP@k and the MRR@k with k={5, 10, 20} averaged across users for both CF systems. When computing precision, we consider as relevant items those with an observed rating >= 4.0 (i.e., those items from the test set with a rating >= 4.0). Reflect on the differences obtained. 

In [50]:
from collections import defaultdict

user_item_test = test_data.pivot('reviewerID', 'asin', 'overall')
user_item_test = user_item_test.fillna(0)

# compute P@k for one user
def precision_at_k(k, user_ratings):
    n_rel = sum((relevant) for (_, _, relevant) in user_ratings[:k])
    return n_rel / k

# compute RR@k for one user
def RR_at_k(k, user_ratings):
    for i in range(k):
        _, _, relevant = user_ratings[i]
        if relevant == 1:
            return 1.0 / (i+1)
    return 0.0

def total_relevant(user_ratings):
    return sum((relevant) for (_, _, relevant) in user_ratings)

def transfer_to_user_item_rating_relevant(pred_list):
    # map the predictions to each user.
    user_item_rating = defaultdict(list)
    for pred in pred_list:
        true_rating = user_item_test.loc[pred.uid, pred.iid] if pred.iid in list(user_item_test.columns) else 0
        relevant = 1 if true_rating >= 4.0 else 0
        user_item_rating[pred.uid].append((pred.iid, pred.est, relevant))
    # {uid: [(iid, est, relevant)]}
    return user_item_rating

def compute_metrics(k, user_item_rating):
    precisions = dict() # precision
    ap = dict() # average precision
    rr = dict() # reciprocal rank
    for uid, user_ratings in user_item_rating.items():
        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        precisions[uid] = precision_at_k(k, user_ratings)
        ap[uid] = (sum((precision_at_k(i, user_ratings)*user_ratings[i-1][2]) for i in range(1, k+1)) / total_relevant(user_ratings)) if total_relevant(user_ratings) != 0 else 0
        rr[uid] = RR_at_k(k, user_ratings)

    return sum(prec for prec in precisions.values()) / len(precisions), sum(prec for prec in ap.values()) / len(ap), sum(prec for prec in rr.values()) / len(rr)

In [51]:
user_item_rating_nb_relevant = transfer_to_user_item_rating_relevant(pred_nb_list_removed)
user_item_rating_lf_relevant = transfer_to_user_item_rating_relevant(pred_lf_list_removed)

k_set = [5, 10, 20]
print("Metrics for Neighborhood based CF:")
for k in k_set:
    p_at_k_nb, map_at_k_nb, mrr_at_k_nb = compute_metrics(k, user_item_rating_nb_relevant)
    print(f"Averaged P@{k}:", round(p_at_k_nb, 3))
    print(f"MAP@{k}:", round(map_at_k_nb, 3))
    print(f"MRR@{k}:", round(mrr_at_k_nb, 3))

print("")

print("Metrics for Latent Factor based CF:")
for k in k_set:
    p_at_k_lf, map_at_k_lf, mrr_at_k_lf = compute_metrics(k, user_item_rating_lf_relevant)
    print(f"Averaged P@{k}:", round(p_at_k_lf, 3))
    print(f"MAP@{k}:", round(map_at_k_lf, 3))
    print(f"MRR@{k}:", round(mrr_at_k_lf, 3))

Metrics for Neighborhood based CF:
Averaged P@5: 0.148
MAP@5: 0.167
MRR@5: 0.167
Averaged P@10: 0.078
MAP@10: 0.174
MRR@10: 0.174
Averaged P@20: 0.04
MAP@20: 0.175
MRR@20: 0.175

Metrics for Latent Factor based CF:
Averaged P@5: 0.024
MAP@5: 0.073
MRR@5: 0.073
Averaged P@10: 0.05
MAP@10: 0.116
MRR@10: 0.116
Averaged P@20: 0.038
MAP@20: 0.137
MRR@20: 0.137


## Excercise 4

Based on the top-5, top-10 and top-20 predictions from Exercise 2, compute the systems’ hit rate averaged over the total number of users in the test set.

In [52]:
# compute HR@k for one user
def HR_at_k(k, user_ratings):
    for i in range(k):
        _, _, relevant = user_ratings[i]
        if relevant == 1:
            return 1.0
    return 0.0

def compute_hit_rate(k, user_item_rating):
    hr = dict() # hit rate
    for uid, user_ratings in user_item_rating.items():
        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        hr[uid] = HR_at_k(k, user_ratings)

    return sum(prec for prec in hr.values()) / len(hr) 

k_set = [5, 10, 20]
print("Hit Rate for Neighborhood based CF:")
for k in k_set:
    mhr_at_k_nb = compute_hit_rate(k, user_item_rating_nb_relevant)
    print(f"Hit Rate (top-{k}):", round(mhr_at_k_nb, 3))

print("")

print("Hit Rate for Latent Factor based CF:")
for k in k_set:
    mhr_at_k_lf = compute_hit_rate(k, user_item_rating_lf_relevant)
    print(f"Hit Rate (top-{k}):", round(mhr_at_k_lf, 3))

Hit Rate for Neighborhood based CF:
Hit Rate (top-5): 0.74
Hit Rate (top-10): 0.781
Hit Rate (top-20): 0.797

Hit Rate for Latent Factor based CF:
Hit Rate (top-5): 0.121
Hit Rate (top-10): 0.502
Hit Rate (top-20): 0.756


In [43]:
# from collections import defaultdict

# user_item_test = test_data.pivot('reviewerID', 'asin', 'overall')
# user_item_test = user_item_test.fillna(0)

# # compute RR@k for one user
# def RR_at(relevants):
#     for i in range(len(relevants)):
#         if relevants[i] == 1.0:
#             return 1.0 / (i+1)
#     return 0.0

# def precision_at(relevants):
#     result = []
#     sum_k = 0
#     for i in range(len(relevants)):
#         sum_k += relevants[i]
#         if relevants[i] == 1.0:
#             result.append(sum_k / (i+1))
#     return result

# def precision_at_k(k, top_k_list, user_item_rating):
#     precisions = dict() # precision
#     ap = dict() # average precision
#     rr = dict() # reciprocal rank
#     for uid, top_k_value in top_k_list.items():
#         relevants = []
#         for iid, _ in top_k_value:
#             if iid in list(user_item_test.columns):
#                 relevant = 1.0 if user_item_test.loc[uid, iid] >= 4.0 else 0 
#                 relevants.append(relevant)
#             else:
#                 relevants.append(0)
#         precisions[uid] = sum(relevants) / k
#         ap[uid] = (sum(precision_at(relevants)) / total_relevant(user_item_rating[uid])) if total_relevant(user_item_rating[uid]) != 0 else 0 
#         rr[uid] = RR_at(relevants)
#     return sum(prec for prec in precisions.values()) / len(precisions), sum(prec for prec in rr.values()) / len(rr), sum(prec for prec in ap.values()) / len(ap)
        
# print("P@5:", precision_at_k(5, top_5_nb, user_item_rating_nb_relevant))
# print("P@10:", precision_at_k(10, top_10_nb, user_item_rating_nb_relevant))
# print("P@20:", precision_at_k(20, top_20_nb, user_item_rating_nb_relevant))

P@5: (0.14794520547945267, 0.1674569722514925, 0.1674569722514925)
P@10: (0.07808219178082199, 0.17352559921053023, 0.17352559921053023)
P@20: (0.03983140147523708, 0.1746243276769131, 0.1746243276769131)
