# Evaluation of Recommender Systems

Based on the same dataset used on previous weeks, let us evaluate the Collaborative Filtering (CF) models implemented last week.

## Exercise 1

1. Load the test set and the predictions made with both Collaborative Filtering models in the previous session. 
2. Detect those users which are in the training set but not in the test set. Remove their predictions before evaluating the systems.
3. Report the Root Mean Square Error (RMSE) for both CF models defined in the previous session.

In [15]:
import os
import sys
sys.path.append('../')
import pickle
import pandas as pd
import numpy as np

In [16]:

# TEST
df_test = pd.read_pickle("testset.pkl")[["reviewerID", "asin", "overall"]]
df_test = df_test.rename(columns={"reviewerID": "uid", "asin": "iid"})

# PREDICTIONS
nb = pd.read_pickle("preds1.pkl")
lf = pd.read_pickle("preds2.pkl")
pred_nb_list = list(nb.itertuples(index=False))
pred_lf_list = list(lf.itertuples(index=False))


# Detect users from training set that are not in test
nb_users = set([pred.uid for pred in pred_nb_list])
lf_users = set([pred.uid for pred in pred_lf_list])
nb_users_in_pred_but_not_in_test = list(nb_users.difference(set(df_test['uid'])))
lf_users_in_pred_but_not_in_test = list(lf_users.difference(set(df_test['uid'])))
assert nb_users_in_pred_but_not_in_test == lf_users_in_pred_but_not_in_test
print(f"There are {len(lf_users_in_pred_but_not_in_test)} users in the training set that are not in the test set.")

# Remove these users' predictions for evaluation
print("Lengths before removing preds not in test set:", len(nb), len(lf))
nb = nb[~nb.uid.isin(nb_users_in_pred_but_not_in_test)]
lf = lf[~lf.uid.isin(nb_users_in_pred_but_not_in_test)]
print("After removing:", len(nb), len(lf))

nb_merge = nb.merge(df_test, how="inner", on=["uid", "iid"])
print("\nkNN RMSE:", np.sqrt((nb_merge["overall"] - nb_merge["est"])**2).mean())

lf_merge = lf.merge(df_test, how="inner", on=["uid", "iid"])
print("SVD RMSE:", np.sqrt((lf_merge["overall"] - lf_merge["est"])**2).mean())

There are 32 users in the training set that are not in the test set.
Lengths before removing preds not in test set: 72404 72404
After removing: 70070 70070

kNN RMSE: 0.06964517083597653
SVD RMSE: 0.07125026112605343


## Exercise 2
Define a general method to get the top-k recommendations for each user. Print the top-k with k={5, 10} recommendations for the user with ID 'ARARUVZ8RUF5T' and its estimated ratings.

In [17]:
top5nb = nb.groupby(['uid']).apply(lambda x: x.nlargest(5,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
top5nb[top5nb["uid"]=="ARARUVZ8RUF5T"][["iid","est"]]

Unnamed: 0,iid,est
4440,B000VV1YOY,5.0
4441,B000WR2HB6,5.0
4442,B000PKKAGO,4.75
4443,B000FOI48G,4.675
4444,B001ET7FZE,4.5


## Excercise 3
Report Precision@k (P@k), MAP@k and the MRR@k with k={5, 10, 20} averaged across users for both CF systems. When computing precision, we consider as relevant items those with an observed rating >= 4.0 (i.e., those items from the test set with a rating >= 4.0). Reflect on the differences obtained. 

In [18]:
def relevant_column(preds, df_test, k): 
    topKpreds = preds.groupby(['uid']).apply(lambda x: x.nlargest(k,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
    merged = topKpreds.merge(df_test[["uid", "iid", "overall"]], how="left", on=["uid", "iid"])
    merged["relevant"] = (merged["overall"] >= 4) * 1 
    return merged

In [19]:
def PatK(preds, df_test, k):
    merged = relevant_column(preds, df_test, k)
    score  = merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].mean().mean()
    return score

PatK(nb, df_test, 5), PatK(lf, df_test, 5)

(0.004636459430979979, 0.00969441517386723)

In [20]:
def MAPatK(preds, df_test, k):
    merged = relevant_column(preds, df_test, k)
    score  = merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].apply(lambda x: 1./np.arange(1,k+1) @ x).mean()
    return score

MAPatK(nb, df_test, 5), MAPatK(lf, df_test, 5), 

(0.008201615735862311, 0.036178433438707414)

In [21]:
# k = 5
# preds = lf
# topKpreds = preds.groupby(['uid']).apply(lambda x: x.nlargest(k,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
# merged = topKpreds.merge(df_test[["uid", "iid", "overall"]], how="left", on=["uid", "iid"])
# merged["relevant"] = (merged["overall"] >= 4) * 1 
# merged[["uid", "iid", "relevant"]].groupby(by="uid").apply(first).mean()

In [22]:
# inverse of rank position of first relevant item
def first(x):
    for i in range(len(x)):
        if x.iloc[i].relevant == 1:
            return 1/(i+1)
    return 0 

#slide 47 lecture 3
def MRRatK(preds, df_test, k):
    merged = relevant_column(preds, df_test, k)
    score  = merged[["uid", "iid", "relevant"]].groupby(by="uid").apply(first).mean()
    return score

MRRatK(nb, df_test, 5), MRRatK(lf, df_test, 5), 

(0.008201615735862311, 0.036178433438707414)

In [23]:
def HRatK(preds, df_test, k):
    merged = relevant_column(preds, df_test, k)
    score = merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].apply(lambda x: x.any()*1).mean()
    return score
HRatK(nb, df_test, 5), HRatK(lf, df_test, 5)

(0.023182297154899896, 0.04847207586933614)

In [24]:
ks = [5, 10, 20]

print(12*" " + "NB | LF")

for k in ks:
    P_nb, P_lf = PatK(nb, df_test, k), PatK(lf, df_test, k)
    MAP_nb, MAP_lf = MAPatK(nb, df_test, k), MAPatK(lf, df_test, k)
    MRR_nb, MRR_lf = MRRatK(nb, df_test, k), MRRatK(lf, df_test, k)
    print(f"  P@{k:2g} = {P_nb  :.4f}|{P_lf  :.4f}")
    print(f"MAP@{k:2g} = {MAP_nb:.4f}|{MAP_lf:.4f}")
    print(f"MRR@{k:2g} = {MRR_nb:.4f}|{MRR_lf:.4f}\n")


            NB | LF
  P@ 5 = 0.0046|0.0097
MAP@ 5 = 0.0082|0.0362
MRR@ 5 = 0.0082|0.0362

  P@10 = 0.0208|0.0128
MAP@10 = 0.0313|0.0452
MRR@10 = 0.0313|0.0452

  P@20 = 0.0455|0.0349
MAP@20 = 0.0837|0.0834
MRR@20 = 0.0837|0.0834



## Excercise 4

Based on the top-5, top-10 and top-20 predictions from Exercise 2, compute the systemsâ€™ hit rate averaged over the total number of users in the test set.

In [25]:
merged = relevant_column(nb, df_test, 5)
scores = merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].apply(lambda x: x.any()*1)
scores.mean()

0.023182297154899896

In [26]:
ks = [5, 10, 20]

print(12*" " + "NB | LF")

for k in ks:
    MHR_nb, MHR_lf = HRatK(nb, df_test, k), HRatK(lf, df_test, k)
    print(f"MHR@{k:2g} = {MHR_nb:.4f}|{MHR_lf:.4f}")

            NB | LF
MHR@ 5 = 0.0232|0.0485
MHR@10 = 0.2076|0.1275
MHR@20 = 0.9104|0.6976
