# Evaluation of Recommender Systems

Based on the same dataset used on previous weeks, let us evaluate the Collaborative Filtering (CF) models implemented last week.

## Exercise 1

1. Load the test set and the predictions made with both Collaborative Filtering models in the previous session. 
2. Detect those users which are in the training set but not in the test set. Remove their predictions before evaluating the systems.
3. Report the Root Mean Square Error (RMSE) for both CF models defined in the previous session.

In [1]:
import os
import sys
sys.path.append('../')
import pickle
import pandas as pd
import numpy as np

In [2]:

# TEST
df_test = pd.read_pickle("testset.pkl")[["reviewerID", "asin", "overall"]]
df_test = df_test.rename(columns={"reviewerID": "uid", "asin": "iid"})

# PREDICTIONS
nb = pd.read_pickle("preds1.pkl")
lf = pd.read_pickle("preds2.pkl")
pred_nb_list = list(nb.itertuples(index=False))
pred_lf_list = list(lf.itertuples(index=False))


# Detect users from training set that are not in test
nb_users = set([pred.uid for pred in pred_nb_list])
lf_users = set([pred.uid for pred in pred_lf_list])
nb_users_in_pred_but_not_in_test = list(nb_users.difference(set(df_test['uid'])))
lf_users_in_pred_but_not_in_test = list(lf_users.difference(set(df_test['uid'])))
assert nb_users_in_pred_but_not_in_test == lf_users_in_pred_but_not_in_test
print(f"There are {len(lf_users_in_pred_but_not_in_test)} users in the training set that are not in the test set.")

# Remove these users' predictions for evaluation
print("Lengths before removing preds not in test set:", len(nb), len(lf))
nb = nb[~nb.uid.isin(nb_users_in_pred_but_not_in_test)]
lf = lf[~lf.uid.isin(nb_users_in_pred_but_not_in_test)]
print("After removing:", len(nb), len(lf))

nb_merge = nb.merge(df_test, how="inner", on=["uid", "iid"])
print("\nkNN RMSE:", np.sqrt((nb_merge["overall"] - nb_merge["est"])**2).mean())

lf_merge = lf.merge(df_test, how="inner", on=["uid", "iid"])
print("SVD RMSE:", np.sqrt((lf_merge["overall"] - lf_merge["est"])**2).mean())

There are 32 users in the training set that are not in the test set.
Lengths before removing preds not in test set: 72404 72404
After removing: 70070 70070

kNN RMSE: 0.07280920646947436
SVD RMSE: 0.06459303590839446


## Exercise 2
Define a general method to get the top-k recommendations for each user. Print the top-k with k={5, 10} recommendations for the user with ID 'ARARUVZ8RUF5T' and its estimated ratings.

In [3]:
top5nb = nb.groupby(['uid']).apply(lambda x: x.nlargest(5,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
top5nb[top5nb["uid"]=="ARARUVZ8RUF5T"][["iid","est"]]

Unnamed: 0,iid,est
4440,B000WR2HB6,5.0
4441,B000VV1YOY,5.0
4442,B016V8YWBC,4.777778
4443,B000GLRREU,4.675
4444,B01BNEYGQU,4.6


In [19]:

topKpreds = preds.groupby(['uid']).apply(lambda x: x.nlargest(3,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
merged = topKpreds.merge(df_test[["uid", "iid", "overall"]], how="outer", on=["uid", "iid"])
merged["relevant"] = (merged["overall"] >= 4) * 1
merged


Unnamed: 0,uid,iid,est,overall,relevant
0,A105A034ZG9EHO,B000URXP6E,5.0,5.0,1
1,A105A034ZG9EHO,B00006L9LC,5.0,,0
2,A105A034ZG9EHO,B001OHV1H4,5.0,,0
3,A10JB7YPWZGRF4,B000FI4S1E,5.0,5.0,1
4,A10JB7YPWZGRF4,B00006L9LC,5.0,,0
...,...,...,...,...,...
2966,AQ6AGWKWMTY8H,B001E96LUO,,4.0,1
2967,ARARUVZ8RUF5T,B019FWRG3C,,5.0,1
2968,AUAC7GPINCR54,B019809F9Y,,5.0,1
2969,AYNTULRNAIPNY,B0010ZBORW,,4.0,1


In [44]:
topKpreds[:5]

Unnamed: 0,uid,iid,est
0,A105A034ZG9EHO,B000URXP6E,5.0
1,A105A034ZG9EHO,B00006L9LC,5.0
2,A105A034ZG9EHO,B001OHV1H4,5.0
3,A10JB7YPWZGRF4,B000FI4S1E,5.0
4,A10JB7YPWZGRF4,B00006L9LC,5.0


In [51]:
1./np.arange(1,k+1) @  topKpreds[:5]["est"]#.astype(np.float32)

11.416666666666666

In [55]:
1./np.arange(1,k+1) 

array([1.        , 0.5       , 0.33333333, 0.25      , 0.2       ])

In [62]:
k = 5
preds = lf

topKpreds = preds.groupby(['uid']).apply(lambda x: x.nlargest(k,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
merged = topKpreds.merge(df_test[["uid", "iid", "overall"]], how="left", on=["uid", "iid"])
merged["relevant"] = (merged["overall"] >= 4) * 1 

merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].apply(lambda x: 1./np.arange(1,k+1) @ x ).mean()
# merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].apply(lambda x: np.mean(x)).mean()


0.2236916051984545

In [37]:
def PatK(preds, df_test, k):
    topKpreds = preds.groupby(['uid']).apply(lambda x: x.nlargest(k,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
    merged = topKpreds.merge(df_test[["uid", "iid", "overall"]], how="left", on=["uid", "iid"])
    merged["relevant"] = (merged["overall"] >= 4) * 1
    return merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].mean().mean()
PatK(nb, df_test, 5), PatK(lf, df_test, 5), 

(0.17618545837723917, 0.05395152792413067)

In [63]:
def MAPatK(preds, df_test, k):
    topKpreds = preds.groupby(['uid']).apply(lambda x: x.nlargest(k,['est'])).reset_index(drop=True)[["uid", "iid", "est"]]
    merged = topKpreds.merge(df_test[["uid", "iid", "overall"]], how="left", on=["uid", "iid"])
    merged["relevant"] = (merged["overall"] >= 4) * 1 
    return merged[["uid", "iid", "relevant"]].groupby(by="uid")["relevant"].apply(lambda x: 1./np.arange(1,k+1) @ x ).mean()
MAPatK(nb, df_test, 5), MAPatK(lf, df_test, 5), 

(0.6114506498068142, 0.2236916051984545)

In [None]:
def MRRatK(preds, df_test, k):
    

## Excercise 3
Report Precision@k (P@k), MAP@k and the MRR@k with k={5, 10, 20} averaged across users for both CF systems. When computing precision, we consider as relevant items those with an observed rating >= 4.0 (i.e., those items from the test set with a rating >= 4.0). Reflect on the differences obtained. 

In [None]:
df_test["reviewerID"]

In [None]:
df_test.index

In [None]:
relevant = df_test[df_test["overall"]>=4]

In [None]:
uid = nb.uid[500]
top5nb[top5nb.uid==uid]["est"]
df_test[df_test["reviewerID"]==uid]["asin"]

In [None]:
top5nb = nb.groupby(['uid']).apply(lambda x: x.nlargest(5,['est'])).reset_index(drop=True)
# top5nb["relevant"] = 
top5nb

## Excercise 4

Based on the top-5, top-10 and top-20 predictions from Exercise 2, compute the systems’ hit rate averaged over the total number of users in the test set.