# Sampled item evaluation protocol

This notebook aims to reproduce the result for ML1M & Pinterest in the paper ["Revisiting the Performance of iALS on Item Recommendation Benchmarks"](https://arxiv.org/abs/2110.14037). On these two datasets, for each user, we try to rank 1 held-out positive (actually touched by the user) item over 100 randomly selected negative (untouched) items.

Since the protocol is widely used for recsys benchmarking after the [NeuMF paper](https://arxiv.org/abs/1708.05031), below we see how we can measure the recommenders' performance following it. Note, however, there is [a study](https://dl.acm.org/doi/10.1145/3394486.3403226) which asserts that this ranking metric may not be a good indicator for recommender performance.

In [1]:
from IPython.display import clear_output
from irspack import Evaluator, IALSRecommender, df_to_sparse, split_last_n_interaction_df
from irspack.dataset.neu_mf import NeuMFML1MDownloader, NeuMFMPinterestDownloader
import numpy as np
import pandas as pd


# Either ml-1m or pinterest
DATA_TYPE = 'ml-1m'
assert DATA_TYPE in ['ml-1m', 'pinterest']

USER = 'user_id'
ITEM = 'item_id'
TIME = 'timestamp'

if DATA_TYPE == 'ml-1m':
    dm = NeuMFML1MDownloader()
else:
    dm = NeuMFMPinterestDownloader()

## Read the train & test dataset

The train set is a usual user/item interaction dataframe.

In [2]:
train, test = dm.read_train_test()
train.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,32,4,2001-01-06 23:38:50
1,0,34,4,2001-01-06 23:38:50
2,0,4,5,2001-01-06 23:38:11
3,0,35,4,2001-01-06 23:38:11
4,0,30,4,2001-01-06 23:38:11


In [3]:
item_list = sorted(list(set(train['item_id'])))
item_set = set(item_list)

## Create validation data

Split `train` into train (`tt`) & validation (`tv`) pair.

The validation data is created in the same way as the test set.

In [4]:
g=train.groupby('user_id')['item_id']
user_id_vs_interacted_items = g.agg(set).to_dict()

rng = np.random.default_rng(0)

# tv is users' last interaction with item.
tt, tv = split_last_n_interaction_df(train, USER, timestamp_column=TIME, n_heldout=1)
tv['positive'] = True
dfs = []
for user_id in tv[USER]:
    items_not_interacted = list(item_set - user_id_vs_interacted_items[user_id])
    negatives = rng.choice(items_not_interacted, size=100, replace=False)
    dfs.append(pd.DataFrame({USER: user_id, ITEM: negatives}))
valid = pd.concat(dfs)
valid['positive'] = False
valid = pd.concat([valid, tv[[USER, ITEM, 'positive']]]).sort_values([USER, 'positive'])

The validation dataframe has an extra column to indicate the positivity of the pair.

In [5]:
valid.head()

Unnamed: 0,user_id,item_id,positive
0,0,1014,False
1,0,131,False
2,0,1281,False
3,0,2669,False
4,0,372,False


Let us convert the data frame into sparse matrix.

In [6]:
X_tt, tt_users, _ = df_to_sparse(tt, USER, ITEM, item_ids=item_list)
X_tv_gt, _, __ = df_to_sparse(valid[valid['positive']], USER, ITEM, user_ids=tt_users, item_ids=item_list)
X_tv_recommendable, _, __ = df_to_sparse(valid, USER, ITEM, user_ids=tt_users, item_ids=item_list)


- Non-zeroes in `X_truth` indicate the positive pair location.
- Non-zeroes in `X_recommendable` are positive & randomly seledted negative pairs.

In the parameter tuning procedure, I found that too eary start of pruning harms the final quality of recommendation.
We can control the parameters of pruning by explicitly providing `optuna.Study`.

In [7]:
validation_evaluator = Evaluator(X_tv_gt, per_user_recommendable_items=X_tv_recommendable, cutoff=10)
best_parameter, validation_recoder = IALSRecommender.tune(
    X_tt, validation_evaluator, fixed_params=dict(n_components=192),
    n_trials=40, random_seed=0, prunning_n_startup_trials=20
)
clear_output()

In [8]:
X_train_all, user_ids, _ = df_to_sparse(train, USER, ITEM, item_ids=item_list)
X_test_gt, _, __ = df_to_sparse(test[test["positive"]], USER, ITEM, user_ids=user_ids, item_ids=item_list)
X_test_recommendable, _, __ = df_to_sparse(test, USER, ITEM, user_ids=user_ids, item_ids=item_list)

NDCG@10/HIT@10 is similar to that reported in the reference.

In [9]:
Evaluator(X_test_gt, per_user_recommendable_items=X_test_recommendable, cutoff=10).get_score(
    IALSRecommender(X_train_all, **best_parameter).learn()
)

{'appeared_item': 2714.0,
 'entropy': 7.405239384868466,
 'gini_index': 0.6666932769922957,
 'hit': 0.7310367671414376,
 'map': 0.3656433198210277,
 'n_items': 3704.0,
 'ndcg': 0.4521301417636353,
 'precision': 0.07310367671414376,
 'recall': 0.7310367671414376,
 'total_user': 6040.0,
 'valid_user': 6038.0}