# Sampled item evaluation protocol

This notebook aims to reproduce the result for ML1M & Pinterest in the paper ["Revisiting the Performance of iALS on Item Recommendation Benchmarks"](https://arxiv.org/abs/2110.14037). On these two datasets, for each user, we try to rank 1 held-out positive (actually touched by the user) item over 100 randomly selected negative (untouched) items.

Since the protocol is widely used for recsys benchmarking after the [NeuMF paper](https://arxiv.org/abs/1708.05031), below we see how we can measure the recommenders' performance following it. Note, however, there is [a study](https://dl.acm.org/doi/10.1145/3394486.3403226) which asserts that this ranking metric may not be a good indicator for recommender performance.

In [1]:
from irspack import Evaluator, IALSRecommender, IALSOptimizer, df_to_sparse, split_last_n_interaction_df
from irspack.dataset.neu_mf import NeuMFML1MDownloader, NeuMFMPinterestDownloader
import numpy as np
import pandas as pd


# Either ml-1m or pinterest
DATA_TYPE = 'ml-1m'
assert DATA_TYPE in ['ml-1m', 'pinterest']

USER = 'user_id'
ITEM = 'item_id'
TIME = 'timestamp'

if DATA_TYPE == 'ml-1m':
    dm = NeuMFML1MDownloader()
else:
    dm = NeuMFMPinterestDownloader()

## Read the train & test dataset

The train set is a usual user/item interaction dataframe.

In [2]:
train, test = dm.read_train_test()
train.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,32,4,2001-01-06 23:38:50
1,0,34,4,2001-01-06 23:38:50
2,0,4,5,2001-01-06 23:38:11
3,0,35,4,2001-01-06 23:38:11
4,0,30,4,2001-01-06 23:38:11


In [3]:
item_list = sorted(list(set(train['item_id'])))
item_set = set(item_list)

## Create validation data

Split `train` into train (`tt`) & validation (`tv`) pair.

The validation data is created in the same way as the test set.

In [4]:
g=train.groupby('user_id')['item_id']
user_id_vs_interacted_items = g.agg(set).to_dict()

rng = np.random.default_rng(0)

# tv is users' last interaction with item.
tt, tv = split_last_n_interaction_df(train, USER, timestamp_column=TIME, n_heldout=1)
tv['positive'] = True
dfs = []
for user_id in tv[USER]:
    items_not_interacted = list(item_set - user_id_vs_interacted_items[user_id])
    negatives = rng.choice(items_not_interacted, size=100, replace=False)
    dfs.append(pd.DataFrame({USER: user_id, ITEM: negatives}))
valid = pd.concat(dfs)
valid['positive'] = False
valid = pd.concat([valid, tv[[USER, ITEM, 'positive']]]).sort_values([USER, 'positive'])

The validation dataframe has an extra column to indicate the positivity of the pair.

In [5]:
valid.head()

Unnamed: 0,user_id,item_id,positive
0,0,1014,False
1,0,131,False
2,0,1281,False
3,0,2669,False
4,0,372,False


Let us convert the data frame into sparse matrix.

In [6]:
X_tt, tt_users, _ = df_to_sparse(tt, USER, ITEM, item_ids=item_list)
X_tv_gt, _, __ = df_to_sparse(valid[valid['positive']], USER, ITEM, user_ids=tt_users, item_ids=item_list)
X_tv_recommendable, _, __ = df_to_sparse(valid, USER, ITEM, user_ids=tt_users, item_ids=item_list)


- Non-zeroes in `X_truth` indicate the positive pair location.
- Non-zeroes in `X_recommendable` are positive & randomly seledted negative pairs.

In the parameter tuning procedure, I found that too eary start of pruning harms the final quality of recommendation.
We can control the parameters of pruning by explicitly providing `optuna.Study`.

In [7]:
import optuna

study = optuna.create_study(
    pruner=optuna.pruners.MedianPruner(n_min_trials=20),
    sampler=optuna.samplers.TPESampler(seed=0)
)
validation_evaluator = Evaluator(X_tv_gt, per_user_recommendable_items=X_tv_recommendable, cutoff=10)
best_parameter, validation_recoder = IALSOptimizer(
    X_tt, validation_evaluator, fixed_params=dict(n_components=192)).optimize_with_study(
    study, n_trials=40
    
)

[32m[IRSPACK:I 2022-06-18 08:22:32,738][0m [34mConfig 30 obtained the following scores: {'appeared_item': 2737.0, 'entropy': 7.388188803893926, 'gini_index': 0.672751231888204, 'hit': 0.741225165562914, 'map': 0.36230158730158746, 'n_items': 3704.0, 'ndcg': 0.45201342424819135, 'precision': 0.07412251655629137, 'recall': 0.741225165562914, 'total_user': 6040.0, 'valid_user': 6040.0} within 2.528101 seconds.[0m
[32m[I 2022-06-18 08:22:32,739][0m Trial 30 finished with value: -0.45201342424819135 and parameters: {'alpha0': 0.06550696205925911, 'reg': 0.008018740558286335}. Best is trial 29 with value: -0.45715350596349286.[0m
[32m[IRSPACK:I 2022-06-18 08:22:32,748][0m [34mTrial 31:[0m
[32m[IRSPACK:I 2022-06-18 08:22:32,749][0m [34mparameter = {'alpha0': 0.11161354265163098, 'reg': 0.002921810545550655, 'n_components': 192}[0m


[32m[I 2022-06-18 08:22:33,870][0m Trial 31 pruned. [0m
[32m[IRSPACK:I 2022-06-18 08:22:33,879][0m [34mTrial 32:[0m
[32m[IRSPACK:I 2022-06-18 08:22:33,880][0m [34mparameter = {'alpha0': 0.1815237716363487, 'reg': 0.005541874501225793, 'n_components': 192}[0m


[32m[I 2022-06-18 08:22:34,150][0m Trial 32 pruned. [0m
[32m[IRSPACK:I 2022-06-18 08:22:34,158][0m [34mTrial 33:[0m
[32m[IRSPACK:I 2022-06-18 08:22:34,159][0m [34mparameter = {'alpha0': 0.09433547150366324, 'reg': 0.023452818918026867, 'n_components': 192}[0m


[32m[I 2022-06-18 08:22:34,429][0m Trial 33 pruned. [0m
[32m[IRSPACK:I 2022-06-18 08:22:34,438][0m [34mTrial 34:[0m
[32m[IRSPACK:I 2022-06-18 08:22:34,439][0m [34mparameter = {'alpha0': 0.025909271245252625, 'reg': 0.009492148252279448, 'n_components': 192}[0m


[32m[IRSPACK:I 2022-06-18 08:22:37,600][0m [34mConfig 34 obtained the following scores: {'appeared_item': 2689.0, 'entropy': 7.285864754091686, 'gini_index': 0.706350026103872, 'hit': 0.7322847682119206, 'map': 0.3520368180384737, 'n_items': 3704.0, 'ndcg': 0.441952566073967, 'precision': 0.07322847682119206, 'recall': 0.7322847682119206, 'total_user': 6040.0, 'valid_user': 6040.0} within 3.170762 seconds.[0m
[32m[I 2022-06-18 08:22:37,602][0m Trial 34 finished with value: -0.441952566073967 and parameters: {'alpha0': 0.025909271245252625, 'reg': 0.009492148252279448}. Best is trial 29 with value: -0.45715350596349286.[0m
[32m[IRSPACK:I 2022-06-18 08:22:37,611][0m [34mTrial 35:[0m
[32m[IRSPACK:I 2022-06-18 08:22:37,611][0m [34mparameter = {'alpha0': 0.6613351951462518, 'reg': 0.003978156749885238, 'n_components': 192}[0m


[32m[I 2022-06-18 08:22:37,878][0m Trial 35 pruned. [0m
[32m[IRSPACK:I 2022-06-18 08:22:37,887][0m [34mTrial 36:[0m
[32m[IRSPACK:I 2022-06-18 08:22:37,888][0m [34mparameter = {'alpha0': 0.11884551336807544, 'reg': 0.0019010987778407956, 'n_components': 192}[0m


[32m[I 2022-06-18 08:22:38,586][0m Trial 36 pruned. [0m
[32m[IRSPACK:I 2022-06-18 08:22:38,595][0m [34mTrial 37:[0m
[32m[IRSPACK:I 2022-06-18 08:22:38,595][0m [34mparameter = {'alpha0': 0.07794758005102936, 'reg': 0.014163842943737471, 'n_components': 192}[0m


[32m[I 2022-06-18 08:22:38,870][0m Trial 37 pruned. [0m
[32m[IRSPACK:I 2022-06-18 08:22:38,879][0m [34mTrial 38:[0m
[32m[IRSPACK:I 2022-06-18 08:22:38,880][0m [34mparameter = {'alpha0': 0.04587692461736193, 'reg': 0.007922630332602817, 'n_components': 192}[0m


[32m[IRSPACK:I 2022-06-18 08:22:41,402][0m [34mConfig 38 obtained the following scores: {'appeared_item': 2765.0, 'entropy': 7.373578701117112, 'gini_index': 0.6780481097936006, 'hit': 0.7379139072847682, 'map': 0.35815627299484903, 'n_items': 3704.0, 'ndcg': 0.4479948470243146, 'precision': 0.0737913907284768, 'recall': 0.7379139072847682, 'total_user': 6040.0, 'valid_user': 6040.0} within 2.531258 seconds.[0m
[32m[I 2022-06-18 08:22:41,404][0m Trial 38 finished with value: -0.4479948470243146 and parameters: {'alpha0': 0.04587692461736193, 'reg': 0.007922630332602817}. Best is trial 29 with value: -0.45715350596349286.[0m
[32m[IRSPACK:I 2022-06-18 08:22:41,413][0m [34mTrial 39:[0m
[32m[IRSPACK:I 2022-06-18 08:22:41,413][0m [34mparameter = {'alpha0': 0.3323156467701052, 'reg': 0.0028721481496044128, 'n_components': 192}[0m


[32m[IRSPACK:I 2022-06-18 08:22:43,290][0m [34mConfig 39 obtained the following scores: {'appeared_item': 2624.0, 'entropy': 7.4281180751395715, 'gini_index': 0.6569347617753494, 'hit': 0.7299668874172185, 'map': 0.3592799327236413, 'n_items': 3704.0, 'ndcg': 0.44715632826157664, 'precision': 0.07299668874172188, 'recall': 0.7299668874172185, 'total_user': 6040.0, 'valid_user': 6040.0} within 1.885472 seconds.[0m
[32m[I 2022-06-18 08:22:43,292][0m Trial 39 finished with value: -0.44715632826157664 and parameters: {'alpha0': 0.3323156467701052, 'reg': 0.0028721481496044128}. Best is trial 29 with value: -0.45715350596349286.[0m


In [8]:
X_train_all, user_ids, _ = df_to_sparse(train, USER, ITEM, item_ids=item_list)
X_test_gt, _, __ = df_to_sparse(test[test["positive"]], USER, ITEM, user_ids=user_ids, item_ids=item_list)
X_test_recommendable, _, __ = df_to_sparse(test, USER, ITEM, user_ids=user_ids, item_ids=item_list)

NDCG@10/HIT@10 is similar to that reported in the reference.

In [9]:
Evaluator(X_test_gt, per_user_recommendable_items=X_test_recommendable, cutoff=10).get_score(
    IALSRecommender(X_train_all, **best_parameter).learn()
)

{'appeared_item': 2714.0,
 'entropy': 7.405239384868466,
 'gini_index': 0.6666932769922957,
 'hit': 0.7310367671414376,
 'map': 0.36564331982102766,
 'n_items': 3704.0,
 'ndcg': 0.4521301417636353,
 'precision': 0.07310367671414375,
 'recall': 0.7310367671414376,
 'total_user': 6040.0,
 'valid_user': 6038.0}