# RePlay recommender models comparison

We will show the main RePlay functionality and compare performance of RePlay models on well-known MovieLens dataset. If you have not used RePlay before, start with 01_replay_basics.ipynb which introduces base concepts and describe main classes and functionality.

### Dataset
We will compare RePlay models on __MovieLens 1m__. 

### Dataset preprocessing: 
Ratings greater than or equal to 3 are considered as positive interactions.

### Data split
Dataset is split by date so that 20% of the last interactions as are placed in the test part. Cold items and users are dropped.

### Predict:
We will predict top-10 most relevant films for each user.

### Metrics
Quality metrics used:__ndcg@k, hitrate@k, map@k, mrr@k__ for k = 1, 5, 10
Additional metrics used: __coverage@k__ and __surprisal@k__.

In [1]:
!pip install -q rs-datasets

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
%config Completer.use_jedi = False

In [4]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

In [5]:
import logging
import time

from pyspark.sql import functions as sf, types as st
from pyspark.sql.types import IntegerType
from rs_datasets import MovieLens
from sklearn.cluster import KMeans

from replay.experimental.models import ULinUCB, HierarchicalRecommender
from replay.experimental.models.base_rec import HybridRecommender
from replay.experimental.preprocessing.data_preparator import DataPreparator, Indexer
from replay.metrics import Experiment
from replay.metrics import Coverage, HitRate, MRR, MAP, NDCG, Surprisal
from replay.models import PopRec, RandomRec, UCB, Wilson
from replay.utils.session_handler import State
from replay.splitters import TimeSplitter
from replay.utils.spark_utils import get_log_info

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [None]:
spark = State().session
spark

In [7]:
spark.sparkContext.setLogLevel('ERROR')

In [8]:
logger = logging.getLogger("replay")

In [9]:
K = 10
K_list_metrics = [1, 5, 10]
BUDGET = 20
BUDGET_NN = 10
SEED = 12345

## 0. Preprocessing <a name='data-preparator'></a>

### 0.1 Data loading

In [10]:
data = MovieLens("1m")
data.info()

ratings


Unnamed: 0,user_id,item_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968



users


Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117



items


Unnamed: 0,item_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance





#### log preprocessing

- converting to spark dataframe
- renaming columns
- checking for nulls
- converting timestamp to Timestamp format

In [11]:
preparator = DataPreparator()

In [12]:
%%time
log = preparator.transform(columns_mapping={'user_id': 'user_id',
                                      'item_id': 'item_id',
                                      'relevance': 'rating',
                                      'timestamp': 'timestamp'
                                     }, 
                           data=data.ratings)

07-Nov-24 13:51:09, replay, INFO: Columns with ids of users or items are present in mapping. The dataframe will be treated as an interactions log.


CPU times: user 79.6 ms, sys: 22.6 ms, total: 102 ms
Wall time: 26.1 s


In [13]:
log.show(2)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2000-12-31 22:12:40|
|      1|    661|      3.0|2000-12-31 22:35:09|
+-------+-------+---------+-------------------+
only showing top 2 rows



In [14]:
# will consider ratings >= 3 as positive feedback. A positive feedback is treated with relevance = 1
only_positives_log = log.filter(sf.col('relevance') >= 3).withColumn('relevance', sf.lit(1))
only_positives_log.count()

                                                                                

836478

In [15]:
# we will use only algorithms which do not require user and item features and thus set feature dataframes to None
user_features=None
item_features=None

<a id='indexing'></a>
### 0.2. Indexing

Convert given users' and items' identifiers (\_id) to integers starting at zero without gaps (\_idx) with Indexer class.

In [16]:
indexer = Indexer(user_col='user_id', item_col='item_id')

Take all available user and item ids from log and features and pass them to Indexer. The _ids_ could repeat, the indexes will be ordered by label frequencies, so the most frequent label gets index 0.

In [17]:
%%time
indexer.fit(users=log.select('user_id'),
           items=log.select('item_id'))



CPU times: user 32.1 ms, sys: 14.1 ms, total: 46.2 ms
Wall time: 48 s


                                                                                

In [18]:
%%time
log_replay = indexer.transform(df=only_positives_log)
log_replay.show(2)

                                                                                

+--------+--------+---------+-------------------+
|user_idx|item_idx|relevance|          timestamp|
+--------+--------+---------+-------------------+
|    4131|      43|        1|2000-12-31 22:12:40|
|    4131|     585|        1|2000-12-31 22:35:09|
+--------+--------+---------+-------------------+
only showing top 2 rows

CPU times: user 53.4 ms, sys: 8.98 ms, total: 62.4 ms
Wall time: 1min 23s


### 0.3. Data split

In [20]:
# train/test split 
train_spl = TimeSplitter(
    time_threshold=0.2,
    drop_cold_items=True,
    drop_cold_users=True,
    query_column="user_idx",
    item_column="item_idx",
)
train, test = train_spl.split(log_replay)
print('train info:\n', get_log_info(train))
print('test info:\n', get_log_info(test))

train info:
 total lines: 669181, total users: 5397, total items: 3569
test info:
 total lines: 86542, total users: 1139, total items: 3279


In [21]:
train.is_cached

False

In [22]:
# train/test split for hyperparameters selection
opt_train, opt_val = train_spl.split(train)
opt_train.count(), opt_val.count()

(535343, 24241)

In [23]:
opt_train.is_cached

False

In [24]:
# negative feedback will be used for Wilson and UCB models
only_negatives_log = indexer.transform(df=log.filter(sf.col('relevance') < 3).withColumn('relevance', sf.lit(0.)))
test_start = test.agg(sf.min('timestamp')).collect()[0][0]

# train with both positive and negative feedback
pos_neg_train=(train
              .withColumn('relevance', sf.lit(1.))
              .union(only_negatives_log.filter(sf.col('timestamp') < test_start))
             )
pos_neg_train.cache()
pos_neg_train.count()

                                                                                

798993

In [25]:
pos_neg_train.is_cached

True

In [26]:
train.show(2)

+--------+--------+---------+-------------------+
|user_idx|item_idx|relevance|          timestamp|
+--------+--------+---------+-------------------+
|     677|    1314|        1|2000-12-02 05:30:12|
|     677|    1282|        1|2000-12-02 05:53:52|
+--------+--------+---------+-------------------+
only showing top 2 rows



# 1. Metrics definition

In [27]:
# experiment is used for metrics calculation
e = Experiment(
    [MAP(K), 
      NDCG(K), 
      HitRate(K_list_metrics), 
      Coverage(K),
      Surprisal(K),
      MRR(K)],
    test,
    pos_neg_train,
    query_column="user_idx", item_column="item_idx", rating_column="relevance"
)

# 2. Model training

In [28]:
def fit_predict_add_res(name, model, experiment, train, suffix=''):
    """
    Run fit_predict for the `model`, measure time on fit_predict and evaluate metrics
    """
    start_time=time.time()
    
    logs = {'log': train}
    predict_params = {'k': K, 'users': test.select('user_idx').distinct()}
    
    if isinstance(model, (ULinUCB)):
        logs['log'] = pos_neg_train

    if isinstance(model, HybridRecommender):
        logs['item_features'] = item_features
        logs['user_features'] = user_features
    
    predict_params.update(logs)

    model.fit(**logs)
    fit_time = time.time() - start_time

    pred=model.predict(**predict_params)
    pred.cache()
    pred.count()
    predict_time = time.time() - start_time - fit_time

    experiment.add_result(name + suffix, pred)
    metric_time = time.time() - start_time - fit_time - predict_time
    experiment.results.loc[name + suffix, 'fit_time'] = fit_time
    experiment.results.loc[name + suffix, 'predict_time'] = predict_time
    experiment.results.loc[name + suffix, 'metric_time'] = metric_time
    experiment.results.loc[name + suffix, 'full_time'] = (fit_time + 
                                                          predict_time +
                                                          metric_time)
    pred.unpersist()
    print(experiment.results[['NDCG@{}'.format(K), 'MRR@{}'.format(K), 'Coverage@{}'.format(K), 'fit_time']].sort_values('NDCG@{}'.format(K), ascending=False))

In [29]:
def full_pipeline(models, experiment, train, suffix='', budget=BUDGET):
    """
    For each model:
        -  if required: run hyperparameters search, set best params and save param values to `experiment`
        - pass model to `fit_predict_add_res`        
    """
    
    for name, [model, params] in models.items():
        model.logger.info(msg='{} started'.format(name))
        if params != 'no_opt':
            model.logger.info(msg='{} optimization started'.format(name))
            best_params = model.optimize(opt_train, 
                                         opt_val, 
                                         param_borders=params, 
                                         item_features=item_features,
                                         user_features=user_features,
                                         k=K, 
                                         budget=budget)
            logger.info(msg='best params for {} are: {}'.format(name, best_params))
            model.set_params(**best_params)
        
        logger.info(msg='{} fit_predict started'.format(name))
        fit_predict_add_res(name, model, experiment, train, suffix)
        # here we call protected attribute to get all parameters set during model initialization
        experiment.results.loc[name + suffix, 'params'] = str(model._init_args)

## 2.1 Hierarchical contextual bandit

### 2.1.1. features preprocesing

In [31]:
item_features_original = preparator.transform(columns_mapping={'item_id': 'item_id'}, 
                           data=data.items)
item_features = indexer.transform(df=item_features_original)
item_features.show(2)

07-Nov-24 14:11:22, replay, INFO: Column with ids of users or items is absent in mapping. The dataframe will be treated as a users'/items' features dataframe.


+--------+----------------+--------------------+
|item_idx|           title|              genres|
+--------+----------------+--------------------+
|      29|Toy Story (1995)|Animation|Childre...|
|     393|  Jumanji (1995)|Adventure|Childre...|
+--------+----------------+--------------------+
only showing top 2 rows



In [32]:
year = item_features.withColumn('year', sf.substring(sf.col('title'), -5, 4).astype(st.IntegerType())).select('item_idx', 'year')
year.show(2)

+--------+----+
|item_idx|year|
+--------+----+
|      29|1995|
|     393|1995|
+--------+----+
only showing top 2 rows



In [33]:
genres = (
    item_features.select(
        "item_idx",
        sf.split("genres", "\|").alias("genres")
    )
)

In [34]:
genres_list = (
    genres.select(sf.explode("genres").alias("genre"))
    .distinct().filter('genre <> "(no genres listed)"')
    .toPandas()["genre"].tolist()
)

In [35]:
genres_list

["Children's",
 'Musical',
 'Action',
 'Crime',
 'Fantasy',
 'Adventure',
 'Romance',
 'War',
 'Sci-Fi',
 'Drama',
 'Comedy',
 'Horror',
 'Documentary',
 'Animation',
 'Mystery',
 'Thriller',
 'Western',
 'Film-Noir']

In [36]:
item_features = genres
for genre in genres_list:
    item_features = item_features.withColumn(
        genre,
        sf.array_contains(sf.col("genres"), genre).astype(IntegerType())
    )
item_features = item_features.drop("genres").cache()
item_features.count()

3883

In [37]:
item_features = item_features.join(year, on='item_idx', how='inner')
item_features.cache()
item_features.count()

3883

In [38]:
item_features.show(2)

+--------+----------+-------+------+-----+-------+---------+-------+---+------+-----+------+------+-----------+---------+-------+--------+-------+---------+----+
|item_idx|Children's|Musical|Action|Crime|Fantasy|Adventure|Romance|War|Sci-Fi|Drama|Comedy|Horror|Documentary|Animation|Mystery|Thriller|Western|Film-Noir|year|
+--------+----------+-------+------+-----+-------+---------+-------+---+------+-----+------+------+-----------+---------+-------+--------+-------+---------+----+
|      29|         1|      0|     0|    0|      0|        0|      0|  0|     0|    0|     1|     0|          0|        1|      0|       0|      0|        0|1995|
|     393|         1|      0|     0|    0|      1|        1|      0|  0|     0|    0|     0|     0|          0|        0|      0|       0|      0|        0|1995|
+--------+----------+-------+------+-----+-------+---------+-------+---+------+-----+------+------+-----------+---------+-------+--------+-------+---------+----+
only showing top 2 rows



In [39]:
# both user and item features need to be present in Hierarchical Recommender (at least with only item/user indices column)
# we would not actually use user_features so leave the table with empty feature columns
user_features_original = preparator.transform(columns_mapping={'user_id': 'user_id'}, 
                           data=data.users)
user_features = indexer.transform(df=user_features_original)
empty_user_features = user_features.select("user_idx")
user_features = empty_user_features
user_features.show(2)

07-Nov-24 14:11:46, replay, INFO: Column with ids of users or items is absent in mapping. The dataframe will be treated as a users'/items' features dataframe.


+--------+
|user_idx|
+--------+
|    4131|
|    2364|
+--------+
only showing top 2 rows



In [40]:
# disbalance in numerical year feature and the rest one-hot data makes the regression in uLinUCB fit much worse
item_features = item_features.drop("year")
item_features.show(2)

+--------+----------+-------+------+-----+-------+---------+-------+---+------+-----+------+------+-----------+---------+-------+--------+-------+---------+
|item_idx|Children's|Musical|Action|Crime|Fantasy|Adventure|Romance|War|Sci-Fi|Drama|Comedy|Horror|Documentary|Animation|Mystery|Thriller|Western|Film-Noir|
+--------+----------+-------+------+-----+-------+---------+-------+---+------+-----+------+------+-----------+---------+-------+--------+-------+---------+
|      29|         1|      0|     0|    0|      0|        0|      0|  0|     0|    0|     1|     0|          0|        1|      0|       0|      0|        0|
|     393|         1|      0|     0|    0|      1|        1|      0|  0|     0|    0|     0|     0|          0|        0|      0|       0|      0|        0|
+--------+----------+-------+------+-----+-------+---------+-------+---+------+-----+------+------+-----------+---------+-------+--------+-------+---------+
only showing top 2 rows



## 2.1.2 experiment

In [43]:
hcbs = {
    'HCB (depth=1)': [
        HierarchicalRecommender(
            depth=1, cluster_model=KMeans(n_clusters=5), recommender_class=ULinUCB,
            recommender_params={"alpha" : -2.0}
        ), 'no_opt'
    ],
    'HCB (depth=2)': [
        HierarchicalRecommender(
            depth=2, cluster_model=KMeans(n_clusters=5), recommender_class=ULinUCB,
            recommender_params={"alpha" : -2.0}
        ), 'no_opt'
    ],
    'HCB (depth=3)': [
        HierarchicalRecommender(
            depth=3, cluster_model=KMeans(n_clusters=5), recommender_class=ULinUCB,
            recommender_params={"alpha" : -2.0}
        ), 'no_opt'
    ]
}

In [45]:
%%time
e = Experiment(
    [MAP(K), 
      NDCG(K), 
      HitRate(K_list_metrics), 
      Coverage(K),
      Surprisal(K),
      MRR(K)],
    test,
    pos_neg_train,
    query_column="user_idx", item_column="item_idx", rating_column="relevance"
)
full_pipeline(hcbs, e, train, budget=BUDGET)

07-Nov-24 15:06:37, replay, INFO: HCB (depth=1) started
07-Nov-24 15:06:37, replay, INFO: HCB (depth=1) fit_predict started
07-Nov-24 15:31:58, replay, INFO: HCB (depth=2) started                         
07-Nov-24 15:31:58, replay, INFO: HCB (depth=2) fit_predict started


                NDCG@10    MRR@10  Coverage@10    fit_time
HCB (depth=1)  0.036343  0.076012       0.0071  205.793682


  super()._check_params_vs_input(X, default_n_init=10)
100%|█████████████████████████████████████████████| 5/5 [01:01<00:00, 12.25s/it]
07-Nov-24 15:58:28, replay, INFO: HCB (depth=3) started                         
07-Nov-24 15:58:28, replay, INFO: HCB (depth=3) fit_predict started


                NDCG@10    MRR@10  Coverage@10    fit_time
HCB (depth=2)  0.152523  0.240873     0.032496  256.318547
HCB (depth=1)  0.036343  0.076012     0.007100  205.793682


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  0%|                                                     | 0/5 [00:00<?, ?it/s]
  0%|                                                     | 0/5 [00:00<?, ?it/s][A
 20%|█████████                                    | 1/5 [00:10<00:42, 10.63s/it][A
 40%|██████████████████                           | 2/5 [00:16<00:23,  7.67s/it][A
 60%|███████████████████████████                  | 3/5 [00:20<00:12,  6.19s/it][A
 80%|████████████████████████████████████         | 4/5 [00:25<00:05,  5.83s/it][A
100%|█████████████████████████████████████████████| 5/5 [00:30<00:00,  6.16s/it][A
 20%|█████████                                    | 1/5 [00:45<03:00, 45.22s/it]
  0%

                NDCG@10    MRR@10  Coverage@10    fit_time
HCB (depth=2)  0.152523  0.240873     0.032496  256.318547
HCB (depth=3)  0.152274  0.240668     0.034407  372.256509
HCB (depth=1)  0.036343  0.076012     0.007100  205.793682
CPU times: user 21min 42s, sys: 20min 58s, total: 42min 41s
Wall time: 1h 20min 4s


# 3. Results

In [46]:
e.results.sort_values('NDCG@10', ascending=False)

Unnamed: 0,MAP@10,NDCG@10,HitRate@1,HitRate@5,HitRate@10,Coverage@10,Surprisal@10,MRR@10,fit_time,predict_time,metric_time,full_time,params
HCB (depth=2),0.08132,0.152523,0.118525,0.397717,0.568042,0.032496,0.182862,0.240873,256.318547,876.468735,457.694904,1590.482186,"{'depth': 2, 'cluster_model': KMeans(n_cluster..."
HCB (depth=3),0.081155,0.152274,0.119403,0.396839,0.562774,0.034407,0.183635,0.240668,372.256509,879.736982,440.705198,1692.698689,"{'depth': 3, 'cluster_model': KMeans(n_cluster..."
HCB (depth=1),0.01452,0.036343,0.035996,0.105356,0.210711,0.0071,0.360603,0.076012,205.793682,861.154209,453.383323,1520.331214,"{'depth': 1, 'cluster_model': KMeans(n_cluster..."
