# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including 
- data preprocessing
- data splitting
- model training and inference
- model optimization
- models comparison

In [29]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [30]:
%config Completer.use_jedi = False

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [31]:
from replay.session_handler import State

spark = State().session
spark

In [32]:
K = 5
SEED=1234

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [33]:
import pandas as pd
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "relevance", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

### 0.1. DataPreparator

An inner data format in RePlay is a spark dataframe.
You can pass spark or pandas dataframe as an input. Columns ``item_id`` and ``user_id`` are required for interaction matrix.
Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``. 

We implemented DataPreparator class to convert dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing.

To convert pandas dataframe to spark as is use function ``convert_to_spark`` from ``replay.utils``.

In [34]:
from replay.data_preparator import DataPreparator

log = DataPreparator().transform(
    data=df,
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance",
        "timestamp": "timestamp"
    }
)

In [35]:
log.show(3)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2000-12-31 22:12:40|
|      1|    661|      3.0|2000-12-31 22:35:09|
|      1|    914|      3.0|2000-12-31 22:32:48|
+-------+-------+---------+-------------------+
only showing top 3 rows



In [36]:
from replay.utils import convert2spark
users = convert2spark(users)

### 0.2. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems.

`UserSplitter` takes ``item_test_size`` items for each user to the test dataset.

In [37]:
from replay.splitters import UserSplitter

splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log)
(
    train.count(), 
    test.count()
)

(997709, 2500)

## 1. Models training

#### SLIM

In [48]:
from replay.models import SLIM

slim = SLIM(lambda_=0.01, beta=0.01, seed=SEED)

In [49]:
%%time

slim.fit(log=train)

24-May-21 08:56:43, replay, DEBUG: Начало обучения SLIM
DEBUG:replay:Начало обучения SLIM
24-May-21 08:56:43, replay, DEBUG: Предварительная стадия обучения (pre-fit)
DEBUG:replay:Предварительная стадия обучения (pre-fit)
24-May-21 08:56:44, replay, DEBUG: Основная стадия обучения (fit)
DEBUG:replay:Основная стадия обучения (fit)


CPU times: user 2.23 s, sys: 148 ms, total: 2.37 s
Wall time: 4.5 s


In [50]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

24-May-21 08:56:48, replay, DEBUG: Начало предикта SLIM
DEBUG:replay:Начало предикта SLIM


CPU times: user 728 ms, sys: 180 ms, total: 908 ms
Wall time: 3.42 s


## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [51]:
from replay.metrics import HitRate, NDCG, MAP
from replay.experiment import Experiment

metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K]})


In [52]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

CPU times: user 64.8 ms, sys: 24.4 ms, total: 89.2 ms
Wall time: 11.4 s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.346,0.053353,0.096688


## 3. Hyperparameters optimization

In [44]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

In [None]:
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=10)

In [17]:
best_params

{'beta': 1.8656829517898104e-07, 'lambda_': 0.11039466478947813}

In [53]:
slim = SLIM(**best_params, seed=SEED)

slim.fit(log=train)

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

metrics.add_result("SLIM_optimized", recs)
metrics.results

24-May-21 08:57:04, replay, DEBUG: Начало обучения SLIM
DEBUG:replay:Начало обучения SLIM
24-May-21 08:57:04, replay, DEBUG: Предварительная стадия обучения (pre-fit)
DEBUG:replay:Предварительная стадия обучения (pre-fit)
24-May-21 08:57:05, replay, DEBUG: Основная стадия обучения (fit)
DEBUG:replay:Основная стадия обучения (fit)
24-May-21 08:57:09, replay, DEBUG: Начало предикта SLIM
DEBUG:replay:Начало предикта SLIM


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.346,0.053353,0.096688
SLIM_optimized,0.18,0.47,0.072093,0.130902


### Convert to pandas

In [54]:
recs_pd = recs.toPandas()
recs_pd.head(3)

Unnamed: 0,user_id,item_id,relevance
0,1344,2918,0.747178
1,1344,595,0.695038
2,1344,1196,0.665159


## 4. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [55]:
from replay.models import ALSWrap

als = ALSWrap(rank=100, seed=SEED)

In [56]:
%%time
als.fit(log=train)

24-May-21 08:57:23, replay, DEBUG: Начало обучения ALSWrap
DEBUG:replay:Начало обучения ALSWrap
24-May-21 08:57:23, replay, DEBUG: Предварительная стадия обучения (pre-fit)
DEBUG:replay:Предварительная стадия обучения (pre-fit)
24-May-21 08:57:24, replay, DEBUG: Основная стадия обучения (fit)
DEBUG:replay:Основная стадия обучения (fit)


CPU times: user 295 ms, sys: 9.81 ms, total: 305 ms
Wall time: 18.3 s


In [57]:
%%time
recs = als.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

24-May-21 08:57:42, replay, DEBUG: Начало предикта ALSWrap
DEBUG:replay:Начало предикта ALSWrap


CPU times: user 723 ms, sys: 153 ms, total: 876 ms
Wall time: 3.11 s


In [58]:
%%time
metrics.add_result("ALS", recs)
metrics.results

CPU times: user 41.7 ms, sys: 30.9 ms, total: 72.7 ms
Wall time: 6.12 s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.346,0.053353,0.096688
SLIM_optimized,0.18,0.47,0.072093,0.130902
ALS,0.236,0.542,0.092333,0.163345


#### MultVAE 
Variational autoencoder for a recommendation task

In [59]:
from replay.models import MultVAE

multvae = MultVAE(epochs=100)

In [60]:
%%time
multvae.fit(log=train)

24-May-21 08:57:51, replay, DEBUG: Начало обучения MultVAE
DEBUG:replay:Начало обучения MultVAE
24-May-21 08:57:51, replay, DEBUG: Предварительная стадия обучения (pre-fit)
DEBUG:replay:Предварительная стадия обучения (pre-fit)
24-May-21 08:57:51, replay, DEBUG: Основная стадия обучения (fit)
DEBUG:replay:Основная стадия обучения (fit)
24-May-21 08:57:53, replay, DEBUG: Составление батча:
DEBUG:replay:Составление батча:
24-May-21 08:57:54, replay, DEBUG: Обучение модели
DEBUG:replay:Обучение модели
24-May-21 08:57:54, replay, DEBUG: Epoch[1] current loss: 1351.10847
DEBUG:replay:Epoch[1] current loss: 1351.10847
24-May-21 08:57:54, replay, DEBUG: Epoch[1] validation average loss: 1475.69263
DEBUG:replay:Epoch[1] validation average loss: 1475.69263
24-May-21 08:57:55, replay, DEBUG: Epoch[2] current loss: 1245.90040
DEBUG:replay:Epoch[2] current loss: 1245.90040
24-May-21 08:57:55, replay, DEBUG: Epoch[2] validation average loss: 1475.77625
DEBUG:replay:Epoch[2] validation average loss:

24-May-21 08:58:12, replay, DEBUG: Epoch[31] current loss: 1180.80163
DEBUG:replay:Epoch[31] current loss: 1180.80163
24-May-21 08:58:12, replay, DEBUG: Epoch[31] validation average loss: 1322.21924
DEBUG:replay:Epoch[31] validation average loss: 1322.21924
24-May-21 08:58:12, replay, DEBUG: Epoch[32] current loss: 1174.51145
DEBUG:replay:Epoch[32] current loss: 1174.51145
24-May-21 08:58:12, replay, DEBUG: Epoch[32] validation average loss: 1320.87622
DEBUG:replay:Epoch[32] validation average loss: 1320.87622
24-May-21 08:58:13, replay, DEBUG: Epoch[33] current loss: 1177.81200
DEBUG:replay:Epoch[33] current loss: 1177.81200
24-May-21 08:58:13, replay, DEBUG: Epoch[33] validation average loss: 1317.95276
DEBUG:replay:Epoch[33] validation average loss: 1317.95276
24-May-21 08:58:13, replay, DEBUG: Epoch[34] current loss: 1174.42239
DEBUG:replay:Epoch[34] current loss: 1174.42239
24-May-21 08:58:14, replay, DEBUG: Epoch[34] validation average loss: 1316.22388
DEBUG:replay:Epoch[34] vali

DEBUG:replay:Epoch[62] validation average loss: 1280.51172
24-May-21 08:58:30, replay, DEBUG: Epoch[63] current loss: 1150.26690
DEBUG:replay:Epoch[63] current loss: 1150.26690
24-May-21 08:58:30, replay, DEBUG: Epoch[63] validation average loss: 1281.41919
DEBUG:replay:Epoch[63] validation average loss: 1281.41919
24-May-21 08:58:31, replay, DEBUG: Epoch[64] current loss: 1149.68372
DEBUG:replay:Epoch[64] current loss: 1149.68372
24-May-21 08:58:31, replay, DEBUG: Epoch[64] validation average loss: 1279.96204
DEBUG:replay:Epoch[64] validation average loss: 1279.96204
24-May-21 08:58:31, replay, DEBUG: Epoch[65] current loss: 1147.41157
DEBUG:replay:Epoch[65] current loss: 1147.41157
24-May-21 08:58:32, replay, DEBUG: Epoch[65] validation average loss: 1278.59192
DEBUG:replay:Epoch[65] validation average loss: 1278.59192
24-May-21 08:58:32, replay, DEBUG: Epoch[66] current loss: 1147.76613
DEBUG:replay:Epoch[66] current loss: 1147.76613
24-May-21 08:58:32, replay, DEBUG: Epoch[66] vali

DEBUG:replay:Epoch[94] validation average loss: 1268.73120
24-May-21 08:58:49, replay, DEBUG: Epoch[95] current loss: 1128.16562
DEBUG:replay:Epoch[95] current loss: 1128.16562
24-May-21 08:58:49, replay, DEBUG: Epoch[95] validation average loss: 1268.49170
DEBUG:replay:Epoch[95] validation average loss: 1268.49170
24-May-21 08:58:49, replay, DEBUG: Epoch[96] current loss: 1138.97953
DEBUG:replay:Epoch[96] current loss: 1138.97953
24-May-21 08:58:49, replay, DEBUG: Epoch[96] validation average loss: 1268.69019
DEBUG:replay:Epoch[96] validation average loss: 1268.69019
24-May-21 08:58:50, replay, DEBUG: Epoch[97] current loss: 1136.25470
DEBUG:replay:Epoch[97] current loss: 1136.25470
24-May-21 08:58:50, replay, DEBUG: Epoch[97] validation average loss: 1268.05981
DEBUG:replay:Epoch[97] validation average loss: 1268.05981
24-May-21 08:58:50, replay, DEBUG: Epoch[98] current loss: 1133.11271
DEBUG:replay:Epoch[98] current loss: 1133.11271
24-May-21 08:58:50, replay, DEBUG: Epoch[98] vali

CPU times: user 5min 2s, sys: 3min 42s, total: 8min 45s
Wall time: 1min


In [61]:
%%time

recs = multvae.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

24-May-21 08:58:52, replay, DEBUG: Начало предикта MultVAE
DEBUG:replay:Начало предикта MultVAE
24-May-21 08:58:54, replay, DEBUG: Предсказание модели
DEBUG:replay:Предсказание модели


CPU times: user 795 ms, sys: 161 ms, total: 955 ms
Wall time: 12.8 s


In [62]:
%%time
metrics.add_result("MultVAE", recs)
metrics.results

CPU times: user 71.6 ms, sys: 0 ns, total: 71.6 ms
Wall time: 14.8 s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.346,0.053353,0.096688
SLIM_optimized,0.18,0.47,0.072093,0.130902
ALS,0.236,0.542,0.092333,0.163345
MultVAE,0.152,0.452,0.06524,0.121208


## 5 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

#### 5.1 Save your recommendations as dataframe with columns `user_id - item_id - relevance`

In [65]:
from pyspark.sql.functions import rand

In [66]:
recs.withColumn('relevance', rand(seed=123)).toPandas().to_csv("recs.csv", index=False)

#### 5.2 Read with DataPreparator

In [67]:
recs = DataPreparator().transform(
    path="recs.csv",
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance"
    },
    header=True,
    format_type="csv"
)

#### 5.3 Compare with Experiment

In [68]:
metrics.add_result("my_model", recs)
metrics.results.sort_values("NDCG@5", ascending=False)

Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.236,0.542,0.092333,0.163345
SLIM_optimized,0.18,0.47,0.072093,0.130902
MultVAE,0.152,0.452,0.06524,0.121208
my_model,0.124,0.452,0.059987,0.11485
SLIM,0.128,0.346,0.053353,0.096688
