# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including 
- data preprocessing
- data splitting
- model training and inference
- model optimization
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [3]:
from replay.session_handler import State

spark = State().session
spark

In [4]:
K = 5
SEED=1234

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [5]:
import pandas as pd
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "relevance", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

### 0.1. DataPreparator

An inner data format in RePlay is a spark dataframe.
You can pass spark or pandas dataframe as an input. Columns ``item_id`` and ``user_id`` are required for interaction matrix.
Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``. 

We implemented DataPreparator class to convert dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing.

To convert pandas dataframe to spark as is use function ``convert_to_spark`` from ``replay.utils``.

In [6]:
from replay.data_preparator import DataPreparator

log = DataPreparator().transform(
    data=df,
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance",
        "timestamp": "timestamp"
    }
)

In [7]:
log.show(3)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2001-01-01 01:12:40|
|      1|    661|      3.0|2001-01-01 01:35:09|
|      1|    914|      3.0|2001-01-01 01:32:48|
+-------+-------+---------+-------------------+
only showing top 3 rows



In [8]:
from replay.utils import convert2spark
users = convert2spark(users)

### 0.2. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems.

`UserSplitter` takes ``item_test_size`` items for each user to the test dataset.

In [9]:
from replay.splitters import UserSplitter

splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log)
(
    train.count(), 
    test.count()
)

(997709, 2500)

## 1. Models training

#### SLIM

In [10]:
from replay.models import SLIM

slim = SLIM(lambda_=0.01, beta=0.01, seed=SEED)



In [11]:
%%time

slim.fit(log=train)

14-Sep-21 15:37:40, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
14-Sep-21 15:37:40, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
14-Sep-21 15:37:44, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage


CPU times: user 3.6 s, sys: 457 ms, total: 4.06 s
Wall time: 15.2 s


In [12]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

14-Sep-21 15:37:55, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM


CPU times: user 1.67 s, sys: 431 ms, total: 2.1 s
Wall time: 7.6 s


## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [13]:
from replay.metrics import HitRate, NDCG, MAP
from replay.experiment import Experiment

metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K]})


In [14]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

CPU times: user 119 ms, sys: 31 ms, total: 150 ms
Wall time: 1min 37s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.124,0.388,0.055293,0.103301


## 3. Hyperparameters optimization

In [15]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

In [16]:
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=10)

[32m[I 2021-09-14 17:21:58,972][0m A new study created in memory with name: no-name-2abc5a32-1e33-48e9-88c7-964e9afdb108[0m
14-Sep-21 17:21:58, replay, DEBUG: Fitting model inside optimization
DEBUG:replay:Fitting model inside optimization
14-Sep-21 17:21:59, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
14-Sep-21 17:21:59, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage
14-Sep-21 17:22:10, replay, DEBUG: Predicting inside optimization
DEBUG:replay:Predicting inside optimization
14-Sep-21 17:22:10, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM
14-Sep-21 17:22:20, replay, DEBUG: Calculating criterion
DEBUG:replay:Calculating criterion
14-Sep-21 17:26:24, replay, DEBUG: NDCG=0.025548
DEBUG:replay:NDCG=0.025548
[32m[I 2021-09-14 17:26:24,619][0m Trial 0 finished with value: 0.025547901548758217 and parameters: {'beta': 0.0025240804025565, 'lambda_': 1.3775675803000321e-08}. Best is trial 0 with value: 0.025547901548758217.[0m
14-

In [17]:
best_params

{'beta': 1.1734055285692488e-09, 'lambda_': 0.06648765558145842}

In [18]:
slim = SLIM(**best_params, seed=SEED)

slim.fit(log=train)

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

metrics.add_result("SLIM_optimized", recs)
metrics.results

14-Sep-21 18:07:02, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
14-Sep-21 18:07:02, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
14-Sep-21 18:07:04, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage
14-Sep-21 18:07:10, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.124,0.388,0.055293,0.103301
SLIM_optimized,0.154,0.44,0.0704,0.126539


### Convert to pandas

In [19]:
recs_pd = recs.toPandas()
recs_pd.head(3)

Unnamed: 0,user_id,item_id,relevance
0,1018,1240,1.024585
1,1018,1370,0.87208
2,1018,1221,0.804827


## 4. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [20]:
from replay.models import ALSWrap

als = ALSWrap(rank=100, seed=SEED)

In [21]:
%%time
als.fit(log=train)

14-Sep-21 18:08:25, replay, DEBUG: Starting fit ALSWrap
DEBUG:replay:Starting fit ALSWrap
14-Sep-21 18:08:25, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
14-Sep-21 18:08:26, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage


CPU times: user 606 ms, sys: 107 ms, total: 714 ms
Wall time: 36.6 s


In [22]:
%%time
recs = als.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

14-Sep-21 18:09:02, replay, DEBUG: Starting predict ALSWrap
DEBUG:replay:Starting predict ALSWrap


CPU times: user 1.03 s, sys: 219 ms, total: 1.25 s
Wall time: 3.58 s


In [23]:
%%time
metrics.add_result("ALS", recs)
metrics.results

CPU times: user 59.3 ms, sys: 15.7 ms, total: 75 ms
Wall time: 10.4 s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.124,0.388,0.055293,0.103301
SLIM_optimized,0.154,0.44,0.0704,0.126539
ALS,0.198,0.56,0.090033,0.162485


#### MultVAE 
Variational autoencoder for a recommendation task

In [24]:
from replay.models import MultVAE

multvae = MultVAE(epochs=100)

In [25]:
%%time
multvae.fit(log=train)

14-Sep-21 18:09:16, replay, DEBUG: Starting fit MultVAE
DEBUG:replay:Starting fit MultVAE
14-Sep-21 18:09:16, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
14-Sep-21 18:09:16, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage
14-Sep-21 18:09:19, replay, DEBUG: Creating batch:
DEBUG:replay:Creating batch:
14-Sep-21 18:09:20, replay, DEBUG: Training VAE
DEBUG:replay:Training VAE
14-Sep-21 18:09:22, replay, DEBUG: Epoch[1] current loss: 1351.33661
DEBUG:replay:Epoch[1] current loss: 1351.33661
14-Sep-21 18:09:22, replay, DEBUG: Epoch[1] validation average loss: 1476.55960
DEBUG:replay:Epoch[1] validation average loss: 1476.55960
14-Sep-21 18:09:24, replay, DEBUG: Epoch[2] current loss: 1246.04034
DEBUG:replay:Epoch[2] current loss: 1246.04034
14-Sep-21 18:09:24, replay, DEBUG: Epoch[2] validation average loss: 1478.62748
DEBUG:replay:Epoch[2] validation average loss: 1478.62748
14-Sep-21 18:09:27, replay, DEBUG: Epoch[3] current loss: 1290.10793
DEBUG:replay:Epoc

CPU times: user 35.7 s, sys: 9.03 s, total: 44.8 s
Wall time: 36 s


In [26]:
%%time

recs = multvae.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

14-Sep-21 18:09:52, replay, DEBUG: Starting predict MultVAE
DEBUG:replay:Starting predict MultVAE
14-Sep-21 18:09:55, replay, DEBUG: Предсказание модели
DEBUG:replay:Предсказание модели


CPU times: user 1.03 s, sys: 243 ms, total: 1.27 s
Wall time: 3.78 s


In [27]:
%%time
metrics.add_result("MultVAE", recs)
metrics.results

CPU times: user 66.5 ms, sys: 15.1 ms, total: 81.6 ms
Wall time: 6.99 s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.124,0.388,0.055293,0.103301
SLIM_optimized,0.154,0.44,0.0704,0.126539
ALS,0.198,0.56,0.090033,0.162485
MultVAE,0.006,0.134,0.0103,0.024421


## 5 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

#### 5.1 Save your recommendations as dataframe with columns `user_id - item_id - relevance`

In [31]:
from pyspark.sql.functions import rand

In [32]:
recs.withColumn('relevance', rand(seed=123)).toPandas().to_csv("recs.csv", index=False)

#### 5.2 Read with DataPreparator

In [34]:
recs = DataPreparator().transform(
    path="recs.csv",
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance"
    },
    header=True,
    format_type="csv"
)

ValueError: feature_columns or columns_names has columns that are not present in DataFrame {'item_id', 'relevance', 'user_id'}

#### 5.3 Compare with Experiment

In [None]:
metrics.add_result("my_model", recs)
metrics.results.sort_values("NDCG@5", ascending=False)