# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including 
- data preprocessing
- data splitting
- model training and inference
- model optimization
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [3]:
from replay.session_handler import State

spark = State().session
spark

In [4]:
K = 5
SEED=1234

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [5]:
import pandas as pd
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "relevance", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

### 0.1. DataPreparator

An inner data format in RePlay is a spark dataframe.
You can pass spark or pandas dataframe as an input. Columns ``item_id`` and ``user_id`` are required for interaction matrix.
Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``. 

We implemented DataPreparator class to convert dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing.

To convert pandas dataframe to spark as is use function ``convert_to_spark`` from ``replay.utils``.

In [6]:
from replay.data_preparator import DataPreparator

log = DataPreparator().transform(
    data=df,
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance",
        "timestamp": "timestamp"
    }
)

In [7]:
log.show(3)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2001-01-01 01:12:40|
|      1|    661|      3.0|2001-01-01 01:35:09|
|      1|    914|      3.0|2001-01-01 01:32:48|
+-------+-------+---------+-------------------+
only showing top 3 rows



In [8]:
from replay.utils import convert2spark
users = convert2spark(users)

### 0.2. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems.

`UserSplitter` takes ``item_test_size`` items for each user to the test dataset.

In [9]:
from replay.splitters import UserSplitter

splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log)
(
    train.count(), 
    test.count()
)

(997709, 2500)

## 1. Models training

#### SLIM

In [10]:
from replay.models import SLIM

slim = SLIM(lambda_=0.01, beta=0.01, seed=SEED)



In [11]:
%%time

slim.fit(log=train)

14-Sep-21 11:42:25, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
14-Sep-21 11:42:25, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
14-Sep-21 11:42:26, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage


CPU times: user 2.66 s, sys: 274 ms, total: 2.94 s
Wall time: 8.68 s


In [12]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

14-Sep-21 11:42:40, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM


CPU times: user 1.36 s, sys: 322 ms, total: 1.68 s
Wall time: 6.19 s


## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [13]:
from replay.metrics import HitRate, NDCG, MAP
from replay.experiment import Experiment

metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K]})


In [14]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

CPU times: user 136 ms, sys: 42.7 ms, total: 179 ms
Wall time: 1min 27s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.124,0.388,0.055293,0.103301


## 3. Hyperparameters optimization

In [15]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

In [None]:
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=10)

[32m[I 2021-09-14 11:44:21,549][0m A new study created in memory with name: no-name-26d8ace4-56f2-49d3-9061-cef206003327[0m
14-Sep-21 11:44:21, replay, DEBUG: Fitting model inside optimization
DEBUG:replay:Fitting model inside optimization
14-Sep-21 11:44:21, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
14-Sep-21 11:44:21, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage
14-Sep-21 11:44:33, replay, DEBUG: Predicting inside optimization
DEBUG:replay:Predicting inside optimization
14-Sep-21 11:44:33, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM
14-Sep-21 11:44:39, replay, DEBUG: Calculating criterion
DEBUG:replay:Calculating criterion
14-Sep-21 11:48:06, replay, DEBUG: NDCG=0.021556
DEBUG:replay:NDCG=0.021556
[32m[I 2021-09-14 11:48:06,621][0m Trial 0 finished with value: 0.02155556969535634 and parameters: {'beta': 0.0006486899564626721, 'lambda_': 1.0474894652929496e-07}. Best is trial 0 with value: 0.02155556969535634.[0m
14

In [None]:
best_params

In [None]:
slim = SLIM(**best_params, seed=SEED)

slim.fit(log=train)

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

metrics.add_result("SLIM_optimized", recs)
metrics.results

### Convert to pandas

In [None]:
recs_pd = recs.toPandas()
recs_pd.head(3)

## 4. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [None]:
from replay.models import ALSWrap

als = ALSWrap(rank=100, seed=SEED)

In [None]:
%%time
als.fit(log=train)

In [None]:
%%time
recs = als.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

In [None]:
%%time
metrics.add_result("ALS", recs)
metrics.results

#### MultVAE 
Variational autoencoder for a recommendation task

In [None]:
from replay.models import MultVAE

multvae = MultVAE(epochs=100)

In [None]:
%%time
multvae.fit(log=train)

In [None]:
%%time

recs = multvae.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

In [None]:
%%time
metrics.add_result("MultVAE", recs)
metrics.results

## 5 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

#### 5.1 Save your recommendations as dataframe with columns `user_id - item_id - relevance`

In [None]:
from pyspark.sql.functions import rand

In [None]:
recs.withColumn('relevance', rand(seed=123)).toPandas().to_csv("recs.csv", index=False)

#### 5.2 Read with DataPreparator

In [None]:
recs = DataPreparator().transform(
    path="recs.csv",
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance"
    },
    header=True,
    format_type="csv"
)

#### 5.3 Compare with Experiment

In [None]:
metrics.add_result("my_model", recs)
metrics.results.sort_values("NDCG@5", ascending=False)