# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including 
- data preprocessing
- data splitting
- model training and inference
- model optimization
- model saving and loading
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [4]:
from replay.session_handler import State

spark = State().session
spark

In [5]:
K = 5
SEED=1234

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [6]:
import pandas as pd
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "relevance", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

### 0.1. DataPreparator

An inner data format in RePlay is a spark dataframe.
You can pass spark or pandas dataframe as an input. Columns ``item_id`` and ``user_id`` are required for interaction matrix.
Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``. 

We implemented DataPreparator class to convert dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing.

To convert pandas dataframe to spark as is use function ``convert_to_spark`` from ``replay.utils``.

In [7]:
from replay.data_preparator import DataPreparator

log = DataPreparator().transform(
    data=df,
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance",
        "timestamp": "timestamp"
    }
)

In [8]:
log.show(3)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2000-12-31 22:12:40|
|      1|    661|      3.0|2000-12-31 22:35:09|
|      1|    914|      3.0|2000-12-31 22:32:48|
+-------+-------+---------+-------------------+
only showing top 3 rows



In [9]:
from replay.utils import convert2spark
users = convert2spark(users)
users.show(3)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
|      3|     M| 25|        15|   55117|
+-------+------+---+----------+--------+
only showing top 3 rows



### 0.2. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems.

`UserSplitter` takes ``item_test_size`` items for ``user_test_size`` user to the test dataset.

In [10]:
from replay.splitters import UserSplitter

splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log)
(
    train.count(), 
    test.count()
)

(997709, 2499)

## 1. Models training

#### SLIM

In [11]:
from replay.models import SLIM

slim = SLIM(seed=SEED)

In [13]:
%%time

slim.fit(log=train)

CPU times: user 2.5 s, sys: 239 ms, total: 2.74 s
Wall time: 10.2 s


In [14]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

CPU times: user 955 ms, sys: 308 ms, total: 1.26 s
Wall time: 11.4 s


In [15]:
recs.show(2)

+-------+-------+-------------------+
|user_id|item_id|          relevance|
+-------+-------+-------------------+
|   5207|   1196| 0.3033848470126411|
|   5207|   2797|0.24751908929278302|
+-------+-------+-------------------+
only showing top 2 rows



## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [16]:
from replay.metrics import Coverage, HitRate, NDCG, MAP
from replay.experiment import Experiment

In [17]:
metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K],
                            Coverage(train): K
                           })

In [18]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

CPU times: user 182 ms, sys: 59.4 ms, total: 241 ms
Wall time: 27.3 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.154926,0.236,0.544,0.094453,0.163997


## 3. Hyperparameters optimization

#### 3.1 Search

In [19]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

In [20]:
%%time
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=15)

[32m[I 2021-11-11 10:21:26,213][0m A new study created in memory with name: no-name-0d9bf027-99b0-46b4-9e68-e4671635cc68[0m
[32m[I 2021-11-11 10:22:00,291][0m Trial 0 finished with value: 0.17884682905765073 and parameters: {'beta': 0.01, 'lambda_': 0.01}. Best is trial 0 with value: 0.17884682905765073.[0m
[32m[I 2021-11-11 10:22:30,488][0m Trial 1 finished with value: 0.17838596144637037 and parameters: {'beta': 1.5843648655110386e-05, 'lambda_': 0.0019381432430530966}. Best is trial 0 with value: 0.17884682905765073.[0m
[32m[I 2021-11-11 10:22:58,659][0m Trial 2 finished with value: 0.17845294084271404 and parameters: {'beta': 0.8747161053651644, 'lambda_': 3.2945196499038316e-06}. Best is trial 0 with value: 0.17884682905765073.[0m
[32m[I 2021-11-11 10:23:42,449][0m Trial 3 finished with value: 0.17574650496622896 and parameters: {'beta': 1.2583517304821009e-05, 'lambda_': 1.2436091696595869e-05}. Best is trial 0 with value: 0.17884682905765073.[0m
[32m[I 2021-11-11

CPU times: user 53 s, sys: 7.62 s, total: 1min
Wall time: 7min 3s


In [21]:
best_params

{'beta': 0.0009102123564610077, 'lambda_': 0.012004677143063356}

#### 3.2 Compare with previous

In [22]:
def fit_predict_evaluate(model, experiment, name):
    model.fit(log=train)

    recs = model.predict(
        k=K,
        users=test.select('user_id').distinct(),
        log=train,
        filter_seen_items=True
    )

    experiment.add_result(name, recs)
    return recs

In [23]:
%%time
recs = fit_predict_evaluate(SLIM(**best_params, seed=SEED), metrics, 'SLIM_optimized')
metrics.results.sort_values('NDCG@5', ascending=False)

CPU times: user 3.68 s, sys: 565 ms, total: 4.24 s
Wall time: 44.8 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.156275,0.24,0.548,0.095153,0.165387
SLIM,0.154926,0.236,0.544,0.094453,0.163997


### Convert to pandas

In [24]:
recs_pd = recs.toPandas()
recs_pd.head(2)

Unnamed: 0,user_id,item_id,relevance
0,5207,1196,0.304172
1,5207,2797,0.24848


## 4. Save and load

RePlay allows to save and load fitted models with `save` and `load` functions of `model_handler` module. Model is saved as a folder with all necessary parameters and data.

In [25]:
from replay.model_handler import save, load

In [26]:
save(slim, path='./slim_best_params')
slim_loaded = load('./slim_best_params')

In [27]:
%%time
pred_from_loaded = slim_loaded.predict(k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True)
pred_from_loaded.show(2)

+-------+-------+-------------------+
|user_id|item_id|          relevance|
+-------+-------+-------------------+
|   5207|   1196| 0.3135545764212523|
|   5207|   2918|0.24504774750833722|
+-------+-------+-------------------+
only showing top 2 rows

CPU times: user 1.04 s, sys: 269 ms, total: 1.31 s
Wall time: 14.2 s


In [28]:
slim_loaded.beta, slim_loaded.lambda_

(1.0330972425382408e-06, 0.6909563043593303)

## 5. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [29]:
from replay.models import ALSWrap

In [30]:
%%time
recs = fit_predict_evaluate(ALSWrap(rank=100, seed=SEED), metrics, 'ALS')
metrics.results.sort_values('NDCG@5', ascending=False)

CPU times: user 1.6 s, sys: 449 ms, total: 2.05 s
Wall time: 1min 5s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.156275,0.24,0.548,0.095153,0.165387
SLIM,0.154926,0.236,0.544,0.094453,0.163997
ALS,0.201619,0.222,0.546,0.092473,0.161675


#### KNN
Commonly-used item-based recommender

In [31]:
from replay.models import KNN

In [32]:
%%time
recs = fit_predict_evaluate(KNN(num_neighbours=100), metrics, 'KNN')
metrics.results.sort_values('NDCG@5', ascending=False)

CPU times: user 1.61 s, sys: 468 ms, total: 2.08 s
Wall time: 1min 3s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.156275,0.24,0.548,0.095153,0.165387
SLIM,0.154926,0.236,0.544,0.094453,0.163997
ALS,0.201619,0.222,0.546,0.092473,0.161675
KNN,0.053441,0.158,0.394,0.05669,0.104892


## 6 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

#### 6.1 Save your recommendations as dataframe with columns `user_id - item_id - relevance`

In [33]:
from pyspark.sql.functions import rand

In [34]:
recs.withColumn('relevance', rand(seed=123)).toPandas().to_csv("recs.csv", index=False)

#### 6.2 Read with DataPreparator

In [35]:
recs = DataPreparator().transform(
    path="recs.csv",
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance"
    },
    reader_kwargs={"header":True},
    format_type="csv"
)

#### 6.3 Compare with Experiment

In [36]:
metrics.add_result("my_model", recs)
metrics.results.sort_values("NDCG@5", ascending=False)

Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.156275,0.24,0.548,0.095153,0.165387
SLIM,0.154926,0.236,0.544,0.094453,0.163997
ALS,0.201619,0.222,0.546,0.092473,0.161675
KNN,0.053441,0.158,0.394,0.05669,0.104892
my_model,0.053441,0.116,0.394,0.04995,0.096634
