# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including 
- data preprocessing
- data splitting
- model training and inference
- model optimization
- model saving and loading
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [4]:
from replay.session_handler import State

spark = State().session
spark

In [5]:
K = 5
SEED=1234

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [6]:
import pandas as pd
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "relevance", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

### 0.1. DataPreparator

An inner data format in RePlay is a spark dataframe.
You can pass spark or pandas dataframe as an input. Columns ``item_id`` and ``user_id`` are required for interaction matrix.
Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``. 

We implemented DataPreparator class to convert dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing.

To convert pandas dataframe to spark as is use function ``convert_to_spark`` from ``replay.utils``.

In [7]:
from replay.data_preparator import DataPreparator

log = DataPreparator().transform(
    data=df,
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance",
        "timestamp": "timestamp"
    }
)

In [8]:
log.show(3)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2000-12-31 22:12:40|
|      1|    661|      3.0|2000-12-31 22:35:09|
|      1|    914|      3.0|2000-12-31 22:32:48|
+-------+-------+---------+-------------------+
only showing top 3 rows



In [9]:
from replay.utils import convert2spark
users = convert2spark(users)
users.show(3)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
|      3|     M| 25|        15|   55117|
+-------+------+---+----------+--------+
only showing top 3 rows



### 0.2. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems.

`UserSplitter` takes ``item_test_size`` items for ``user_test_size`` user to the test dataset.

In [10]:
from replay.splitters import UserSplitter

splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log)
(
    train.count(), 
    test.count()
)

(997709, 2499)

## 1. Models training

#### SLIM

In [11]:
from replay.models import SLIM

slim = SLIM(seed=SEED)

In [12]:
%%time

slim.fit(log=train)

CPU times: user 2.47 s, sys: 236 ms, total: 2.7 s
Wall time: 9.3 s


In [13]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

CPU times: user 1 s, sys: 258 ms, total: 1.26 s
Wall time: 10.8 s


In [14]:
recs.show(2)

+-------+-------+-------------------+
|user_id|item_id|          relevance|
+-------+-------+-------------------+
|   5207|   1196|0.21695371083720696|
|   5207|    260|0.16495029676206974|
+-------+-------+-------------------+
only showing top 2 rows



## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [15]:
from replay.metrics import Coverage, HitRate, NDCG, MAP
from replay.experiment import Experiment

In [16]:
metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K],
                            Coverage(train): K
                           })

In [17]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

CPU times: user 200 ms, sys: 61.1 ms, total: 261 ms
Wall time: 36.6 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.102024,0.232,0.534,0.091587,0.160643


## 3. Hyperparameters optimization

#### 3.1 Search

In [18]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

In [19]:
%%time
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=10)

[32m[I 2021-11-10 12:36:51,508][0m A new study created in memory with name: no-name-c72ad0d2-7fce-4c86-9b21-2fecfd9e06ad[0m
[32m[I 2021-11-10 12:37:26,470][0m Trial 0 finished with value: 0.17096044647105743 and parameters: {'beta': 4.0, 'lambda_': 0.02}. Best is trial 0 with value: 0.17096044647105743.[0m
[32m[I 2021-11-10 12:37:50,985][0m Trial 1 finished with value: 0.16159904375486778 and parameters: {'beta': 0.05471053849264981, 'lambda_': 0.10778024693813405}. Best is trial 0 with value: 0.17096044647105743.[0m
[32m[I 2021-11-10 12:38:23,438][0m Trial 2 finished with value: 0.1769496456868256 and parameters: {'beta': 0.00044924832342506534, 'lambda_': 2.3001573649768165e-05}. Best is trial 2 with value: 0.1769496456868256.[0m
[32m[I 2021-11-10 12:38:42,970][0m Trial 3 finished with value: 0.08840223781877281 and parameters: {'beta': 3.137366770695763e-05, 'lambda_': 1.5241217094477804}. Best is trial 2 with value: 0.1769496456868256.[0m
[32m[I 2021-11-10 12:39:13,

CPU times: user 35.2 s, sys: 5.56 s, total: 40.7 s
Wall time: 4min 24s


In [20]:
best_params

{'beta': 0.00549724414230249, 'lambda_': 2.615874217001037e-05}

#### 3.2 Compare with previous

In [21]:
def fit_predict_evaluate(model, experiment, name):
    model.fit(log=train)

    recs = model.predict(
        k=K,
        users=test.select('user_id').distinct(),
        log=train,
        filter_seen_items=True
    )

    experiment.add_result(name, recs)
    return recs

In [22]:
%%time
recs = fit_predict_evaluate(SLIM(**best_params, seed=SEED), metrics, 'SLIM_optimized')
metrics.results.sort_values('NDCG@5', ascending=False)

CPU times: user 3.62 s, sys: 634 ms, total: 4.26 s
Wall time: 52.9 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.102024,0.232,0.534,0.091587,0.160643
SLIM_optimized,0.161943,0.234,0.524,0.09126,0.15897


### Convert to pandas

In [23]:
recs_pd = recs.toPandas()
recs_pd.head(2)

Unnamed: 0,user_id,item_id,relevance
0,5207,1196,0.303419
1,5207,2797,0.245172


## 4. Save and load

RePlay allows to save and load fitted models with `save` and `load` functions of `model_handler` module. Model is saved as a folder with all necessary parameters and data.

In [24]:
from replay.model_handler import save, load

In [25]:
save(slim, path='./slim_best_params')
slim_loaded = load('./slim_best_params')

In [26]:
%%time
pred_from_loaded = slim_loaded.predict(k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True)
pred_from_loaded.show(2)

+-------+-------+------------------+
|user_id|item_id|         relevance|
+-------+-------+------------------+
|   5207|   1196|0.3291081525427723|
|   5207|   2797|0.2627330979873368|
+-------+-------+------------------+
only showing top 2 rows

CPU times: user 934 ms, sys: 350 ms, total: 1.28 s
Wall time: 13.8 s


In [27]:
slim_loaded.beta, slim_loaded.lambda_

(2.7264231112255945e-06, 0.40390670161348624)

## 5. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [28]:
from replay.models import ALSWrap

In [29]:
%%time
recs = fit_predict_evaluate(ALSWrap(rank=100, seed=SEED), metrics, 'ALS')
metrics.results.sort_values('NDCG@5', ascending=False)

CPU times: user 1.58 s, sys: 503 ms, total: 2.09 s
Wall time: 1min 2s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.201619,0.222,0.546,0.092473,0.161675
SLIM,0.102024,0.232,0.534,0.091587,0.160643
SLIM_optimized,0.161943,0.234,0.524,0.09126,0.15897


#### KNN
Commonly-used item-based recommender

In [30]:
from replay.models import KNN

In [31]:
%%time
recs = fit_predict_evaluate(KNN(num_neighbours=100), metrics, 'KNN')
metrics.results.sort_values('NDCG@5', ascending=False)

CPU times: user 1.61 s, sys: 447 ms, total: 2.06 s
Wall time: 1min 3s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.201619,0.222,0.546,0.092473,0.161675
SLIM,0.102024,0.232,0.534,0.091587,0.160643
SLIM_optimized,0.161943,0.234,0.524,0.09126,0.15897
KNN,0.053441,0.158,0.394,0.05669,0.104892


## 6 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

#### 6.1 Save your recommendations as dataframe with columns `user_id - item_id - relevance`

In [32]:
from pyspark.sql.functions import rand

In [33]:
recs.withColumn('relevance', rand(seed=123)).toPandas().to_csv("recs.csv", index=False)

#### 6.2 Read with DataPreparator

In [34]:
recs = DataPreparator().transform(
    path="recs.csv",
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance"
    },
    reader_kwargs={"header":True},
    format_type="csv"
)

#### 6.3 Compare with Experiment

In [35]:
metrics.add_result("my_model", recs)
metrics.results.sort_values("NDCG@5", ascending=False)

Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.201619,0.222,0.546,0.092473,0.161675
SLIM,0.102024,0.232,0.534,0.091587,0.160643
SLIM_optimized,0.161943,0.234,0.524,0.09126,0.15897
KNN,0.053441,0.158,0.394,0.05669,0.104892
my_model,0.053441,0.116,0.394,0.04995,0.096634
