# RePlay Library Experiments on ML-1m [Part 1]

## Setup

### Spark installation

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install -q pyspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

import findspark
findspark.init()

### RePlay library installation

In [None]:
!pip install replay-rec #v0.6.1

In [None]:
import replay
replay.__version__

'0.6.1'

### Environment setup

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import logging
logger = logging.getLogger("replay")

### Downloading ML-1m datasets

In [None]:
!wget -q --show-progress https://github.com/RecoHut-Datasets/movielens_1m/raw/main/ml1m_items.dat
!wget -q --show-progress https://github.com/RecoHut-Datasets/movielens_1m/raw/main/ml1m_ratings.dat
!wget -q --show-progress https://github.com/RecoHut-Datasets/movielens_1m/raw/main/ml1m_users.dat
!wget -q --show-progress https://github.com/RecoHut-Datasets/movielens_1m/raw/main/ml_ratings.csv



### Spark Session State

State object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

In [None]:
from replay.session_handler import State

spark = State().session
spark

### Params

In [None]:
K = 5
SEED=1234

## Data preprocessing

We will use MovieLens 1m as an example.

In [None]:
import pandas as pd
df = pd.read_csv("ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "relevance", "timestamp"])
users = pd.read_csv("ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

### DataPreparator

An inner data format in RePlay is a spark dataframe.
You can pass spark or pandas dataframe as an input. Columns ``item_id`` and ``user_id`` are required for interaction matrix.
Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``. 

We implemented DataPreparator class to convert dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing.

To convert pandas dataframe to spark as is use function ``convert_to_spark`` from ``replay.utils``.

In [None]:
from replay.data_preparator import DataPreparator

log = DataPreparator().transform(
    data=df,
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance",
        "timestamp": "timestamp"
    }
)

In [None]:
log.show(3)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2000-12-31 22:12:40|
|      1|    661|      3.0|2000-12-31 22:35:09|
|      1|    914|      3.0|2000-12-31 22:32:48|
+-------+-------+---------+-------------------+
only showing top 3 rows



In [None]:
from replay.utils import convert2spark
users = convert2spark(users)

### Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems.

`UserSplitter` takes ``item_test_size`` items for each user to the test dataset.

In [None]:
from replay.splitters import UserSplitter

splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log)
(
    train.count(), 
    test.count()
)

(997709, 2500)

## Models training

#### SLIM

In [None]:
from replay.models import SLIM

slim = SLIM(lambda_=0.01, beta=0.01, seed=SEED)

In [None]:
%%time

slim.fit(log=train)

12-Oct-21 14:57:25, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
12-Oct-21 14:57:25, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
12-Oct-21 14:57:29, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage


CPU times: user 3.88 s, sys: 544 ms, total: 4.43 s
Wall time: 16.4 s


In [None]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

12-Oct-21 14:57:41, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM


CPU times: user 1.49 s, sys: 404 ms, total: 1.89 s
Wall time: 7.26 s


## Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [None]:
from replay.metrics import HitRate, NDCG, MAP
from replay.experiment import Experiment

metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K]})


In [None]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

CPU times: user 126 ms, sys: 66.2 ms, total: 192 ms
Wall time: 1min 30s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.418,0.056027,0.106655


## Hyperparameters optimization

In [None]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

In [None]:
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=10)

[32m[I 2021-10-12 14:59:20,761][0m A new study created in memory with name: no-name-8a5ce520-8017-46ce-9e75-be24871278dd[0m
12-Oct-21 14:59:20, replay, DEBUG: Fitting model inside optimization
DEBUG:replay:Fitting model inside optimization
12-Oct-21 14:59:20, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
12-Oct-21 14:59:20, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage
12-Oct-21 14:59:28, replay, DEBUG: Predicting inside optimization
DEBUG:replay:Predicting inside optimization
12-Oct-21 14:59:28, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM
12-Oct-21 14:59:35, replay, DEBUG: Calculating criterion
DEBUG:replay:Calculating criterion
12-Oct-21 15:04:49, replay, DEBUG: NDCG=0.028056
DEBUG:replay:NDCG=0.028056
[32m[I 2021-10-12 15:04:49,686][0m Trial 0 finished with value: 0.028055917242384147 and parameters: {'beta': 2.4225132371536803e-05, 'lambda_': 4.690301379365397e-06}. Best is trial 0 with value: 0.028055917242384147.[0m


In [None]:
best_params

{'beta': 2.742456249258626, 'lambda_': 0.0017326235909461067}

In [None]:
slim = SLIM(**best_params, seed=SEED)

slim.fit(log=train)

recs = slim.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

metrics.add_result("SLIM_optimized", recs)
metrics.results

12-Oct-21 15:19:45, replay, DEBUG: Starting fit SLIM
DEBUG:replay:Starting fit SLIM
12-Oct-21 15:19:45, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
12-Oct-21 15:19:47, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage
12-Oct-21 15:19:52, replay, DEBUG: Starting predict SLIM
DEBUG:replay:Starting predict SLIM


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.418,0.056027,0.106655
SLIM_optimized,0.218,0.54,0.086473,0.153695


### Convert to pandas

In [None]:
recs_pd = recs.toPandas()
recs_pd.head(3)

Unnamed: 0,user_id,item_id,relevance
0,1064,527,0.516111
1,1064,1213,0.505202
2,1064,2571,0.466897


## Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [None]:
from replay.models import ALSWrap

als = ALSWrap(rank=100, seed=SEED)

In [None]:
%%time
als.fit(log=train)

12-Oct-21 15:21:07, replay, DEBUG: Starting fit ALSWrap
DEBUG:replay:Starting fit ALSWrap
12-Oct-21 15:21:07, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
12-Oct-21 15:21:07, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage


CPU times: user 576 ms, sys: 114 ms, total: 690 ms
Wall time: 38.9 s


In [None]:
%%time
recs = als.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

12-Oct-21 15:21:46, replay, DEBUG: Starting predict ALSWrap
DEBUG:replay:Starting predict ALSWrap


CPU times: user 1.05 s, sys: 225 ms, total: 1.27 s
Wall time: 3.94 s


In [None]:
%%time
metrics.add_result("ALS", recs)
metrics.results

CPU times: user 68.1 ms, sys: 21.1 ms, total: 89.1 ms
Wall time: 11 s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.418,0.056027,0.106655
SLIM_optimized,0.218,0.54,0.086473,0.153695
ALS,0.224,0.582,0.094293,0.168814


#### MultVAE 
Variational autoencoder for a recommendation task

In [None]:
from replay.models import MultVAE

multvae = MultVAE(epochs=100)

In [None]:
%%time
multvae.fit(log=train)

12-Oct-21 15:22:01, replay, DEBUG: Starting fit MultVAE
DEBUG:replay:Starting fit MultVAE
12-Oct-21 15:22:01, replay, DEBUG: Creating indexers
DEBUG:replay:Creating indexers
12-Oct-21 15:22:01, replay, DEBUG: Main fit stage
DEBUG:replay:Main fit stage
12-Oct-21 15:22:03, replay, DEBUG: Creating batch:
DEBUG:replay:Creating batch:
12-Oct-21 15:22:04, replay, DEBUG: Training VAE
DEBUG:replay:Training VAE
12-Oct-21 15:22:06, replay, DEBUG: Epoch[1] current loss:
1340.81750
DEBUG:replay:Epoch[1] current loss:
1340.81750
12-Oct-21 15:22:07, replay, DEBUG: Epoch[1] validation
average loss: 1475.84116
DEBUG:replay:Epoch[1] validation
average loss: 1475.84116
12-Oct-21 15:22:09, replay, DEBUG: Epoch[2] current loss:
1235.64831
DEBUG:replay:Epoch[2] current loss:
1235.64831
12-Oct-21 15:22:09, replay, DEBUG: Epoch[2] validation
average loss: 1478.26356
DEBUG:replay:Epoch[2] validation
average loss: 1478.26356
12-Oct-21 15:22:11, replay, DEBUG: Epoch[3] current loss:
1257.92309
DEBUG:replay:Epoc

CPU times: user 4min 35s, sys: 1min 7s, total: 5min 42s
Wall time: 4min 21s


In [None]:
%%time

recs = multvae.predict(
    k=K,
    users=test.select('user_id').distinct(),
    log=train,
    filter_seen_items=True
)

12-Oct-21 15:26:22, replay, DEBUG: Starting predict MultVAE
DEBUG:replay:Starting predict MultVAE
12-Oct-21 15:26:30, replay, DEBUG: Предсказание модели
DEBUG:replay:Предсказание модели


CPU times: user 1.39 s, sys: 340 ms, total: 1.73 s
Wall time: 9.39 s


In [None]:
%%time
metrics.add_result("MultVAE", recs)
metrics.results

CPU times: user 93.2 ms, sys: 26.5 ms, total: 120 ms
Wall time: 15.6 s


Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.128,0.418,0.056027,0.106655
SLIM_optimized,0.218,0.54,0.086473,0.153695
ALS,0.224,0.582,0.094293,0.168814
MultVAE,0.142,0.464,0.06196,0.118569


## Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

#### Save your recommendations as dataframe with columns `user_id - item_id - relevance`

In [None]:
from pyspark.sql.functions import rand

In [None]:
recs.withColumn('relevance', rand(seed=123)).toPandas().to_csv("recs.csv", index=False)

#### Read with DataPreparator

In [None]:
recs = DataPreparator().transform(
    path="recs.csv",
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance"
    },
    reader_kwargs={"header":True},
    format_type="csv"
)

#### Compare with Experiment

In [None]:
metrics.add_result("my_model", recs)
metrics.results.sort_values("NDCG@5", ascending=False)

Unnamed: 0,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.224,0.582,0.094293,0.168814
SLIM_optimized,0.218,0.54,0.086473,0.153695
MultVAE,0.142,0.464,0.06196,0.118569
my_model,0.13,0.464,0.059013,0.114787
SLIM,0.128,0.418,0.056027,0.106655
