# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including:
- creating SparkSession or passing your own session to RePlay
- data preprocessing
- dataset users and items re-indexing
- data splitting
- model training and inference
- model optimization
- model saving and loading
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings

from optuna.exceptions import ExperimentalWarning

warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

In [4]:
import pandas as pd
from pyspark.sql import SparkSession

from replay.metrics import Coverage, HitRate, NDCG, MAP, Experiment, OfflineMetrics
from replay.utils.model_handler import save, load, save_encoder, load_encoder
from replay.utils.session_handler import get_spark_session, State
from replay.splitters import TwoStageSplitter
from replay.utils.spark_utils import convert2spark, get_log_info

from replay.data import Dataset, FeatureHint, FeatureInfo, FeatureSchema, FeatureType
from replay.data.dataset_utils import DatasetLabelEncoder
from replay.metrics import MAP, NDCG, Coverage, Experiment, HitRate, OfflineMetrics
from replay.models import SLIM, ALSWrap, ItemKNN
from replay.splitters import TwoStageSplitter
from replay.utils.model_handler import load, load_encoder, save, save_encoder
from replay.utils.session_handler import State, get_spark_session
from replay.utils.spark_utils import convert2spark, get_log_info

In [5]:
K = 5
SEED = 42

## Managing SparkSession

RePlay uses Spark as a backend, and thus `SparkSession` should be created before RePlay running. Depends on your needs, you can choose, what to do about `SparkSession`.

- Option 1: use default RePlay `SparkSession`
- You can pass you own session to RePlay. Class `State` stores current session. Here you also have two options: 
    - Option 2: call `get_spark_session` to use default RePlay `SparkSession` with the custom driver memory and number of partitions
    - Option 3: create `SparkSession` from scratch


### Option 1: use default RePlay's SparkSession
It is the simplest and recommended way for the local execution mode. RePlay will get existing SparkSession or create the new one with default configuration.  Default session parameters are stated in `replay/utils/session_handler.py` file. The driver memory volume and number of partitions depends on available RAM and number of cores respectively.

You could initiate default session creation explicitly, e.g. to preprocess spark DataFrames, get link to SparkUI and set logging level, but if you do not create it by yourself, the session will be created by RePlay anyway.

In [None]:
spark = State().session
spark.sparkContext.setLogLevel("ERROR")
spark

In [7]:
def print_config_param(session, conf_name):
    # get current spark session configuration:
    conf = session.sparkContext.getConf().getAll()
    # get num partitions
    print(f"{conf_name}: {dict(conf)[conf_name]}")

In [8]:
print_config_param(spark, "spark.sql.shuffle.partitions")

spark.sql.shuffle.partitions: 48


### Option 2:  Call `get_spark_session`  function to customize driver memory (spark.driver.memory) or number of partitions (spark.sql.shuffle.partitions) and use the default RePlay settings for other configuration parameters.
We will specify 16 partitions and 4Gb driver memory for example. Pass created session to RePlay `State`.

In [9]:
spark.stop()
session = get_spark_session(spark_memory=4, shuffle_partitions=16)
spark = State(session).session

In [10]:
print_config_param(spark, "spark.sql.shuffle.partitions")

spark.sql.shuffle.partitions: 16


### Option 3: Create your own session
Pass created session to RePlay `State`.

In [11]:
spark.stop()
session = (
    SparkSession.builder.config("spark.driver.memory", "8g")
    .config("spark.sql.shuffle.partitions", "50")
    .config("spark.driver.bindAddress", "127.0.0.1")
    .config("spark.driver.host", "localhost")
    .master("local[*]")
    .enableHiveSupport()
    .getOrCreate()
)
spark = State(session).session
print_config_param(spark, "spark.sql.shuffle.partitions")

spark.sql.shuffle.partitions: 50


#### Will return to the default session config

In [12]:
spark.stop()
spark = State(get_spark_session()).session
spark.sparkContext.setLogLevel("ERROR")
spark

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [13]:
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

In [14]:
df.head(2)

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109


In [15]:
users.head(2)

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072


### 0.1. Preprocessing

An inner data format in RePlay is a spark dataframe. 

Columns with users' and items' identifiers are required for interactions. Optional columns are ``rating`` and interaction ``timestamp``.

In [16]:
df_spark = convert2spark(df)
df_spark.show(2)

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|      1|   1193|     5|978300760|
|      1|    661|     3|978302109|
+-------+-------+------+---------+
only showing top 2 rows



In [17]:
users_spark = convert2spark(users)
users_spark.show(2)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
+-------+------+---+----------+--------+
only showing top 2 rows



### 0.2. Filtering
It is common to filter interactions by interaction date or rating value or remove items or users with small number of interactions. RePlay offers some filters presented in `replay.preprocessig.filters` module.
We will leave ratings greater than or equal to 3 and remove users with 4 or fewer interactions.

In [18]:
from replay.preprocessing.filters import filter_by_min_count, filter_out_low_ratings

In [19]:
filtered_df = filter_out_low_ratings(df_spark, value=3, rating_column="rating")
get_log_info(filtered_df, user_col="user_id", item_col="item_id")

'total lines: 836478, total users: 6039, total items: 3628'

In [20]:
filtered_df = filter_by_min_count(filtered_df, num_entries=5, group_by="user_id")
get_log_info(filtered_df, user_col="user_id", item_col="item_id")

10-Nov-23 16:46:36, replay, INFO: current threshold removes 1.1954887038272376e-06% of data


'total lines: 836477, total users: 6038, total items: 3628'

In [21]:
filtered_df.show(5)

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|     28|   1179|     4|978126422|
|     28|   2550|     4|978125885|
|     28|    648|     3|978982323|
|     28|   3793|     4|978982233|
|     28|    650|     3|978126224|
+-------+-------+------+---------+
only showing top 5 rows



### 0.3. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems. Splitters return cached dataframes to compute them once and re-use for models training, inference and metrics calculation.

`TwoStageSplitter` takes ``second_divide_size`` of items for ``first_divide_size`` of users to the test dataset.

In [22]:
splitter = TwoStageSplitter(
    query_column="user_id",
    item_column="item_id",
    first_divide_column="user_id",
    second_divide_column="item_id",
    drop_cold_items=True,
    drop_cold_users=True,
    second_divide_size=K,
    first_divide_size=500,
    seed=SEED,
    shuffle=True,
)
train, test = splitter.split(filtered_df)
print(train.count(), test.count())

                                                                                

833977 2500


In [23]:
test.is_cached

False

### 0.4. Dataset

RePlay provides you universal container with interactions and user/item features for feeding data to models. Instance of Dataset requires dataframes of same type and description of features, written with FeatureSchema class. In next section it will help to encode whole dataset at once.


In [24]:
total_user_count = filtered_df.select("user_id").distinct().count()
total_item_count = filtered_df.select("item_id").distinct().count()
print(total_user_count, total_item_count)

6038 3628


In [25]:
feature_schema = FeatureSchema(
    [
        FeatureInfo(
            column="user_id",
            feature_type=FeatureType.CATEGORICAL,
            feature_hint=FeatureHint.QUERY_ID,
            cardinality=total_user_count,
        ),
        FeatureInfo(
            column="item_id",
            feature_type=FeatureType.CATEGORICAL,
            feature_hint=FeatureHint.ITEM_ID,
            cardinality=total_item_count,
        ),
        FeatureInfo(
            column="rating",
            feature_type=FeatureType.NUMERICAL,
            feature_hint=FeatureHint.RATING,
        ),
        FeatureInfo(
            column="timestamp",
            feature_type=FeatureType.NUMERICAL,
            feature_hint=FeatureHint.TIMESTAMP,
        ),
    ]
)

In [26]:
train_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=train,
)

test_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=test,
)

### 0.5. DatasetLabelEncoder

RePlay models require columns with users' and items' identifiers _(ids)_, those should be integers starting at zero without gaps. This is important for models that use sparse matrices and define the matrix size as the biggest seen user and item index. Storing _ids_ as integers also help to reduce memory usage compared to string _ids_.

You should convert user and item _ids_ in interactions and features dataframes. RePlay offers DatasetLabelEncoder class to perform the _ids_ conversion and convert them back after recommendations generation (predict). The DatasetLabelEncoder will store label encoders for users and items (optionally for other columns) and allow transforming ids for users and items, which come after the encoder fit.

In [27]:
encoder = DatasetLabelEncoder()
train_dataset = encoder.fit_transform(train_dataset)
test_dataset = encoder.transform(test_dataset)

                                                                                

## 1. Models training

#### SLIM

In [28]:
slim = SLIM(seed=SEED)

In [29]:
%%time
slim.fit(train_dataset)



CPU times: user 847 ms, sys: 59.2 ms, total: 906 ms
Wall time: 20.6 s


                                                                                

In [30]:
%%time
recs = slim.predict(
    k=K,
    queries=test_dataset.query_ids,
    dataset=train_dataset,
    filter_seen_items=False
)



CPU times: user 49.1 ms, sys: 6.24 ms, total: 55.3 ms
Wall time: 8.59 s


                                                                                

In [31]:
recs.show(2)

+-------+-------+------------------+
|user_id|item_id|            rating|
+-------+-------+------------------+
|    271|   1342|1.4208161339117393|
|    271|   3000|  1.37722459857679|
+-------+-------+------------------+
only showing top 2 rows



## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. They can be calculated directly as follows

In [32]:
print(
    NDCG(K, query_column="user_id", item_column="item_id", rating_column="rating")(recs, test_dataset.interactions),
    MAP(K, query_column="user_id", item_column="item_id", rating_column="rating")(recs, test_dataset.interactions),
    HitRate([1, K], query_column="user_id", item_column="item_id", rating_column="rating")(
        recs, test_dataset.interactions
    ),
    Coverage(K, query_column="user_id", item_column="item_id", rating_column="rating")(recs, test_dataset.interactions),
)

                                                                                

{'NDCG@5': 0.006616433861608617} {'MAP@5': 0.0035333333333333336} {'HitRate@1': 0.01, 'HitRate@5': 0.024} {'Coverage@5': 0.3651635720601238}


 If you need to calculate multiple metrics for the same input data, then using `OfflineMetrics` class is much more efficient than calculating metrics individually.The result of the work coincides with the result of the work of individual metrics.

In [33]:
offline_metrics = OfflineMetrics(
    [NDCG(K), MAP(K), HitRate([1, K]), Coverage(K)],
    query_column="user_id",
    item_column="item_id",
    rating_column="rating",
)
offline_metrics(recs, test_dataset.interactions, train_dataset.interactions)

                                                                                

{'NDCG@5': 0.06347640619606824,
 'MAP@5': 0.034,
 'HitRate@1': 0.086,
 'HitRate@5': 0.242,
 'Coverage@5': 0.11383682469680265}

In [34]:
metrics = Experiment(
    [
        NDCG(K),
        MAP(K),
        HitRate([1, K]),
        Coverage(K)
    ],
    test_dataset.interactions,
    train_dataset.interactions,
    query_column="user_id",
    item_column="item_id",
    rating_column="rating",
)

In [35]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

CPU times: user 52.1 ms, sys: 19.2 ms, total: 71.3 ms
Wall time: 3.63 s


Unnamed: 0,NDCG@5,MAP@5,HitRate@1,HitRate@5,Coverage@5
SLIM,0.063476,0.034,0.086,0.242,0.113837


## 3. Hyperparameters optimization

#### 3.1 Search

In [None]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

train_opt_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=train_opt,
)
val_opt_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=val_opt,
)
train_opt_dataset = encoder.transform(train_opt_dataset)
val_opt_dataset = encoder.transform(val_opt_dataset)

In [None]:
%%time
best_params = slim.optimize(train_opt_dataset, val_opt_dataset, criterion=NDCG, k=K, budget=5)

In [None]:
best_params

#### 3.2 Compare with previous

In [36]:
def fit_predict_evaluate(model, experiment, name):
    model.fit(train_dataset)

    recs = model.predict(dataset=train_dataset, k=K, queries=test_dataset.query_ids, filter_seen_items=False)

    experiment.add_result(name, recs)
    return recs

In [None]:
%%time
recs = fit_predict_evaluate(SLIM(**best_params, seed=SEED), metrics, "SLIM_optimized")
recs.cache() #caching for further processing
metrics.results.sort_values("NDCG@5", ascending=False)

The optimized model was better on the validation dataset, but shows comparable quality on test (better by HitRate@5 and worse by the other quality metrics). 

## 4. Getting final recommendations 

### Return to original user and item identifiers

In [37]:
%%time
recs = encoder.query_and_item_id_encoder.inverse_transform(recs)
recs.show(2)

+------------------+-------+-------+
|            rating|user_id|item_id|
+------------------+-------+-------+
|1.4208161339117393|    949|   1374|
|1.3478753222783952|    949|   1580|
+------------------+-------+-------+
only showing top 2 rows

CPU times: user 114 ms, sys: 993 µs, total: 115 ms
Wall time: 488 ms


### Convert to pandas or save

In [38]:
recs_pd = recs.toPandas()
recs_pd.head(2)

Unnamed: 0,rating,user_id,item_id
0,0.928045,3745,2762
1,1.388134,5550,1304


In [39]:
%%time
recs.write.parquet(path="./slim_recs.parquet", mode="overwrite")

CPU times: user 5.08 ms, sys: 634 µs, total: 5.72 ms
Wall time: 1 s


## 4. Save and load

RePlay allows saving and loading fitted models with `save` and `load` functions of `model_handler` module. Model is saved as a folder with all necessary parameters and data.

In [40]:

%%time
save_encoder(encoder, "./encoder")
encoder_loaded = load_encoder("./encoder")

CPU times: user 2.02 ms, sys: 714 µs, total: 2.73 ms
Wall time: 2.46 ms


In [41]:
%%time
save(slim, path="./slim_best_params")
slim_loaded = load("./slim_best_params")

                                                                                

CPU times: user 42.5 ms, sys: 15.8 ms, total: 58.3 ms
Wall time: 4.53 s


In [42]:
slim_loaded.beta, slim_loaded.lambda_

(0.01, 0.01)

In [43]:
%%time
pred_from_loaded = slim_loaded.predict(
    k=K,
    queries=test_dataset.query_ids,
    dataset=train_dataset,
    filter_seen_items=True)
pred_from_loaded.show(2)

                                                                                

+-------+-------+------------------+
|user_id|item_id|            rating|
+-------+-------+------------------+
|    271|    592|1.0208525984297374|
|    271|   1032|0.9421524513268297|
+-------+-------+------------------+
only showing top 2 rows

CPU times: user 41.9 ms, sys: 32.1 ms, total: 73.9 ms
Wall time: 7.41 s


In [44]:
%%time
recs = encoder_loaded.query_and_item_id_encoder.inverse_transform(pred_from_loaded)
recs.show(2)

+------------------+-------+-------+
|            rating|user_id|item_id|
+------------------+-------+-------+
|1.0208525984297374|    949|   3638|
|0.9421524513268297|    949|   1608|
+------------------+-------+-------+
only showing top 2 rows

CPU times: user 106 ms, sys: 7.46 ms, total: 114 ms
Wall time: 669 ms


## 5. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [45]:
%%time
recs = fit_predict_evaluate(ALSWrap(rank=100, seed=SEED), metrics, "ALS")
metrics.results.sort_values("NDCG@5", ascending=False)

                                                                                

CPU times: user 144 ms, sys: 61.1 ms, total: 206 ms
Wall time: 41.9 s


Unnamed: 0,NDCG@5,MAP@5,HitRate@1,HitRate@5,Coverage@5
ALS,0.068995,0.034867,0.072,0.286,0.16097
SLIM,0.063476,0.034,0.086,0.242,0.113837


#### ItemKNN
Commonly-used item-based recommender

In [46]:
%%time
recs = fit_predict_evaluate(ItemKNN(num_neighbours=100), metrics, "ItemKNN")
metrics.results.sort_values("NDCG@5", ascending=False)

                                                                                

CPU times: user 165 ms, sys: 63.5 ms, total: 228 ms
Wall time: 37.2 s


Unnamed: 0,NDCG@5,MAP@5,HitRate@1,HitRate@5,Coverage@5
ALS,0.068995,0.034867,0.072,0.286,0.16097
SLIM,0.063476,0.034,0.086,0.242,0.113837
ItemKNN,0.058481,0.029853,0.07,0.238,0.037762


## 6 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

In [47]:
import pyspark.sql.functions as sf

In [48]:
metrics.add_result("my_model", recs.withColumn("rating", sf.rand()))
metrics.results.sort_values("NDCG@5", ascending=False)

Unnamed: 0,NDCG@5,MAP@5,HitRate@1,HitRate@5,Coverage@5
ALS,0.068995,0.034867,0.072,0.286,0.16097
SLIM,0.063476,0.034,0.086,0.242,0.113837
ItemKNN,0.058481,0.029853,0.07,0.238,0.037762
my_model,0.057533,0.029093,0.068,0.238,0.037762
