# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including:
- creating SparkSession or passing your own session to RePlay
- data preprocessing
- dataset users and items re-indexing
- data splitting
- model training and inference
- model optimization
- model saving and loading
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

In [4]:
import pandas as pd

from pyspark.sql import SparkSession

from replay.metrics import Coverage, HitRate, NDCG, MAP, Experiment
from replay.utils.model_handler import save, load, save_encoder, load_encoder
from replay.utils.session_handler import get_spark_session, State 
from replay.splitters import TwoStageSplitter
from replay.utils import convert2spark, get_log_info

from replay.data import Dataset, FeatureHint, FeatureInfo, FeatureSchema, FeatureType
from replay.data.dataset_utils import DatasetLabelEncoder

from replay.models import ALSWrap, ItemKNN, SLIM


In [5]:
K = 5
SEED=42

## Managing SparkSession

RePlay uses Spark as a backend, and thus `SparkSession` should be created before RePlay running. Depends on your needs, you can choose, what to do about `SparkSession`.

- Option 1: use default RePlay `SparkSession`
- You can pass you own session to RePlay. Class `State` stores current session. Here you also have two options: 
    - Option 2: call `get_spark_session` to use default RePlay `SparkSession` with the custom driver memory and number of partitions 
    - Option 3: create `SparkSession` from scratch


### Option 1: use default RePlay's SparkSession
It is the simplest and recommended way for the local execution mode. RePlay will get existing SparkSession or create the new one with default configuration.  Default session parameters are stated in `replay/utils/session_handler.py` file. The driver memory volume and number of partitions depends on available RAM and number of cores respectively.

You could initiate default session creation explicitly, e.g. to preprocess spark DataFrames, get link to SparkUI and set logging level, but if you do not create it by yourself, the session will be created by RePlay anyway.

In [6]:
spark = State().session
spark.sparkContext.setLogLevel('ERROR')
spark

23/11/08 15:09:24 WARN Utils: Your hostname, ecs-eemalov-large resolves to a loopback address: 127.0.1.1; using 10.11.10.44 instead (on interface eth0)
23/11/08 15:09:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/08 15:09:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/08 15:09:25 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/11/08 15:09:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [7]:
def print_config_param(session, conf_name):
    # get current spark session configuration:
    conf = session.sparkContext.getConf().getAll()
    # get num partitions
    print(f'{conf_name}: {dict(conf)[conf_name]}')

In [8]:
print_config_param(spark, 'spark.sql.shuffle.partitions')

spark.sql.shuffle.partitions: 48


### Option 2:  Call `get_spark_session`  function to customize driver memory (spark.driver.memory) or number of partitions (spark.sql.shuffle.partitions) and use the default RePlay settings for other configuration parameters.
We will specify 16 partitions and 4Gb driver memory for example. Pass created session to RePlay `State`.

In [9]:
spark.stop()
session = get_spark_session(spark_memory=4, shuffle_partitions=16)
spark = State(session).session

In [10]:
print_config_param(spark, 'spark.sql.shuffle.partitions')

spark.sql.shuffle.partitions: 16


### Option 3: Create your own session
Pass created session to RePlay `State`.

In [11]:
spark.stop()
session = (
        SparkSession.builder.config("spark.driver.memory", "8g")
        .config("spark.sql.shuffle.partitions", "50")
        .config("spark.driver.bindAddress", "127.0.0.1")
        .config("spark.driver.host", "localhost")
        .master("local[*]")
        .enableHiveSupport()
        .getOrCreate()
    )
spark = State(session).session
print_config_param(spark, 'spark.sql.shuffle.partitions')

spark.sql.shuffle.partitions: 50


#### Will return to the default session config

In [12]:
spark.stop()
spark = State(get_spark_session()).session
spark.sparkContext.setLogLevel('ERROR')
spark

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [13]:
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

In [14]:
df.head(2)

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109


In [15]:
users.head(2)

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072


### 0.1. Preprocessing

An inner data format in RePlay is a spark dataframe. 

Columns with users' and items' identifiers are required for interactions. Optional columns are ``rating`` and interaction ``timestamp``.

In [17]:
df_spark = convert2spark(df)
df_spark.show(2)

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|      1|   1193|     5|978300760|
|      1|    661|     3|978302109|
+-------+-------+------+---------+
only showing top 2 rows



In [18]:
users_spark = convert2spark(users)
users_spark.show(2)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
+-------+------+---+----------+--------+
only showing top 2 rows



### 0.2. Filtering
It is common to filter interactions by interaction date or rating value or remove items or users with small number of interactions. RePlay offers some filters presented in `replay.preprocessig.filters` module.
We will leave ratings greater than or equal to 3 and remove users with 4 or fewer interactions.

In [16]:
from replay.preprocessing.filters import filter_by_min_count, filter_out_low_ratings

In [19]:
filtered_df = filter_out_low_ratings(df_spark, value=3, rating_column="rating")
get_log_info(filtered_df, user_col='user_id', item_col='item_id')

'total lines: 836478, total users: 6039, total items: 3628'

In [20]:
filtered_df = filter_by_min_count(filtered_df, num_entries=5, group_by='user_id')
get_log_info(filtered_df, user_col='user_id', item_col='item_id')

08-Nov-23 15:09:55, replay, INFO: current threshold removes 1.1954887038272376e-06% of data


'total lines: 836477, total users: 6038, total items: 3628'

In [21]:
filtered_df.show(5)

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|     28|   1179|     4|978126422|
|     28|   2550|     4|978125885|
|     28|    648|     3|978982323|
|     28|   3793|     4|978982233|
|     28|    650|     3|978126224|
|     28|   2997|     4|978125846|
|     28|   3000|     3|978126172|
|     28|      1|     3|978985309|
|     28|   2132|     5|978985335|
|     28|   1265|     4|978126107|
+-------+-------+------+---------+
only showing top 10 rows



### 0.3. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems. Splitters return cached dataframes to compute them once and re-use for models training, inference and metrics calculation.

`TwoStageSplitter` takes ``second_divide_size`` of items for ``first_divide_size`` of users to the test dataset.

In [22]:
splitter = TwoStageSplitter(
    query_column="user_id",
    item_column="item_id",
    first_divide_column="user_id",
    second_divide_column="item_id",
    drop_cold_items=True,
    drop_cold_users=True,
    second_divide_size=K,
    first_divide_size=500,
    seed=SEED,
    shuffle=True
)
train_df, test_df = splitter.split(filtered_df)
print(train_df.count(), test_df.count())

                                                                                

833977 2500


In [None]:
test_df.is_cached

### 0.4. Dataset

RePlay provides you universal container with interactions and user/item features for feeding data to models. Instance of Dataset requires dataframes of same type and description of features, written with FeatureSchema class. In next section it will help to encode whole dataset at once.


In [23]:
total_user_count = filtered_df.select("user_id").distinct().count()
total_item_count = filtered_df.select("item_id").distinct().count()
print(total_user_count, total_item_count)

6038 3628


In [24]:
feature_schema = FeatureSchema(
    [
        FeatureInfo(
            column="user_id",
            feature_type=FeatureType.CATEGORICAL,
            feature_hint=FeatureHint.QUERY_ID,
            cardinality=total_user_count,
        ),
        FeatureInfo(
            column="item_id",
            feature_type=FeatureType.CATEGORICAL,
            feature_hint=FeatureHint.ITEM_ID,
            cardinality=total_item_count,
        ),
        FeatureInfo(
            column="rating",
            feature_type=FeatureType.NUMERICAL,
            feature_hint=FeatureHint.RATING,
        ),
        FeatureInfo(
            column="timestamp",
            feature_type=FeatureType.NUMERICAL,
            feature_hint=FeatureHint.TIMESTAMP,
        ),
    ]
)

In [25]:
train_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=train_df,
    query_features=None,
    item_features=None,
    check_consistency=True,
    categorical_encoded=False,
)

test_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=test_df,
    query_features=None,
    item_features=None,
    check_consistency=True,
    categorical_encoded=False,
)

### 0.5. DatasetLabelEncoder

RePlay models require columns with users' and items' identifiers _(ids)_, those should be integers starting at zero without gaps. This is important for models that use sparse matrices and define the matrix size as the biggest seen user and item index. Storing _ids_ as integers also help to reduce memory usage compared to string _ids_.

You should convert user and item _ids_ in interactions and features dataframes. RaPlay offers DatasetLabelEncoder class to perform the _ids_ conversion and convert them back after recommendations generation (predict). The DatasetLabelEncoder will store label encoders for users and items (optionally for other columns) and allow transforming ids for users and items, which come after the encoder fit.

In [26]:
encoder = DatasetLabelEncoder()
train_dataset = encoder.fit_transform(train_dataset)
test_dataset = encoder.transform(test_dataset)

                                                                                

## 1. Models training

#### SLIM

In [28]:
slim = SLIM(seed=SEED)

In [29]:
%%time
slim.fit(train_dataset)



CPU times: user 815 ms, sys: 69.5 ms, total: 884 ms
Wall time: 20.3 s


                                                                                

In [30]:
%%time
recs = slim.predict(
    k=K,
    queries=test_dataset.query_ids,
    dataset=train_dataset,
    filter_seen_items=False
)



CPU times: user 39.8 ms, sys: 14.9 ms, total: 54.7 ms
Wall time: 8.23 s


                                                                                

In [31]:
recs.show(2)

+-------+-------+------------------+
|user_id|item_id|            rating|
+-------+-------+------------------+
|    271|   1342|1.4208161339117393|
|    271|   3000|  1.37722459857679|
+-------+-------+------------------+
only showing top 2 rows



## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [32]:
col_names = {
    "item_column": train_dataset.feature_schema.item_id_column,
    "query_column": train_dataset.feature_schema.query_id_column,
    "rating_column": train_dataset.feature_schema.interactions_rating_column,
}

In [33]:
metrics = Experiment(
    test_dataset, 
    {
        NDCG(**col_names): K,
        MAP(**col_names) : K,
        HitRate(**col_names): [1, K],
        Coverage(
            interactions=train_dataset.interactions,
            **col_names,
        ): K
    }
)

In [34]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

                                                                                

CPU times: user 76.8 ms, sys: 15.3 ms, total: 92.1 ms
Wall time: 8.12 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.113837,0.086,0.242,0.034,0.063476


## 3. Hyperparameters optimization

#### 3.1 Search

In [35]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train_df)

train_opt_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=train_opt,
    query_features=None,
    item_features=None,
    check_consistency=True,
    categorical_encoded=False,
)
val_opt_dataset = Dataset(
    feature_schema=feature_schema,
    interactions=val_opt,
    query_features=None,
    item_features=None,
    check_consistency=True,
    categorical_encoded=False,
)
train_opt_dataset = encoder.transform(train_opt_dataset)
val_opt_dataset = encoder.transform(val_opt_dataset)

                                                                                

In [36]:
%%time
best_params = slim.optimize(train_opt_dataset, val_opt_dataset, criterion=NDCG, k=K, budget=15)

[I 2023-11-08 15:15:10,431] A new study created in memory with name: no-name-e75ebc7a-4db1-47d1-9c3a-9222e5cf7bd7
  res[param] = suggest_fn(param, low=low, high=high)
[I 2023-11-08 15:16:03,718] Trial 0 finished with value: 0.0019448038281215333 and parameters: {'beta': 0.01, 'lambda_': 0.01}. Best is trial 0 with value: 0.0019448038281215333.
[I 2023-11-08 15:16:59,880] Trial 1 finished with value: 0.0019448038281215333 and parameters: {'beta': 0.14885702854659433, 'lambda_': 2.0270331887746544e-05}. Best is trial 0 with value: 0.0019448038281215333.
[I 2023-11-08 15:17:50,299] Trial 2 finished with value: 0.0019448038281215333 and parameters: {'beta': 1.6900127682870672e-06, 'lambda_': 0.03491238597526555}. Best is trial 0 with value: 0.0019448038281215333.
[I 2023-11-08 15:18:40,244] Trial 3 finished with value: 0.0022077398552003743 and parameters: {'beta': 0.00012878358743355014, 'lambda_': 0.06352770193460419}. Best is trial 3 with value: 0.0022077398552003743.
[I 2023-11-08 15:1

CPU times: user 14.4 s, sys: 1.4 s, total: 15.8 s
Wall time: 16min 19s


In [37]:
best_params

{'beta': 0.08287653801805005, 'lambda_': 0.2871501039218261}

#### 3.2 Compare with previous

In [38]:
def fit_predict_evaluate(model, experiment, name):
    model.fit(train_dataset)

    recs = model.predict(
        dataset=train_dataset,
        k=K,
        queries=test_dataset.query_ids,
        filter_seen_items=False
    )

    experiment.add_result(name, recs)
    return recs

In [39]:
%%time
recs = fit_predict_evaluate(SLIM(**best_params, seed=SEED), metrics, 'SLIM_optimized')
recs.cache() #caching for further processing
metrics.results.sort_values('NDCG@5', ascending=False)

                                                                                

CPU times: user 927 ms, sys: 110 ms, total: 1.04 s
Wall time: 31.8 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.113837,0.086,0.242,0.034,0.063476
SLIM_optimized,0.06312,0.078,0.244,0.032753,0.06192


The optimized model was better on the validation dataset, but shows comparable quality on test (better by HitRate@5 and worse by the other quality metrics). 

## 4. Getting final recommendations 

### Return to original user and item identifiers

In [40]:
%%time
recs = encoder.query_and_item_id_encoder.inverse_transform(recs)
recs.show(2)

+------------------+-------+-------+
|            rating|user_id|item_id|
+------------------+-------+-------+
|1.1819098819880203|    949|    260|
|1.0638413372139444|    949|   1196|
+------------------+-------+-------+
only showing top 2 rows

CPU times: user 113 ms, sys: 624 Âµs, total: 114 ms
Wall time: 540 ms


### Convert to pandas or save

In [41]:
recs_pd = recs.toPandas()
recs_pd.head(2)

Unnamed: 0,rating,user_id,item_id
0,1.18191,949,260
1,1.150479,949,2571


In [42]:
%%time
recs.write.parquet(path='./slim_recs.parquet', mode='overwrite')

CPU times: user 6.78 ms, sys: 0 ns, total: 6.78 ms
Wall time: 1.24 s


                                                                                

## 4. Save and load

RePlay allows saving and loading fitted models with `save` and `load` functions of `model_handler` module. Model is saved as a folder with all necessary parameters and data.

In [43]:

%%time
save_encoder(encoder, "./encoder")
encoder_loaded = load_encoder("./encoder")

CPU times: user 2.9 ms, sys: 0 ns, total: 2.9 ms
Wall time: 3.23 ms


In [44]:
%%time
save(slim, path='./slim_best_params')
slim_loaded = load('./slim_best_params')

                                                                                

CPU times: user 59 ms, sys: 14.4 ms, total: 73.4 ms
Wall time: 27.3 s


In [45]:
slim_loaded.beta, slim_loaded.lambda_

(0.08287653801805005, 0.2871501039218261)

In [46]:
%%time
pred_from_loaded = slim_loaded.predict(
    k=K,
    queries=test_dataset.query_ids,
    dataset=train_dataset,
    filter_seen_items=True)
pred_from_loaded.show(2)

                                                                                

+-------+-------+------------------+
|user_id|item_id|            rating|
+-------+-------+------------------+
|    271|   1032| 0.671871745783729|
|    271|    730|0.5642839036157286|
+-------+-------+------------------+
only showing top 2 rows

CPU times: user 62.7 ms, sys: 8.47 ms, total: 71.2 ms
Wall time: 7.49 s


In [47]:
%%time
recs = encoder_loaded.query_and_item_id_encoder.inverse_transform(pred_from_loaded)
recs.show(2)

+------------------+-------+-------+
|            rating|user_id|item_id|
+------------------+-------+-------+
| 0.671871745783729|    949|   1608|
|0.5642839036157286|    949|   3114|
+------------------+-------+-------+
only showing top 2 rows

CPU times: user 224 ms, sys: 7.89 ms, total: 232 ms
Wall time: 767 ms


## 5. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [48]:
%%time
recs = fit_predict_evaluate(ALSWrap(rank=100, seed=SEED), metrics, 'ALS')
metrics.results.sort_values('NDCG@5', ascending=False)

                                                                                

CPU times: user 163 ms, sys: 64 ms, total: 227 ms
Wall time: 44.9 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.16097,0.072,0.286,0.034867,0.068995
SLIM,0.113837,0.086,0.242,0.034,0.063476
SLIM_optimized,0.06312,0.078,0.244,0.032753,0.06192


#### ItemKNN
Commonly-used item-based recommender

In [49]:
%%time
recs = fit_predict_evaluate(ItemKNN(num_neighbours=100), metrics, 'ItemKNN')
metrics.results.sort_values('NDCG@5', ascending=False)

                                                                                

CPU times: user 195 ms, sys: 43 ms, total: 238 ms
Wall time: 36.8 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.16097,0.072,0.286,0.034867,0.068995
SLIM,0.113837,0.086,0.242,0.034,0.063476
SLIM_optimized,0.06312,0.078,0.244,0.032753,0.06192
ItemKNN,0.037762,0.07,0.238,0.029853,0.058481


## 6 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

In [50]:
import pyspark.sql.functions as sf

In [52]:
metrics.add_result("my_model", recs.withColumn("rating", sf.rand()))
metrics.results.sort_values("NDCG@5", ascending=False)

                                                                                

Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
ALS,0.16097,0.072,0.286,0.034867,0.068995
SLIM,0.113837,0.086,0.242,0.034,0.063476
SLIM_optimized,0.06312,0.078,0.244,0.032753,0.06192
ItemKNN,0.037762,0.07,0.238,0.029853,0.058481
my_model,0.037762,0.056,0.238,0.027313,0.05509
