# RePlay working with multi-frameworks models

We will show how you can work with models on different frameworks and compare performance of this new feature on well-known MovieLens dataset. If you have not used RePlay before, start with 01_replay_basics.ipynb which introduces base concepts and describe main classes and functionality.

### Dataset
We will compare RePlay models on __MovieLens 1m__. 

# TERMS:
- Implementation: A real model, that realize a common functions of Recommendation model: fit, predict, fit_predict...  
This model is writen on one of frameworks - ``Pandas``, ``Polars`` or ``Spark``
- Client: an object that contain link to 3 implementations - on Polars, on Pandas, on Spark.  
It also provides convertation from one framework to other (for example - from spark-fitted model to pandas)  
It wraps functions of implementation - usually, when you call ``BaseRecommenderClient.fit()``, inside this fit you call ``Client._impl.fit()`` add some functionality, for example - type-checking  
- Convertation: Change the ``_impl`` link from one implementation to other. When you call ``to_pandas()``, it creates implementation of Pandas type, and Client change its ``_impl`` link to this new implementation  
Also, we converts all DataFrameLike properties to selected framework type  
All other properties are retain their values too

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

In [4]:
import logging
import time

from pyspark.sql import functions as sf
import pandas as pd
import polars as pl

from replay.data import Dataset, FeatureHint, FeatureInfo, FeatureSchema, FeatureType
from replay.data.dataset_utils import DatasetLabelEncoder
from replay.models import PopRec
from replay.models.implementations import _PopRecSpark, _PopRecPolars, _PopRecPandas

from replay.utils.session_handler import State, get_spark_session
from replay.utils.spark_utils import convert2spark, get_log_info

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [5]:
spark = State().session
spark

25/03/31 05:54:38 WARN Utils: Your hostname, ecs-alaleksepetrov-2 resolves to a loopback address: 127.0.1.1; using 10.11.12.199 instead (on interface eth0)
25/03/31 05:54:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/31 05:54:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/31 05:54:39 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
25/03/31 05:54:40 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
25/03/31 05:54:40 WARN SQLConf: The SQL config 'spark.sql.execution.arro

In [6]:
spark.sparkContext.setLogLevel('ERROR')

In [7]:
logger = logging.getLogger("replay")

In [8]:
K = 10
BUDGET = 5
SEED = 42

## 0. Preprocessing <a name='data-preparator'></a>

### 0.1 Data loading

In [9]:
import replay
from os.path import dirname, join

dir = dirname(replay.__file__)
pandas_ratings = pd.read_csv(join(dir, "../examples/data/ml1m_ratings.dat"), sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
pandas_users = pd.read_csv(join(dir, "../examples/data/ml1m_users.dat"), sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])
pandas_items =  pd.read_csv(join(dir, "../examples/data/ml1m_items.dat"), sep="\t", names=["item_id", "name", "category"])
print("ratings")
print(pandas_ratings.info())
print("users")
print(pandas_users.info())
print("items")
print(pandas_items.info())
print(pandas_items)

ratings
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype
---  ------     --------------    -----
 0   user_id    1000209 non-null  int64
 1   item_id    1000209 non-null  int64
 2   rating     1000209 non-null  int64
 3   timestamp  1000209 non-null  int64
dtypes: int64(4)
memory usage: 30.5 MB
None
users
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     6040 non-null   int64 
 1   gender      6040 non-null   object
 2   age         6040 non-null   int64 
 3   occupation  6040 non-null   int64 
 4   zip_code    6040 non-null   object
dtypes: int64(3), object(2)
memory usage: 236.1+ KB
None
items
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtyp

### 0.2. Dataset preparation

In [10]:
feature_schema = FeatureSchema(
    [
        FeatureInfo(
            column="user_id",
            feature_type=FeatureType.CATEGORICAL,
            feature_hint=FeatureHint.QUERY_ID,
        ),
        FeatureInfo(
            column="item_id",
            feature_type=FeatureType.CATEGORICAL,
            feature_hint=FeatureHint.ITEM_ID,
        ),
        FeatureInfo(
            column="rating",
            feature_type=FeatureType.NUMERICAL,
            feature_hint=FeatureHint.RATING,
        ),
        FeatureInfo(
            column="timestamp",
            feature_type=FeatureType.NUMERICAL,
            feature_hint=FeatureHint.TIMESTAMP,
        ),
    ]
)

polars_ratings = pl.from_pandas(pandas_ratings)






In [12]:

spark_ratings = convert2spark(pandas_ratings)
all_datasets = {
        "pandas": Dataset(feature_schema, pandas_ratings),
        "polars": Dataset(feature_schema, polars_ratings),
        "spark": Dataset(feature_schema, spark_ratings)
    }


train_datasets = {
        "pandas": Dataset(feature_schema, pandas_ratings[pandas_ratings["user_id"]%10!=0]),
        "polars": Dataset(feature_schema, polars_ratings.filter(pl.col("user_id")%10!=0)),
        "spark": Dataset(feature_schema, spark_ratings.filter(sf.col("user_id")%10!=0))
    }

predict_datasets = {
        "pandas": Dataset(feature_schema, pandas_ratings[pandas_ratings["user_id"]%10==0]),
        "polars": Dataset(feature_schema, polars_ratings.filter(pl.col("user_id")%10==0)),
        "spark": Dataset(feature_schema, spark_ratings.filter(sf.col("user_id")%10==0))
    }
encoder = DatasetLabelEncoder()
encoder.fit(all_datasets["spark"])

for framework in ["spark", "pandas", "polars"]:
    train_datasets[framework] = encoder.transform(train_datasets[framework])
    predict_datasets[framework] = encoder.transform(predict_datasets[framework])

                                                                                

# 1. Models training

In [28]:
def fit_predict_add_res(model, framework="spark", suffix='', count_runs=100):
    """
    Run fit_predict for the `model`, measure time on fit_predict and evaluate metrics
    """
    fit_times = []
    predict_times = []
    for i in range(count_runs):
        train_dataset, predict_dataset = train_datasets[framework], predict_datasets[framework]
        start_time=time.time()
        
        logs = {'dataset': predict_dataset}
        predict_params = {'k': K}
        
        predict_params.update(logs)

        model.fit(train_dataset)
        fit_time = time.time() - start_time

        pred=model.predict(**predict_params)
        if hasattr(model, "_impl") and model._get_implementation_type()=="spark":
            pred.count()
        predict_time = time.time() - start_time - fit_time
        #print(f"{fit_time=}")
        #print(f"{predict_time=}")
        fit_times.append(fit_time)
        predict_times.append(predict_time)
    return fit_times, predict_times

In [29]:
import numpy as np
def full_pipeline(models, framework="spark", suffix='', budget=BUDGET):
    """
    For each model:
        -  if required: run hyperparameters search, set best params and save param values to `experiment`
        - pass model to `fit_predict_add_res`        
    """
    results = []
    for name, [model, params] in models.items():
        fit_times, predict_times = fit_predict_add_res(model, framework, suffix)
        fit_mean, fit_std, predict_mean, predict_std = (
            np.array(fit_times).mean(), 
            np.array(fit_times).std(), 
            np.array(predict_times).mean(), 
            np.array(predict_times).std()
        )
        print(f"\n{name}:")
        print(f"\n{fit_mean=}\n{fit_std=}\n{predict_mean=}\n{predict_std=}")
        results.append([name, (fit_mean, fit_std, predict_mean, predict_std)])
    return results

In [32]:
models_to_compare_spark = {
    "Implementation-only": [_PopRecPolars(), None],
    "Client+Implementation": [PopRec(), None],
}
# !! think about how to restart SparkSession to calculate time on spark
results = full_pipeline(models_to_compare_spark, framework="polars")

Implementation-only:

fit_mean=0.024392220973968506
fit_std=0.0018462059990847015
predict_mean=0.14525318622589112
predict_std=0.0075125902241277055
Client+Implementation:

fit_mean=0.024563350677490235
fit_std=0.0018489209806340904
predict_mean=0.14441093921661377
predict_std=0.007492736718608101


# 2. 2 ways to use client:


#### default - fit client and th

In [None]:
PopRec().fit_predict(all_datasets["spark"], 1)
PopRec().fit_predict(all_datasets["pandas"], 1)
PopRec().fit_predict(all_datasets["polars"], 1)
print(f"{model.study}")
model.items_count
model.item_popularity

model = PopRec().fit(train_datasets["pandas"])
model.to_spark()
res = model.predict(predict_datasets["spark"], 1)

#### advanced

In [None]:
model = PopRec()
model._impl = _PopRecPandas()
model.fit_items = pd.DataFrame([1, 2, 3])
model.fit_queries = pd.DataFrame([1, 2])
assert model._impl.fit_queries == model.fit_queries # True

model._impl.can_predict_cold_items = False
assert model.can_predict_cold_items == model._impl.can_predict_cold_items # True

# If you need to change some attribute, and no setter - try code like this:
model.can_predict_cold_items = True # RAISE Error - no setter
model._impl.can_predict_cold_items = True
assert model.can_predict_cold_items == True