# Использование доступных моделей
Данный ноутбук показывает базовый порядок работы с библиотекой.
- загрузка датасета
- деление на трейн и тест
- обучение модели
- сравнение с бейзлайном

Документацию можно собрать из папки `docs` командой `make html`. Она создаст папку `_build` с документацией -- `sponge-bob-magic/docs/_build/html/index.html`

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os, sys

parent_dir = os.path.split(os.getcwd())[0]
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

Целевая архитектура библиотеки - ЛД 3.0, поэтому используем Spark.
Для отладки моделей (не на кластере) создаём локальную сессию.

Объект `State` позволяет использовать одну и ту же сессию Spark в разных объектах.
Если сессия уже инициализированна, то её нужно передать при инициализации `State`. Модули библиотеки, которым необходимо использовать спарк-сессию, например для конвертации из пандас в спарк, будут искать сессию именно в `State`.

По умолчанию создастся дефолтная сессия. Простой способ получить сессию с заданным количеством выделенной памяти -- функция `get_spark_session` модуля `session_handler`.

In [3]:
from sponge_bob_magic.session_handler import State

spark = State().session
spark

## Подготовка данных <a name='data-preparator'></a>
Библиотека содержит утилиты для скачивания и парсинга популярных датасетов для рекомендаций.
Если датасет не найден в заданной директории, он будет автоматически скачан и обработан.

Данные датасета доступны в виде `pandas.DataFrame` атрибутов объекта.
Посмотреть доступные данные можно с помощью метода `info`.

In [4]:
from sponge_bob_magic.datasets import MovieLens

data = MovieLens("1m")
data.info()

ratings


Unnamed: 0,user_id,item_id,relevance,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968



users


Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117



items


Unnamed: 0,item_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance





Внутренний формат данных в либе -- спарк датафрейм. Сплиттеры умеют втоматически конвертировать пандас датафреймы, но модели сейчас ожидают именно спарк, причем с заданными обязательными колонками: `user_id, item_id, timestamp, relevance, context`. 

Получить данные в нужном формате можно с помощью `DataPreparator.transform_log`, который создаст нужные колонки, если их нет.

In [5]:
from sponge_bob_magic.data_preparator import DataPreparator

log = DataPreparator().transform(
    data=data.ratings,
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance",
        "timestamp": "timestamp"
    }
)

Библиотека содержит различные схемы валидации рекомендательных систем, встречающиеся в литературе.

`UserSplitter` отбирает для теста некоторое количество или долю объектов для каждого пользователя.

In [6]:
from sponge_bob_magic.splitters import UserSplitter

splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=1,
    user_test_size=500,
    seed=1234,
    shuffle=True
)
train, test = splitter.split(log)
(
    train.count(), 
    test.count()
)

(999709, 500)

## NMF
Простейший пример использования DL в рекомендациях

In [7]:
from sponge_bob_magic.models import NeuroMFRec

nmf = NeuroMFRec(
    learning_rate=0.01,
    epochs=10,
    embedding_dimension=100
)



In [8]:
%%time

nmf.fit(log=train)

25-Mar-20 20:36:24, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=10.
25-Mar-20 20:36:46, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:21
25-Mar-20 20:36:46, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[1] Loss: 0.8105


25-Mar-20 20:36:53, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:06
25-Mar-20 20:36:53, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:06
25-Mar-20 20:36:53, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 1 Avg loss: 0.7336


25-Mar-20 20:36:55, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:36:55, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:02


Validation set Results - Epoch: 1 Avg loss: 0.7392


25-Mar-20 20:37:18, ignite.engine.engine.Engine, INFO: Epoch[2] Complete. Time taken: 00:00:23
25-Mar-20 20:37:18, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[2] Loss: 0.3452


25-Mar-20 20:37:25, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:37:25, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:06
25-Mar-20 20:37:25, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 2 Avg loss: 0.3239


25-Mar-20 20:37:27, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:37:27, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:01


Validation set Results - Epoch: 2 Avg loss: 0.3247


25-Mar-20 20:37:49, ignite.engine.engine.Engine, INFO: Epoch[3] Complete. Time taken: 00:00:22
25-Mar-20 20:37:49, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[3] Loss: 0.2487


25-Mar-20 20:37:55, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:37:55, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:05
25-Mar-20 20:37:55, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 3 Avg loss: 0.2432


25-Mar-20 20:37:57, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:37:57, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:01


Validation set Results - Epoch: 3 Avg loss: 0.2442


25-Mar-20 20:38:18, ignite.engine.engine.Engine, INFO: Epoch[4] Complete. Time taken: 00:00:21
25-Mar-20 20:38:18, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[4] Loss: 0.2286


25-Mar-20 20:38:24, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:38:24, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:06
25-Mar-20 20:38:24, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 4 Avg loss: 0.2235


25-Mar-20 20:38:26, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:38:26, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:01


Validation set Results - Epoch: 4 Avg loss: 0.2253


25-Mar-20 20:38:50, ignite.engine.engine.Engine, INFO: Epoch[5] Complete. Time taken: 00:00:23
25-Mar-20 20:38:50, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[5] Loss: 0.2158


25-Mar-20 20:38:56, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:38:56, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:06
25-Mar-20 20:38:56, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 5 Avg loss: 0.2151


25-Mar-20 20:38:58, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:38:58, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:02


Validation set Results - Epoch: 5 Avg loss: 0.2172


25-Mar-20 20:39:22, ignite.engine.engine.Engine, INFO: Epoch[6] Complete. Time taken: 00:00:23
25-Mar-20 20:39:22, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[6] Loss: 0.2118


25-Mar-20 20:39:28, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:39:28, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:06
25-Mar-20 20:39:28, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 6 Avg loss: 0.2102


25-Mar-20 20:39:31, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:02
25-Mar-20 20:39:31, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:02


Validation set Results - Epoch: 6 Avg loss: 0.2120


25-Mar-20 20:39:54, ignite.engine.engine.Engine, INFO: Epoch[7] Complete. Time taken: 00:00:23
25-Mar-20 20:39:54, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[7] Loss: 0.2076


25-Mar-20 20:40:00, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:40:00, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:05
25-Mar-20 20:40:00, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 7 Avg loss: 0.2068


25-Mar-20 20:40:02, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:40:02, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:01


Validation set Results - Epoch: 7 Avg loss: 0.2084


25-Mar-20 20:40:26, ignite.engine.engine.Engine, INFO: Epoch[8] Complete. Time taken: 00:00:24
25-Mar-20 20:40:26, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[8] Loss: 0.2051


25-Mar-20 20:40:33, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:06
25-Mar-20 20:40:33, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:06
25-Mar-20 20:40:33, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 8 Avg loss: 0.2041


25-Mar-20 20:40:35, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:40:35, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:02


Validation set Results - Epoch: 8 Avg loss: 0.2058


25-Mar-20 20:41:00, ignite.engine.engine.Engine, INFO: Epoch[9] Complete. Time taken: 00:00:24
25-Mar-20 20:41:00, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[9] Loss: 0.2032


25-Mar-20 20:41:06, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:41:06, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:06
25-Mar-20 20:41:06, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 9 Avg loss: 0.2019


25-Mar-20 20:41:08, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:41:08, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:01


Validation set Results - Epoch: 9 Avg loss: 0.2038


25-Mar-20 20:41:31, ignite.engine.engine.Engine, INFO: Epoch[10] Complete. Time taken: 00:00:22
25-Mar-20 20:41:31, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Epoch[10] Loss: 0.2009


25-Mar-20 20:41:37, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:05
25-Mar-20 20:41:37, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:05
25-Mar-20 20:41:37, ignite.engine.engine.Engine, INFO: Engine run starting with max_epochs=1.


Training set Results - Epoch: 10 Avg loss: 0.1998


25-Mar-20 20:41:39, ignite.engine.engine.Engine, INFO: Epoch[1] Complete. Time taken: 00:00:01
25-Mar-20 20:41:39, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:00:01
25-Mar-20 20:41:39, ignite.engine.engine.Engine, INFO: Engine run complete. Time taken 00:05:14


Validation set Results - Epoch: 10 Avg loss: 0.2018
CPU times: user 3min 31s, sys: 9.64 s, total: 3min 40s
Wall time: 5min 45s


In [9]:
%%time

recs = nmf.predict(
    k=10,
    users=test.select('user_id').distinct(),
    items=test.select('item_id').distinct(),
    log=train,
    filter_seen_items=True
)

CPU times: user 2.66 s, sys: 237 ms, total: 2.89 s
Wall time: 45.2 s


В библиотеке реализованы различные метрики качества рекомендательных систем, встречающихся в литературе.
Их можно использовать напрямую, либо запоминать результаты с помощью класса `Experiment`.

In [10]:
from sponge_bob_magic.metrics import HitRate, NDCG
from sponge_bob_magic.experiment import Experiment

metrics = Experiment(test, {NDCG(): 10, 
                            HitRate(): 10})

metrics.add_result("NMF", recs)
metrics.pandas_df.loc["NMF"]

HitRate@10    0.038000
NDCG@10       0.018884
Name: NMF, dtype: float64

## SLIM
Один из простых, но эффективных алгоритмов 

In [12]:
from sponge_bob_magic.models import SlimRec

slim = SlimRec(lambda_=0.05, beta=2.)

In [13]:
%%time

slim.fit(log=train)

CPU times: user 6.53 s, sys: 384 ms, total: 6.91 s
Wall time: 11.2 s


In [14]:
%%time

recs = slim.predict(
    k=10,
    users=test.select('user_id').distinct(),
    items=test.select('item_id').distinct(),
    log=train,
    filter_seen_items=True
)

CPU times: user 39 ms, sys: 9.07 ms, total: 48.1 ms
Wall time: 9.14 s


In [15]:
metrics.add_result("SLIM", recs)
metrics.pandas_df.loc["SLIM"]

HitRate@10    0.290000
NDCG@10       0.163229
Name: SLIM, dtype: float64

## ALS
Библиотека также содержит классические алгоритмы рекомендаций, например, матричную факторизацию

In [16]:
from sponge_bob_magic.models import ALSRec

als = ALSRec(rank=100)

In [17]:
%%time
als.fit(log=train)

CPU times: user 64.8 ms, sys: 11.4 ms, total: 76.2 ms
Wall time: 43.1 s


In [26]:
%%time
recs = als.predict(
    k=10,
    users=test.select('user_id').distinct(),
    items=test.select('item_id').distinct(),
    log=train,
    filter_seen_items=True
)

CPU times: user 25.5 ms, sys: 6.76 ms, total: 32.2 ms
Wall time: 9.51 s


In [31]:
metrics.add_result("ALS", recs)
metrics.pandas_df.loc["ALS"]

HitRate@10    0.140000
NDCG@10       0.057582
Name: ALS, dtype: float64

## Своя модель
Для построения своей модели нужно использовать тот же самый split, что и для бейзлайнов

In [32]:
train.toPandas().to_csv("train.csv", index=False)



In [33]:
!head train.csv

user_id,item_id,relevance,timestamp
1090,1961,4.0,2000-11-23 01:20:08
1090,2428,2.0,2000-11-23 01:27:11
1090,1208,3.0,2000-11-23 01:21:17
1090,296,5.0,2000-11-23 01:06:15
1090,318,5.0,2000-11-23 01:11:51
1090,1704,4.0,2000-11-23 01:12:48
1090,2433,3.0,2000-11-23 01:26:01
1090,1569,4.0,2000-11-23 01:02:22
1090,110,3.0,2000-11-23 01:12:48


Здесь магия: можно взять train и обучить на нём свою любимую модель вне библиотеки

Также нужно выдать рекомендации обученной моделью

In [34]:
recs.toPandas().to_csv("recs.csv", index=False)

Теперь нужно прочитать рекомендации в формате, поддерживаемом библиотекой

In [35]:
recs = DataPreparator().transform(
    path="recs.csv",
    columns_names={
        "user_id": "user_id",
        "item_id": "item_id",
        "relevance": "relevance"
    },
    header=True,
    format_type="csv"
)

и сравнить качество своей модели с бейзлайнамии

In [36]:
metrics.add_result("my_model", recs)
metrics.pandas_df

Unnamed: 0,HitRate@10,NDCG@10
NMF,0.038,0.018884
SLIM,0.29,0.163229
ALS,0.14,0.057582
my_model,0.14,0.057582
