# Вебинар 6. Двухуровневые модели рекомендаций


### Зачем 2 уровня?
- Классические модели классификации (lightgbm) зачастую работают лучше, чем рекоммендательные модели (als, lightfm)
- Данных много, предсказаний много (# items * # users) --> с таким объемом lightgbm не справляется
- Но рекомендательные модели справляются!

Отбираем top-N (200) *кандидатов* с помощью простой модели (als) --> переранжируем их сложной моделью (lightgbm)
и выберем top-k (10).

---

### Как отбирать кандидатов?

Вариантов множество. Тут нам поможет *MainRecommender*. Пока в нем реализованы далеко не все возможные способы генерации кандидатов

- Генерируем топ-k кандидатов
- Качество кандидатов измеряем через **recall@k**
- recall@k показывает какую долю из купленных товаров мы смогли выявить (рекомендовать) нашей моделью

In [1]:
!pip install implicit --no-use-pep517

Collecting implicit
  Downloading implicit-0.4.8.tar.gz (1.1 MB)
[?25l[K     |▎                               | 10 kB 15.4 MB/s eta 0:00:01[K     |▋                               | 20 kB 15.1 MB/s eta 0:00:01[K     |▉                               | 30 kB 10.7 MB/s eta 0:00:01[K     |█▏                              | 40 kB 9.3 MB/s eta 0:00:01[K     |█▍                              | 51 kB 5.8 MB/s eta 0:00:01[K     |█▊                              | 61 kB 6.0 MB/s eta 0:00:01[K     |██                              | 71 kB 5.8 MB/s eta 0:00:01[K     |██▎                             | 81 kB 6.4 MB/s eta 0:00:01[K     |██▋                             | 92 kB 6.4 MB/s eta 0:00:01[K     |██▉                             | 102 kB 5.5 MB/s eta 0:00:01[K     |███▏                            | 112 kB 5.5 MB/s eta 0:00:01[K     |███▍                            | 122 kB 5.5 MB/s eta 0:00:01[K     |███▊                            | 133 kB 5.5 MB/s eta 0:00:01[K     |██

In [2]:
#Прикрепляем google disc

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
# Присоединяем директорию с модулями на гугл диске
import sys
sys.path.insert(0,"/content/drive/My Drive/")

----

# Практическая часть

# Import libs

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als

# Модель второго уровня
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

# Написанные нами функции
from metrics_2 import precision_at_k, recall_at_k
from utils import prefilter_items
from recommenders_2 import MainRecommender

## Read data

In [5]:
data = pd.read_csv('/content/drive/MyDrive/retail_train.csv')
item_features = pd.read_csv('/content/drive/MyDrive/product.csv')
user_features = pd.read_csv('/content/drive/MyDrive/hh_demographic.csv')

# Process features dataset

In [6]:
ITEM_COL = 'item_id'
USER_COL = 'user_id'

In [7]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': ITEM_COL}, inplace=True)
user_features.rename(columns={'household_key': USER_COL }, inplace=True)

# Split dataset for train, eval, test

In [8]:
# Важна схема обучения и валидации!
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)


VAL_MATCHER_WEEKS = 6
VAL_RANKER_WEEKS = 3

In [9]:
# берем данные для тренировки matching модели
data_train_matcher = data[data['week_no'] < data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)]

# берем данные для валидации matching модели
data_val_matcher = data[(data['week_no'] >= data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)) &
                      (data['week_no'] < data['week_no'].max() - (VAL_RANKER_WEEKS))]


# берем данные для тренировки ranking модели
data_train_ranker = data_val_matcher.copy()  # Для наглядности. Далее мы добавим изменения, и они будут отличаться

# берем данные для теста ranking, matching модели
data_val_ranker = data[data['week_no'] >= data['week_no'].max() - VAL_RANKER_WEEKS]

In [10]:
def print_stats_data(df_data, name_df):
    print(name_df)
    print(f"Shape: {df_data.shape} Users: {df_data[USER_COL].nunique()} Items: {df_data[ITEM_COL].nunique()}")

In [11]:
print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (2108779, 12) Users: 2498 Items: 83685
val_matcher
Shape: (169711, 12) Users: 2154 Items: 27649
train_ranker
Shape: (169711, 12) Users: 2154 Items: 27649
val_ranker
Shape: (118314, 12) Users: 2042 Items: 24329


In [None]:
# выше видим разброс по пользователям и товарам

In [12]:
data_train_matcher.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


# Prefilter items

In [13]:
n_items_before = data_train_matcher['item_id'].nunique()

data_train_matcher = prefilter_items(data_train_matcher, take_n_popular=5000)

n_items_after = data_train_matcher['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

Decreased # items from 83685 to 5001


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


# Make cold-start to warm-start

In [14]:
# ищем общих пользователей
common_users = list(set(data_train_matcher.user_id.values)&(set(data_val_matcher.user_id.values))&set(data_val_ranker.user_id.values))

data_train_matcher = data_train_matcher[data_train_matcher.user_id.isin(common_users)]
data_val_matcher = data_val_matcher[data_val_matcher.user_id.isin(common_users)]
data_train_ranker = data_train_ranker[data_train_ranker.user_id.isin(common_users)]
data_val_ranker = data_val_ranker[data_val_ranker.user_id.isin(common_users)]

print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (1912681, 12) Users: 1915 Items: 4998
val_matcher
Shape: (163261, 12) Users: 1915 Items: 27118
train_ranker
Shape: (163261, 12) Users: 1915 Items: 27118
val_ranker
Shape: (115989, 12) Users: 1915 Items: 24042


In [None]:
# Теперь warm-start по пользователям

# Init/train recommender

In [15]:
recommender = MainRecommender(data_train_matcher)



  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/4998 [00:00<?, ?it/s]

### Варианты, как получить кандидатов

Можно потом все эти варианты соединить в один

(!) Если модель рекомендует < N товаров, то рекомендации дополняются топ-популярными товарами до N

In [None]:
# Берем тестового юзера 2375

In [16]:
recommender.get_als_recommendations(2375, N=5)

[981760, 923746, 845208, 899624, 1037863]

In [17]:
recommender.get_own_recommendations(2375, N=5)

[1036501, 1085983, 1079023, 907099, 910439]

In [18]:
recommender.get_similar_items_recommendation(2375, N=5)

[891542, 889731, 1055646, 1046545, 9527160]

In [19]:
recommender.get_similar_users_recommendation(2375, N=5)

[934399, 10456226, 1138292, 856942, 1075305]

# Eval recall of matching

### Измеряем recall@k

Это будет в ДЗ: 

A) Попробуйте различные варианты генерации кандидатов. Какие из них дают наибольший recall@k ?
- Пока пробуем отобрать 50 кандидатов (k=50)
- Качество измеряем на data_val_matcher: следующие 6 недель после трейна

Дают ли own recommendtions + top-popular лучший recall?  

B)* Как зависит recall@k от k? Постройте для одной схемы генерации кандидатов эту зависимость для k = {20, 50, 100, 200, 500}  
C)* Исходя из прошлого вопроса, как вы думаете, какое значение k является наиболее разумным?


In [20]:
ACTUAL_COL = 'actual'

In [21]:
result_eval_matcher = data_val_matcher.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_matcher.columns=[USER_COL, ACTUAL_COL]
result_eval_matcher.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96..."


In [22]:
# N = Neighbors
N_PREDICT = 50 

In [None]:
%%time
# для понятности расписано все в строчку, без функций, ваша задача уметь оборачивать все это в функции
result_eval_matcher['own_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
result_eval_matcher['sim_item_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_items_recommendation(x, N=50))
result_eval_matcher['als_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_als_recommendations(x, N=50))

CPU times: user 1min 21s, sys: 48.7 s, total: 2min 10s
Wall time: 1min 14s


In [None]:
result_eval_matcher.head(8)

Unnamed: 0,user_id,actual,own_rec,sim_item_rec,als_rec
0,1,"[853529, 865456, 867607, 872137, 874905, 87524...","[856942, 9297615, 5577022, 1074612, 9655212, 9...","[931136, 999999, 1082185, 9526410, 1124432, 92...","[1082212, 1082185, 958046, 7467039, 935578, 10..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96...","[13003092, 1119051, 9911484, 5569792, 1048257,...","[999999, 874149, 904360, 845208, 948650, 87390...","[1082185, 878996, 857006, 930118, 965267, 1024..."
2,7,"[836281, 843306, 845294, 914190, 920456, 93886...","[1075524, 845814, 1097544, 1112957, 6944571, 9...","[999999, 1056762, 1015247, 1094955, 1131351, 1...","[1123086, 1029504, 1404121, 1003188, 857390, 8..."
3,8,"[868075, 886787, 945611, 1005186, 1008787, 101...","[1116578, 969932, 981660, 1105433, 5577022, 10...","[999999, 1106523, 873902, 9836195, 1110843, 55...","[1082185, 981760, 995242, 840361, 916122, 9615..."
4,9,"[883616, 1029743, 1039126, 1051323, 1082772, 1...","[1056005, 862799, 1018588, 1090017, 918046, 99...","[1098066, 826249, 1008032, 1098066, 882830, 10...","[1082212, 849843, 998558, 1092149, 893018, 112..."
5,13,"[6544236, 822407, 908317, 1056775, 1066289, 11...","[6544236, 965772, 893802, 1038985, 862070, 104...","[961554, 1098066, 1082185, 981760, 999999, 571...","[1014948, 960732, 970866, 1039589, 1029504, 11..."
6,14,"[917277, 981760, 878234, 925514, 986394, 10220...","[1123106, 1056651, 1008012, 932006, 6533765, 8...","[961554, 896974, 999999, 960318, 1098066, 9106...","[871756, 1131344, 981760, 1033615, 911311, 969..."
7,15,"[996016, 1014509, 1044404, 1087353, 976199, 10...","[7024847, 910439, 920002, 1053530, 835595, 103...","[999999, 1126899, 1098066, 1082185, 981760, 10...","[6979089, 1028816, 895268, 957772, 863632, 933..."


In [None]:
%%time
#result_eval_matcher['sim_user_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_users_recommendation(x, N=50))

CPU times: user 30min 50s, sys: 43.1 s, total: 31min 33s
Wall time: 2min 37s


### Пример оборачивания

In [23]:
# # сырой и простой пример как можно обернуть в функцию
def evalRecall(df_result, target_col_name, recommend_model):
    result_col_name = 'result'
    df_result[result_col_name] = df_result[target_col_name].apply(lambda x: recommend_model(x, N=25))
    return df_result.apply(lambda row: recall_at_k(row[result_col_name], row[ACTUAL_COL], k=N_PREDICT), axis=1).mean()

In [None]:
# evalRecall(result_eval_matcher, USER_COL, recommender.get_own_recommendations)

In [24]:
def calc_recall(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: recall_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()

In [25]:
def calc_precision(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: precision_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()

### Recall@50 of matching

In [26]:
TOPK_RECALL = 50

In [None]:
sorted(calc_recall(result_eval_matcher, TOPK_RECALL), key=lambda x: x[1],reverse=True)

[('own_rec', 0.10356617070924073),
 ('als_rec', 0.07583310645220356),
 ('sim_item_rec', 0.055207909107046)]

### Precision@5 of matching

In [54]:
TOPK_PRECISION = 5

In [None]:
sorted(calc_precision(result_eval_matcher, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

[('own_rec', 0.2736292428198411),
 ('als_rec', 0.1620887728459512),
 ('sim_item_rec', 0.11049608355091306)]

# Ranking part

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_ranking
- Обучаем *только* на выбранных кандидатах
- Я *для примера* сгенерирую топ-50 кадидиатов через get_own_recommendations
- (!) Если юзер купил < 50 товаров, то get_own_recommendations дополнит рекоммендации топ-популярными

In [None]:
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 

## Подготовка данных для трейна

In [27]:
# взяли пользователей из трейна для ранжирования
df_match_candidates = pd.DataFrame(data_train_ranker[USER_COL].unique())
df_match_candidates.columns = [USER_COL]

In [28]:
# собираем кандитатов с первого этапа (matcher)
df_match_candidates['candidates'] = df_match_candidates[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

In [29]:
df_match_candidates.head(2)

Unnamed: 0,user_id,candidates
0,2070,"[834103, 878302, 1085604, 1119399, 13511722, 9..."
1,2021,"[1119454, 871279, 1019142, 863762, 835578, 101..."


In [30]:
df_items = df_match_candidates.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
df_items.name = 'item_id'

In [31]:
df_match_candidates = df_match_candidates.drop('candidates', axis=1).join(df_items)

In [32]:
df_match_candidates.head(8)

Unnamed: 0,user_id,item_id
0,2070,834103
0,2070,878302
0,2070,1085604
0,2070,1119399
0,2070,13511722
0,2070,925258
0,2070,1055863
0,2070,975938


### Check warm start

In [33]:
print_stats_data(df_match_candidates, 'match_candidates')

match_candidates
Shape: (95750, 2) Users: 1915 Items: 4813


### Создаем трейн сет для ранжирования с учетом кандидатов с этапа 1 

In [34]:
df_ranker_train = data_train_ranker[[USER_COL, ITEM_COL]].copy()
df_ranker_train['target'] = 1  # тут только покупки 

df_ranker_train = df_match_candidates.merge(df_ranker_train, on=[USER_COL, ITEM_COL], how='left')

df_ranker_train['target'].fillna(0, inplace= True)

In [35]:
df_ranker_train.target.value_counts()

0.0    83403
1.0    20065
Name: target, dtype: int64

In [38]:
df_ranker_train.shape

(103468, 3)

(!) На каждого юзера 50 item_id-кандидатов

In [37]:
df_ranker_train['target'].mean()

0.19392469169211737

![hard_choice.png](attachment:hard_choice.png)

Слайд из [презентации](https://github.com/aprotopopov/retailhero_recommender/blob/master/slides/retailhero_recommender.pdf) решения 2-ого места X5 Retail Hero

- Пока для простоты обучения выберем LightGBM c loss = binary. Это классическая бинарная классификация
- Это пример *без* генерации фич

## Подготавливаем фичи для обучения модели

In [40]:
item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [41]:
user_features.head(2)

Unnamed: 0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_id
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [42]:
df_ranker_train = df_ranker_train.merge(item_features, on='item_id', how='left')
df_ranker_train = df_ranker_train.merge(user_features, on='user_id', how='left')

df_ranker_train.head(9)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
1,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
2,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
3,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
4,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
5,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
6,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
7,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
8,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown


**Фичи user_id:**
    - Средний чек
    - Средняя сумма покупки 1 товара в каждой категории
    - Кол-во покупок в каждой категории
    - Частотность покупок раз/месяц
    - Долю покупок в выходные
    - Долю покупок утром/днем/вечером

**Фичи item_id**:
    - Кол-во покупок в неделю
    - Среднее ол-во покупок 1 товара в категории в неделю
    - (Кол-во покупок в неделю) / (Среднее ол-во покупок 1 товара в категории в неделю)
    - Цена (Можно посчитать из retil_train.csv)
    - Цена / Средняя цена товара в категории
    
**Фичи пары user_id - item_id**
    - (Средняя сумма покупки 1 товара в каждой категории (берем категорию item_id)) - (Цена item_id)
    - (Кол-во покупок юзером конкретной категории в неделю) - (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)
    - (Кол-во покупок юзером конкретной категории в неделю) / (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)

In [43]:
X_train = df_ranker_train.drop('target', axis=1)
y_train = df_ranker_train[['target']]

In [44]:
cat_feats = X_train.columns[2:].tolist()
X_train[cat_feats] = X_train[cat_feats].astype('category')

cat_feats

['manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc']

## Обучение модели ранжирования

In [45]:
lgb = LGBMClassifier(objective='binary',
                     max_depth=8,
                     n_estimators=300,
                     learning_rate=0.05,
                     categorical_column=cat_feats)

lgb.fit(X_train, y_train)

train_preds = lgb.predict_proba(X_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [46]:
df_ranker_predict = df_ranker_train.copy()

In [47]:
df_ranker_predict['proba_item_purchase'] = train_preds[:,1]

In [48]:
df_ranker_predict.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,proba_item_purchase
0,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.706018
1,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.706018


In [49]:
df_ranker_predict.shape

(103468, 17)

In [50]:
df_ranker_predict.loc[df_ranker_predict['user_id']==2070].sort_values('proba_item_purchase', ascending=False)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,proba_item_purchase
0,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.706018
10,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.706018
1,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.706018
18,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.706018
17,2070,834103,1.0,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.706018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46,2070,968932,0.0,1276,GROCERY,National,ISOTONIC DRINKS,ISOTONIC DRINKS SINGLE SERVE,32 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.072163
65,2070,1070803,0.0,2224,GROCERY,National,SOFT DRINKS,SOFT DRINKS 6PK/4PK CAN CARB (,8 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.059592
32,2070,1107824,0.0,544,GROCERY,National,WAREHOUSE SNACKS,SNACK MIX,7.75 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.044618
30,2070,975938,0.0,69,GROCERY,Private,SOFT DRINKS,MIXERS(CLUB SODA/SELTZERS)FLAV,1 LTR,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,0.027355


## Подведем итоги

    Мы обучили модель ранжирования на покупках из сета data_train_ranker и на кандитатах от own_recommendations, что является тренировочным сетом, и теперь наша задача предсказать и оценить именно на тестовом сете.

# Evaluation on test dataset

In [51]:
result_eval_ranker = data_val_ranker.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_ranker.columns=[USER_COL, ACTUAL_COL]
result_eval_ranker.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


## Eval matching on test dataset

In [52]:
%%time
result_eval_ranker['own_rec'] = result_eval_ranker[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

CPU times: user 17.3 s, sys: 949 ms, total: 18.2 s
Wall time: 18.6 s


In [55]:
# померяем precision только модели матчинга, чтобы понимать влияение ранжирования на метрики

sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True)

[('own_rec', 0.22005221932114677)]

## Eval re-ranked matched result on test dataset
    Вспомним df_match_candidates сет, который был получен own_recommendations на юзерах, набор пользователей мы фиксировали и он одинаков, значи и прогноз одинаков, поэтому мы можем использовать этот датафрейм для переранжирования.
    

In [56]:
def rerank(user_id):
    return df_ranker_predict[df_ranker_predict[USER_COL]==user_id].sort_values('proba_item_purchase', ascending=False).head(5).item_id.tolist()

In [57]:
result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))

In [58]:
result_eval_ranker['reranked_own_rec']

0       [9655212, 9655212, 8293439, 8293439, 8293439]
1       [1098844, 1098844, 1037863, 1037863, 1037863]
2       [1122358, 1072483, 9338009, 1079023, 1017061]
3         [1029915, 969932, 1116578, 972931, 1048483]
4         [9416729, 13007284, 896085, 897120, 907647]
                            ...                      
1910       [1056509, 1056509, 995876, 986760, 827271]
1911        [900802, 965719, 870515, 1066893, 896938]
1912       [1077143, 967762, 1028473, 901776, 917277]
1913      [1070820, 1070820, 947798, 866528, 6533889]
1914     [869195, 1013389, 1065538, 1065538, 1065538]
Name: reranked_own_rec, Length: 1915, dtype: object

In [59]:
print(*sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('own_rec', 0.22005221932114677)
('reranked_own_rec', 0.15561357702349707)


Берем топ-k предсказаний, ранжированных по вероятности, для каждого юзера

# Домашнее задание

**Задание 1.**

A) Попробуйте различные варианты генерации кандидатов. Какие из них дают наибольший recall@k ?
- Пока пробуем отобрать 50 кандидатов (k=50)
- Качество измеряем на data_val_matcher: следующие 6 недель после трейна

Дают ли own recommendtions + top-popular лучший recall?  


In [60]:
result_eval_matcher = data_val_matcher.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_matcher.columns=[USER_COL, ACTUAL_COL]
result_eval_matcher.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96..."


In [61]:
%%time
result_eval_matcher['own_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=50))
result_eval_matcher['sim_item_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_items_recommendation(x, N=50))
result_eval_matcher['als_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_als_recommendations(x, N=50))

CPU times: user 1min 25s, sys: 53.9 s, total: 2min 19s
Wall time: 1min 20s


In [62]:
TOPK_RECALL = 50
sorted(calc_recall(result_eval_matcher, TOPK_RECALL), key=lambda x: x[1],reverse=True)

[('own_rec', 0.10356617070924073),
 ('als_rec', 0.0756988865413088),
 ('sim_item_rec', 0.05938433867201691)]

**Судя по полученным результатам, own recommendtions + top-popular дают лучший recall_at_k**




B)* Как зависит recall@k от k? Постройте для одной схемы генерации кандидатов эту зависимость для k = {20, 50, 100, 200, 500}  
C)* Исходя из прошлого вопроса, как вы думаете, какое значение k является наиболее разумным?

In [63]:
result_eval_matcher = data_val_matcher.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_matcher.columns=[USER_COL, ACTUAL_COL]
result_eval_matcher.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96..."


In [64]:
result_eval_matcher['als_rec_500'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_als_recommendations(x, N=500))

In [65]:
k = [20, 50, 100, 200, 500]

for i in k:
    print(f'при k={i} recall_at_k={sorted(calc_recall(result_eval_matcher, i), key=lambda x: x[1],reverse=True)}')

при k=20 recall_at_k=[('als_rec_500', 0.040358890236414674)]
при k=50 recall_at_k=[('als_rec_500', 0.0756988865413088)]
при k=100 recall_at_k=[('als_rec_500', 0.1167094720984737)]
при k=200 recall_at_k=[('als_rec_500', 0.1734926119165839)]
при k=500 recall_at_k=[('als_rec_500', 0.2732072921826846)]


При увеличении k recall_at_k растет, что логично, т.к. recall - это доля купленных товаров, которая была в рекомендациях и, соответственно, чем больше число рекомендованных товаров, тем больше фактически купленных будут в рекомендациях. Но нам важно выделить небольшое количество товаров для рекомендации, чтобы клиент их все посмотрел. Поэтому из предложенных вариантов наиболее разумным видится k=20

**Задание 2.**

Обучите модель 2-ого уровня, при этом:

- Добавьте минимум по 2 фичи для юзера, товара и пары юзер-товар

In [81]:
data_train_ranker.shape

(163261, 12)

In [101]:
df_train_ranker = data_train_ranker.copy()

df_train_ranker['target'] = 1  # тут только покупки 

df_train_ranker = df_match_candidates.merge(df_train_ranker, on=[USER_COL, ITEM_COL], how='left')

df_train_ranker['target'].fillna(0, inplace= True)

df_train_ranker.head(4)

Unnamed: 0,user_id,item_id,basket_id,day,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,target
0,2070,834103,40642690000.0,595.0,1.0,1.0,311.0,-0.29,2209.0,86.0,0.0,0.0,1.0
1,2070,834103,40666680000.0,597.0,1.0,1.0,311.0,-0.29,1813.0,86.0,0.0,0.0,1.0
2,2070,834103,40679930000.0,598.0,1.0,1.0,311.0,-0.29,2029.0,86.0,0.0,0.0,1.0
3,2070,834103,40715280000.0,601.0,1.0,1.0,311.0,-0.29,205.0,87.0,0.0,0.0,1.0


In [103]:
#Посчитаем общее количество покупок, общую сумму покупок и средний чек
user_sum_all_amounts = data_train_ranker.groupby('user_id')['sales_value'].sum().reset_index()
user_all_quantity = data_train_ranker.groupby('user_id')['quantity'].count().reset_index()
user_new_features = user_sum_all_amounts.merge(user_all_quantity, on=[USER_COL], how='left')
user_new_features['av_check'] = user_new_features['sales_value'] / user_new_features['quantity']
user_new_features.rename(columns={'sales_value': 'all_sales_sum', 'quantity': 'user_total_quantity'}, inplace=True)
user_new_features

Unnamed: 0,user_id,all_sales_sum,user_total_quantity,av_check
0,1,341.78,133,2.569774
1,6,329.00,102,3.225490
2,7,187.65,90,2.085000
3,8,304.14,123,2.472683
4,9,185.25,57,3.250000
...,...,...,...,...
1910,2496,346.86,142,2.442676
1911,2497,560.65,160,3.504063
1912,2498,184.34,59,3.124407
1913,2499,246.26,78,3.157179


In [104]:
# добавим новые признаки в обучающую выборку
df_train_ranker = df_train_ranker.merge(user_new_features, on='user_id', how='left')
df_train_ranker

Unnamed: 0,user_id,item_id,basket_id,day,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,target,all_sales_sum,user_total_quantity,av_check
0,2070,834103,4.064269e+10,595.0,1.0,1.0,311.0,-0.29,2209.0,86.0,0.0,0.0,1.0,617.29,204,3.025931
1,2070,834103,4.066668e+10,597.0,1.0,1.0,311.0,-0.29,1813.0,86.0,0.0,0.0,1.0,617.29,204,3.025931
2,2070,834103,4.067993e+10,598.0,1.0,1.0,311.0,-0.29,2029.0,86.0,0.0,0.0,1.0,617.29,204,3.025931
3,2070,834103,4.071528e+10,601.0,1.0,1.0,311.0,-0.29,205.0,87.0,0.0,0.0,1.0,617.29,204,3.025931
4,2070,834103,4.071528e+10,601.0,1.0,1.0,311.0,-0.29,407.0,87.0,0.0,0.0,1.0,617.29,204,3.025931
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103463,1745,948832,,,,,,,,,,,0.0,13.97,3,4.656667
103464,1745,1007191,,,,,,,,,,,0.0,13.97,3,4.656667
103465,1745,1054030,,,,,,,,,,,0.0,13.97,3,4.656667
103466,1745,1088147,,,,,,,,,,,0.0,13.97,3,4.656667


In [105]:
# добавим user_features и item_features в обучающую выборку
df_train_ranker = df_train_ranker.merge(item_features, on='item_id', how='left')
df_train_ranker = df_train_ranker.merge(user_features, on='user_id', how='left')
df_train_ranker.head(4)

Unnamed: 0,user_id,item_id,basket_id,day,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,target,all_sales_sum,user_total_quantity,av_check,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2070,834103,40642690000.0,595.0,1.0,1.0,311.0,-0.29,2209.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
1,2070,834103,40666680000.0,597.0,1.0,1.0,311.0,-0.29,1813.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
2,2070,834103,40679930000.0,598.0,1.0,1.0,311.0,-0.29,2029.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
3,2070,834103,40715280000.0,601.0,1.0,1.0,311.0,-0.29,205.0,87.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown


In [106]:
#Посчитаем количество и сумму покупок в разрезе юзеров и commodities
sales_of_cat_per_user = df_train_ranker.groupby(['user_id', 'commodity_desc'])[['sales_value', 'quantity']].sum().reset_index()
sales_of_cat_per_user.rename(columns={'sales_value': 'user_sales_in_category', 'quantity': 'commodity_quantity' }, inplace=True)
sales_of_cat_per_user.head(4)

Unnamed: 0,user_id,commodity_desc,user_sales_in_category,commodity_quantity
0,1,BACON,0.0,0.0
1,1,BAG SNACKS,0.0,0.0
2,1,BAKED BREAD/BUNS/ROLLS,5.49,2.0
3,1,BAKED SWEET GOODS,0.0,0.0


In [107]:
# добавим новые признаки в обучающую выборку
df_train_ranker = df_train_ranker.merge(sales_of_cat_per_user, on=['user_id', 'commodity_desc'], how='left')
df_train_ranker.head()

Unnamed: 0,user_id,item_id,basket_id,day,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,target,all_sales_sum,user_total_quantity,av_check,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_sales_in_category,commodity_quantity
0,2070,834103,40642690000.0,595.0,1.0,1.0,311.0,-0.29,2209.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0
1,2070,834103,40666680000.0,597.0,1.0,1.0,311.0,-0.29,1813.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0
2,2070,834103,40679930000.0,598.0,1.0,1.0,311.0,-0.29,2029.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0
3,2070,834103,40715280000.0,601.0,1.0,1.0,311.0,-0.29,205.0,87.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0
4,2070,834103,40715280000.0,601.0,1.0,1.0,311.0,-0.29,407.0,87.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0


In [108]:
# посчитаем для каждого юзера долю его покупок в каждом commodity, а также посчитаем среднее количество покупок .юзером каждого commodity в неделю
df_train_ranker['share_of_cat_per_user'] = df_train_ranker['user_sales_in_category'] / df_train_ranker['all_sales_sum']
df_train_ranker['commodity_purchases_per_week'] = df_train_ranker['commodity_quantity'] / VAL_MATCHER_WEEKS
df_train_ranker.head(4)

Unnamed: 0,user_id,item_id,basket_id,day,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,target,all_sales_sum,user_total_quantity,av_check,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_sales_in_category,commodity_quantity,share_of_cat_per_user,commodity_purchases_per_week
0,2070,834103,40642690000.0,595.0,1.0,1.0,311.0,-0.29,2209.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0,0.032578,3.333333
1,2070,834103,40666680000.0,597.0,1.0,1.0,311.0,-0.29,1813.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0,0.032578,3.333333
2,2070,834103,40679930000.0,598.0,1.0,1.0,311.0,-0.29,2029.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0,0.032578,3.333333
3,2070,834103,40715280000.0,601.0,1.0,1.0,311.0,-0.29,205.0,87.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0,0.032578,3.333333


In [110]:
#Подготовим обучающую выборку и выделим таргет
X_train = df_train_ranker.drop(['target', 'basket_id', 'store_id'], axis=1)
y_train = df_ranker_train[['target']]

In [112]:
# Выделим категориальные признаки
cat_feats = ['manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc']

X_train[cat_feats] = X_train[cat_feats].astype('category')

cat_feats

['manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc']

In [113]:
# Обучим модель
lgb = LGBMClassifier(objective='binary',
                     max_depth=8,
                     n_estimators=300,
                     learning_rate=0.05,
                     categorical_column=cat_feats)

lgb.fit(X_train, y_train)

train_preds = lgb.predict_proba(X_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Сделаем предсказания

In [114]:
df_ranker_predict = df_train_ranker.copy()

In [115]:
df_ranker_predict['proba_item_purchase'] = train_preds[:,1]

In [116]:
df_ranker_predict.head(2)

Unnamed: 0,user_id,item_id,basket_id,day,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,target,all_sales_sum,user_total_quantity,av_check,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_sales_in_category,commodity_quantity,share_of_cat_per_user,commodity_purchases_per_week,proba_item_purchase
0,2070,834103,40642690000.0,595.0,1.0,1.0,311.0,-0.29,2209.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0,0.032578,3.333333,1.0
1,2070,834103,40666680000.0,597.0,1.0,1.0,311.0,-0.29,1813.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0,0.032578,3.333333,1.0


In [117]:
df_ranker_predict.shape

(103468, 34)

In [118]:
# Посмотрим предсказания по конкретному юзеру
df_ranker_predict.loc[df_ranker_predict['user_id']==2070].sort_values('proba_item_purchase', ascending=False)

Unnamed: 0,user_id,item_id,basket_id,day,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,target,all_sales_sum,user_total_quantity,av_check,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_sales_in_category,commodity_quantity,share_of_cat_per_user,commodity_purchases_per_week,proba_item_purchase
0,2070,834103,4.064269e+10,595.0,1.0,1.00,311.0,-0.29,2209.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,2224,GROCERY,National,SOFT DRINKS,SFT DRNK SNGL SRV BTL CARB (EX,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,20.11,20.0,0.032578,3.333333,9.999998e-01
41,2070,949616,4.088898e+10,615.0,1.0,0.25,311.0,-0.35,205.0,89.0,0.0,0.0,1.0,617.29,204,3.025931,857,DRUG GM,National,CANDY - CHECKLANE,CANDY BARS (SINGLES)(INCLUDING,1 CT,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,3.34,13.0,0.005411,2.166667,9.999998e-01
29,2070,1055863,4.116013e+10,629.0,1.0,0.59,311.0,0.00,1902.0,91.0,0.0,0.0,1.0,617.29,204,3.025931,693,DRUG GM,National,CANDY - CHECKLANE,CANDY BARS (SINGLES)(INCLUDING,1.55 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,3.34,13.0,0.005411,2.166667,9.999998e-01
33,2070,949616,4.066668e+10,597.0,1.0,0.25,311.0,-0.35,1813.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,857,DRUG GM,National,CANDY - CHECKLANE,CANDY BARS (SINGLES)(INCLUDING,1 CT,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,3.34,13.0,0.005411,2.166667,9.999998e-01
34,2070,949616,4.067993e+10,598.0,1.0,0.25,311.0,-0.35,2029.0,86.0,0.0,0.0,1.0,617.29,204,3.025931,857,DRUG GM,National,CANDY - CHECKLANE,CANDY BARS (SINGLES)(INCLUDING,1 CT,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,3.34,13.0,0.005411,2.166667,9.999998e-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43,2070,935546,,,,,,,,,,,0.0,617.29,204,3.025931,693,DRUG GM,National,CANDY - CHECKLANE,CANDY BARS (SINGLES)(INCLUDING,2.8 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,3.34,13.0,0.005411,2.166667,5.899988e-08
80,2070,906262,,,,,,,,,,,0.0,617.29,204,3.025931,1208,GROCERY,National,WATER - CARBONATED/FLVRD DRINK,NON-CRBNTD DRNKING/MNERAL WATE,20 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,4.26,6.0,0.006901,1.000000,5.899988e-08
49,2070,936511,,,,,,,,,,,0.0,617.29,204,3.025931,348,DRUG GM,National,CANDY - CHECKLANE,CHEWING GUM,14 PC,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,3.34,13.0,0.005411,2.166667,5.899988e-08
28,2070,925258,,,,,,,,,,,0.0,617.29,204,3.025931,2150,GROCERY,National,BAG SNACKS,POTATO CHIPS,3.5 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,1.80,6.0,0.002916,1.000000,5.899988e-08


- Измерьте отдельно precision@5 модели 1-ого уровня и двухуровневой модели на data_val_ranker

- Вырос ли precision@5 при использовании двухуровневой модели?

In [119]:
result_eval_ranker = data_val_ranker.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_ranker.columns=[USER_COL, ACTUAL_COL]
result_eval_ranker.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


Посчитаем метрику только на модели 1-го уровня

In [120]:
%%time
result_eval_ranker['own_rec'] = result_eval_ranker[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

CPU times: user 18.1 s, sys: 351 ms, total: 18.5 s
Wall time: 18.5 s


In [121]:
sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True)

[('own_rec', 0.22005221932114677)]

Ранжируем рекомендации с помощью обученной модели 2-го уровня и посчитаем метрику

    

In [122]:
def rerank(user_id):
    return df_ranker_predict[df_ranker_predict[USER_COL]==user_id].sort_values('proba_item_purchase', ascending=False).head(5).item_id.tolist()

In [123]:
result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))

In [124]:
result_eval_ranker['reranked_own_rec']

0          [856942, 1043064, 1050310, 856942, 10149640]
1          [1037863, 895268, 1119051, 1119051, 1119051]
2           [1079023, 949836, 840386, 9338009, 1072483]
3          [1116578, 950824, 969932, 1029915, 12172240]
4         [1056005, 1040346, 948254, 13007284, 9416729]
                             ...                       
1910         [10285187, 957741, 827271, 820122, 995876]
1911        [1066685, 838487, 1066685, 1081479, 870515]
1912    [1100379, 12949855, 10456152, 9859182, 9526100]
1913        [1060872, 1070820, 833458, 1070820, 947798]
1914      [1065538, 1065538, 1065538, 1054945, 1019643]
Name: reranked_own_rec, Length: 1915, dtype: object

In [125]:
print(*sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('reranked_own_rec', 0.22464751958224255)
('own_rec', 0.22005221932114677)


За счет фичегенерации получилось получить лучшую метрику precision

## телеграмм @Artem1_55