This notebook provides ranking baseline that uses item, user features and lightgbm as the ranker model. Code for preparing item features [this](https://www.kaggle.com/alexvishnevskiy/ranking-item-features), code for preparing user features [this](https://www.kaggle.com/alexvishnevskiy/ranking-user-features). Some code is taken from [this repo](https://github.com/radekosmulski/personalized_fashion_recs).

In [3]:
# !pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-6.0.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
[K     |################################| 25.6 MB 3.8 MB/s eta 0:00:01
Installing collected packages: pyarrow
Successfully installed pyarrow-6.0.1
You should consider upgrading via the '/home/tarique/myvenv/bin/python3.6 -m pip install --upgrade pip' command.[0m


In [4]:
from lightgbm.sklearn import LGBMRanker
from datetime import timedelta
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm

### Load all data

In [5]:
# user_features = pd.read_parquet('../input/ranking-features/user_features.parquet')
# item_features = pd.read_parquet('../input/ranking-features/item_features.parquet')

user_features = pd.read_parquet('user_features.parquet')
item_features = pd.read_parquet('item_features.parquet')

transactions_df = pd.read_csv('transactions_train.csv')
transactions_df.t_dat = pd.to_datetime( transactions_df.t_dat )

Last 4 weeks of transactions will be used as a baseline.

過去4週間のトランザクションをベースラインとして使用します。

In [6]:
user_features.head()

Unnamed: 0_level_0,mean_transactions,max_transactions,min_transactions,median_transactions,sum_transactions,max_minus_min_transactions,n_transactions,n_transactions_bigger_mean,n_online_articles,n_unique_articles,...,top_index_name_2,top_index_group_name_0,top_index_group_name_1,top_index_group_name_2,top_section_name_0,top_section_name_1,top_section_name_2,top_garment_group_name_0,top_garment_group_name_1,top_garment_group_name_2
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657,0.050831,0.050831,0.050831,0.050831,0.050831,0.0,21,10,12,19,...,0,0,0,0,0,0,0,0,0,0
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,0.027102,0.027102,0.027102,0.027102,0.027102,0.0,78,32,74,58,...,0,0,0,0,0,0,0,0,0,0
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,0.061,0.061,0.061,0.061,0.061,0.0,15,6,15,12,...,1,1,1,1,1,1,1,1,1,1
00006413d8573cd20ed7128e53b7b13819fe5cfc2d801fe7fc0f26dd8d65a85a,0.032186,0.047441,0.020322,0.030492,0.128746,0.027119,11,5,11,10,...,2,2,2,2,2,2,2,0,0,0
0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,0.038119,0.042356,0.033881,0.038119,0.076237,0.008475,6,2,6,6,...,0,0,0,0,0,0,0,0,0,0


In [7]:
item_features.head()

Unnamed: 0_level_0,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,...,product_group_name_3,graphical_appearance_name_3,colour_group_name_3,perceived_colour_value_name_3,perceived_colour_master_name_3,department_name_3,index_name_3,index_group_name_3,section_name_3,garment_group_name_3
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
108775015,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,0,1,1,0,1
108775044,0,0,0,1,1,1,0,0,0,0,...,1,1,1,1,1,0,1,1,0,1
110065001,1,1,0,0,0,0,1,1,0,1,...,0,1,1,1,1,0,0,1,0,0
110065002,1,1,0,1,1,1,1,1,0,1,...,0,1,1,1,1,0,0,1,0,0
110065011,1,1,0,3,2,2,1,1,0,1,...,0,1,0,1,0,0,0,1,0,0


In [8]:
df_4w = transactions_df[transactions_df['t_dat'] >= pd.to_datetime('2020-08-24')].copy()
df_3w = transactions_df[transactions_df['t_dat'] >= pd.to_datetime('2020-08-31')].copy()
df_2w = transactions_df[transactions_df['t_dat'] >= pd.to_datetime('2020-09-07')].copy()
df_1w = transactions_df[transactions_df['t_dat'] >= pd.to_datetime('2020-09-15')].copy()

Factorize all categorical features

すべてのカテゴリ特徴量を因数分解(数値情報に置き換える)する

In [9]:
user_features[['club_member_status', 'fashion_news_frequency']]

Unnamed: 0_level_0,club_member_status,fashion_news_frequency
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657,ACTIVE,NONE
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,ACTIVE,NONE
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,ACTIVE,NONE
00006413d8573cd20ed7128e53b7b13819fe5cfc2d801fe7fc0f26dd8d65a85a,ACTIVE,Regularly
0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,ACTIVE,NONE
...,...,...
ffff61677073258d461e043cc9ed4ed97be5617a920640ff61024f4619bf41c4,ACTIVE,Regularly
ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e4747568cac33e8c541831,ACTIVE,NONE
ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab53481233731b5c4f8b7,ACTIVE,NONE
ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264,ACTIVE,Regularly


In [10]:
user_features[['club_member_status', 'fashion_news_frequency']] = (
                   user_features[['club_member_status', 'fashion_news_frequency']]
                   .apply(lambda x: pd.factorize(x)[0])
).astype('int8')

Merge user, item features to transactions.

In [11]:
transactions_df = (
    transactions_df
    .merge(user_features, on = ('customer_id'))
    .merge(item_features, on = ('article_id'))
)

In [12]:
transactions_df.sort_values(['t_dat', 'customer_id'], inplace=True)

In [13]:
transactions_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16582699 entries, 77193 to 9896816
Data columns (total 88 columns):
 #   Column                              Dtype         
---  ------                              -----         
 0   t_dat                               datetime64[ns]
 1   customer_id                         object        
 2   article_id                          int64         
 3   price                               float64       
 4   sales_channel_id                    int64         
 5   mean_transactions                   float32       
 6   max_transactions                    float32       
 7   min_transactions                    float32       
 8   median_transactions                 float32       
 9   sum_transactions                    float32       
 10  max_minus_min_transactions          float32       
 11  n_transactions                      int8          
 12  n_transactions_bigger_mean          int8          
 13  n_online_articles                   i

In [14]:
# N_ROWS = 1_000_000

# train = transactions_df.loc[ transactions_df.t_dat <= pd.to_datetime('2020-09-15') ].iloc[:N_ROWS]
# valid = transactions_df.loc[ transactions_df.t_dat >= pd.to_datetime('2020-09-16') ]

N_ROWS = 4_057_000

train = transactions_df.loc[ transactions_df.t_dat <= pd.to_datetime('2020-09-22') ].iloc[-N_ROWS:]

In [15]:
#delete transactions to save memory
del transactions_df

In [16]:
# train.shape, valid.shape

train.shape

(4057000, 88)

### Prepare candidates

In [17]:
purchase_dict_4w = {}

for i,x in enumerate(zip(df_4w['customer_id'], df_4w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_4w:
        purchase_dict_4w[cust_id] = {}
    
    if art_id not in purchase_dict_4w[cust_id]:
        purchase_dict_4w[cust_id][art_id] = 0
    
    purchase_dict_4w[cust_id][art_id] += 1

dummy_list_4w = list((df_4w['article_id'].value_counts()).index)[:12]

週ごとに誰が、何回同じ商品を買ったのかチェック

以下のような形式で保存される

> `{'顧客ID(誰が？)': {商品ID(何を？): 購入回数(何回？)}}`


In [18]:
#検証用アルゴリズム
names = ['Alice', 'Bob', 'Charlie','Alice']
ages = [24, 50, 18,24]
test_dict = {}

for i, (name, age) in enumerate(zip(names, ages)):
    print(i, name, age)
    if name not in test_dict:
        test_dict[name] = {}
    
    if age not in test_dict[name]:
        test_dict[name][age] = 0
    
    test_dict[name][age] += 1
test_dict

0 Alice 24
1 Bob 50
2 Charlie 18
3 Alice 24


{'Alice': {24: 2}, 'Bob': {50: 1}, 'Charlie': {18: 1}}

In [19]:
purchase_dict_3w = {}

for i,x in enumerate(zip(df_3w['customer_id'], df_3w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_3w:
        purchase_dict_3w[cust_id] = {}
    
    if art_id not in purchase_dict_3w[cust_id]:
        purchase_dict_3w[cust_id][art_id] = 0
    
    purchase_dict_3w[cust_id][art_id] += 1

dummy_list_3w = list((df_3w['article_id'].value_counts()).index)[:12]

In [20]:
purchase_dict_2w = {}

for i,x in enumerate(zip(df_2w['customer_id'], df_2w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_2w:
        purchase_dict_2w[cust_id] = {}
    
    if art_id not in purchase_dict_2w[cust_id]:
        purchase_dict_2w[cust_id][art_id] = 0
    
    purchase_dict_2w[cust_id][art_id] += 1

dummy_list_2w = list((df_2w['article_id'].value_counts()).index)[:12]

In [21]:
purchase_dict_1w = {}

for i,x in enumerate(zip(df_1w['customer_id'], df_1w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_1w:
        purchase_dict_1w[cust_id] = {}
    
    if art_id not in purchase_dict_1w[cust_id]:
        purchase_dict_1w[cust_id][art_id] = 0
    
    purchase_dict_1w[cust_id][art_id] += 1

dummy_list_1w = list((df_1w['article_id'].value_counts()).index)[:12]

prepare_candidatesでやっていること

- 顧客が特定の週で最も購入している商品(特定顧客ベース)上位12をトレーニングデータに設定
- 12の商品がなかった場合、特定の週で最も購入された商品(特定の週の全取引情報ベース)上位12のデータで不足を保管

In [22]:
def prepare_candidates(customers_id, n_candidates = 12):
  """
  df - basically, dataframe with customers(customers should be unique)
  """
  prediction_dict = {}
  dummy_list = list((df_2w['article_id'].value_counts()).index)[:n_candidates]

  for i, cust_id in tqdm(enumerate(customers_id)):
    # comment this for validation
    if cust_id in purchase_dict_1w:
        # 顧客が購入したアイテムの回数のデータを参照して、降順に並び替える
        l = sorted((purchase_dict_1w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        # 降順に並び替えたリストから、アイテムIDを配列で取得
        l = [y[0] for y in l]
        # 予測アイテム数の上限よりもアイテムID数が多かった場合、予測アイテム数の上限までのアイテムIDのリスト要素を取得
        if len(l)>n_candidates:
            s = l[:n_candidates]
            # 予測アイテム数の上限よりもアイテムID数が少なかった場合、ダミーの値で保管
            # ダミーの値の中身は、その週に最も購入された上位12の商品
        else:
            s = l+dummy_list_1w[:(n_candidates-len(l))]
    elif cust_id in purchase_dict_2w:
        l = sorted((purchase_dict_2w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>n_candidates:
            s = l[:n_candidates]
        else:
            s = l+dummy_list_2w[:(n_candidates-len(l))]
    elif cust_id in purchase_dict_3w:
        l = sorted((purchase_dict_3w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>n_candidates:
            s = l[:n_candidates]
        else:
            s = l+dummy_list_3w[:(n_candidates-len(l))]
    elif cust_id in purchase_dict_4w:
        l = sorted((purchase_dict_4w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>n_candidates:
            s = l[:n_candidates]
        else:
            s = l+dummy_list_4w[:(n_candidates-len(l))]
    else:
        s = dummy_list
    prediction_dict[cust_id] = s

  k = list(map(lambda x: x[0], prediction_dict.items()))
  v = list(map(lambda x: x[1], prediction_dict.items()))
  negatives_df = pd.DataFrame({'customer_id': k, 'negatives': v})
  negatives_df = (
      negatives_df
      .explode('negatives')
      .rename(columns = {'negatives': 'article_id'})
  )
  return negatives_df

### Train model

In [23]:
train['rank'] = range(len(train))
train.assign(rn = train.groupby(['customer_id'])['rank'].rank(method='first', ascending=False))

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,mean_transactions,max_transactions,min_transactions,median_transactions,sum_transactions,...,colour_group_name_3,perceived_colour_value_name_3,perceived_colour_master_name_3,department_name_3,index_name_3,index_group_name_3,section_name_3,garment_group_name_3,rank,rn
3756864,2020-06-21,ff0f526e93a150f5f0ccef8453ee1dcb5e96f4376aaa22...,806388005,0.013542,2,0.033294,0.083390,0.011847,0.022017,0.699170,...,0,0,0,0,1,1,0,1,0,21.0
5691096,2020-06-21,ff0f526e93a150f5f0ccef8453ee1dcb5e96f4376aaa22...,853916001,0.033881,2,0.033294,0.083390,0.011847,0.022017,0.699170,...,1,1,1,0,0,0,0,1,1,20.0
6490847,2020-06-21,ff0f526e93a150f5f0ccef8453ee1dcb5e96f4376aaa22...,891199002,0.022017,2,0.033294,0.083390,0.011847,0.022017,0.699170,...,1,1,1,1,1,1,0,1,2,19.0
6600464,2020-06-21,ff0f526e93a150f5f0ccef8453ee1dcb5e96f4376aaa22...,825580001,0.016932,2,0.033294,0.083390,0.011847,0.022017,0.699170,...,1,1,1,0,1,1,0,1,3,18.0
7599051,2020-06-21,ff0f526e93a150f5f0ccef8453ee1dcb5e96f4376aaa22...,800691007,0.011847,2,0.033294,0.083390,0.011847,0.022017,0.699170,...,1,1,1,0,1,1,0,1,4,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3443216,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,891322004,0.042356,2,0.029208,0.076254,0.005068,0.025407,3.972237,...,1,1,1,0,1,1,1,0,4056995,2.0
12204387,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,929511001,0.059305,2,0.029208,0.076254,0.005068,0.025407,3.972237,...,1,1,1,0,1,1,1,0,4056996,1.0
16573416,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,918325001,0.043203,1,0.020110,0.043203,0.010153,0.013542,0.080441,...,1,1,1,0,0,1,0,1,4056997,1.0
10375456,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,833459002,0.006763,1,0.035576,0.067780,0.006763,0.030102,0.782678,...,0,1,0,0,0,1,0,1,4056998,1.0


In [24]:
#take only last 15 transactions
#トレーニングデータの長さ分の数値を格納
train['rank'] = range(len(train))
#カスタマーごとに最新の15のトランザクションをトレーニングデータとして扱う
train = (
    train
    .assign(
        rn = train.groupby(['customer_id'])['rank']
                  .rank(method='first', ascending=False))
    .query("rn <= 15")
    .drop(columns = ['price', 'sales_channel_id'])
    .sort_values(['t_dat', 'customer_id'])
)
train['label'] = 1

del train['rank']
del train['rn']

# valid.sort_values(['t_dat', 'customer_id'], inplace = True)

Append negatives to positives using last dates from train

In [25]:
#カスタマーごとに最新の購入日を取得
last_dates = (
    train
    .groupby('customer_id')['t_dat']
    .max()
    .to_dict()
)

# 
negatives = prepare_candidates(train['customer_id'].unique(), 15)
negatives['t_dat'] = negatives['customer_id'].map(last_dates)

negatives = (
    negatives
    .merge(user_features, on = ('customer_id'))
    .merge(item_features, on = ('article_id'))
)
negatives['label'] = 0

535431it [00:02, 252230.77it/s]


In [26]:
negatives

Unnamed: 0,customer_id,article_id,t_dat,mean_transactions,max_transactions,min_transactions,median_transactions,sum_transactions,max_minus_min_transactions,n_transactions,...,graphical_appearance_name_3,colour_group_name_3,perceived_colour_value_name_3,perceived_colour_master_name_3,department_name_3,index_name_3,index_group_name_3,section_name_3,garment_group_name_3,label
0,ff0f526e93a150f5f0ccef8453ee1dcb5e96f4376aaa22...,891591001,2020-08-30,0.033294,0.083390,0.011847,0.022017,0.699170,0.071542,-110,...,0,1,1,1,0,1,1,0,0,0
1,0b45b16d9d448c2c029ea9cee8f464be3e4bca2bc89a1e...,891591001,2020-09-19,0.040435,0.084729,0.016932,0.033881,0.606525,0.067797,108,...,0,1,1,1,0,1,1,0,0,0
2,176e33213f43ebfaa7b7cd4ebdc7dd4ca13d24c0380cb4...,891591001,2020-09-19,0.029194,0.084729,0.016932,0.022017,0.379525,0.067797,79,...,0,1,1,1,0,1,1,0,0,0
3,69c8baca290d79a6162dabce58bba8488a6d4aff4249a0...,891591001,2020-09-02,0.035092,0.084729,0.007424,0.033881,0.491288,0.077305,39,...,0,1,1,1,0,1,1,0,0,0
4,7b47db7787f919bc3800d065e8b13d01b4d557458ed27b...,891591001,2020-08-26,0.025638,0.084729,0.015237,0.022017,0.282017,0.069492,59,...,0,1,1,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7815442,ef42ee5fa92d612288a6d42a5a2eb25e55383ded18d70f...,823685002,2020-09-22,0.018627,0.025407,0.013542,0.016932,0.149017,0.011864,8,...,1,0,1,0,0,0,1,0,0,0
7815443,f1d7a9b981448439a2bfc989aea90621aa79d1318474d2...,798622005,2020-09-22,0.020661,0.030492,0.010153,0.016932,0.103305,0.020339,5,...,1,1,1,1,0,0,1,0,0,0
7815444,f26132ea566e3aac25c89925ba0ad88a34b67545f3c6db...,554450034,2020-09-22,0.033782,0.084729,0.010153,0.033881,0.574288,0.074576,17,...,1,0,0,0,0,1,1,1,0,0
7815445,f79e372e21c1359dfebc7da0bf7f321d55e47b3275c351...,533261032,2020-09-22,0.033881,0.033881,0.033881,0.033881,0.067763,0.000000,62,...,1,0,0,0,0,0,1,1,1,0


In [27]:
train = pd.concat([train, negatives])
train.sort_values(['customer_id', 't_dat'], inplace = True)

LGBMRankerは、groupプロパティに「どこからどこまでの配列が一人の顧客がどの商品購入したトランザクションデータなのか」を伝える必要があるので上記で、カスタマーIDでソートして、以下の処理で各カスタマーIDがどの商品を何回購入したかの回数を取得する

その回数を配列にすることにより、「どこからどこまでの配列が一人の顧客がどの商品購入したトランザクションデータなのか」のデータ形式を満たすことができる。


In [28]:
# train_baskets = train.groupby(['customer_id'])['article_id'].count().values
# valid_baskets = valid.groupby(['customer_id'])['article_id'].count().values
train_baskets = train.groupby(['customer_id'])['article_id'].count().values

In [29]:
train_baskets

array([14, 16, 14, ..., 23, 25, 16])

Fit lightgbm ranker model

In [30]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    max_depth=7,
    n_estimators=300,
    importance_type='gain',
    verbose=10
)

In [None]:
ranker = ranker.fit(
    train.drop(columns = ['t_dat', 'customer_id', 'article_id', 'label']),
    train.pop('label'),
    group=train_baskets,
)

[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.892217
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.524799
[LightGBM] [Debug] init for col-wise cost 1.967700 seconds, init for row-wise cost 5.682926 seconds
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 5157
[LightGBM] [Info] Number of data points in the train set: 11153983, number of used features: 83
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and 

[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Traine

[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[LightGBM] [Debug] Traine

In [None]:
ranker

### Predictions

In [None]:
sample_sub = pd.read_csv('sample_submission.csv')

In [None]:
candidates = prepare_candidates(sample_sub.customer_id.unique(), 12)
candidates = (
    candidates
    .merge(user_features, on = ('customer_id'))
    .merge(item_features, on = ('article_id'))
)

Predict using batches, otherwise doesn't fit into memory.

In [None]:
batch_size = 1_000_000
for bucket in tqdm(range(0, len(candidates), batch_size)):
    print(bucket)
    print(batch_size)
    print(bucket+batch_size)
    #candidates.iloc[bucket: bucket+batch_size]

In [None]:
preds = []
batch_size = 1_000_000
# 1_000_000行ごとにcandidatesを取り出し予測
# 予測結果はpredsに格納
for bucket in tqdm(range(0, len(candidates), batch_size)):
  outputs = ranker.predict(
      candidates.iloc[bucket: bucket+batch_size]
      .drop(columns = ['customer_id', 'article_id'])
      )
  preds.append(outputs)

In [None]:
preds

In [None]:
preds = np.concatenate(preds)
preds

In [None]:
candidates['preds'] = preds
candidates['preds']

In [None]:
preds = candidates[['customer_id', 'article_id', 'preds']]
preds

In [None]:
preds.sort_values(['customer_id', 'preds'], ascending=False, inplace = True)
preds

In [None]:
preds = (
    preds
    .groupby('customer_id')[['article_id']]
    .aggregate(lambda x: x.tolist())
)
preds

In [None]:
preds['article_id'] = preds['article_id'].apply(lambda x: ' '.join(['0'+str(k) for k in x]))
preds['article_id'] 

Join with sample submission and fillna with articles from dummy_list_2w

In [None]:
preds = sample_sub[['customer_id']].merge(
    preds
    .reset_index()
    .rename(columns = {'article_id': 'prediction'}), how = 'left')
preds['prediction'].fillna(' '.join(['0'+str(art) for art in dummy_list_2w]), inplace = True)

In [None]:
preds.to_csv('submisssion_v1.csv', index = False)

In [3]:
import kaggle

In [5]:
# !kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f "submisssion_v1.csv" -m "LGBM Ranker 457M with 300 Trees"


100%|########################################| 258M/258M [00:06<00:00, 44.3MB/s]
Successfully submitted to H&M Personalized Fashion Recommendations