## Capstone Part 4 - LGBM Model

The objective of this notebook is to use Light GBM (LGBM) Ranker model to beat the base model that was created in part 2. This notebook was configured to run on Google Colab, as the amount of RAM used was too large for normal Jupyter notebook. As such, the relative references are kept in order to let the notebook run on Colab.

In [None]:
# Import libraries
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# read parquet files processed in previous part
transactions = pd.read_parquet('/content/drive/MyDrive/datasets/transactions_5w_train.parquet')
customers = pd.read_parquet('/content/drive/MyDrive/datasets/customers.parquet')
articles = pd.read_parquet('/content/drive/MyDrive/datasets/articles.parquet')

I will be using the last 5 weeks data. This was the initial intuition, as I had limited RAM on my computer. However, after I have subscribed to Google Colab 
account, the maximum number of weeks of data I was able to use became around 26 weeks. Based on my iterative process, 5 weeks LGBM gave the most optimal final model. As such, I will be presenting only the 5 weeks model in this notebook.

As previously discussed, this might be due to H&M being a fast fashion retailer. As part of their strategy, they would refer to fashion show designs and produce them quickly. As such, their goods may change, and accordingly, customer's purchase trends would also change.

In [None]:
transactions.shape

(1300034, 6)

In [None]:
transactions.columns

Index(['t_dat', 'customer_id', 'article_id', 'price', 'sales_channel_id',
       'week'],
      dtype='object')

In [None]:
# see customer_id to weeks when had transactions
c2weeks = transactions.groupby('customer_id')['week'].unique()

In [None]:
# display output, noted that all outputs are within 100 to 104, 5 weeks of data
c2weeks

customer_id
28847241659200          [101, 102]
116809474287335         [101, 103]
200292573348128              [102]
272412481300040              [103]
329094189075899              [100]
                           ...    
18446590778427270109         [102]
18446630855572834764         [103]
18446662237889060501         [100]
18446705133201055310         [102]
18446737527580148316         [104]
Name: week, Length: 273166, dtype: object

Noted that there are 273,166 unique customer_id that have purchases in week 100 to 104

In [None]:
# use last week as test week
test_week = transactions.week.max()

In [None]:
# week 104 will be used as test week for ranking
test_week

104

As the purpose of the recommender is to predict what will be bought within 7 days, essentially, it is to predict the items that each customer will buy on week 105. As such, we will be using the last week as test dataset.

In [None]:
# Create dictionary within dictionary to hold shifted weeks, last value on dictionary to be week 104, 
# First item to be last 2 purchase weeks from c2week2
# Second item to be last week of purchase:104
# 104:104 means that last purchase was indeed on 104 week
c2weeks2shifted_weeks = {}

for c_id, weeks in c2weeks.items():
    c2weeks2shifted_weeks[c_id] = {}
    for i in range(weeks.shape[0]-1):
        c2weeks2shifted_weeks[c_id][weeks[i]] = weeks[i+1]
    c2weeks2shifted_weeks[c_id][weeks[-1]] = test_week

In [None]:
c2weeks2shifted_weeks

{28847241659200: {101: 102, 102: 104},
 116809474287335: {101: 103, 103: 104},
 200292573348128: {102: 104},
 272412481300040: {103: 104},
 329094189075899: {100: 104},
 519262836338427: {102: 104},
 690285180337957: {103: 104},
 745180086074610: {100: 102, 102: 104},
 762483386043116: {100: 104},
 805095543045062: {102: 104},
 964326548579219: {102: 104},
 1037449031262554: {101: 102, 102: 104},
 1195818762005827: {100: 104},
 1200402310946735: {103: 104},
 1219588721247131: {102: 103, 103: 104},
 1289455304111298: {101: 104},
 1292700965481018: {103: 104},
 1296218836199721: {101: 104},
 1394073833551710: {100: 102, 102: 103, 103: 104},
 1402273113592184: {104: 104},
 1428037123270201: {100: 104},
 1456826891333599: {102: 104},
 1520973890714130: {102: 104},
 1563099511359960: {100: 102, 102: 104},
 1667805948360801: {100: 104},
 1827730561464445: {102: 104, 104: 104},
 1830503753738904: {100: 102, 102: 104},
 1905990147027598: {101: 103, 103: 104},
 1951136007097426: {104: 104},
 20

In [None]:
# create shallow copy of existing transactions list
candidates_last_purchase = transactions.copy()

In [None]:
# Impute weeks into candidates last purchase listing using c2weeks2shifted_weeks dictionary
weeks = []    # Append shifted weeks into last purchase 
for i, (c_id, week) in enumerate(zip(transactions['customer_id'], transactions['week'])):
    weeks.append(c2weeks2shifted_weeks[c_id][week])
    
candidates_last_purchase.week=weeks

In [None]:
# check that all weeks are extracted for transactions 1300034 is shape for transactions
len(weeks)

1300034

In [None]:
candidates_last_purchase.groupby('customer_id')['week'].unique()

customer_id
28847241659200          [102, 104]
116809474287335         [103, 104]
200292573348128              [104]
272412481300040              [104]
329094189075899              [104]
                           ...    
18446590778427270109         [104]
18446630855572834764         [104]
18446662237889060501         [104]
18446705133201055310         [104]
18446737527580148316         [104]
Name: week, Length: 273166, dtype: object

Similar to `c2week`, noted that there are 273,166 unique customer_id. However, the last purchase is now between 102 to 104

##  Best seller candidate

In [None]:
# Calculate mean price per week and article_id
mean_price = transactions \
    .groupby(['week', 'article_id'])['price'].mean()

In [None]:
# create bestseller_rank column by week from 101 to 104 by article_id, making only 1 to 12 positions
sales = transactions \
    .groupby('week')['article_id'].value_counts() \
    .groupby('week').rank(method='dense', ascending=False) \
    .groupby('week').head(12).rename('bestseller_rank').astype('int8')

In [None]:
sales

week  article_id
100   916468003      1
      896152003      2
      896152002      3
      751471001      4
      706016001      5
      918292001      6
      921906003      7
      751471043      8
      706016003      9
      918292004     10
      915526002     11
      920610001     12
101   898694001      1
      933706001      2
      751471001      3
      915526001      4
      915529003      5
      706016001      6
      918292001      7
      751471043      8
      915526002      9
      915529001     10
      862970001     11
      863595006     12
102   915526001      1
      751471043      2
      751471001      3
      706016001      4
      919365008      5
      915529003      6
      918292001      7
      863595006      8
      896152002      9
      448509014     10
      909916001     11
      762846031     12
103   909370001      1
      865799006      2
      918522001      3
      924243001      4
      448509014      5
      751471001      6
      809238001  

In [None]:
# create bestsellers_rank last week 
bestsellers_previous_week = pd.merge(sales, mean_price, on=['week', 'article_id']).reset_index()
bestsellers_previous_week.week += 1

In [None]:
bestsellers_previous_week

Unnamed: 0,week,article_id,bestseller_rank,price
0,101,916468003,1,0.032983
1,101,896152003,2,0.033229
2,101,896152002,3,0.033338
3,101,751471001,4,0.033391
4,101,706016001,5,0.033502
5,101,918292001,6,0.041646
6,101,921906003,7,0.033327
7,101,751471043,8,0.033299
8,101,706016003,9,0.033337
9,101,918292004,10,0.041617


In [None]:
# create dataframe containing unique transaction for each week and customer_id
unique_transactions = transactions \
    .groupby(['week', 'customer_id']) \
    .head(1) \
    .drop(columns=['article_id', 'price']) \
    .copy()

In [None]:
unique_transactions.head()

Unnamed: 0,t_dat,customer_id,sales_channel_id,week
30500174,2020-08-19,6435666514878045,2,100
30495830,2020-08-19,6930054433895293,1,100
30520935,2020-08-19,8383252499052781,1,100
30492047,2020-08-19,9057218560097811,1,100
30491229,2020-08-19,11942017059998426,1,100


In [None]:
# join unique_transactions with bestsellers_previous_week 
candidates_bestsellers = pd.merge(
    unique_transactions,
    bestsellers_previous_week,
    on='week',
)

In [None]:
candidates_bestsellers.head()

Unnamed: 0,t_dat,customer_id,sales_channel_id,week,article_id,bestseller_rank,price
0,2020-08-26,116809474287335,1,101,916468003,1,0.032983
1,2020-08-26,116809474287335,1,101,896152003,2,0.033229
2,2020-08-26,116809474287335,1,101,896152002,3,0.033338
3,2020-08-26,116809474287335,1,101,751471001,4,0.033391
4,2020-08-26,116809474287335,1,101,706016001,5,0.033502


In [None]:
# extract all unique transactions per customer_id, and set week 104  as test_set_transaction week
test_set_transactions = unique_transactions.drop_duplicates('customer_id').reset_index(drop=True)
test_set_transactions.week = test_week

In [None]:
# create dataframe for bestseller ranking and test set transactions
candidates_bestsellers_test_week = pd.merge(
    test_set_transactions,
    bestsellers_previous_week,
    on='week')

In [None]:
candidates_bestsellers = pd.concat([candidates_bestsellers, candidates_bestsellers_test_week])
candidates_bestsellers.drop(columns='bestseller_rank', inplace=True)

In [None]:
candidates_bestsellers

Unnamed: 0,t_dat,customer_id,sales_channel_id,week,article_id,price
0,2020-08-26,116809474287335,1,101,916468003,0.032983
1,2020-08-26,116809474287335,1,101,896152003,0.033229
2,2020-08-26,116809474287335,1,101,896152002,0.033338
3,2020-08-26,116809474287335,1,101,751471001,0.033391
4,2020-08-26,116809474287335,1,101,706016001,0.033502
...,...,...,...,...,...,...
3277987,2020-09-22,18438270306572912089,1,104,918292001,0.041424
3277988,2020-09-22,18438270306572912089,1,104,762846027,0.025104
3277989,2020-09-22,18438270306572912089,1,104,809238005,0.041656
3277990,2020-09-22,18438270306572912089,1,104,673677002,0.024925


## Combining transactions and candidates / negative examples

In [None]:
transactions['purchased'] = 1

In [None]:
transactions.shape

(1300034, 7)

In [None]:
candidates_last_purchase.shape

(1300034, 6)

In [None]:
candidates_bestsellers.shape

(6842928, 6)

In [None]:
# concatenate positive transactions and candidates which are last purchase and bestseller
data = pd.concat([transactions, candidates_last_purchase, candidates_bestsellers])
data.purchased.fillna(0, inplace=True)

data.purchased.mean()

0.1376717728144754

In [None]:
data.shape

(9442996, 7)

In [None]:
# drop candidates, last purchase and bestseller transaction information if already purchased within the week
data.drop_duplicates(['customer_id', 'article_id', 'week'], inplace=True)

In [None]:
data.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week,purchased
30500174,2020-08-19,6435666514878045,816423005,0.011847,2,100,1.0
30500175,2020-08-19,6435666514878045,599718043,0.016932,2,100,1.0
30500176,2020-08-19,6435666514878045,806528004,0.025407,2,100,1.0
30500177,2020-08-19,6435666514878045,903211001,0.042356,2,100,1.0
30500178,2020-08-19,6435666514878045,779781006,0.042356,2,100,1.0


At this stage, we have accomplished to have a dataframe containing purchased items, as well as the recommended bestsellers that week and the purchased items the week before.

## Add bestseller information

In [None]:
# merge data with bestseller rank from previous week, can't use current week as leakage of data
data = pd.merge(
    data,
    bestsellers_previous_week[['week', 'article_id', 'bestseller_rank']],
    on=['week', 'article_id'],
    how='left'
)

In [None]:
data.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week,purchased,bestseller_rank
0,2020-08-19,6435666514878045,816423005,0.011847,2,100,1.0,
1,2020-08-19,6435666514878045,599718043,0.016932,2,100,1.0,
2,2020-08-19,6435666514878045,806528004,0.025407,2,100,1.0,
3,2020-08-19,6435666514878045,903211001,0.042356,2,100,1.0,
4,2020-08-19,6435666514878045,779781006,0.042356,2,100,1.0,


In [None]:
# fill null values with 999 in rank
#data = data[data.week != data.week.min()]
data.bestseller_rank.fillna(999, inplace=True)

In [None]:
data.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week,purchased,bestseller_rank
0,2020-08-19,6435666514878045,816423005,0.011847,2,100,1.0,999.0
1,2020-08-19,6435666514878045,599718043,0.016932,2,100,1.0,999.0
2,2020-08-19,6435666514878045,806528004,0.025407,2,100,1.0,999.0
3,2020-08-19,6435666514878045,903211001,0.042356,2,100,1.0,999.0
4,2020-08-19,6435666514878045,779781006,0.042356,2,100,1.0,999.0


In [None]:
data = pd.merge(data, articles, on='article_id', how='left')
data = pd.merge(data, customers, on='customer_id', how='left')

In [None]:
data.sort_values(['week', 'customer_id'], inplace=True)
data.reset_index(drop=True, inplace=True)

In [None]:
# train test split manually
train = data[data.week != test_week]
test = data[data.week==test_week].drop_duplicates(['customer_id', 'article_id', 'sales_channel_id']).copy()

In [None]:
data['purchased'].value_counts(normalize=True)

0.0    0.85575
1.0    0.14425
Name: purchased, dtype: float64

We see an imbalanced problem here, where most of the observations are 0. As such scale pos weight will be used to balance the classes.

In [None]:
train_baskets = train.groupby(['week', 'customer_id'])['article_id'].count().values

In [None]:
train_baskets

array([ 1, 17,  3, ..., 13, 30, 15])

In [None]:
# Feature engineering can be looked into
columns_to_use = ['article_id', 'product_type_no', 'graphical_appearance_no', 'colour_group_code', 'perceived_colour_value_id',
'perceived_colour_master_id', 'department_no', 'index_code',
'index_group_no', 'section_no', 'garment_group_no', 'age', 'bestseller_rank']

In [None]:
%%time

X_train = train[columns_to_use]
y_train = train['purchased']

CPU times: user 65.3 ms, sys: 579 µs, total: 65.9 ms
Wall time: 64.7 ms


In [None]:
test_baskets = test.groupby(['week', 'customer_id'])['article_id'].count().values

In [None]:
test_baskets

array([13, 13, 13, ..., 14, 13, 16])

In [None]:
X_test = test[columns_to_use]
y_test = test['purchased']

## Modelling

In [None]:
from lightgbm.sklearn import LGBMRanker

In [None]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=5,
    scale_pos_weight=8,
    importance_type='gain',
    verbose=10,
)

In [None]:
%%time

ranker = ranker.fit(
    X_train,
    y_train,
    group=train_baskets,
    eval_set=[(X_test, y_test)],
    eval_group=[test_baskets], 
    eval_at=[12,20,25], 
    early_stopping_rounds=50
)

[1]	valid_0's ndcg@12: 0.98269	valid_0's ndcg@20: 0.983556	valid_0's ndcg@25: 0.983665




[2]	valid_0's ndcg@12: 0.977037	valid_0's ndcg@20: 0.978033	valid_0's ndcg@25: 0.978153
[3]	valid_0's ndcg@12: 0.974776	valid_0's ndcg@20: 0.975787	valid_0's ndcg@25: 0.975921
[4]	valid_0's ndcg@12: 0.974135	valid_0's ndcg@20: 0.975177	valid_0's ndcg@25: 0.975315
[5]	valid_0's ndcg@12: 0.973656	valid_0's ndcg@20: 0.974699	valid_0's ndcg@25: 0.974836
CPU times: user 19.8 s, sys: 100 ms, total: 19.9 s
Wall time: 3.7 s


In [None]:
for i in ranker.feature_importances_.argsort()[::-1]:
    print(columns_to_use[i], ranker.feature_importances_[i]/ranker.feature_importances_.sum())

NotFittedError: ignored

We see here that from feature importance, the most important feature was found to be 'bestseller_rank'. This was alot higher in feature importance than any other features.

In [None]:
%time

test['preds'] = ranker.predict(X_test)

c_id2predicted_article_ids = test \
    .sort_values(['customer_id', 'preds'], ascending=False) \
    .groupby('customer_id')['article_id'].apply(list).to_dict()

bestsellers_last_week = \
    bestsellers_previous_week[bestsellers_previous_week.week == bestsellers_previous_week.week.max()]['article_id'].tolist()

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs


In [None]:
test.sort_values("preds", ascending=False)

Unnamed: 0.1,t_dat,customer_id,article_id,price,sales_channel_id,week,purchased,bestseller_rank,Unnamed: 0,product_code,...,garment_group_name,detail_desc,kangol,FN,Active,club_member_status,fashion_news_frequency,age,postal_code,preds
4553572,2020-09-16,2980539755859949658,908799002,0.027102,2,104,1.0,999.0,103033,908799,...,19,27340,0,1,1,0,1,45,177854,0.217927
4370366,2020-09-11,2182892762944674298,909371001,0.033881,1,104,0.0,999.0,103109,909371,...,19,27262,0,0,0,0,0,26,6276,0.217927
7001954,2020-09-10,13711006276754330551,923020001,0.067780,2,104,0.0,999.0,104429,923020,...,19,26020,0,1,1,0,1,24,17051,0.217927
7001955,2020-09-10,13711006276754330551,909371001,0.033881,2,104,0.0,999.0,103109,909371,...,19,27262,0,1,1,0,1,24,17051,0.217927
7015931,2020-09-11,13773443448661361320,923028002,0.135576,2,104,0.0,999.0,104431,923028,...,19,19560,0,1,1,0,1,41,40553,0.217927
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7403977,2020-08-26,15460656271975412889,923758001,0.033478,2,104,0.0,12.0,104527,923758,...,6,25984,0,1,1,0,1,40,6075,-0.211990
4187043,2020-08-30,1367164984103984313,923758001,0.033478,2,104,0.0,12.0,104527,923758,...,6,25984,0,0,0,0,0,30,151117,-0.211990
5765547,2020-09-11,8274906784288763104,923758001,0.033478,1,104,0.0,12.0,104527,923758,...,6,25984,0,1,1,0,1,27,140613,-0.211990
7403963,2020-09-14,15460643812847391449,923758001,0.033478,1,104,0.0,12.0,104527,923758,...,6,25984,0,1,1,0,1,72,0,-0.211990


## Submission

In [None]:
sub = pd.read_csv('/content/drive/MyDrive/datasets/sample_submission.csv')

In [None]:
sub.head()

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0924243001 0909370001 0865799006 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0924243001 0924243002 0918522001 0923758001 08...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0924243001 0762846027 0918522001 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0924243001 0924243002 0918522001 0923758001 08...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0924243001 0924243002 0918522001 0923758001 08...


In [None]:
def customer_hex_id_to_int(series):
    return series.str[-16:].apply(hex_id_to_int)
def hex_id_to_int(str):
    return int(str[-16:], 16)

In [None]:
%%time
preds = []
for c_id in customer_hex_id_to_int(sub.customer_id):
    pred = c_id2predicted_article_ids.get(c_id, [])
    pred = pred + bestsellers_last_week
    preds.append(pred[:12])

CPU times: user 4.27 s, sys: 180 ms, total: 4.45 s
Wall time: 4.42 s


In [None]:
preds = [' '.join(['0' + str(p) for p in ps]) for ps in preds]
sub.prediction = preds

In [None]:
sub.to_csv('/content/drive/MyDrive/datasets/LGBM5w_NDCG@122025ES.csv', index=False)

Kaggle score returned for MAP@12 is 0.0199. There were other better scores using different features, hyperparameter tuning and weeks for LGBM. However, this version was part of the ensemble that returned the highest MAP@12. As such, we will keep this for our presentation purpose.