The other two notebooks have been finein the sense that I learned a lot about the problem.

I got a sense for how candidate genertion feels, what trends might exist in the data, etc.

Let's not try generating a submission implementing a full (albeit small) data processing pipeline where at the end we will train an lgbm model and make a submission.

Let's get started.

# Feature (and candidate) Engineering

In [1]:
# import pandas as pd
import modin.pandas as pd
# import swifter
import numpy as np

from distributed import Client

client = Client(n_workers=16)

In [2]:
VALID_RUN = True

In [3]:
%%time
# https://www.kaggle.com/paweljankiewicz/hm-create-dataset-samples

transactions = pd.read_csv('data/transactions_train.csv', dtype={"article_id": "str"})
customers = pd.read_csv('data/customers.csv')
articles = pd.read_csv('data/articles.csv', dtype={"article_id": "str"})

CPU times: user 6.08 s, sys: 2.31 s, total: 8.39 s
Wall time: 13 s


We can skip customers and articles for now, the most heart of the problem are the transactions.

We want to use features we generate for up to an including week $t$ to generate predictions for week $t_{t+1}$.

Of course, that is just one way to structure the problem. We could treat it as a time series problem where we look at the sequence of weeks $t_1...t_n$ and predict the purchases for week $t_{n+1}$. There are many ways to frame the problem.

But we want to do something that would be
* simple
* has a chance to lending itself well for using gradient boosted trees

In line with [the suggestion from Pawel Jankiewicz](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/307288), we will predict baskets. And to make things even simpler, we will aggregate purchases by weeks.

In [4]:
# %%time

# for sample_repr, sample in [("01", 0.001), ("1", 0.01), ("5", 0.05)]:
#     print(sample)
#     customers_sample = customers.sample(int(customers.shape[0]*sample), replace=False)
#     customers_sample_ids = set(customers_sample["customer_id"])
#     transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
#     articles_sample_ids = set(transactions_sample["article_id"])
#     articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]
#     customers_sample.to_csv(f"data/customers_sample_{sample_repr}.csv.gz", index=False)
#     transactions_sample.to_csv(f"data/transactions_train_sample_{sample_repr}.csv.gz", index=False)
#     articles_sample.to_csv(f"data/articles_train_sample_{sample_repr}.csv.gz", index=False)

In [5]:
# transactions = pd.read_csv('data/transactions_train_sample_01.csv.gz', dtype={"article_id": "str"})
# customers = pd.read_csv('data/customers_sample_01.csv.gz')
# articles = pd.read_csv('data/articles_train_sample_01.csv.gz', dtype={"article_id": "str"})

In [6]:
%%time

transactions['week'] = pd.to_datetime(transactions.t_dat, format='%Y-%m-%d') \
    .apply(lambda t: t.year * 100 + t.week) \
    .rank(method='dense') \
    .astype('int')

transactions.drop(columns='t_dat', inplace=True)

CPU times: user 10.1 s, sys: 1.42 s, total: 11.5 s
Wall time: 35.3 s


In [7]:
transactions = transactions[transactions.week > 97]

In [8]:
# I'm using the last week for local validation

if VALID_RUN: transactions = transactions[transactions.week != transactions.week.max()]
transactions.reset_index(inplace=True, drop=True)

In [9]:
%%time

transactions['purchased'] = 1 # our positive examples

transactions.drop_duplicates(['customer_id', 'article_id', 'week'], inplace=True)



In [10]:
%%time

bought_previous_week = transactions.copy()
bought_previous_week.week += 1

bought_previous_week.purchased = 0 # negative examples, possibly
bought_previous_week['bought_previous_week'] = 1

In [11]:
%%time

# updating true sales already in transactions with information on whether an article was bought by a customer
# in the previous week

transactions = pd.merge(
    transactions,
    bought_previous_week[['customer_id', 'article_id', 'week', 'bought_previous_week']],
    on=['customer_id', 'article_id', 'week'], how='left'
)

transactions.bought_previous_week.fillna(0, inplace=True)

In [12]:
%%time

# adding our negative examples

transactions = pd.concat([transactions, bought_previous_week])

`transactions` now contains some fake transactions. But if you observe, I am always adding these possibly fake transactions (for instance, negative examples where in fact a purchase was made of that article by a given customer in a given week) at the bottom of the dataframe.

This insight is key. I will be able to remove them by using `drop_duplicates` once I am done with generating negative examples (or data to predict on for the week in the test set).

Once you trace what is going on here and spot the techniques I am using to generate candidate purchases / negative examples, you will have everything to completely rock this competition! This is one framing of the problem, but a very good one.

In [13]:
transactions

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased,bought_previous_week
0,000c41868d0170bf1e022a985a37f52344ba14ca5c331b...,0854338008,0.016932,1,98,1,0.0
1,000c41868d0170bf1e022a985a37f52344ba14ca5c331b...,0684209026,0.015237,1,98,1,0.0
2,000c41868d0170bf1e022a985a37f52344ba14ca5c331b...,0844294002,0.015237,1,98,1,0.0
3,000c41868d0170bf1e022a985a37f52344ba14ca5c331b...,0599580061,0.008458,1,98,1,0.0
4,0011c0c71f6e3a871a9520a4055ed5a9b0d2f428e35b77...,0787946003,0.016932,1,98,1,0.0
...,...,...,...,...,...,...,...
2491886,fff7e7674509592818bf453391af43a85eaaac9a52d858...,0624486049,0.013542,1,107,0,1.0
2491887,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,0717490010,0.008458,2,107,0,1.0
2491888,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,0717490058,0.008458,2,107,0,1.0
2491889,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,0717490057,0.008458,2,107,0,1.0


In [14]:
%%time

bestsellers_previous_week = pd.DataFrame._to_pandas(transactions) \
    .groupby('week')['article_id'].value_counts() \
    .groupby('week').head(20) \
    .groupby('week').rank(method='dense', ascending=False) \
    .to_frame('bestseller_previous_week_rank').reset_index()
bestsellers_previous_week = pd.DataFrame(bestsellers_previous_week)

bestsellers_previous_week.week += 1

bestsellers_previous_week = bestsellers_previous_week[bestsellers_previous_week.week != bestsellers_previous_week.max().week.item()]



CPU times: user 4.64 s, sys: 1.03 s, total: 5.66 s
Wall time: 6.82 s


In [15]:
bestsellers_previous_week

Unnamed: 0,week,article_id,bestseller_previous_week_rank
0,99,0372860002,1.0
1,99,0827968001,2.0
2,99,0717490064,3.0
3,99,0706016003,4.0
4,99,0760084003,5.0
...,...,...,...
175,107,0863646001,16.0
176,107,0896169005,17.0
177,107,0715624001,18.0
178,107,0762846031,19.0


In [16]:
transactions = pd.merge(transactions, bestsellers_previous_week, on=['week', 'article_id'], how='left')

transactions.bestseller_previous_week_rank.fillna(-1, inplace=True)

In [17]:
%%time

negative_examples_bestsellers_previous_week = pd.merge( # negative examples AND candidates for week 108!
    transactions[['customer_id', 'week', 'sales_channel_id']].drop_duplicates(),
    bestsellers_previous_week, how='outer', on='week')

To request implementation, send an email to feature_requests@modin.org.


In [18]:
negative_examples_bestsellers_previous_week.shape

(24879519, 5)

In [19]:
%%time

# need to recover price

negative_examples_bestsellers_previous_week.week -= 1
negative_examples_bestsellers_previous_week = pd.merge(
    negative_examples_bestsellers_previous_week,
    transactions[['week', 'article_id', 'price']].groupby(['week', 'article_id']).mean().reset_index(), # mean of prices
                                                                                                        # across days / channels
    on=['week', 'article_id'],
    how='left'
) # not perfect without the 'sales_channel_id' but probably good enough
negative_examples_bestsellers_previous_week.week += 1

In [20]:
negative_examples_bestsellers_previous_week

Unnamed: 0,customer_id,week,sales_channel_id,article_id,bestseller_previous_week_rank,price
0,000c41868d0170bf1e022a985a37f52344ba14ca5c331b...,98,1,,,
1,0011c0c71f6e3a871a9520a4055ed5a9b0d2f428e35b77...,98,1,,,
2,00130faf36f2571cf7e08451b317545004a8a85327661f...,98,2,,,
3,001824e6c2a853c60b4998d3f402f01b6864d0a1f5ed01...,98,2,,,
4,00201f34c8c92683263346d78c2b45ffd0c6927229542d...,98,2,,,
...,...,...,...,...,...,...
24879514,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,107,2,0863646001,16.0,0.033373
24879515,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,107,2,0896169005,17.0,0.049964
24879516,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,107,2,0715624001,18.0,0.024931
24879517,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,107,2,0762846031,19.0,0.025006


In [21]:
transactions = pd.concat([transactions, negative_examples_bestsellers_previous_week])
transactions.reset_index(inplace=True, drop=True)

Let's generate candidate examples for all the customers in the test set as well.

In [22]:
test_data = pd.read_csv('data/sample_submission.csv')[['customer_id']]

test_data['sales_channel_id'] = 1

test_data_2 = test_data.copy()
test_data_2['sales_channel_id'] = 2

test_data = pd.concat([test_data, test_data_2])

In [23]:
negative_examples_bestsellers_previous_week = \
    negative_examples_bestsellers_previous_week[
        negative_examples_bestsellers_previous_week.week == negative_examples_bestsellers_previous_week.week.max()
]

In [29]:
%%time

bestseller_article_info = negative_examples_bestsellers_previous_week.drop_duplicates(['article_id', 'sales_channel_id'])[['article_id', 'sales_channel_id', 'bestseller_previous_week_rank', 'price']]

CancelledError: 

In [None]:
%%time

test_candidates = pd.merge( # negative examples AND candidates for week 108!
    test_data,
    bestseller_article_info,
    how='outer',
    on='sales_channel_id'
)

test_candidates['week'] = transactions.week.max()

transactions = pd.concat([transactions, test_candidates])

In [26]:
%%time

transactions.bought_previous_week.fillna(0, inplace=True)
transactions.purchased.fillna(0, inplace=True)



In [None]:
%%time

# removing "fake", incorrect transactions
transactions.drop_duplicates(['customer_id', 'week', 'article_id', 'sales_channel_id'], inplace=True)

In [None]:
transactions.shape

In [None]:
%%time
# Yet another way to create negative examples, to further differentiate between user preferences.

negative_transactions = transactions[transactions.purchased == 1].copy()
negative_transactions.purchased = 0
negative_transactions.bought_previous_week = 0

In [None]:
%%time

negative_transactions.customer_id = negative_transactions.groupby('week')['customer_id'].transform(np.random.permutation)

In [None]:
%%time

transactions = pd.concat([transactions, negative_transactions])
transactions.drop_duplicates(['customer_id', 'week', 'article_id', 'sales_channel_id'], inplace=True)

In [None]:
transactions.shape

Let's merge `customers` and `articles` with `transactions`.

In [None]:
articles.drop(columns=['detail_desc', 'prod_name'], inplace=True)

In [None]:
%%time

transactions = pd.merge(transactions, articles, on='article_id', how='left')
transactions = pd.merge(transactions, customers, on='customer_id', how='left')

In [None]:
transactions.shape

In [None]:
%%time

# we need to sort transactions to create baskets,
# where a basket is a sequence of records
# representing purchases (and purchase candidates / negative examples)
# for a customer for a given week
transactions.sort_values(['customer_id', 'week'], inplace=True)



In [None]:
%%time

transactions.to_pickle(f"data/transactions_last_10_weeks_sorted.pkl")
# transactions = pd.read_pickle(f"data/transactions_last_10_weeks_sorted.pkl")

# Dataset preparation

In [40]:
first_two_weeks = transactions.week.sort_values().unique()[:2]

test_set_week = transactions.week.max()
valid_set_week = test_set_week - 1
train_set_weeks = set(transactions.week) - set([test_set_week]) - set([valid_set_week])

In [41]:
%%time

transactions = transactions[~transactions.week.isin(set(first_two_weeks))]

test_candidates = transactions[transactions.week == test_set_week]
train_set = transactions[transactions.week != test_set_week]

In [43]:
train_set.shape, test_candidates.shape

((25358610, 36), (55077568, 36))

In [44]:
%%time

train_X = train_set[train_set.week.isin(train_set_weeks)]
valid_X = train_set[train_set.week == valid_set_week]

train_y = train_X['purchased']
valid_y = valid_X['purchased']

train_X = train_X.drop(columns='purchased')
valid_X = valid_X.drop(columns='purchased')

In [45]:
del transactions

In [46]:
test_X = test_candidates.drop(columns='purchased')
del test_candidates

In [47]:
%%time

train_baskets = train_X.groupby(['customer_id', 'week'])['article_id'].count().values
valid_baskets = valid_X.groupby(['customer_id', 'week'])['article_id'].count().values
test_baskets = test_X.groupby(['customer_id', 'week'])['article_id'].count().values

CPU times: user 19.3 s, sys: 1.29 s, total: 20.6 s
Wall time: 20.6 s


# Vectorize datasets

In [48]:
import lightgbm
lightgbm.__version__

'3.3.2'

In [49]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer

from sklearn.compose import ColumnTransformer
from lightgbm.sklearn import LGBMRanker

In [50]:
# https://github.com/logicai-io/recsys2019/blob/master/src/recsys/transformers.py
# https://github.com/logicai-io/recsys2019/blob/master/src/recsys/vectorizers.py

class PandasToRecords(BaseEstimator, TransformerMixin):
    def fit(self, X, *arg):
        return self

    def transform(self, X):
        return X.to_dict(orient="records")

class SparsityFilter(BaseEstimator, TransformerMixin):
    def __init__(self, min_nnz=None):
        self.min_nnz = min_nnz

    def fit(self, X, y=None):
        self.sparsity = X.getnnz(0)
        return self

    def transform(self, X):
        return X[:, self.sparsity >= self.min_nnz]

class PandasToNpArray(BaseEstimator, TransformerMixin):
    def fit(self, X, *arg):
        return self

    def transform(self, X):
        return X.values.astype(np.float)

In [51]:
class Categorize(BaseEstimator, TransformerMixin):
    def __init__(self, min_examples=0):
        self.min_examples = min_examples
        
    def fit(self, X):
        vc = X.iloc[:, 0].value_counts()
        self.categories = vc[vc > self.min_examples].index.tolist()
        return self

    def transform(self, X):
        return pd.DataFrame(data={X.columns[0]: pd.Categorical(X.iloc[:, 0], categories=self.categories).codes})

In [52]:
categorical_variables = ['article_id', 'postal_code', 'sales_channel_id', 'bought_previous_week', 'product_type_name', 'product_group_name', 
 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name', 'department_name',
 'index_code', 'index_name', 'index_group_name', 'section_name', 'garment_group_name',
 'FN', 'Active', 'club_member_status', 'fashion_news_frequency', 'graphical_appearance_name',
 'bestseller_previous_week_rank', 'product_type_no', 'colour_group_code']
numerical_variables = ['graphical_appearance_no', 'perceived_colour_value_id', 'perceived_colour_master_id', 
 'department_no', 'index_group_no', 'section_no', 'garment_group_no', 'age', 'price', 'product_code']

In [53]:
pipe = make_pipeline(
    ColumnTransformer(
        [   
            (
                "customer_id",
                Categorize(150),
                ['customer_id'],
            ),
            (
                "categorical",
                Categorize(60),
                categorical_variables,
            ),
            (
                "numerical",
                make_pipeline(PandasToNpArray(), SimpleImputer(strategy="constant", fill_value=-1)),
                numerical_variables,
            ),
        ]
    )
)

In [55]:
# pipe = make_pipeline(
#     ColumnTransformer(
#         [   
#             (
#                 "customer_id",
#                 make_pipeline(PandasToRecords(), DictVectorizer(), SparsityFilter(min_nnz=150)),
#                 ['customer_id'],
#             ),
#             (
#                 "categorical",
#                 make_pipeline(PandasToRecords(), DictVectorizer(), SparsityFilter(min_nnz=60)),
#                 categorical_variables,
#             ),
#             (
#                 "numerical",
#                 make_pipeline(PandasToNpArray(), SimpleImputer(strategy="constant", fill_value=-1)),
#                 numerical_variables,
#             ),
#         ]
#     )
# )

In [1]:
%%time
train_X_vec = pipe.fit_transform(train_X)

NameError: name 'pipe' is not defined

In [2]:
%%time
valid_X_vec = pipe.transform(valid_X)

NameError: name 'pipe' is not defined

In [3]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=10,
    num_rounds=3,
)

NameError: name 'LGBMRanker' is not defined

In [None]:
%%time

r = ranker.fit(
    train_X_vec,
    train_y,
    group=train_baskets,
    eval_set=[(valid_X_vec, valid_y)],
    eval_group=[valid_baskets],
)

In [62]:
ranker.feature_importances_.argsort()

array([  0, 288, 287, 286, 285, 284, 283, 282, 280, 279, 278, 289, 277,
       275, 274, 273, 272, 271, 270, 269, 268, 267, 266, 276, 265, 290,
       292, 315, 314, 313, 312, 311, 310, 309, 308, 307, 305, 291, 304,
       302, 301, 300, 299, 298, 297, 296, 295, 294, 293, 303, 316, 264,
       262, 233, 232, 231, 230, 229, 228, 227, 226, 225, 224, 234, 223,
       221, 220, 219, 218, 217, 216, 214, 213, 212, 211, 222, 263, 235,
       237, 261, 260, 259, 258, 257, 255, 254, 253, 252, 251, 236, 250,
       247, 246, 245, 244, 243, 242, 241, 240, 239, 238, 249, 317, 318,
       319, 399, 398, 397, 396, 395, 394, 393, 392, 391, 389, 401, 388,
       386, 385, 384, 383, 382, 381, 380, 379, 378, 376, 387, 375, 402,
       404, 426, 425, 424, 423, 422, 421, 420, 419, 418, 417, 403, 416,
       414, 413, 412, 411, 410, 409, 408, 407, 406, 405, 415, 374, 373,
       372, 343, 342, 341, 340, 339, 338, 337, 336, 335, 333, 344, 331,
       329, 328, 327, 326, 325, 324, 323, 322, 321, 320, 330, 34

In [63]:
%%time

ranker.fit(
    train_X_vec,
    train_y,
    group=train_baskets,
    eval_set=[(valid_X_vec, valid_y)],
    eval_group=[valid_baskets],
)

[1]	valid_0's ndcg@1: 0.8	valid_0's ndcg@2: 0.853102	valid_0's ndcg@3: 0.877094	valid_0's ndcg@4: 0.893132	valid_0's ndcg@5: 0.899708
CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s
Wall time: 43.2 ms


LGBMRanker(boosting_type='dart', metric='ndcg', n_estimators=10, num_rounds=1,
           objective='lambdarank')

In [64]:
%%time

ranker.fit(
    train_X_vec,
    train_y,
    group=train_baskets,
    eval_set=[(valid_X_vec, valid_y)],
    eval_group=[valid_baskets],
)

#  20 ests

[1]	valid_0's ndcg@1: 0.8	valid_0's ndcg@2: 0.853102	valid_0's ndcg@3: 0.877094	valid_0's ndcg@4: 0.893132	valid_0's ndcg@5: 0.899708
CPU times: user 1.19 s, sys: 14.8 ms, total: 1.21 s
Wall time: 50.3 ms


LGBMRanker(boosting_type='dart', metric='ndcg', n_estimators=10, num_rounds=1,
           objective='lambdarank')

In [65]:
%%time

ranker.fit(
    train_X_vec,
    train_y,
    group=train_baskets,
    eval_set=[(valid_X_vec, valid_y)],
    eval_group=[valid_baskets],
)

# 10 ests

[1]	valid_0's ndcg@1: 0.8	valid_0's ndcg@2: 0.853102	valid_0's ndcg@3: 0.877094	valid_0's ndcg@4: 0.893132	valid_0's ndcg@5: 0.899708
CPU times: user 1.16 s, sys: 19.4 ms, total: 1.18 s
Wall time: 49.1 ms


LGBMRanker(boosting_type='dart', metric='ndcg', n_estimators=10, num_rounds=1,
           objective='lambdarank')

In [66]:
del valid_X, train_X
del valid_X_vec, train_X_vec

In [67]:
gc.collect()

241

In [68]:
del train_set

In [69]:
import joblib
# save model
# joblib.dump(ranker, 'data/ranker.pkl')
# load model
ranker = joblib.load('data/ranker.pkl')

In [None]:
test_X = pd.read_pickle('data/test_X.pkl')

In [None]:
test_baskets = pd.read_pickle('data/test_baskets.pkl')

In [None]:
import joblib
pipe = joblib.load('data/pipe.pkl')

In [None]:
import gc
gc.collect()

In [None]:
%%time

chunk_size = 5_000_000
preds = []

for i in range(0, test_X.shape[0], chunk_size):
    test_X_chunk_vectorized = pipe.transform(test_X.iloc[i:i+chunk_size])
    preds.append(ranker.predict(test_X_chunk_vectorized))

In [None]:
preds = np.concatenate(preds)

In [None]:
test_X.shape

In [None]:
preds.shape

In [None]:
pd.to_pickle(preds, 'data/preds.pkl')

In [None]:
test_X.shape

In [None]:
test_X['preds'] = preds

In [None]:
test_X = test_X[['customer_id', 'article_id', 'preds']]

In [None]:
del test_X_chunk_vectorize, test_X_chunk_vectorized

In [None]:
del preds

In [None]:
gc.collect()

In [None]:
%%time
cust_id2pred = {}

for grp in test_X.groupby('customer_id'):
    cust_id2pred[grp[0]] = grp[1][['article_id', 'preds']].sort_values('preds', ascending=False)['article_id'].head(12).tolist()

In [None]:
sub = pd.read_csv('data/sample_submission.csv')

In [None]:
preds_str = []
for c in sub.customer_id:
    preds_str.append(' '.join(cust_id2pred[c]))

In [None]:
sub.prediction = preds_str

In [None]:
sub.prediction

In [None]:
sub.to_csv('data/subs/test.csv', index=False)

In [None]:
from utils import eval_subl_sub

In [None]:
eval_sub('data/subs/test.csv')

In [None]:
eval_sub('data/subs/test.csv', skip_cust_with_no_purchases=0)

In [None]:
!kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f 'data/subs/test.csv' -m 'test'

In [None]:
sub['customer_id'].iloc[0]

In [None]:
del text_X

In [None]:
gc.collect()

In [None]:
import sys

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)

In [None]:
len(cust_id2pred)

In [None]:
grp[1][['article_id', 'preds']].sort_values('preds', ascending=False)['article_id'].head(12).tolist()

In [None]:
len(cust_id2pred)

In [None]:
grp[0]

In [None]:
pd.to_pickle(test_baskets, 'data/test_baskets.pkl')

In [None]:
joblib.dump(pipe, 'data/pipe.pkl')

In [None]:
ranker.predict(valid_X_vec).shape

In [None]:
import gc

In [None]:
chunks =  np.array_split(test_X.index, 100)

In [None]:
for chunk in chunks:
    break

In [None]:
%%time
o = pipe.transform(test_X.loc[chunk])

In [None]:
a[0].shape