The other two notebooks have been finein the sense that I learned a lot about the problem.

I got a sense for how candidate genertion feels, what trends might exist in the data, etc.

Let's not try generating a submission implementing a full (albeit small) data processing pipeline where at the end we will train an lgbm model and make a submission.

Let's get started.

# Feature (and candidate) Engineering

In [1]:
import pandas as pd
import numpy as np
import arrow

In [3]:
%%time
# https://www.kaggle.com/paweljankiewicz/hm-create-dataset-samples

transactions = pd.read_csv('data/transactions_train.csv', dtype={"article_id": "str"})
customers = pd.read_csv('data/customers.csv')
articles = pd.read_csv('data/articles.csv', dtype={"article_id": "str"})

CPU times: user 20.7 s, sys: 1.83 s, total: 22.5 s
Wall time: 23.9 s


In [6]:
%%time

for sample_repr, sample in [("01", 0.001), ("1", 0.01), ("5", 0.05)]:
    print(sample)
    customers_sample = customers.sample(int(customers.shape[0]*sample), replace=False)
    customers_sample_ids = set(customers_sample["customer_id"])
    transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
    articles_sample_ids = set(transactions_sample["article_id"])
    articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]
    customers_sample.to_csv(f"data/customers_sample_{sample_repr}.csv.gz", index=False)
    transactions_sample.to_csv(f"data/transactions_train_sample_{sample_repr}.csv.gz", index=False)
    articles_sample.to_csv(f"data/articles_train_sample_{sample_repr}.csv.gz", index=False)

0.001
0.01
0.05
CPU times: user 15.9 s, sys: 53.7 ms, total: 15.9 s
Wall time: 15.9 s


In [2]:
transactions = pd.read_csv('data/transactions_train_sample_01.csv.gz', dtype={"article_id": "str"})
customers = pd.read_csv('data/customers_sample_01.csv.gz')
articles = pd.read_csv('data/articles_train_sample_01.csv.gz', dtype={"article_id": "str"})

We can skip customers and articles for now, the most heart of the problem are the transactions.

We want to use features we generate for up to an including week $t$ to generate predictions for week $t_{t+1}$.

Of course, that is just one way to structure the problem. We could treat it as a time series problem where we look at the sequence of weeks $t_1...t_n$ and predict the purchases for week $t_{n+1}$. There are many ways to frame the problem.

But we want to do something that would be
* simple
* has a chance to lending itself well for using gradient boosted trees

In line with [the suggestion from Pawel Jankiewicz](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/307288), we will predict baskets. And to make things even simpler, we will aggregate purchases by weeks.

In [3]:
transactions['week'] = transactions.t_dat \
    .apply(lambda t: arrow.get(t)) \
    .apply(lambda t: t.year * 100 + t.week) \
    .rank(method='dense').astype('int')

transactions.drop(columns='t_dat', inplace=True)

In [4]:
transactions['purchased'] = 1 # our positive examples

transactions.drop_duplicates(['customer_id', 'article_id', 'week'], inplace=True)

In [5]:
bought_previous_week = transactions.copy()
bought_previous_week.week += 1

bought_previous_week.purchased = 0 # negative examples, possibly
bought_previous_week['bought_previous_week'] = 1

In [6]:
# updating true sales already in transactions with information on whether an article was bought by a customer
# in the previous week

transactions = pd.merge(
    transactions,
    bought_previous_week[['customer_id', 'article_id', 'week', 'bought_previous_week']],
    on=['customer_id', 'article_id', 'week'], how='left'
)

transactions.bought_previous_week.fillna(0, inplace=True)

In [7]:
# adding our negative examples

transactions = pd.concat([transactions, bought_previous_week])

`transactions` now contains some fake transactions. But if you observe, I am always adding these possibly fake transactions (for instance, negative examples where in fact a purchase was made of that article by a given customer in a given week) at the bottom of the dataframe.

This insight is key. I will be able to remove them by using `drop_duplicates` once I am done with generating negative examples (or data to predict on for the week in the test set).

Once you trace what is going on here and spot the techniques I am using to generate candidate purchases / negative examples, you will have everything to completely rock this competition! This is one framing of the problem, but a very good one.

In [8]:
transactions

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased,bought_previous_week
0,05d0c63e8a3ff46f9519e38f1af70007d474650975ef40...,0613497006,0.033881,2,2,1,0.0
1,05d0c63e8a3ff46f9519e38f1af70007d474650975ef40...,0474138001,0.033881,2,2,1,0.0
2,2e50a35485613ab376d1d1acd3a1efef779fcfe80bca46...,0686631001,0.042356,1,2,1,0.0
3,4670f2d4f48201418c5f79af70ae2dd5e53affacb21b4a...,0514134001,0.067780,1,2,1,0.0
4,4670f2d4f48201418c5f79af70ae2dd5e53affacb21b4a...,0639685006,0.042356,1,2,1,0.0
...,...,...,...,...,...,...,...
32187,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,0905811003,0.033881,2,108,0,1.0
32188,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,0801673001,0.016932,2,108,0,1.0
32189,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,0886566001,0.033881,2,108,0,1.0
32190,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,0905811002,0.033881,2,108,0,1.0


In [9]:
bestsellers_previous_week = transactions \
    .groupby('week')['article_id'].value_counts() \
    .groupby('week').head(20) \
    .groupby('week').rank(method='dense', ascending=False) \
    .to_frame('bestseller_previous_week_rank').reset_index()

bestsellers_previous_week.week += 1

bestsellers_previous_week = bestsellers_previous_week[bestsellers_previous_week.week != bestsellers_previous_week.max().week.item()]

In [10]:
bestsellers_previous_week

Unnamed: 0,week,article_id,bestseller_previous_week_rank
0,2,0451229014,1.0
1,2,0651335004,1.0
2,2,0658848001,1.0
3,2,0666143001,1.0
4,2,0671515003,1.0
...,...,...,...
2122,108,0372860068,2.0
2123,108,0377277002,2.0
2124,108,0436083001,2.0
2125,108,0448509014,2.0


In [11]:
transactions = pd.merge(transactions, bestsellers_previous_week, on=['week', 'article_id'], how='left')

transactions.bestseller_previous_week_rank.fillna(-1, inplace=True)

In [12]:
negative_examples_bestsellers_previous_week = pd.merge( # negative examples AND candidates for week 108!
    transactions[['customer_id', 'week', 'sales_channel_id']].drop_duplicates(),
    bestsellers_previous_week, how='outer', on='week')

In [13]:
negative_examples_bestsellers_previous_week.shape

(301767, 5)

In [14]:
# need to recover price

negative_examples_bestsellers_previous_week.week -= 1
negative_examples_bestsellers_previous_week = pd.merge(
    negative_examples_bestsellers_previous_week,
    transactions[['week', 'article_id', 'price']].groupby(['week', 'article_id']).mean().reset_index(), # mean of prices
                                                                                                        # across days / channels
    on=['week', 'article_id'],
    how='left'
) # not perfect without the 'sales_channel_id' but probably good enough
negative_examples_bestsellers_previous_week.week += 1

In [15]:
negative_examples_bestsellers_previous_week

Unnamed: 0,customer_id,week,sales_channel_id,article_id,bestseller_previous_week_rank,price
0,05d0c63e8a3ff46f9519e38f1af70007d474650975ef40...,2,2,0451229014,1.0,0.050831
1,05d0c63e8a3ff46f9519e38f1af70007d474650975ef40...,2,2,0651335004,1.0,0.014492
2,05d0c63e8a3ff46f9519e38f1af70007d474650975ef40...,2,2,0658848001,1.0,0.011847
3,05d0c63e8a3ff46f9519e38f1af70007d474650975ef40...,2,2,0666143001,1.0,0.011847
4,05d0c63e8a3ff46f9519e38f1af70007d474650975ef40...,2,2,0671515003,1.0,0.015237
...,...,...,...,...,...,...
301762,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,108,2,0372860068,2.0,0.013542
301763,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,108,2,0377277002,2.0,0.008085
301764,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,108,2,0436083001,2.0,0.024322
301765,db1f0328dcdc4e12318b7091159ea412af2934047fc191...,108,2,0448509014,2.0,0.040237


In [16]:
transactions = pd.concat([transactions, negative_examples_bestsellers_previous_week])
transactions.reset_index(inplace=True, drop=True)

Let's generate candidate examples for all the customers in the test set as well.

In [17]:
test_data = pd.read_csv('data/sample_submission.csv')[['customer_id']]

In [18]:
test_data['sales_channel_id'] = 1

In [19]:
test_data_2 = test_data.copy()
test_data_2['sales_channel_id'] = 2

In [20]:
test_data = pd.concat([test_data, test_data_2])

In [21]:
negative_examples_bestsellers_previous_week = \
    negative_examples_bestsellers_previous_week[
        negative_examples_bestsellers_previous_week.week == negative_examples_bestsellers_previous_week.week.max()
]

In [22]:
bestseller_article_info = negative_examples_bestsellers_previous_week.drop_duplicates(['article_id', 'sales_channel_id'])[['article_id', 'sales_channel_id', 'bestseller_previous_week_rank', 'price']]

In [23]:
test_data.shape

(2743960, 2)

In [24]:
bestseller_article_info.shape

(40, 4)

In [25]:
test_candidates = pd.merge( # negative examples AND candidates for week 108!
    test_data,
    bestseller_article_info,
    how='outer',
    on='sales_channel_id'
)

In [26]:
test_candidates['week'] = transactions.week.max()

In [27]:
transactions = pd.concat([transactions, test_candidates])

In [28]:
transactions.bought_previous_week.fillna(0, inplace=True)
transactions.purchased.fillna(0, inplace=True)

In [29]:
# removing "fake", incorrect transactions
transactions.drop_duplicates(['customer_id', 'week', 'article_id', 'sales_channel_id'], inplace=True)

In [30]:
transactions.shape

(55234762, 8)

In [31]:
# Yet another way to create negative examples, to further differentiate between user preferences.

negative_transactions = transactions[transactions.purchased == 1].copy()
negative_transactions.purchased = 0
negative_transactions.bought_previous_week = 0

In [32]:
negative_transactions.customer_id = negative_transactions.groupby('week')['customer_id'].transform(np.random.permutation)

In [33]:
transactions = pd.concat([transactions, negative_transactions])
transactions.drop_duplicates(['customer_id', 'week', 'article_id', 'sales_channel_id'], inplace=True)

In [34]:
transactions.shape

(55262342, 8)

Let's merge `customers` and `articles` with `transactions`.

In [35]:
articles.drop(columns=['detail_desc', 'prod_name'], inplace=True)

In [36]:
transactions = pd.merge(transactions, articles, on='article_id', how='left')
transactions = pd.merge(transactions, customers, on='customer_id', how='left')

In [37]:
transactions.shape

(55262342, 36)

In [None]:
import gc
gc.collect()

# Dataset preparation

In [38]:
first_two_weeks = transactions.week.sort_values().unique()[:2]
test_set_week = transactions.week.max()

In [39]:
transactions = transactions[~transactions.week.isin(set(first_two_weeks))]

In [40]:
test_candidates = transactions[transactions.week == test_set_week]
train_set = transactions[transactions.week != test_set_week]

In [80]:
test_candidates = test_candidates.drop(columns='purchased')

In [41]:
train_set.shape, test_candidates.shape

((382509, 36), (54879240, 36))

In [81]:
gc.collect()

5154

In [72]:
cust_ids = train_set.customer_id.unique()
print(len(cust_ids))

train_cust_ids = cust_ids[:int(0.8*cust_ids.shape[0])]
valid_cust_ids = cust_ids[int(0.8*cust_ids.shape[0]):]

train_cust_ids.shape[0], valid_cust_ids.shape[0]

1362


(1089, 273)

In [79]:
train_X = train_set[train_set.customer_id.isin(set(train_cust_ids))]
valid_X = train_set[~train_set.customer_id.isin(set(train_cust_ids))]

train_y = train_X['purchased']
valid_y = valid_X['purchased']

train_X = train_X.drop(columns='purchased')
valid_X = valid_X.drop(columns='purchased')

# Vectorize datasets