Download competition data and extract it to the data folder.

In [1]:
ls data

articles.csv                                      submission.csv.gz
customers.csv                                     transactions_train.csv
h-and-m-personalized-fashion-recommendations.zip  transactions_train.parquet
[0m[01;34mimages[0m/                                           validation_week_purchases.pkl
sample_submission.csv


We will aim for the simplest preprocessing possible.

Yes, we might take steps to conserve memory.

But when possible we will happily trade increased memory footprint, longer run time, for smaller code complexity.

Why?

The developer is the most important component of the data processing pipeline. That is what I want to protect.

Also, I have already had a go at this competition. It is so easy to die under the weight of the preprocessing steps.

> Make it work ➤ Make it right ➤ Make it fast
>
> -- [DHH](https://twitter.com/dhh/status/600667857639309313?s=20&t=fHdTPO2BGA5RekgWA6sJbw)

Essentially, I have already died in this competition by attempting the 3rd and 2nd steps without doing justice to the 1st one. That is a receipe for disaster.

This work is heavily based by research code written for interal use by my outstanding colleague, [Gabriel Moreira](https://twitter.com/gspmoreira).

In [2]:
import os
import shutil
import glob
import cudf
import pandas as pd
import numpy as np
import nvtabular as nvt

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask as dask, dask_cudf
import rmm
from numba import cuda

# Setup Dask cluster

The most important file is `transactions_train.csv`. That is where the record of transactions live.

`customers.csv` and `articles.csv` only provide secondary information.

Below, I am making sure `article_id` gets processed as an object (string) and not an int. Why?

This will require a bit more memory for every row. As we create more data (candidate transactions), this effect will compound.

But at this point I don't mind -- we first need to get it to work. Only then we can make it right, to ultimately make it fast.

In [3]:
dtypes = {
 'customer_id': 'O',
 'article_id': 'O',
 'price': 'float64',
 'sales_channel_id': 'int8'
}

In [4]:
transactions = pd.read_csv('data/transactions_train.csv', parse_dates=['t_dat'], dtype=dtypes)

In [5]:
transactions.info(memory_usage=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype         
---  ------            -----         
 0   t_dat             datetime64[ns]
 1   customer_id       object        
 2   article_id        object        
 3   price             float64       
 4   sales_channel_id  int8          
dtypes: datetime64[ns](1), float64(1), int8(1), object(2)
memory usage: 1000.4+ MB


In [6]:
transactions['week'] = 104 - (transactions.t_dat.max() - transactions.t_dat).dt.days // 7

In [7]:
transactions.to_parquet('data/transactions_train.parquet', index=False) 

# Split data into train and val

In [8]:
!wget https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py

--2022-05-26 10:23:07--  https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1654 (1.6K) [text/plain]
Saving to: ‘average_precision.py’


2022-05-26 10:23:23 (1.13 MB/s) - ‘average_precision.py’ saved [1654/1654]



In [16]:
from average_precision import apk

In [9]:
valid = transactions[transactions['week'] == transactions['week'].max()]
train = transactions[transactions['week'].isin(set([101, 102, 103]))]

Let's output predictions using a couple of strategies

### Bestsellers across last 3 weeks

In [10]:
validation_week_purchases = valid.groupby(['customer_id'])['article_id'].apply(list)

In [11]:
validation_week_purchases

customer_id
00039306476aaf41a07fed942884f16b30abfa83a2a8bea972019098d6406793                                         [0624486001]
0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf4672f30b3e622fec55                                         [0827487003]
000493dd9fc463df1acc2081450c9e75ef8e87d5dd17ed6396773839f6bf71a9                 [0757926001, 0788575004, 0640021019]
000525e3fe01600d717da8423643a8303390a055c578ed8a97256600baf54565                                         [0874110016]
00077dbd5c4a4991e092e63893ccf29294a9d5c46e85010e95f2fc10bf9437a4    [0903762001, 0879189005, 0158340001, 086796600...
                                                                                          ...                        
fffa67737587e52ff1afa9c7c6490b5eb7acbc439fe82bd11d746ddb223dff26                             [0874816003, 0911870004]
fffa7d7799eb390a76308454cbdd76e473d65b1497fbe44fe8cf95effea0bed7                             [0861803014, 0849886010]
fffae8eb3a282d8c43c77dd2ca0621703b71e90904df

In [12]:
pd.to_pickle(validation_week_purchases, 'data/validation_week_purchases.pkl')

In [13]:
def mean_apk(predictions, purchases):
    return np.mean([apk(actual, predicted, 12) for actual, predicted in zip(purchases, predictions)])

In [14]:
bestsellers_last_3_weeks = train['article_id'].value_counts().index.tolist()[:12]

In [17]:
mean_apk([bestsellers_last_3_weeks] * validation_week_purchases.shape[0], validation_week_purchases)

0.00563433927371356

### Bestsellers from the week of last purchase

In [18]:
cust_id2last_purchase_week = train.groupby('customer_id')['week'].max()
week2bestsellers = train.groupby('week')['article_id'].apply(lambda df: df.value_counts()[:12].keys().tolist())

In [19]:
predictions = []

for customer_id in validation_week_purchases.keys():
    week_of_last_purchase = cust_id2last_purchase_week.get(customer_id, None)
    if week_of_last_purchase:
        predictions.append(week2bestsellers[week_of_last_purchase])
    else:
        predictions.append(bestsellers_last_3_weeks)

In [20]:
mean_apk(predictions, validation_week_purchases)

0.006215228420780319

## Bestsellers from the last week

In [21]:
predictions = []

for customer_id in validation_week_purchases.keys():
    predictions.append(week2bestsellers[103])

In [22]:
mean_apk(predictions, validation_week_purchases)

0.008148887244205453

### Items purchased by customer most recently (from purchases in the last 3 weeks)

The logic here is as follows:
- find the week of customer's last purchase
- grab purchased items from that week

In [23]:
%%time

cust_id2last_purchase = {}
for customer_id, grp_df in train.groupby('customer_id'):
    cust_id2last_purchase[customer_id] = grp_df[grp_df.week == grp_df.week.max()]['article_id'].value_counts().index.tolist()

CPU times: user 1min 43s, sys: 108 ms, total: 1min 44s
Wall time: 1min 44s


In [24]:
predictions = []

for customer_id in validation_week_purchases.keys():
    last_purchased_items = cust_id2last_purchase.get(customer_id, [])
    predictions.append(last_purchased_items)

In [25]:
mean_apk(predictions, validation_week_purchases)

0.0168696346048911

### Items purchased by customer most recently (from purchases in the last 3 weeks) + bestsellers from week of last purchase

In [26]:
predictions = []

for customer_id in validation_week_purchases.keys():
    last_purchased_items = cust_id2last_purchase.get(customer_id, [])
    week_of_last_purchase = cust_id2last_purchase_week.get(customer_id, 0)
    if week_of_last_purchase:
        last_purchased_items = last_purchased_items + week2bestsellers[week_of_last_purchase]
    predictions.append(last_purchased_items)

In [27]:
mean_apk(predictions, validation_week_purchases)

0.017914785044906158

### Items purchased by customer most recently (from purchases in the last 3 weeks) + bestsellers from last week + bestsellers from week 103

In [28]:
predictions = []

for customer_id in validation_week_purchases.keys():
    last_purchased_items = cust_id2last_purchase.get(customer_id, [])
    week_of_last_purchase = cust_id2last_purchase_week.get(customer_id, 0)
    if week_of_last_purchase:
        last_purchased_items = last_purchased_items + week2bestsellers[week_of_last_purchase]
    last_purchased_items += week2bestsellers[103]
    predictions.append(last_purchased_items)

In [29]:
mean_apk(predictions, validation_week_purchases)

0.022762596906868174

### Items purchased by customer most recently (from purchases in the last 3 weeks) + bestsellers from week 103

In [30]:
predictions = []

for customer_id in validation_week_purchases.keys():
    last_purchased_items = cust_id2last_purchase.get(customer_id, [])
    last_purchased_items += week2bestsellers[103]
    predictions.append(last_purchased_items)

In [31]:
mean_apk(predictions, validation_week_purchases)

0.023122213718998495

The above solution is loosely based on [the following kernel](https://www.kaggle.com/code/hengzheng/time-is-our-best-friend-v2).

When run on weeks 101-103, it scores 0.00969 on Private LB. When run on weeks 102-104, it scores 0.02024.

That is slightly better then the original kernel. But this is not important -- the only way I was able to implement and experiment with this logic is because of how simple I kept the code!

That needs to be my north start going forward. Code simplicity is where it's at. The key to ML is refactor, refactor, refactor, and simplify.

I don't see it any other way.

# Making a submission

In [358]:
sample_sub = pd.read_csv('data/sample_submission.csv')

In [397]:
predictions = []

for customer_id in sample_sub['customer_id']:
    last_purchased_items = cust_id2last_purchase.get(customer_id, [])
    last_purchased_items += week2bestsellers[103]
    predictions.append(last_purchased_items)

In [367]:
sample_sub

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0706016001 0706016002 0372860001 0610776002 07...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0706016001 0706016002 0372860001 0610776002 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0706016001 0706016002 0372860001 0610776002 07...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0706016001 0706016002 0372860001 0610776002 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0706016001 0706016002 0372860001 0610776002 07...
...,...,...
1371975,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0706016001 0706016002 0372860001 0610776002 07...
1371976,ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...,0706016001 0706016002 0372860001 0610776002 07...
1371977,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,0706016001 0706016002 0372860001 0610776002 07...
1371978,ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38...,0706016001 0706016002 0372860001 0610776002 07...


In [398]:
sub = pd.DataFrame(data={'customer_id': sample_sub.customer_id, 'prediction': [' '.join(p) for p in predictions]})

In [399]:
sub.to_csv('data/submission.csv.gz', index=False)

In [377]:
from IPython.lib.display import FileLink

In [400]:
FileLink('data/submission.csv.gz')

In [401]:
pd.read_csv('data/submission.csv.gz')

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0924243001 0924243002 0918522001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0924243001 0924243002 0918522001 0923758001 08...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0924243001 0924243002 0918522001 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0924243001 0924243002 0918522001 0923758001 08...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0924243001 0924243002 0918522001 0923758001 08...
...,...,...
1371975,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0713997002 0720125039 0740922009 0791587007 08...
1371976,ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...,0924243001 0924243002 0918522001 0923758001 08...
1371977,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,0762846027 0924243001 0924243002 0918522001 09...
1371978,ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38...,0924243001 0924243002 0918522001 0923758001 08...
