We will start with a simple baseline.

The idea here is to get a feel for what this very simple approach can achieve.

Further to that, we want to start getting our feet wet with the data, learning about the problem domain, get the logistics in place for making a sumission and so forth.

# First Submission

The period of interest for this competition is one week **after** the train data.

I imagine trends are quite important when it comes to fashion. Let's try to verify this assumption as we work on our first submission.

On our path to our first submission we will create a validation set (we will use the last week of the train data for this).

Our two sets of predictions, that we will submit to Kaggle, will be the most popular items purchased in the 2 weeks leading up to the period of our validation dataset, and the most popular items purchased 3 months earlier.

We will learn a little bit about trends by following this strategy. What is probably more important at this point, we will also learn whether our validatio set tracks the public LB.

In [12]:
import pandas as pd

In [213]:
transactions = pd.read_csv('data/transactions_train.csv', dtype={'article_id': str})
sample_sub = pd.read_csv('data/sample_submission.csv')

In [215]:
transactions.t_dat = pd.to_datetime(transactions.t_dat)

In [41]:
newest_entry = transactions.t_dat.max()
newest_entry

Timestamp('2020-09-22 00:00:00')

In [42]:
import datetime

In [44]:
last_7_days = transactions.t_dat > newest_entry - datetime.timedelta(days=7)

In [48]:
leading_up_2_weeks =  (transactions.t_dat < newest_entry - datetime.timedelta(days=7)) & (transactions.t_dat > newest_entry - datetime.timedelta(days=21)) 

In [49]:
a_quarter_ago_2_weeks =  (transactions.t_dat < newest_entry - datetime.timedelta(days=7+30*3)) & (transactions.t_dat > newest_entry - datetime.timedelta(days=21+30*3)) 

We have the time ranges for our validation dataset and the two "train sets". But how do we go from the transactions in the validation time period to something we can use to evaluate our predictions?

For every customer that has made a purchase in the period of interest we will use their purchases as our ground truth in chronological order. If fewer than 12 purchases were made, we will pad the list with 0s. 

In [220]:
cust_id = []
purchases = []

for grp in transactions[last_7_days].groupby('customer_id'):
    cust_purchases = grp[1]['article_id'].tolist()[:12]
    while len(cust_purchases) < 12:
        cust_purchases.append('0')
    cust_id.append(grp[0])
    purchases.append(cust_purchases)

In [221]:
len(cust_id), len(purchases)

(68984, 68984)

In [222]:
validation_set = pd.DataFrame(data={'customer_id': cust_id, 'purchases': purchases})
validation_set.to_pickle('data/validation_set.pkl')

Now let's "calculate" our predictions.

In [223]:
predictions_leading_2_weeks = validation_set.copy()
predictions_leading_2_weeks['purchases'] = [transactions[leading_up_2_weeks]['article_id'].value_counts().index.tolist()[:12]]*68984

We can grab the code to run the metric calculation directly from github.

In [1]:
!wget https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py

--2022-02-14 09:58:00--  https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1654 (1.6K) [text/plain]
Saving to: ‘average_precision.py’


2022-02-14 09:58:01 (60.0 MB/s) - ‘average_precision.py’ saved [1654/1654]



In [182]:
from average_precision import apk

In [185]:
import numpy as np

def calculate_apk(list_of_preds, list_of_gts):
    apks = []
    for preds, gt in zip(list_of_preds, list_of_gts):
        apks.append(apk(preds, gt))
    return np.mean(apks)

And how did we do?

In [224]:
calculate_apk(predictions_leading_2_weeks.purchases, validation_set.purchases)

0.0036434351466822766

Not ideal but it is a start! :) 

Now, let's do the same but using data from roughly three months prior.

In [225]:
predictions_a_quarter_ago_2_weeks = validation_set.copy()
predictions_a_quarter_ago_2_weeks['purchases'] = [transactions[a_quarter_ago_2_weeks]['article_id'].value_counts().index.tolist()[:12]]*68984

In [226]:
calculate_apk(predictions_a_quarter_ago_2_weeks.purchases, validation_set.purchases)

0.00044962634537753403

As it turns out, there might be some trending/seasonality to fashion purchases!

Now let's see if we can make a submission using both sets of these predictions.

In [253]:
name = 'most_purchased_last_two_weeks'
sub = sample_sub.copy()
per_customer_predictions = ' '.join(transactions[leading_up_2_weeks]['article_id'].value_counts().index.tolist()[:12])
sub['prediction'] = [per_customer_predictions]*sub.shape[0]

In [239]:
!mkdir -p data/subs

In [254]:
sub.to_csv(f'data/subs/{name}.csv.gz', index=False, compression='gzip')
pd.read_csv(f'data/subs/{name}.csv.gz')

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0751471001 0909370001 0915526001 0751471043 04...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0751471001 0909370001 0915526001 0751471043 04...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0751471001 0909370001 0915526001 0751471043 04...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0751471001 0909370001 0915526001 0751471043 04...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0751471001 0909370001 0915526001 0751471043 04...
...,...,...
1371975,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0751471001 0909370001 0915526001 0751471043 04...
1371976,ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...,0751471001 0909370001 0915526001 0751471043 04...
1371977,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,0751471001 0909370001 0915526001 0751471043 04...
1371978,ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38...,0751471001 0909370001 0915526001 0751471043 04...


In [246]:
!kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f data/subs/{name}.csv.gz -m "{name}"

100%|██████████████████████████████████████| 50.3M/50.3M [00:25<00:00, 2.05MB/s]
Successfully submitted to H&M Personalized Fashion Recommendations

And now looking roughly a quarter back.

In [247]:
name = 'most_purchased_3_mths_ago_two_weeks'
sub = sample_sub.copy()
per_customer_predictions = ' '.join(transactions[a_quarter_ago_2_weeks]['article_id'].value_counts().index.tolist()[:12])
sub['prediction'] = [per_customer_predictions]*sub.shape[0]

In [249]:
sub.to_csv(f'data/subs/{name}.csv.gz', index=False, compression='gzip')
pd.read_csv(f'data/subs/{name}.csv.gz')

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0817472002 0599580038 0599580052 0817472005 03...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0817472002 0599580038 0599580052 0817472005 03...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0817472002 0599580038 0599580052 0817472005 03...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0817472002 0599580038 0599580052 0817472005 03...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0817472002 0599580038 0599580052 0817472005 03...
...,...,...
1371975,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0817472002 0599580038 0599580052 0817472005 03...
1371976,ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...,0817472002 0599580038 0599580052 0817472005 03...
1371977,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,0817472002 0599580038 0599580052 0817472005 03...
1371978,ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38...,0817472002 0599580038 0599580052 0817472005 03...


In [252]:
!kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f data/subs/{name}.csv.gz -m "{name}"

100%|██████████████████████████████████████| 50.3M/50.3M [00:22<00:00, 2.30MB/s]
Successfully submitted to H&M Personalized Fashion Recommendations