Evaluation set is "predicted items a customer will buy in the next 7-day period after the training time period".

Given this we will use a _global temporal_ split: a fixed time-point that is shared across all users, where any interactions after that point are used for testing.

We will take the last 7-day period as a test set and the 7-day period before that as validation.

~2.5% of users in the sample submission are not in the training set (cold-start users). We check that our splits have similar proportions of cold-start users. Note however that these customers may not have necessarily made purchases.

__Approach__:

_Training set_:
- Map `article_id` to `article_id_idx`
- Create (single-purchase) labels
- For customers _with_ purchase history, concatenate historical `article_id_idx`s into a comma-separated string
- For customers _without_ purchase history, create history value `'1'`
- Sort history and label DataFrames so they are aligned 
- Save both files separately as CSVs
- Note: For all-purchase labels can look at purchases 1 week from selected label

_Dev/Test set_:
- Need single-purchase labels for loss and all purchases for MAP
- For single-purchase labels take first purchase by `customer_id` in `dev_df`
- Save all purchase labels separately
- Same logic as above for input and construct using data from `train_df` for dev set, and `train_df` and `dev_df` for test set

In [1]:
import os
import datetime

import pandas as pd
import numpy as np

In [2]:
os.chdir('..')

#### Map `article_id`s to numeric indices

In [3]:
articles_df = pd.read_csv('data/articles.csv', dtype={'article_id': str})  # Make sure article_id is being loading in as a string
print(articles_df.shape)
articles_df.head()

(105542, 25)


Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [4]:
article_id_to_idx = dict(
    zip(
        articles_df['article_id'],
        articles_df.index + 2
    )
)

We reserve index `0` for padding and `1` for 'no history'

#### Create splits

In [5]:
transactions_train_df = pd.read_csv('data/transactions_train.csv', dtype={'article_id': str})  # Make sure article_id is being loading in as a string
print(transactions_train_df.shape)
transactions_train_df.head()

(31788324, 5)


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [6]:
transactions_train_df['t_dat'] = pd.to_datetime(transactions_train_df['t_dat'])

In [7]:
transactions_train_df.sort_values(['customer_id', 't_dat'], inplace=True)

In [8]:
transactions_train_df['article_id_idx'] = transactions_train_df['article_id'].map(article_id_to_idx)

In [9]:
test_end = transactions_train_df['t_dat'].max()
test_start = transactions_train_df['t_dat'].max() - datetime.timedelta(days=7)

test_start, test_end

(Timestamp('2020-09-15 00:00:00'), Timestamp('2020-09-22 00:00:00'))

In [10]:
dev_end = test_start
dev_start = dev_end - datetime.timedelta(days=7)

dev_start, dev_end

(Timestamp('2020-09-08 00:00:00'), Timestamp('2020-09-15 00:00:00'))

In [11]:
train_start = transactions_train_df['t_dat'].min()
train_end = dev_start

train_start, train_end

(Timestamp('2018-09-20 00:00:00'), Timestamp('2020-09-08 00:00:00'))

In [12]:
test_mask = transactions_train_df['t_dat'].between(test_start, test_end, inclusive='right')
dev_mask = transactions_train_df['t_dat'].between(dev_start, dev_end, inclusive='right')
train_mask = transactions_train_df['t_dat'].between(train_start, train_end, inclusive='both')

In [13]:
train_df = transactions_train_df.copy()[train_mask]
dev_df = transactions_train_df.copy()[dev_mask]
test_df = transactions_train_df.copy()[test_mask]

In [14]:
assert train_df.shape[0] + dev_df.shape[0] + test_df.shape[0] == transactions_train_df.shape[0]

In [15]:
# Proportion of dev set customers not in training set
len(
    set(dev_df['customer_id'].unique()) - 
    set(train_df['customer_id'].unique())
) / dev_df['customer_id'].nunique()

0.07491078743109457

In [16]:
# Proportion of test set customers not in training set
len(
    set(test_df['customer_id'].unique()) - 
    set(train_df['customer_id'].unique())
) / test_df['customer_id'].nunique()

0.08591847384900847

#### Prepare training data

For each customer we randomly select a transaction to be their label. Any transactions after this one are discarded for training purposes.

In [17]:
train_df['customer_id'].nunique()

1351314

In [18]:
train_df.reset_index(drop=True, inplace=True)

In [19]:
train_df['uid'] = train_df.index  # Create a UID as a customer may purchase the same item multiple times preventing a 1:1 join

In [20]:
%%time
labels = train_df[['customer_id', 'uid']].groupby('customer_id').sample(n=1, random_state=3)

CPU times: user 53.8 s, sys: 5.71 s, total: 59.5 s
Wall time: 59.8 s


In [21]:
labels.head()

Unnamed: 0,customer_id,uid
19,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,19
35,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,35
112,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,112
125,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,125
131,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,131


In [22]:
labels['label'] = 1

In [23]:
train_df = train_df.merge(labels, on=['customer_id', 'uid'], how='left')

In [24]:
train_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_id_idx,uid,label
0,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,625548001,0.044051,1,29518,0,
1,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,176209023,0.035576,1,101,1,
2,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,627759010,0.030492,1,30329,2,
3,2019-05-02,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,697138006,0.010153,2,50726,3,
4,2019-05-25,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,568601006,0.050831,2,16005,4,


In [25]:
train_df['keep'] = train_df['label'].copy()

In [26]:
train_df['keep'] = train_df.groupby('customer_id')['keep'].bfill()

In [27]:
# Drop records for each customer after the label
train_df = train_df.copy()[train_df['keep'] == 1]

In [28]:
train_df['label'].fillna(0, inplace=True)

In [29]:
train_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_id_idx,uid,label,keep
0,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,625548001,0.044051,1,29518,0,0.0,1.0
1,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,176209023,0.035576,1,101,1,0.0,1.0
2,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,627759010,0.030492,1,30329,2,0.0,1.0
3,2019-05-02,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,697138006,0.010153,2,50726,3,0.0,1.0
4,2019-05-25,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,568601006,0.050831,2,16005,4,0.0,1.0


In [30]:
train_df.drop(columns=['uid', 'keep'], inplace=True)

In [31]:
train_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_id_idx,label
0,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,625548001,0.044051,1,29518,0.0
1,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,176209023,0.035576,1,101,0.0
2,2018-12-27,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,627759010,0.030492,1,30329,0.0
3,2019-05-02,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,697138006,0.010153,2,50726,0.0
4,2019-05-25,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,568601006,0.050831,2,16005,0.0


In [32]:
train_df['article_id_idx'] = train_df['article_id_idx'].astype(str)

In [33]:
train_historical_purchases = (
    train_df[train_df['label'] == 0][['customer_id', 'article_id_idx']]
    .groupby('customer_id')
    .agg({'article_id_idx': ','.join})
    .reset_index()
)

train_historical_purchases.head()

Unnamed: 0,customer_id,article_id_idx
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,"29518,101,30329,50726,16005,16005,23998,65669,..."
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,"19335,33750,33993,8218,41026,19335,42628,41026..."
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,4018110522401811819959460
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,64527
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,322494344451126544632183


In [34]:
train_historical_purchases.shape

(1038868, 2)

In [35]:
train_labels = train_df.copy()[train_df['label'] == 1][['customer_id', 'article_id_idx']]
print(train_labels.shape)
train_labels.head()

(1351314, 2)


Unnamed: 0,customer_id,article_id_idx
19,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,93746
35,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,59460
112,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,1471
125,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,61177
131,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2183


In [36]:
train_combined = train_labels.merge(train_historical_purchases, on='customer_id', how='left', suffixes=('_label', '_historical'), indicator=True)
train_combined.head()

Unnamed: 0,customer_id,article_id_idx_label,article_id_idx_historical,_merge
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,93746,"29518,101,30329,50726,16005,16005,23998,65669,...",both
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,59460,"19335,33750,33993,8218,41026,19335,42628,41026...",both
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,1471,4018110522401811819959460,both
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,61177,64527,both
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2183,322494344451126544632183,both


In [37]:
train_combined['_merge'].value_counts() / train_combined.shape[0]

both          0.768784
left_only     0.231216
right_only    0.000000
Name: _merge, dtype: float64

`left_only` represents customers with no purchase history. 23% is much larger than the corresponding proportion of no history customers in the sample submission (2.5%). We therefore downsample these customers.

In [38]:
TARGET_PROP_NO_HISTORY = 0.025

In [39]:
number_customers_both = train_combined[train_combined['_merge'] == 'both'].shape[0]
number_customers_both

1038868

In [40]:
number_to_sample = ((number_customers_both / (1-TARGET_PROP_NO_HISTORY)) - number_customers_both)
number_to_sample = int(number_to_sample)
number_to_sample

26637

In [41]:
train_combined_new = pd.concat(
    [
        train_combined[train_combined['_merge'] == 'both'],
        train_combined[train_combined['_merge'] != 'both'].sample(n=number_to_sample, random_state=3)
    ]
)

train_combined_new.head()

Unnamed: 0,customer_id,article_id_idx_label,article_id_idx_historical,_merge
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,93746,"29518,101,30329,50726,16005,16005,23998,65669,...",both
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,59460,"19335,33750,33993,8218,41026,19335,42628,41026...",both
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,1471,4018110522401811819959460,both
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,61177,64527,both
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2183,322494344451126544632183,both


In [42]:
train_combined_new['_merge'].value_counts() / train_combined_new.shape[0]

both          0.975001
left_only     0.024999
right_only    0.000000
Name: _merge, dtype: float64

In [43]:
train_combined_new.drop('_merge', axis=1, inplace=True)

In [44]:
train_combined_new.isnull().sum()

customer_id                      0
article_id_idx_label             0
article_id_idx_historical    26637
dtype: int64

In [45]:
NO_HISTORY_ARTICLE_ID_IDX = '1'

train_combined_new['article_id_idx_historical'].fillna(NO_HISTORY_ARTICLE_ID_IDX, inplace=True)

In [46]:
print(train_combined_new.shape)
train_combined_new.head()

(1065505, 3)


Unnamed: 0,customer_id,article_id_idx_label,article_id_idx_historical
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,93746,"29518,101,30329,50726,16005,16005,23998,65669,..."
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,59460,"19335,33750,33993,8218,41026,19335,42628,41026..."
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,1471,4018110522401811819959460
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,61177,64527
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2183,322494344451126544632183


Save

In [47]:
train_combined_new.to_csv('data/splits/train_single_purchase_label.tsv', sep='\t', index=False)

#### Prepare dev data

In [47]:
dev_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_id_idx
31521960,2020-09-15,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,794321007,0.061,2,78505
31492019,2020-09-14,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,719530003,0.033881,2,58297
31492020,2020-09-14,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,448509014,0.042356,2,3093
31492021,2020-09-14,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,734592001,0.030492,1,61918
31412220,2020-09-12,0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf...,640021012,0.050831,2,33870


Make all and single-purchase labels

In [48]:
dev_df['article_id_idx'] = dev_df['article_id_idx'].astype(str)

In [49]:
dev_labels_all_purchases = (
    dev_df[['customer_id', 'article_id_idx']]
    .groupby('customer_id')
    .agg({
        'article_id_idx': ','.join
    })
    .reset_index()
)

dev_labels_all_purchases.head()

Unnamed: 0,customer_id,article_id_idx
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,78505
1,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,582973093
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,61918
3,0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf...,33870279079860898608100230
4,00040239317e877c77ac6e79df42eb2633ad38fcac09fc...,97668976699766897669


In [50]:
dev_labels_single_label = dev_df.copy()[['customer_id', 'article_id_idx']].groupby('customer_id').head(1)

dev_labels_single_label.head()

Unnamed: 0,customer_id,article_id_idx
31521960,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,78505
31492019,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,58297
31492021,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,61918
31412220,0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf...,33870
31412224,00040239317e877c77ac6e79df42eb2633ad38fcac09fc...,97668


In [51]:
dev_labels_all_purchases.shape[0], dev_labels_single_label.shape[0]

(72019, 72019)

In [52]:
dev_labels = dev_labels_all_purchases.merge(dev_labels_single_label, on='customer_id', suffixes=('_all_purchases', '_single_label'))
dev_labels.head()

Unnamed: 0,customer_id,article_id_idx_all_purchases,article_id_idx_single_label
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,78505,78505
1,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,582973093,58297
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,61918,61918
3,0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf...,33870279079860898608100230,33870
4,00040239317e877c77ac6e79df42eb2633ad38fcac09fc...,97668976699766897669,97668


Merge on historical purchases from training data

In [53]:
dev_labels = dev_labels.merge(train_historical_purchases, on='customer_id', how='left').rename(columns={'article_id_idx': 'article_id_idx_historical'})
print(dev_labels.shape)
dev_labels.head()

(72019, 4)


Unnamed: 0,customer_id,article_id_idx_all_purchases,article_id_idx_single_label,article_id_idx_historical
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,78505,78505,4018110522401811819959460
1,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,582973093,58297,62290
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,61918,61918,"16166,16166,39929,8343,930,41146,44864,14571,1..."
3,0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf...,33870279079860898608100230,33870,862171782966376160497679
4,00040239317e877c77ac6e79df42eb2633ad38fcac09fc...,97668976699766897669,97668,"25554,75579,60345,27440,59102,48230,8339,71896..."


In [54]:
dev_labels['article_id_idx_historical'].fillna(NO_HISTORY_ARTICLE_ID_IDX, inplace=True)

In [55]:
dev_labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72019 entries, 0 to 72018
Data columns (total 4 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   customer_id                   72019 non-null  object
 1   article_id_idx_all_purchases  72019 non-null  object
 2   article_id_idx_single_label   72019 non-null  object
 3   article_id_idx_historical     72019 non-null  object
dtypes: object(4)
memory usage: 2.7+ MB


In [57]:
# All purchase label
(
    dev_labels[['customer_id', 'article_id_idx_all_purchases', 'article_id_idx_historical']]
    .rename(columns={'article_id_idx_all_purchases': 'article_id_idx_label'})
    .to_csv('data/splits/dev_all_purchase_label.tsv', sep='\t', index=False)
)

In [58]:
# Single purchase label
(
    dev_labels[['customer_id', 'article_id_idx_single_label', 'article_id_idx_historical']]
    .rename(columns={'article_id_idx_single_label': 'article_id_idx_label'})
    .to_csv('data/splits/dev_single_purchase_label.tsv',sep='\t', index=False)
)

#### Prepare test data

TODO

#### Prepare submission data

In [58]:
submission_customers = pd.read_csv('data/sample_submission.csv', usecols=['customer_id'])
print(submission_customers.shape)
submission_customers.head()

(1371980, 1)


Unnamed: 0,customer_id
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...


In [68]:
all_historical_transactions = pd.read_csv('data/transactions_train.csv', dtype={'article_id': str})
all_historical_transactions['article_id_idx'] = all_historical_transactions['article_id'].map(article_id_to_idx)
all_historical_transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_id_idx
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,40181
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,10522
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,6389
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2,46306
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2,46307


In [69]:
all_historical_transactions.sort_values(['customer_id', 't_dat'], inplace=True)

In [70]:
all_historical_transactions['article_id_idx'] = all_historical_transactions['article_id_idx'].astype(str)

In [71]:
all_historical_transactions_by_cust = (
    all_historical_transactions[['customer_id', 'article_id_idx']]
    .groupby('customer_id')
    .agg({
        'article_id_idx': ','.join
    })
    .reset_index()
)

all_historical_transactions_by_cust.head()

Unnamed: 0,customer_id,article_id_idx
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,"29518,101,30329,50726,16005,16005,23998,65669,..."
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,"19335,33750,33993,8218,41026,19335,42628,41026..."
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,"40181,10522,40181,18199,59460,1471,1471,60255,..."
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,6452761177
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,"32249,43444,51126,54463,2183,2183,20519,87478,..."


In [72]:
all_historical_transactions_by_cust.shape

(1362281, 2)

In [94]:
submission_inputs = submission_customers.merge(all_historical_transactions_by_cust, on='customer_id', how='left', indicator=True)

In [95]:
submission_inputs['_merge'].value_counts() / submission_inputs.shape[0]

both          0.992931
left_only     0.007069
right_only    0.000000
Name: _merge, dtype: float64

In [96]:
submission_inputs.head()

Unnamed: 0,customer_id,article_id_idx,_merge
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,"29518,101,30329,50726,16005,16005,23998,65669,...",both
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,"19335,33750,33993,8218,41026,19335,42628,41026...",both
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,"40181,10522,40181,18199,59460,1471,1471,60255,...",both
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,6452761177,both
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,"32249,43444,51126,54463,2183,2183,20519,87478,...",both


In [98]:
submission_inputs['article_id_idx'].fillna(NO_HISTORY_ARTICLE_ID_IDX, inplace=True)

In [99]:
submission_inputs.drop('_merge', axis=1, inplace=True)

In [100]:
submission_inputs.rename(columns={'article_id_idx': 'article_id_idx_historical'}, inplace=True)

In [101]:
# Create dummy label so Dataset can process it
submission_inputs['dummy_label'] = '999'

In [102]:
submission_inputs = submission_inputs.copy()[['customer_id', 'dummy_label', 'article_id_idx_historical']]
submission_inputs.head()

Unnamed: 0,customer_id,dummy_label,article_id_idx_historical
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,999,"29518,101,30329,50726,16005,16005,23998,65669,..."
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,999,"19335,33750,33993,8218,41026,19335,42628,41026..."
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,999,"40181,10522,40181,18199,59460,1471,1471,60255,..."
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,999,6452761177
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,999,"32249,43444,51126,54463,2183,2183,20519,87478,..."


In [103]:
submission_inputs.to_csv('data/splits/submission_inputs.tsv',sep='\t', index=False)

#### Dataset

In [89]:
from itertools import cycle, islice

import torch
import torch.nn.functional as F
from torch.utils.data import IterableDataset, DataLoader

In [90]:
class FashionDatasetSingleLabel(IterableDataset):

    def __init__(self, dataset_filepath, max_length, padding_value):
        
        self.dataset_itr = open(dataset_filepath, 'r')
        next(self.dataset_itr)  # skip header
        
        self.max_length = max_length
        
        self.padding_value = padding_value
    
    def process_label(self, label: str):

        return torch.tensor(int(label))
    
    def process_input(self, input_str: str, max_length, padding_value):
        
        input_tensor = torch.tensor([int(v) for v in input_str.split(',')])
        
        len_orig = len(input_tensor)
        
        if len_orig >= max_length:
            
            input_tensor = input_tensor[-max_length:]  # Take latest items
            
        else:
            
            num_pad = max_length - len_orig
            
            input_tensor = F.pad(input_tensor, (0, num_pad), value=padding_value)
            
        return input_tensor
    
    def parse_itr(self, dataset_itr):
        
        for line in dataset_itr:
        
            line_items = line.rstrip('\n').split('\t')  # [customer_id, label, input]
            
            label = self.process_label(line_items[1])
            
            input_seq = self.process_input(line_items[2], self.max_length, self.padding_value)

            yield input_seq, label
        
    def get_stream(self, dataset_itr):
        
        return self.parse_itr(dataset_itr)

    def __iter__(self):
        
        return self.get_stream(self.dataset_itr)

In [67]:
dataset = FashionDatasetSingleLabel(dataset_filepath='data/splits/train_single_purchase_label_sample.tsv', max_length=1, padding_value=0)

In [68]:
loader = DataLoader(dataset, batch_size=4)

In [69]:
# pd.read_csv('data/splits/dev_single_purchase_label.tsv', sep='\t')

In [70]:
for data in loader:
 
    X, y = data
    print(X.shape)

torch.Size([4, 1])
torch.Size([4, 1])
torch.Size([2, 1])


In [39]:
for batch in islice(loader, 8):
    print(batch)