Evaluation set is "predicted items a customer will buy in the next 7-day period after the training time period".

Given this we will use a _global temporal_ split: a fixed time-point that is shared across all users, where any interactions after that point are used for testing.

We will take the last 7-day period as a test set and the 7-day period before that as validation.

~2% of users in the sample submission are not in the training set (cold-start users). We check that our splits have similar proportions of cold-start users. Note however that these customers may not have necessarily made purchases.

In [1]:
import os
import datetime

import pandas as pd
import numpy as np

In [2]:
os.chdir('..')

In [3]:
transactions_train_df = pd.read_csv('data/transactions_train.csv', dtype={'article_id': str})  # Make sure article_id is being loading in as a string
print(transactions_train_df.shape)
transactions_train_df.head()

(30647963, 5)


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0


In [4]:
transactions_train_df['t_dat'] = pd.to_datetime(transactions_train_df['t_dat'])

In [5]:
test_end = transactions_train_df['t_dat'].max()
test_start = transactions_train_df['t_dat'].max() - datetime.timedelta(days=7)

test_start, test_end

(Timestamp('2020-08-16 00:00:00'), Timestamp('2020-08-23 00:00:00'))

In [6]:
dev_end = test_start
dev_start = dev_end - datetime.timedelta(days=7)

dev_start, dev_end

(Timestamp('2020-08-09 00:00:00'), Timestamp('2020-08-16 00:00:00'))

In [7]:
train_start = transactions_train_df['t_dat'].min()
train_end = dev_start

train_start, train_end

(Timestamp('2018-09-20 00:00:00'), Timestamp('2020-08-09 00:00:00'))

In [8]:
test_mask = transactions_train_df['t_dat'].between(test_start, test_end, inclusive='right')
dev_mask = transactions_train_df['t_dat'].between(dev_start, dev_end, inclusive='right')
train_mask = transactions_train_df['t_dat'].between(train_start, train_end, inclusive='both')

In [9]:
train_df = transactions_train_df.copy()[train_mask]
dev_df = transactions_train_df.copy()[dev_mask]
test_df = transactions_train_df.copy()[test_mask]

In [10]:
assert train_df.shape[0] + dev_df.shape[0] + test_df.shape[0] == transactions_train_df.shape[0]

In [11]:
# Proportion of dev set customers not in training set
len(
    set(dev_df['customer_id'].unique()) - 
    set(train_df['customer_id'].unique())
) / dev_df['customer_id'].nunique()

0.06473353579897657

In [12]:
# Proportion of test set customers not in training set
len(
    set(test_df['customer_id'].unique()) - 
    set(train_df['customer_id'].unique())
) / test_df['customer_id'].nunique()

0.06977888489098011

### Save splits

In [14]:
train_df.to_parquet('data/splits/train.parquet', index=False)
dev_df.to_parquet('data/splits/dev.parquet', index=False)
test_df.to_parquet('data/splits/test.parquet', index=False)