#### In this notebook, I try to answer two important questions: 
- Is the dataset properly sorted? 
- How are training data and test data overlapped?

Given the data size, it is challenging to even read the entire test data. Using RAPIDS cudf, we can process the data and get our answer in minutes. If you check the notebook log, you can find the entire notebook running time. However on a computer with faster disk, it can be done within 10 mins!

#### TLDR
- Yes, the dataset is properly sorted by `customer_ID` then `S_2` (timestamp) so the records of customer is continuous in the csv in the order of timing.
- No, there are no overlap between training and test data in terms of `customer_ID` and `timestamps`. Specifically, no `customer_ID` are found in common. training data is from `2017-03-01` to `2018-03-01`. test data is from `2018-04-01` to `2019-10-01` 

In [None]:
import cudf
import cuml
import cupy
from tqdm import tqdm

In [None]:
import os
path = '../input/amex-default-prediction'
os.listdir(path)

In [None]:
%%time
test = cudf.read_csv(f'{path}/test_data.csv', nrows=10)
cols = test.columns
test.head()

#### Read the file in chunks using cudf

You might wonder why so much fuss reading the file. It turns out that `read_csv` is less than ideal due to the memory overhead. The peak memory usage is so high that we have to read in chunks even when we only use two columns in this case.

In [None]:
%%time

def read_csv_iter(path, total_rows, chunk_rows, all_cols, use_cols):
    sofar = 1
    ts = []
    for i in tqdm(range(total_rows//chunk_rows+1)):
        nr = total_rows - sofar
        nr = min(chunk_rows, nr)
        t = cudf.read_csv(path, header=None, names=cols, nrows = nr, skiprows=sofar, usecols=use_cols)
        sofar += nr
        ts.append(t)
    return cudf.concat(ts)

In [None]:
%%time

test_rows = 11363762

# please note that the total_rows here includes the header row of the csv
test = read_csv_iter(f'{path}/test_data.csv', total_rows=test_rows+1, 
                     chunk_rows=4_000_000, all_cols=cols, 
                     use_cols=['customer_ID','S_2'])
print(test.shape)
test.head()

In [None]:
%%time

train_rows = 5531451

# please note that the total_rows here includes the header row of the csv
train = read_csv_iter(f'{path}/train_data.csv', total_rows=train_rows+1, 
                     chunk_rows=4_000_000, all_cols=cols, 
                     use_cols=['customer_ID','S_2'])
print(train.shape)
train.head()

#### Check if the data is sorted

Why do we care about if the data is sorted?

Due to the datasize, it is very likely that the data and model can't fit in the system or GPU memory. Hence it is desirable to process and train the data in a streaming/batching fashion. If the data is properly sorted, the records of one customer are in continuous rows of the csv file, which makes batching much easier. 

So how do we check? We simply sort the dataframe with the desired way and compare the row index. if they are equal, it proves that the data is already sorted in that way.

In [None]:
def check_sorted(df):    
    df['row_id'] = cupy.arange(df.shape[0])
    df['cid'],_ = df.customer_ID.factorize()
    df_sort = df.sort_values(['cid','S_2'])
    return (df['row_id'] == df_sort['row_id']).all()

In [None]:
%%time

check_sorted(test)

In [None]:
%%time

check_sorted(train)

Great! Both train and test data are properly sorted.

#### Check if train and test are overlapped

To setup a robust validation scheme, we need to understand how the train and test are split. Specifically we want to check `customer_ID` and `S_2` (timestamp).

First, let's check if the timestamps are overlapped between train and test

In [None]:
%%time
train['S_2'] = cudf.to_datetime(train['S_2'], format='%Y-%m-%d')
test['S_2'] = cudf.to_datetime(test['S_2'], format='%Y-%m-%d')

In [None]:
%%time
train['S_2'].min(), train['S_2'].max()

In [None]:
%%time
test['S_2'].min(), test['S_2'].max()

Nope, test is the future and train is the past. So we have a forecast problem

In [None]:
%%time
train_ids = train.customer_ID.unique()
test_ids = test.customer_ID.unique()

In [None]:
mask = train_ids.isin(test_ids)
mask.sum()/train_ids.shape[0]

In [None]:
mask = test_ids.isin(train_ids)
mask.sum()/test_ids.shape[0]

No shared customer IDs. Very challenging!