<h1>Riid AIEd Challenge 2020 - Part II new</h1>

Due to memory/time restrictions in this competition, work is divided into several parts (kernels):
<ul>
    <li>Part I - Memory optimization</li>
    <li>Part II - Splitting data</li>
    <li>Part III - Feature engineering</li>
    <li>Part IV - Training and validation</li>
    <li>Part V - Prediction and submission</li>
</ul>

This is Part II. In this part I'll 
<ul>
    <li>Divide the competition data into two parts. The first part, which I'll call <code>past_data</code> will be used to create features for training.  The second part will be the test dataset, and will be designed to be similar to the competition test set.</li>
    <li>A small part of <code>past_data</code> will be used to create folds of training/validation data.</li>
    <li>Save everything in pickle format to be used by the next phase.</li>
</ul>

In [None]:
# Imports

import os
import pandas as pd
import numpy as np
import pickle
import gc

In [None]:
# Define directories used

DATA_DIR = '/kaggle/input/riiid-test-answer-prediction'
PART_I_OUTPUT_DIR = '/kaggle/input/riiid-aied-part-i'
WORKING_DIR = '/kaggle/working'

In [None]:
%%time

# Read competition data
competition_data = pd.read_pickle(os.path.join(PART_I_OUTPUT_DIR, 'competition_data.pkl'))
competition_data.head()

In [None]:
competition_data.info()

<h2>Sort competition data by date</h2>

For this, I use this <a href='https://www.kaggle.com/its7171/cv-strategy'>notebook</a> by tito.

In [None]:
%%time

# Create a dataframe with max timestamp per user
timestamps_df = competition_data.groupby('user_id')['timestamp'].max().reset_index()
timestamps_df.columns = ['user_id', 'max_timestamp']

# Calculate maximum timestamp of all users
MAX_TIMESTAMP = timestamps_df['max_timestamp'].max()

print(f'max timestamp = {MAX_TIMESTAMP}')

In [None]:
%%time

# Set start of each user's interactions at some random point between 0 and MAX_TIMESTAMP - user's max timestamp

def random_start(max_timestamp):
    return np.random.randint(0, high=MAX_TIMESTAMP - max_timestamp + 1)

timestamps_df['random_start'] = timestamps_df.max_timestamp.apply(random_start)

In [None]:
%%time

# Join competition data with this new information about users

competition_data = competition_data.merge(timestamps_df, on='user_id', how='left')

del timestamps_df
_ = gc.collect()

In [None]:
# Calculate the virtual timestamp of every interaction

competition_data['virtual_timestamp'] = competition_data['random_start'] + competition_data['timestamp']

# Free memory
competition_data.drop(columns=['max_timestamp', 'random_start'], inplace=True)

gc.collect()

In [None]:
%%time

# Sort the competition_data by virtual_timestamp

competition_data = competition_data.sort_values(by='virtual_timestamp', ascending=True).reset_index(drop=True)

_ = gc.collect()

<h2>Split data into past data and test data</h2>

In [None]:
# Create test set as the last 100K rows
test = competition_data.iloc[-100000:].copy()
test.reset_index()

# Save test data
test.to_pickle(os.path.join(WORKING_DIR, 'test.pkl'))

# Create past_data and save it (needed for feature creation)
past_data = competition_data.drop(index=test.index).reset_index(drop=True)
past_data.to_pickle(os.path.join(WORKING_DIR, 'past_data.pkl'))

# Create train/validation data as the last 15M rows of past_data
train_val = past_data.iloc[-10000000:].copy()
train_val.reset_index()

del past_data
del test
del competition_data

gc.collect()

<h2>Create train/validation folds</h2>

In [None]:
# Split users in train_val into 4 groups, each one of which will be used to create a cross-validation fold.

NUM_FOLDS = 4

np.random.seed(42)

user_ids = train_val.user_id.unique()
np.random.shuffle(user_ids)
user_groups = np.array_split(user_ids, NUM_FOLDS)

for i, group in enumerate(user_groups):
    train_val_fold = train_val.loc[train_val.user_id.isin(user_groups[i])]
    
    # The last 500K rows are for the validation set and the rest for the training set
    train = train_val_fold.iloc[:-500000].reset_index(drop=True)
    val = train_val_fold.iloc[-500000:].reset_index(drop=True)
    
    print(f'train_{i}.shape={train.shape}, val_{i}.shape={val.shape}')
    
    # Save everything
    train.to_pickle(os.path.join(WORKING_DIR, f'train_{i}.pkl'))
    val.to_pickle(os.path.join(WORKING_DIR, f'val_{i}.pkl'))

In [None]:
train.head()

That's all folks