# Jane Street Market Prediction - Data Conversion
This notebook take the data from the  __[Jane Street Market Prediction](https://www.kaggle.com/c/jane-street-market-prediction)__ competition and converts it into a dataset suitable for TPUs.

In [None]:
import json
import os
from shutil import make_archive

import numpy as np
import pandas as pd
import tensorflow as tf

# create a temporary folder to create the dataset in
temp = os.path.join(os.pardir, "temp", "tempdata")
os.makedirs(temp, exist_ok=True)

Settings for the dataset.

In [None]:
# number of days available
TOTAL_DAYS = 500

# number of days to put in each tf record
# one day corresponds to ~3 MB on disk
DAYS_PER_FILE = 20

# number of folds to split the data into
FOLDS = 5

# impute missing values with this value
# EDA showed that none of the features are never very 
# negative (<= -10) so we impute missing value with -100
NAN_VALUE = -100.0

Read the CSV file, convert values to to 32 bit floats, and impute missing values. Then store a dictionary translating column names to indices in `columns.json`. Finally, initialize the stats dictionary with an entry storing `NAN_VALUE`.

In [None]:
comp_folder = os.path.join(os.pardir, "input", "jane-street-market-prediction")
df = pd.read_csv(os.path.join(comp_folder, "train.csv"))
df = df.astype({c: np.float32 for c in df.select_dtypes(include="float64").columns})
df.fillna(NAN_VALUE, inplace=True)

columns = {col: ix for (ix, col) in enumerate(df.columns)}
with open(os.path.join(temp, "columns.json"), "w") as file:
    json.dump(columns, file)

stats = {"nan_value": NAN_VALUE}

Write the data into `FOLDS` directories of tf records containing `DAYS_PER_FILE` days worth of data each. As we process each fold, we calculate the number of samples, mean, and variance of the data in the remaining folds and store the in the stats dictionary.

In [None]:
days_per_fold = TOTAL_DAYS // FOLDS
files_per_fold = TOTAL_DAYS // (FOLDS * DAYS_PER_FILE)

for fold in range(FOLDS):
    # make a directory for files in this fold
    os.makedirs(os.path.join(temp, f"fold{fold}"), exist_ok=True)
    
    # split data into data for this fold and remainder
    fold_cols = df["date"].between(fold * days_per_fold, (fold + 1) * days_per_fold - 1)
    fold_df, rest_df = df[fold_cols], df[~fold_cols]
    
    # store the statistics of the remaining data
    stats[fold] = {"length": len(rest_df),
                   "mean": dict(rest_df.mean()),
                   "variance": dict(rest_df.var())
                  }

    # write the days for this fold into tf records
    for file in range(files_per_fold):
        first = fold * days_per_fold + file * DAYS_PER_FILE
        last = first + DAYS_PER_FILE - 1
        file_df = fold_df[fold_df["date"].between(first, last)]
                
        # convert the corresponding part of the data frame to tensor
        tensor = tf.convert_to_tensor(file_df, dtype=tf.float32)

        # convert the tensor to a TF dataset
        ds = tf.data.Dataset.from_tensor_slices(tensor)

        # serialize the tensors in the data set
        ds = ds.map(tf.io.serialize_tensor)

        # write the serialized data to TF record
        record_path = os.path.join(temp, f"fold{fold}", f"{file}.tfrec")
        record = tf.data.experimental.TFRecordWriter(record_path)
        record.write(ds)

Write the stats dictionary to `stats.json` and compress the entire dataset into a zip archive in the working (output) directory.

In [None]:
with open(os.path.join(temp, "stats.json"), "w") as file:
    json.dump(stats, file)

_ = make_archive(os.path.join(os.curdir, "data"), "zip", temp)