# Jane Street Market Prediction - Data Conversion
This notebook take the data from the  __[Jane Street Market Prediction](https://www.kaggle.com/c/jane-street-market-prediction)__ competition and converts it into a dataset suitable for TPUs.

In [None]:
import json
import os
from shutil import make_archive

import numpy as np
import pandas as pd
import tensorflow as tf

Decide how many days to put in each file and what percentage of files should be put in the train/valid/test folders respectively. One day corresponds to about 3 MB of data.

In [None]:
DAYS_PER_FILE = 5
SPLIT = {"train": 80, "valid": 15, "test": 5}

Create the temporary directories for the dataset files.

In [None]:
temp = os.path.join(os.pardir, "temp", "tempdata")
os.makedirs(os.path.join(temp, "train"), exist_ok=True)
os.makedirs(os.path.join(temp, "valid"), exist_ok=True)
os.makedirs(os.path.join(temp, "test"), exist_ok=True)

Read the CSV files, convert to 32 bit floats, and impute missing values with the median for that feature.

In [None]:
comp_folder = os.path.join(os.pardir, "input", "jane-street-market-prediction")
df = pd.read_csv(os.path.join(comp_folder, "train.csv"))
df = df.astype({c: np.float32 for c in df.select_dtypes(include="float64").columns})
df.fillna(df.median(), inplace=True)

Store a dictionary translating column names to indices in `columns.json`.

In [None]:
columns = {col: ix for (ix, col) in enumerate(df.columns)}
with open(os.path.join(temp, "columns.json"), "w") as file:
    json.dump(columns, file)

Split the dataframe in to shards of `DAYS_PER_FILE` days and write them into tf records in the temporary directory structure.

In [None]:
files = 500 // DAYS_PER_FILE
for file in range(files):
    # determine which days to put in this file
    days = range(DAYS_PER_FILE * file, DAYS_PER_FILE * (file + 1))
    
    # convert the corresponding part of the data frame to tensor
    tensor = tf.convert_to_tensor(df[df["date"].isin(days)], dtype=tf.float32)

    # convert the tensor to a TF dataset
    ds = tf.data.Dataset.from_tensor_slices(tensor)
    
    # serialize the tensors in the data set
    ds = ds.map(tf.io.serialize_tensor)
    
    # decide on a folder for this file
    if file < 0.01 * SPLIT["train"] * files:
        folder = "train"
    elif file < 0.01 * (SPLIT["train"] + SPLIT["valid"]) * files:
        folder = "valid"
    else:
        folder = "test"
    
    # write the serialized data to TF record
    record_path = os.path.join(temp, folder, f"{file}.tfrec")
    record = tf.data.experimental.TFRecordWriter(record_path)
    record.write(ds)

Compress the temporary dataset directory into a zip archive in the output directory.

In [None]:
arc = make_archive(os.path.join(os.curdir, "data"), "zip", temp)
print(f"Data written to {arc}")