![logo](https://www.usu.edu/degrees/images/large/mathematics2.jpg)

# Jane Street Data Preprocessing
The steps taken here are:
1. Read in data
2. Filter out data to use
3. Calculate mean, median, skew etc for each feature
4. Create mean vector for use during imputation
5. Impute NaNs 
6. Apply standardscaler
7. Pickle training data and other parameters needed for training and inference

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import skew, normaltest
from sklearn.preprocessing import StandardScaler

## Read data and filter
Many notebooks filter out data prior to date = 85 because those data show different properties then after. Maybe best to try training a model with all data as well.

In [None]:
train_all = pd.read_csv('../input/jane-street-market-prediction/train.csv')
train_all = train_all[train_all.date > 85].reset_index(drop = True) 
train_all = train_all[train_all['weight'] != 0]

## Calculate mean, median, skew etc for each feature
Here we drop all NaNs and compute a few statistical properties for each column.

In [None]:
par = []
for i in range(130):
    df = train_all['feature_'+str(i)].dropna()
    par.append([i, np.mean(df), np.median(df), df.mode()[0], np.abs(skew(df)), normaltest(df)[1]])
dfp = pd.DataFrame(par, columns=['feature', 'mean', 'median', 'mode', 'skew', 'normaltest'])

In [None]:
dfp.describe()

The skew tells us if the values have a symmetric distribution (skew close to zero), or skewed distribution. If the distribution is symmetric, it makes sense to use the mean value during imputation (filling in the NaNs). Otherwise it might be better to use the median value, or the mode value (most common value).
BTW: The normaltest shows that none of the features are normal distributed, which is as expected.

## Create mean vector
We can use the mean values only, or a combination of mean, median and mode values based on skew. Here we define two thresholds that allow us to make use of one, two or all of the mean, meadian and mode values based on the skew value. Strictly speaking, we should only calculate mean values from the training set after the train/test split, to avoid data leakage from the test/validation set into the mean vector.

In [None]:
MEAN_TH = 1.25
MODE_TH = 100. # must be >= MEAN_TH

def get_mmm(dfm):
    fmean = np.zeros(130)
    for i in range(130):
        if dfm['skew'][i] <= MEAN_TH:
            fmean[i] = dfm['mean'][i]  # use mean value
        elif dfm['skew'][i] > MODE_TH:
            fmean[i] = dfm['mode'][i] # use mode value
        else:
            fmean[i] = dfm['median'][i] # use median value
    return fmean

f_mean = get_mmm(dfp)

## Impute
Once the mean vector has been created, we use it to impute all the missing numbers.

In [None]:
f_pad = np.concatenate(([0.,0.,0.,0.,0.,0.,0.], f_mean, [0.]))
pad = pd.Series(f_pad, index = train_all.columns)
train_all.fillna(pad, inplace=True)

## Scale
When feeding a DNN it is benefical to scale the features to mean = 0 and std.dev. = 1.0. While training a random forest and similar standard scaling has no impact (but it does not hurt either).  
Feature_0 is only 1 or -1, so we will leave that one alone.

In [None]:
features = [c for c in train_all.columns if "feature" in c]
features = features[1:] # leave feature_0 untouched
X_train = train_all[features]

Before scaling:

In [None]:
train_all.describe()

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_t = scaler.transform(X_train)
del X_train

In [None]:
train_all[features] = X_t

After scaling:

In [None]:
train_all.describe()

Feature 1 and upwards all have std = 1 now, and the other variables are unchanged. Finally we pickle the training data for use in training notebooks.

In [None]:
train_all.to_pickle('train_data.pkl')

## Save parameters
Last thing to do is to save the scaler model and the mean vector, both are required during inference.

In [None]:
import pickle

pickle.dump(scaler, open('./scaler.pkl','wb'))
np.save('feat_mmm.npy', f_mean) 