In [None]:
import re

import pandas as pd
import numpy as np
import plotnine as pn
import matplotlib.pyplot as plt 
import dask.dataframe as dd
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Read subset of training data to make EDA simpler to handle

In [None]:
train = pd.read_csv("../input/jane-street-market-prediction/train.csv", nrows=20000)

In [None]:
train.head()

In [None]:
target = (train.weight * train.resp > 0).astype(int)
train.loc[:, 'target'] = target

Directly, we notice `feature_0` as different from the others. 1 and -1 seems weird to be stock market prices at least

For each `date` $i$ we have

$$p_i = \sum_j(weight_{ij} * resp_{ij} * action_{ij}),$$

which we input to 

$$t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}}.$$

where $|i|$ is the number of unique days in the sample. The utility score is finally

$$u = \min(\max(t,0), 6)  \sum p_i.$$

For comparison, the Sharpe ratio is defined as 

$$\text{Sharpe Ratio} = \frac{R_p - R_f}{\sigma_p}$$

where $R_p$ is the return of the portolfio, $R_f$ is the risk-free return, and $\sigma_p$ is the volatility of the portfolio excess return.

Say that we would predict action 1 and 0 at random

In [None]:
action = np.random.randint(2, size=len(train.index))
y_pred = pd.DataFrame(dict(action=action), index=train.date)

The utility score using our mock predictions for date 0 is

In [None]:
def calculate_utility(df: pd.DataFrame, y_pred: pd.Series) -> int:
    unique_dates = train.date.unique()
    ps = np.array([0] * unique_dates)
    for i in unique_dates:
        t0 = train.loc[train.date == i]
        y0 = y_pred.loc[y_pred.index == i]
        p = sum(np.multiply(t0.weight, np.multiply(np.array(t0.resp), np.array(y0.action))))
        ps[i] = p
    t = sum(ps) / np.sqrt(sum(ps ** 2)) * np.sqrt(250 / len(unique_dates))
    u = min(max(t, 0), 6) * sum(ps)
    return u

Simulate for $N$ rounds

In [None]:
from tqdm import tqdm

seed = 42
np.random.seed(seed)
N = 1000
us = []
for i in tqdm(range(N)):
    action = np.random.randint(2, size=len(train.index))
    y_pred = pd.DataFrame(dict(action=action), index=train.date)
    us.append(calculate_utility(train, y_pred))

In [None]:
plt.figure(figsize=(10, 10))
pd.Series(us).hist(bins=20)

What if we're able to predict all trades with positive return?

In [None]:
y_pred = train[["date", "target"]]
y_pred = y_pred.set_index("date")
y_pred.columns = ["action"]

In [None]:
calculate_utility(train, y_pred)

## Explore the features

In [None]:
b = train.columns.str.contains("feature")
features = train[train.columns[b]]

In [None]:
features.describe().iloc[1:]

In [None]:
df_plt = features.iloc[:500, 1:3].reset_index()
m = pd.melt(df_plt, value_vars=["feature_1", "feature_2"], id_vars="index")
(pn.ggplot(m, pn.aes(x="index", y="value", color="variable")) 
 + pn.geom_line()
 + pn.xlab("Time")
 + pn.ylab("Return"))

Return seem to be capped for some assets

### Correlation analysis

In [None]:
b = train.columns.str.contains("feature")
features = train[train.columns[b]]

In [None]:
cmat = features.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(cmat)

Two obvious areas that need further exploration: Features around the sixties, and the large negative correlations around 30.

In [None]:
c = cmat.abs().unstack().sort_values(ascending=False)[len(features.columns):len(features.columns) + 10]
c

Assets 60-69 seem to be coming from the same asset class, or classes that are strongly correlated in some sense.

In [None]:
sixties = ["feature_" + str(i) for i in range(60, 70)]
f = features.columns.isin(sixties)
sdf = features[features.columns[f]]

In [None]:
def plt(df_plt: pd.DataFrame, value_vars: list):
    m = pd.melt(df_plt.reset_index(), value_vars=value_vars, id_vars="index")
    return (
        pn.ggplot(m, pn.aes(x="index", y="value", color="variable")) 
        + pn.geom_line()
        + pn.xlab("Time")
    )
df_plt = sdf.iloc[:1000, :]
plt(df_plt, sixties)

In [None]:
ldf = sdf.filter(regex="[0-5]")
lower_sixties = list(ldf.columns)
plt(ldf.iloc[1000:2000, :], lower_sixties)

What are those monotonically increasing  feature 64? Trend?

In [None]:
f64 = features[["feature_64"]]
plt(f64.iloc[:, :], ["feature_64"])

Feature summary:

* Features 60-65

    * Feature 64 seem to be a piecewise, with monotonically(almost, see around time 15000) incerasing functions
    * Features 61-63 and 65 seem to be capped from above, and only have "negative return". Is that shorting behaviour, and should we analyse the absolute values?
    

## Missingness Handling

In [None]:
(features.isnull().sum() / len(features.index)).sort_values(ascending=False)

How are missing values distributed over time?

In [None]:
features.isnull().sum(axis=1).plot()

Missing values seem to be clustered, with regular spikes and hill-looking values.

## Full dataset

In [None]:
train = dd.read_csv("../input/jane-street-market-prediction/train.csv")

In [None]:
ncols = train.shape[1]
nrows = train.shape[0].compute()

In [None]:
print("Columns: {ncols}\nRows: {nrows}".format(ncols=ncols, nrows=nrows))

In [None]:
train.head()