# Jane Street Exploratory Data Analysis

Welcome to my notebook for the Jane Street Market Prediction competition, in which I'm going to be analyzing and taking a look at what exactly to do in the wonderful world of forecasting. Last TS comp I participated in (M5) had the largest shakeup of any competition, with people shaking up and down by ~5500 places and here, since we do not want something similar, our models need to be incredibly robust.

### TOC:

<ol>
    <li><a href="#introduction">Introduction</a></li>
    <li><a href="#exp">Data exploration</a></li>
   </ol>

<h1 id="introduction">Introduction</h1>

First we take a few general imports that will be relevant to the scope of this notebook (numpy, pandas and plotly for interactive plots etc.) and we also read in the data.

In [None]:
import numpy as np, pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.offline as py
import seaborn as sns
import missingno as msno
train = pd.read_csv('../input/jane-street-market-prediction/train.csv')
feats = pd.read_csv('../input/jane-street-market-prediction/features.csv')

Now we have a look at the first few rows of the training data:

In [None]:
train.head()

<h1 id="exp">Exploration</h1>

So over here it seems like the response columns (all those columns with resp) are our labels and the features will be used to predict the labels. Let's proceed to check the response distributions over time (thank you to @xhlulu for his wonderful kernel detailing this https://www.kaggle.com/xhlulu/jane-street-animated-and-interactive-plots):

In [None]:
t_start = 0
t_end = 30

fig = px.histogram(
    train[(train.date >= t_start) & (train.date <= t_end)], 
    x=['resp_1', 'resp_2', 'resp_3', 'resp_4'], 
    facet_col='variable', animation_frame='date', template="plotly_white")
fig.show()

In [None]:
x, cl = 1, 2
t_start, t_end = 0, 5
resp = 'resp_1'

fig = px.scatter(
    train[(train.date >= t_start) & (train.date <= t_end)], 
    x=f'feature_{x}', 
    y=resp, 
    color=f'feature_{cl}',
    animation_frame='date',
    template="plotly_white"
)
fig.show()

So this is the evolution of the targets present over time, and it's interesting to observe the fluctuations in the target time series. Let's have a look at the missing values, to account for the NaNs we saw earlier::

In [None]:
import missingno as msno
msno.matrix(df=train.head(50_000), figsize=(20, 14), color=(0.42, 0.1, 0.05));

So apparently the missing values all occur in a very neat pattern in the data which is definitely interesting to have a look at. Again, a reminder to *clean your data* here before moving on.

Let's just have a quick look at the features provided:

In [None]:
feats.head()

Let's just quickly check the counts of each tag, since it looks like the tags correspond to the features earlier discussed. 

In [None]:
bin_col = [col for col in feats.columns if 'tag' in col]
t_l = []
f_l = []
for col in bin_col:
    t_l.append((feats[col]==True).sum())
    f_l.append((feats[col]==False).sum())
trace1 = go.Bar(
    x=bin_col,
    y=t_l ,
    name='True count'
)
trace2 = go.Bar(
    x=bin_col,
    y=f_l,
    name='False count'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='stack',
    title='Count of True and False in tags',
    template="plotly_white"
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='stacked-bar')

So in every tag it seems like we have very few True values and a lot of False values. 

## WIP