# Introduction

In this competition, the problem statement is to make a model that can capture all the good investment opportunities, and leave out the bad ones. 

It is a time series competition, so the cross-validation framework won't be straightforward. In this notebook, I want to get a good sense of the data, and the kind of information we are working with, and potentially derive some usefull insights that can be used in modelling later on. Then, we'll establish a baseline using LightGBM, along with a crude cross validation framework.

Let's start by importing some libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
import plotly as py
import plotly.express as px
import seaborn as sns
import gc

pd.options.display.max_columns = 999

For reading train data, I am using the feather format. All formats are available in this notebook: https://www.kaggle.com/pedrocouto39/fast-reading-w-pickle-feather-parquet-jay/output. This saves lot of time, and gets code running faster.

Using this, we can read data in 5 seconds and only use 600 mb!

In [None]:
dtypes = {
    'date':'int16',
    'weight':'float16',
    'resp_1':'float16',
    'resp_2':'float16',
    'resp_3':'float16',
    'resp_4':'float16',
    'resp':'float16',
    'feature_0':'int8',
    'feature_1':'float16',
    'feature_2':'float16',
    'feature_3':'float16',
    'feature_4':'float16',
    'feature_5':'float16',
    'feature_6':'float16',
    'feature_7':'float16',
    'feature_8':'float16',
    'feature_9':'float16',
    'feature_10':'float16',
    'feature_11':'float16',
    'feature_12':'float16',
    'feature_13':'float16',
    'feature_14':'float16',
    'feature_15':'float16',
    'feature_16':'float16',
    'feature_17':'float16',
    'feature_18':'float16',
    'feature_19':'float16',
    'feature_20':'float16',
    'feature_21':'float16',
    'feature_22':'float16',
    'feature_23':'float16',
    'feature_24':'float16',
    'feature_25':'float16',
    'feature_26':'float16',
    'feature_27':'float16',
    'feature_28':'float16',
    'feature_29':'float16',
    'feature_30':'float16',
    'feature_31':'float16',
    'feature_32':'float16',
    'feature_33':'float16',
    'feature_34':'float16',
    'feature_35':'float16',
    'feature_36':'float16',
    'feature_37':'float16',
    'feature_38':'float16',
    'feature_39':'float16',
    'feature_40':'float16',
    'feature_41':'float16',
    'feature_42':'float16',
    'feature_43':'float16',
    'feature_44':'float16',
    'feature_45':'float16',
    'feature_46':'float16',
    'feature_47':'float16',
    'feature_48':'float16',
    'feature_49':'float16',
    'feature_50':'float16',
    'feature_51':'float16',
    'feature_52':'float16',
    'feature_53':'float16',
    'feature_54':'float16',
    'feature_55':'float16',
    'feature_56':'float16',
    'feature_57':'float16',
    'feature_58':'float16',
    'feature_59':'float16',
    'feature_60':'float16',
    'feature_61':'float16',
    'feature_62':'float16',
    'feature_63':'float16',
    'feature_64':'float16',
    'feature_65':'float16',
    'feature_66':'float16',
    'feature_67':'float16',
    'feature_68':'float16',
    'feature_69':'float16',
    'feature_70':'float16',
    'feature_71':'float16',
    'feature_72':'float16',
    'feature_73':'float16',
    'feature_74':'float16',
    'feature_75':'float16',
    'feature_76':'float16',
    'feature_77':'float16',
    'feature_78':'float16',
    'feature_79':'float16',
    'feature_80':'float16',
    'feature_81':'float16',
    'feature_82':'float16',
    'feature_83':'float16',
    'feature_84':'float16',
    'feature_85':'float16',
    'feature_86':'float16',
    'feature_87':'float16',
    'feature_88':'float16',
    'feature_89':'float16',
    'feature_90':'float16',
    'feature_91':'float16',
    'feature_92':'float16',
    'feature_93':'float16',
    'feature_94':'float16',
    'feature_95':'float16',
    'feature_96':'float16',
    'feature_97':'float16',
    'feature_98':'float16',
    'feature_99':'float16',
    'feature_100':'float16',
    'feature_101':'float16',
    'feature_102':'float16',
    'feature_103':'float16',
    'feature_104':'float16',
    'feature_105':'float16',
    'feature_106':'float16',
    'feature_107':'float16',
    'feature_108':'float16',
    'feature_109':'float16',
    'feature_110':'float16',
    'feature_111':'float16',
    'feature_112':'float16',
    'feature_113':'float16',
    'feature_114':'float16',
    'feature_115':'float16',
    'feature_116':'float16',
    'feature_117':'float16',
    'feature_118':'float16',
    'feature_119':'float16',
    'feature_120':'float16',
    'feature_121':'float16',
    'feature_122':'float16',
    'feature_123':'float16',
    'feature_124':'float16',
    'feature_125':'float16',
    'feature_126':'float16',
    'feature_127':'float16',
    'feature_128':'float16',
    'feature_129':'float16',
    'ts_id':'int32'
}

In [None]:
train = pd.read_feather('../input/fast-reading-w-pickle-feather-parquet-jay/jane_street_train.feather')
features = pd.read_csv('../input/jane-street-market-prediction/features.csv')
train = train.astype(dtypes)

In [None]:
# Old reduce_mem_usage. Directly use dtypes dict from now on to improve speed.
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
# train = reduce_mem_usage(train)

Display Train data

In [None]:
train

Really quick small points:
- There seem to be null values in some of the columns. When exploring the data in depth, we should make sure to check the percentage of null values and drop any high null value features
- Also, Feature_0 seems to be a binary variable. We'll look into that later.

Let's just find some basic statistics for the train data

In [None]:
train.describe()

Quick points:
- From the look of it, null values don't seem to be a concern. Most columns seem to have almost all values.
- It looks like most features have a median at or very close to 0. Similar story with the mean. It seems that the features may have been normalized as well. We'll look into that later.

# Features Metadata

Now, let's look at the faetures metadata file.

In [None]:
features

So each feature has 29 boolean flags to describe the nature of the feature. Before we do any detailed analysis, let's see the distribution of True and False for each tag. From what I see, there is an overwhelming majority of False

In [None]:
fig, ax = plt.subplots(7, 4, figsize=(15,15))
labels= False, True
for i in range(28):
    ax[int(i / 4), i%4].pie([len(features[features['tag_' + str(i)] == False]), len(features[features['tag_' + str(i)] == True])], labels = labels, autopct='%1.1f%%')
    ax[int(i / 4), i%4].axis('equal')
    ax[int(i / 4), i%4].set_title('Distribution of tag_' + str(i))
    
plt.show()

There is a wide range in True/False distribution, but for all tags, an overwhelming majority of them have False. The distribution ranges from only 1% True up to 36.9% True.

Now, I think that many features will have the exact same, or very close metadata (since there are only 29 boolean flags). Let's see if any features are similar, but to do this, I will first use PCA to compress the tags into 2 dimensions.


In [None]:
from sklearn.decomposition import PCA
cols = list(features.columns)
cols.remove('feature')
X = features[cols]
X = X.astype(int)
pca = PCA(n_components=2)
y = pca.fit_transform(X)
plt.figure(figsize=(15,15))
plt.scatter(y[:, 0], y[:, 1])
for i in range(130):
    plt.annotate("Feature " + str(i), y[i])
plt.title("Dimensionality reduction of feature metadata")
plt.show()

I was right! Some features have the exact same metadata, so they fall right on top of each other after PCA. Let's isolate the sets of overlapping features. Maybe features that have the same metadata have the same distribution in train as well?

In [None]:
df = pd.DataFrame(y)
df = df.round(decimals=2)
df = df[df.duplicated(keep=False)]

df = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist()
print (df)

The above is a list of all features very close to each other (within 0.01 of PCA features). Now, for each set of similar features, let's plot their distribution of train data. If the distributions are similar, then it means that this metadata is directly linked to values in train data.

In [None]:
fig, ax = plt.subplots(5, 4, figsize=(20,20))
count = 0
for i in df:
    for col in i:
        sns.distplot(train['feature_' + str(col)], label='feature_' + str(col), ax=ax[int(count / 4), count%4])
        leg = ax[int(count / 4), count%4].legend()
    count += 1

How disappointing...
Features with same metadata don't have same distribution in train, however, there is some similarity. For example, in the plot for Feature_60, 61, 62, 63, 65, etc (3rd row, 3rd column), we can see that all features have a similar distribution, which is different from the rest of the plots.

Also, we shouldn't think that the metadata can explain the median and mode for the features. If you look closely, all features are centered around 0, so metadata has nothing to do with this. The shift is only due to any outliers present in features, causing the plot horizon to change.

I think this metadata can only explain the variance of the features, but not really the peak or anything else. Let's just verify by calculating the correlation between two features in the same bucket.

Side Notes:
- The features have a very wide kind of distribution. Most of them seem like a normal distribution, which suports the hypothesis that features were normalized.
- There is one very interesting set of expections (feature 60..). This has a very interesting distribution. Seems like it is multimodal, but the left mode has a much higher frequency count than the other one. 

In [None]:
del cols, df
gc.collect()

In [None]:
np.corrcoef(train['feature_83'], train['feature_95'])

Yes, the correlation is very poor, so looks like the metadata is not helpful. Even though features may have similar standard deviation, for every individual datapoint in train, features in same bucket are not correlated. Let's move on to see the train data.

# Train data
Printing data once again:

In [None]:
train

There are 2390491 rows and 138 columns in train data.

Let's go feature by feature, starting with weight

In [None]:
plt.hist(train['weight'], bins=50)
plt.title("Distribution of Weight")
plt.show()

In [None]:
max(train['weight']), min(train['weight'])

Looks like there are a large number of 0s. One critical thing is that rows with 0 weight are only present in train, and in test data, there won't be any 0s. I think that the ones with 0 weight should only be used for training, and any non-0 weights could be used for cross validation. Also, the max weight for a single datapoint is 167.293. When modelling, it's important to take special care for these high weight outliers. Making the right decision here would give a huge score boost, compared to getting 10-15 low weight ones right.

Moving on to resp features. Now, these are returns of investment, so let's plot them on a time series. I will use ts_id instead of date, because date repeats, but ts_id is just the chronological id, so it won't mess with our plot.
The red line is the average return for each resp

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(15,15))
j = 0
for i in range(1, 6):
    if i == 5:
        ax[int(j / 2), j%2].plot(train['ts_id'], train['resp'], c='y')
        ax[int(j / 2), j%2].plot(train['ts_id'], [train['resp'].mean()]*len(train), c='r')
        ax[int(j / 2), j%2].set_title('Distribution of Resp')
        break
            
    ax[int(j / 2), j%2].plot(train['ts_id'], train['resp_' + str(i)])
    ax[int(j / 2), j%2].plot(train['ts_id'], [train['resp_' + str(i)].mean()]*len(train), c='r')
    ax[int(j / 2), j%2].set_title('Distribution of Resp_' + str(i))
    j+= 1
    
plt.show()

So it seems that each resp feature has a slightly different distribution, whcih is expected since they are from different time horizons.
It looks like Resp_1-3 have a similar distribution. 1&2 are more alike than 3, but even 3 follows more or less the same pattern. Then it looks like Resp_4 and the final Resp have a very similar distribution. Since they are also more volatile, maybe these are from a shorter time horizon? This information won't be present in test data, so we have to be careful with how we use this, especially in a time series competition.

Now let's move onto the features. We already saw a few things before, but let's go through it in more detail.

In [None]:
cols = list(train.columns)
for removeCol in ['date', 'weight', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp', 'ts_id']:
    cols.remove(removeCol)

X = train[cols]
X = X.fillna(0)



Before PCA, I just want to explore the column feature_0. It seems to be boolean and different from the rest

In [None]:
train['feature_0'].unique()

So it is boolean. Let's look at the distribution.

In [None]:
labels = 1, -1
plt.pie([len(train[train['feature_0'] == 1]), len(train[train['feature_0'] == -1])], labels = labels, autopct='%1.1f%%')
plt.axis('equal')
plt.title("Feature 0")
plt.show()

Wow, almost 50%. Let's also see if this correlates to Resp

In [None]:
np.corrcoef(train['feature_0'], train['resp']),np.corrcoef(train['feature_0'], train['resp']>0) 

The first correlation is for whether feature 0 is directly correlated to Resp, and the second set is if feature 0 is correlated to whether resp is positive or negative. It seems like there is nothing directly significant.

First PCA to reduce the dimension to something more understable. We have already seen a lot of distributions in the feature metadata section, so here it is more about the usefullness of the features

In [None]:
pca = PCA(n_components=2)
y = pca.fit_transform(X)
y

Plotting only first 200,000 and last 200,000, but hopefully that is representative of the data. They have similar shapes so I think it shouldn't be hiding too much information.

In [None]:
fig = px.scatter_3d(X, x=y[0:200000, 0], y=y[0:200000, 1], z = train['resp'].values[0:200000])
fig.show()

In [None]:
fig = px.scatter_3d(X, x=y[-200000:, 0], y=y[-200000:, 1], z = train['resp'].values[-200000:])
fig.show()

With the PCA, I can't really see any direct clear trend. The market really does seem to be volatile. Maybe there are some complex interactions between features that help a model make a better decision. For this reason, I am going to establish the baseline with Gradient Boosting Decision Trees, as they can take into account feature interaction quite well.

Let's see if any features are correlated with each other.

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(train.corr())

It seems that mostly features are not correlated to each other, but there are gaps where suddenly there is a very high positive or negative correlation. Also, it seems that these gaps are mostly near the central diagonal line, which means that features with similar numbers generally have a high correlation, so would have a similar distribution.

Interesting thing to note again is that resp is not strongly correlated to any other feature at all. This makes this task extremely difficult. Feature engineering will also be quite interesting, and would be a more numerical / brute forced approach than any domain knowledge.

A small animation that I liked to explore distribution of all features:

In [None]:
# Taken from this notebook: https://www.kaggle.com/blurredmachine/jane-street-market-eda-viz-prediction

date = 0
n_features = 130

cols = [f'feature_{i}' for i in range(1, n_features)]
hist = px.histogram(
    train[train["date"] == date], 
    x=cols, 
    animation_frame='variable', 
    range_y=[0, 600], 
    range_x=[-7, 7]
)

hist.show()

# Modelling

Here's how we are going to set up the problem statement:
1. Right now, I will only be using Resp column, not resp_1,2,3,4.
2. I create a new column called action. If Resp is more than a threshold, action = 1, else action = 0

I am going to set this threshold more than 0 (update: I set it to 0 now, as it seems more appropriate). The logic is that for very very small returns, it is not worth the risk to take the opportunity, as there are error bars in our predictions. Better to miss out on small returns than make negative return.

I am going to create this column before hand, rather than making lightGBM make Resp predictions, and then applying the threshold, as it makes the task simpler.

Cross validation strategy (very crude):
1. Train on First 90% of datapoints.
2. For crossvalidation, make predictions of last 10% and calculate the p variable (in evaluation section of competition description).
3. Train on last 10%, and use model for test predictions.


In [None]:
del X, features, y
gc.collect()

In [None]:

y = train['resp']
train.drop(['date', 'weight', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp', 'ts_id'], axis=1, inplace=True)

These parameters are just a hunch. I haven't tuned them, but plan to do so in future iterations.

In [None]:
import lightgbm as lgb

params = {'objective': 'binary',
          'max_depth': -1,
          'learning_rate': 0.3,
          "boosting_type": "gbdt",
          "random_state" : 42,
          'device': 'cpu'
}


This is the threshold for the action I mentioned earlier. It is tunable, so is essentially a hyperparameter. Also, the training data is essentially the first 90% of data. The number is just to find the first 90%. The validation is the last 10%

In [None]:
threshold = 0.00
train = train.values
y = y.values
y = y>threshold
y = y.astype(int)
dTrain = lgb.Dataset(train[0:2151442], y[0:2151442])

In [None]:
validationSet = lgb.Dataset(train[2151442:2390492], y[2151442:2390492])

Finally Training the Model. I will do it for 2500 estimators, with an early stopping if there is no improvement with 100 additional estimators. The validation set is the last 10% of rows, and score will be printed after every 50 estimators. Letting it run:

In [None]:
model = lgb.train(params, dTrain, 2500, early_stopping_rounds = 100, valid_sets = [validationSet], verbose_eval = 50) #2500

Okay, now before making predictions and finding cross-validation score, let's first just plot the feature importances. I will only plot the top 100 features.

In [None]:
lgb.plot_importance(model, max_num_features = 100, figsize=(25, 25))

Seems like Feature 0 isn't really important. It was boolean, with 50-50 distribution, so it is not surprising. In the next iteration, it might make sense only to take the top 50-100 features for training, and leave th rest. 

Now for the predictions. Our model outputs the probability that we should take the action. Since we are already a bit conservative in the resp conversion, I will take the opportunity even if we are more than 50% confident. This is another hyperparameter, and is super easy to tune. Just need to change this value, and see the oof output below.

In [None]:
confThreshold = 0.5
def transformPred(pred):
    pred = pred>confThreshold
    pred = pred.astype(int)
    return pred

In [None]:
pred = model.predict(train[2151442:2390492])
newpred = transformPred(pred)

In [None]:
del train, y, dTrain, validationSet, pred
gc.collect()

Great, now we have our predictions. Just need to read in the train data again to find all of the returns.

In [None]:
train = pd.read_feather('../input/fast-reading-w-pickle-feather-parquet-jay/jane_street_train.feather')
train = train.astype(dtypes)
train

This just defines the metric used for evaluation in the competition. We'll use it for crossvalidation

In [None]:
def calcUtility(p, i):
    t = (p.sum() / np.sqrt((p*p).sum()) ) * np.sqrt((250/i))
    u = min(max(t, 0), 6) * p.sum()
    return u

We will use resp, to calculate the return. Also, i is the number of unique days, so that's is what the last line does.

In [None]:
resp = train["resp"]
w = train['weight']
resp = resp[2151442:2390492]
i = len(train[2151442:2390492].date.unique())

Finally. Now let's first calculate the unweighted score.

In [None]:
p = newpred * resp
print("Unweighted utility:  " + str(calcUtility(p, i)))

Now the real weighted utility.

In [None]:
# Weighted Score:
p = w[2151442:2390492] * newpred * resp
print("Weighted utility:  " + str(calcUtility(p, i)))

Great. Not too bad for such a crude model. Now let's make the predictions and submit.

In [None]:
del newpred, resp, w, p, i
del train
gc.collect()

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:

for (test_df, sample_prediction_df) in iter_test:
#     print(test_df.values.shape)
    row = test_df.drop(['date', 'weight'], axis=1)
    
    pred = model.predict(row.values.reshape(1, -1))
    sample_prediction_df.action = transformPred(pred)[0] #make your 0/1 prediction here
    env.predict(sample_prediction_df)

Thank you for going through my notebook. I hope it helped you get a sense for the data, and a basic way to set up the problem statement. It would be great if you could give me feedback, and I'll try to incroporate it into future versions.

Please do Upvote if you Liked it!