# Regression instead of Classification  
I start thinking about this problem mainly as a regression instead of classification. If we change the primary target to pursue a return? One should pay attention to the fact that, as an arbitrageur on high frequency trading, the most important thing is not the action of trade or not itself, but the return impact of each trade. 

Based on that predicted return per trade, it can be much easier to guide a decision about action. In this notebook my aim is to give a **brief** description of another way to look at the same problem. If you enjoy, upvote and comment below!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #plotting
import seaborn as sns #plotting 
import warnings

from sklearn.model_selection import train_test_split #split training set
from sklearn import metrics as ms  # MSE and accuracy
from sklearn.preprocessing import StandardScaler #standardizing features
from lightgbm import LGBMRegressor #Light Gradient Boost Regressor

warnings.filterwarnings('ignore')
%matplotlib inline
pd.set_option('display.max_columns', 200)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# A very brief EDA

In [None]:
df = pd.read_csv('../input/jane-street-market-prediction/train.csv')

In [None]:
df.shape

Let´s see the number of null values for each feature

In [None]:
nulls = pd.DataFrame(df.iloc[:, 8:].isnull().sum(), columns = ['missing_values'])
nulls.sort_values(by='missing_values', ascending=False).T

With a total of 2,4 million samples, I think that imputing the features with more than 10% missing values with median or something like will not improve that much our analysis. So, I decided to drop some of these features

In [None]:
df = df.drop(['feature_28', 'feature_17', 'feature_27', 'feature_18', 'feature_7',
              'feature_8', 'feature_108', 'feature_114', 'feature_90', 'feature_96',
              'feature_102', 'feature_78', 'feature_72', 'feature_84'], axis = 1)

Now, the complete train set is full of weight-zero rows. They do not change our return target, so, let´s get rid of them filtering df and after that, I´ll create a trade_return column that equals $response * weight$

In [None]:
df = df[df.weight != 0] # filtering our df
df['trade_return'] = df.weight * df.resp # trade_return column
df['action'] = df.resp.apply(lambda x: x > 0).astype(int) # this will be my "second" target

In [None]:
# I´ll drop ts_id column also, since I think that it´s useless for a first study
df.drop('ts_id', axis = 1, inplace = True)

In [None]:
df.shape # now wep´ve 1,9m samples and 125 features. 

It´s a good idea to have a look at the correlations between features.

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.drop(['action', 'date'], axis = 1).corr(), cmap='inferno');

Any of the features alone cannot give good correlations with trade_return. That´s previously expected, as if arbitrage could be succesfully done with only one feature, everyone would succeed on investing.

# Splitting the train set

In [None]:
X = df.iloc[:, 8:-2] # Select only feature columns
y = df.iloc[:, -2:] # return per trade and action column

In [None]:
# Action is based on resp, and resp*weight gives return. I´ll work with return target instead of action as said before
Xtrain, Xval, ytrain, yval = train_test_split(X, y.trade_return, test_size = .2, random_state=0)

I made a first data preprocessing when dropped some of the features above. But I did not impute any missing values yet. Now, I´ll do it using median, as I think that the remaining features won´t be impacted by imputing median

In [None]:
Xtrain = Xtrain.fillna(value = Xtrain.median()) # Median is quite better than avg for this case
Xval = Xval.fillna(value = Xval.median()) # I´m avoiding information leakage here (from train to val)

Let´s get a visual clue about the features on the training set. Remember that we should do our EDA on train set and use validation set only after dealing with the training one.

In [None]:
Xtrain.shape

In [None]:
# Creating a distplot for each of the variables 
plt.figure(figsize=(30, 70))
for i in range(1, len(Xtrain.columns)+1):

  plt.subplot(58, 2, i)
  sns.distplot(Xtrain.iloc[:5000, i-1], bins = 70, kde=True) # using only first 5000 rows

plt.subplots_adjust(hspace=1.25, wspace=.2);

Almost all of the features seems to have a mean close to zero. But not all of them. Otherwise, some features have different range of values, and that may be a problem when doing regressions without standardizing them. Let´s do it after creating a baseline model

# Constructing a baseline

In [None]:
baseline = np.zeros_like(yval) # np.array with the same dimensions as yval
baseline += ytrain.mean() # broadcasting it with all elements equallying ytrain mean

# Any model worst than RMSE baseline should be ignored
print ('Baseline RMSE:', np.sqrt(ms.mean_squared_error(yval, baseline))*100,'%')

# Standardizing data

In [None]:
scaler = StandardScaler()
scaler.fit(Xtrain)

Xtrain_std = scaler.transform(Xtrain)
Xval_std = scaler.transform(Xval)

# LightGBM Regressor  
I think that a good start is LGBM, as it´s pretty fast and robust

In [None]:
lgb = LGBMRegressor(num_leaves=30, n_estimators=400, max_depth=10) # Without any hyperparameter opt.

In [None]:
lgb.fit(Xtrain_std, ytrain)
predlgb = lgb.predict(Xval_std)
print ('Light GBM RMSE:', np.sqrt(ms.mean_squared_error(yval, predlgb))*100,'%')

# Now, classification  
As we´ve already done predictions of returns, we can try to classify action target based on trade_return criteria. If it´s greater than zero, action = 1, else, action = 0. 

In [None]:
predictions = pd.DataFrame() # blank dataframe
predictions['pred_lgb'] = predlgb # creating a column that contain return predictions made before

df1 = pd.DataFrame(yval)
df1['action'] = [1 if p > 0 else 0 for p in df1.trade_return] # as resp defines action, whenever resp>0, action > 0.
df1.reset_index(inplace=True)
df1.drop('index', axis = 1, inplace = True)

predictions = pd.concat([df1, predictions], axis = 1) # concatenating both dataframes

Now let´s create action predictions based on LGB return predictions and compare it to the baseline 

In [None]:
predictions['lgb_action'] = [1 if p > 0 else 0 for p in predictions.pred_lgb]
predictions['pred_baseline'] = 1 # cause the average return from ytrain is > 0

predictions.head()

Finally, let´s evaluate the accuracy of them. Remember that you can also try using different regression models, hyperparameter optimization on LGB, XGBoost or another, and even try to run many models and create a final ensemble of them, using stacking, average voting or mode voting!

In [None]:
print ('Baseline Accuracy:', ms.accuracy_score(predictions.action, predictions.pred_baseline)*100,'%')
print ('-'*50)
print ('Light Gradient Boosting Accuracy:', ms.accuracy_score(predictions.action, predictions.lgb_action)*100,'%')

We can see that the LGM model performed very good if considered that few EDA were done before and the hyperparameters weren´t optimized. Be creative, try another ways to deal with this problem!