“Buy low, sell high.” It sounds so easy….

In reality, trading for profit has always been a difficult problem to solve, even more so in today’s fast-moving and complex financial markets. Electronic trading allows for thousands of transactions to occur within a fraction of a second, resulting in nearly unlimited opportunities to potentially find and take advantage of price differences in real time.

In a perfectly efficient market, buyers and sellers would have all the agency and information needed to make rational trading decisions. As a result, products would always remain at their “fair values” and never be undervalued or overpriced. However, financial markets are not perfectly efficient in the real world.

Developing trading strategies to identify and take advantage of inefficiencies is challenging. Even if a strategy is profitable now, it may not be in the future, and market volatility makes it impossible to predict the profitability of any given trade with certainty. As a result, it can be hard to distinguish good luck from having made a good trading decision.

In the first three months of this challenge, you will build your own quantitative trading model to maximize returns using market data from a major global stock exchange. Next, you’ll test the predictiveness of your models against future market returns and receive feedback on the leaderboard.

Your challenge will be to use the historical data, mathematical tools, and technological tools at your disposal to create a model that gets as close to certainty as possible. You will be presented with a number of potential trading opportunities, which your model must choose whether to accept or reject.

In general, if one is able to generate a highly predictive model which selects the right trades to execute, they would also be playing an important role in sending the market signals that push prices closer to “fair” values. That is, a better model will mean the market will be more efficient going forward. However, developing good models will be challenging for many reasons, including a very low signal-to-noise ratio, potential redundancy, strong feature correlation, and difficulty of coming up with a proper mathematical formulation.

Jane Street has spent decades developing their own trading models and machine learning solutions to identify profitable opportunities and quickly decide whether to execute trades. These models help Jane Street trade thousands of financial products each day across 200 trading venues around the world.

Admittedly, this challenge far oversimplifies the depth of the quantitative problems Jane Streeters work on daily, and Jane Street is happy with the performance of its existing trading model for this particular question. However, there’s nothing like a good puzzle, and this challenge will hopefully serve as a fun introduction to a type of data science problem that a Jane Streeter might tackle on a daily basis. Jane Street looks forward to seeing the new and creative approaches the Kaggle community will take to solve this trading challenge.

In [None]:
import time
import numpy as np
import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# install datatable
!pip install datatable > /dev/null
import datatable as dt

In [None]:
# fast load the data
train_datatable = dt.fread('../input/jane-street-market-prediction/train.csv')
train = train_datatable.to_pandas()
del train_datatable

example_test_datatable = dt.fread('../input/jane-street-market-prediction/example_test.csv')
example_test = example_test_datatable.to_pandas()
del example_test_datatable

features = pd.read_csv("../input/jane-street-market-prediction/features.csv" ,index_col=0)

In [None]:
#Function to reduce memory usage.
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

train = reduce_mem_usage(train)
# example_test = reduce_mem_usage(example_test)
features = reduce_mem_usage(features)

In [None]:
# Display the data
print('train:')
display(train)

print()
print('features:')
display(features)

In [None]:
# A better way to visualize the feature database.
features_T = (features*1).T
display(features_T.style.background_gradient(cmap='Blues'))
del features_T

This model will be measured with a utility function that multiplies weight and resp. Therefore, all observations with weight equal to zero will not be contabilized. However, they still may be carrying important information for our model.\
We will analyze the distributions of the variables with the whole database, and with only the observations with weight different from zero. 

The only information that we have about the weight variables is that observations with weight equals to zero will not be considered. Nonetheless, we do not know what does it really means. Maybe, these observations are from long only founds and the weight variable is the asset weight in the portfolio. However, this does not make too much sense because our goal is to predict a portfolio with a positive return, and not the assets that a manager decided to buy and had a positive return. In other words, if weight express the weight of an asset in a portfolio, our predicted portfolio will not be able to carry any assets that the found manager decided not to buy. In my opinion, the weight varible is related to the liquity of the asset. Therefore, these assets will not be considered in our utility function because there were no transactions with them.

There are 5 resp variables that express the return in different time horizons. However we do not know what are these time horizons.

About the date variable, the days considered are probably only business days (days desconsidering weekends and holidays). It would be interesting to have the exact day because the risk of carrying an asset before a weekend is higher. For example, if an terrorist attack occur in a saturday, the manager will not be able to sell its assets and will lose more money than it would in any other business day.

About the features, we do not have any information whatsoever. We have the feature database that atribute tags for each feature, but we do not know what these tags means.

The ts_id is the id of the observation, it is not the id of an asset. Therefore, there is no information in this variable, it is just an index variable. The fact that we do not have an id per asset is unfortunate because we will not be able to calculate the intrinsic risk of an asset.

In example_test, we have only 3 days and only 15219 observations, that represents less than 1% of our 2390491 training observations.

In [None]:
train_null_features = pd.DataFrame(data = train.isnull().sum() / len(train))
display(train_null_features.sort_values(by = 0, ascending = False).head(50))
del train_null_features

13 features have almost 15% of missing values.\
16 features have almost 3% of missing values.\
For all the rest, we have less than 1% of missing values.

In the best scenario we will be able to remove these 13 variables with 15% of missing value when we do a correlation study. However, for all variables with missing values that we cannot exclude, we will use the mean or a flag if we decide to work with categorical variables. The use of a flag is more reliable.

In [None]:
# A quick description of the variables.
train.describe()

We cannot have a clear description of the variables using describe. We can only see tha we have a binary variable, feature_0, and several continuous variables that can assume negative and positive values.

Lets start studying the variable date.

In [None]:
# Show most frequent dates.
train['date'].value_counts(dropna = False).head(60)

In [None]:
# Show least frequent dates.
train['date'].value_counts(dropna = False).tail(60)

In [None]:
train['date'].value_counts(dropna = False).describe()

In total there are 500 days and we can see that there are no missing in date.\
The day with the most number of observations has almost 20,000 asset returns, and the day with the least observations has only 29.\
The mean of observations per day is almos 4,500.
Lets plot an histogram to see it better.

In [None]:
fig = plt.figure(figsize=(15,10))

ax = train['date'].value_counts(dropna = False).hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('Observations per day')

The number of observations per day is positive skewed.\
It is hard to believe that in only 500 days we once had 18,000 stocks and after only 2,500. Probably this database has information about derivatives such as options that can be easily created and sold by any investor, and has a short life spam.\
This is a problem because it is almost impossible to analyze the price of an option without knowing the underlying stock. We can still study the greeks of an option, but we do not know which features represent the greeks. Moreover, studying an option only with the greeks it would be a really poor study.\
The ideal would be to find a way to identify these assets that disapear from our database and model only with the 2,500 assets that have observations all days. 

Lets now sudy the weight variable.

In [None]:
train['weight'].value_counts(dropna = False)

In [None]:
train['weight'].describe(percentiles = [.1, .25, .5, .571, .6, .75, .8, .85, .9, .95])

In [None]:
fig = plt.figure(figsize=(15,10))

ax = train[~np.in1d(train['weight'], 0)]['weight'].hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('weight')
plt.title('Weight distribution without weight = 0')

In [None]:
# Close up 
fig = plt.figure(figsize=(15,10))

ax = train[(~np.in1d(train['weight'], 0)) & (train['weight'] < 1.366877e+01)]['weight'].hist(density = True,
                histtype = 'bar', bins = 14, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('weight')
plt.title('Weight distribution with weight = (0, 13]')

In [None]:
# Close up 
fig = plt.figure(figsize=(15,10))

ax = train[(~np.in1d(train['weight'], 0)) & (train['weight'] < 2)]['weight'].hist(density = True,
                histtype = 'bar', bins = 14, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('weight')
plt.title('Weight distribution with weight = (0, 2]')

There is no negative and no missing value at the weight variable.\
Considering that there are no negative value, this variable hardly express the position of an asset in a portfolio. Specially, if we are working with derivatives, this means that this is a hedge found and it would be really strange a hedge found without being short in any position.\
On the other hand, the highest value is 167.29. Therefore this weigh variable cannot represent the traded volume of an asset over the total trade volume of the day.\
One possibility is that this variable represent a trading limit for an asset. In other words, if the weight is zero, it means that the risk management area considers the asset too risk and will not let any trade with it. However, if the weight is higher than 1, it means that we can buy the stock and have an exposure higher than our total assets. This means the weight is a good variable that tells us something about the risk of investing in the asset. If the weight is low, there is a high risk; if the weight is high, there is a low risk.

Lets filter the observations with weight equals to zero and see if the number of observations per day becames more stable.

In [None]:
train_wo_weight0 = train[~np.in1d(train['weight'], 0)]
display(train_wo_weight0['date'].value_counts(dropna = False).describe())
print()

fig = plt.figure(figsize=(15,10))

ax = train_wo_weight0['date'].value_counts(dropna = False).hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('Observations per day')

In [None]:
'''
The number of observation per day is more stable but it is far from enough.
Lets drop the assets with low weight, or the most risky ones.
'''
train_wo_weight025 = train[train['weight'] > 0.25]
display(train_wo_weight025['date'].value_counts(dropna = False).describe())
print()

fig = plt.figure(figsize=(15,10))

ax = train_wo_weight025['date'].value_counts(dropna = False).hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('Observations per day without weight = 0')

del train_wo_weight025

We can play with the weight condition to reduce the variance in the number of observation per day. However, we will not be able to find a magic condition that turns it stable.\
It is a good idea to study the variance in return considering the difference in weight. If the return distribution is too different when considering weight equals to zero, it is best to drop these observations from our training. However, we can still use them to make new variables.

In [None]:
# Now lets study the resp and resp 1 to 4 variables.
print('Missing:', train[['resp','resp_1','resp_2','resp_3','resp_4']].isnull().sum())
train[['resp','resp_1','resp_2','resp_3','resp_4']].describe()

In [None]:
#Correlation
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

corr_matrix = train[['resp','resp_1','resp_2','resp_3','resp_4']].corr()
corr_table = corr_matrix.abs().unstack()
labels_to_drop = get_redundant_pairs(train[['resp','resp_1','resp_2','resp_3','resp_4']])
corr_table = corr_table.drop(labels = labels_to_drop).sort_values(ascending = False)

size_x = 10    
size_y = 10
plt.figure(figsize = (size_x, size_y))
sns.set(font_scale = 1.5)

ax = sns.heatmap(corr_matrix, annot = False, linewidth = 0.2, cmap='coolwarm')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

In [None]:
plot_list = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4'] 
fig = make_subplots(rows=3, cols=2)

traces = [
    go.Histogram(
        x=train[col], 
        nbinsx=100, 
#         name=col,
        histnorm = 'probability',
    ) for col in plot_list
]

for i in range(len(traces)):
    fig.append_trace(
        traces[i], 
        (i // 2) + 1, 
        (i % 2) + 1
    )

fig.update_layout(
    title_text='Resp distributions', 
    height = 900,
    width = 800,
)

for i in range(len(traces)):
    fig.update_xaxes(title_text = plot_list[i], range=[-0.5, 0.5], row = (i // 2) + 1, col = (i % 2) + 1)
    fig.update_yaxes(range=[0, 0.5], row = (i // 2) + 1, col = (i % 2) + 1)

fig.show()

It is expected that shorter periods of time have a higher variance. However, if this is true for our database, resp_1 would represent the longest period of time, resp_4 the shortest, and resp should be between resp_3 and resp_4. On the other hand, if we study the mininmum and maximun return, resp_1 should represent the shortest period, resp_4 the longest, and resp should be between resp_3 and resp_4. Finaly, studying the correlation among the resp variables, resp is closer to resp_4. Having said that, we should have in mind that our model will be used to make daily decisions, so resp probably is the daily return.\
If 1, 2, 3 and 4 represent periods of time, we can assume that resp is between resp_3 and resp_4.\
Maybe, it is a given that if we decide to buy an asset, we must carry it for at least a month. Therefore, resp would represent the monthly return.

Lets repeat this study adding a filter in weight.

In [None]:
display(train_wo_weight0[['resp','resp_1','resp_2','resp_3','resp_4']].describe())
print()

corr_matrix = train_wo_weight0[['resp','resp_1','resp_2','resp_3','resp_4']].corr()
corr_table = corr_matrix.abs().unstack()
labels_to_drop = get_redundant_pairs(train[['resp','resp_1','resp_2','resp_3','resp_4']])
corr_table = corr_table.drop(labels = labels_to_drop).sort_values(ascending = False)

size_x = 10    
size_y = 10
plt.figure(figsize = (size_x, size_y))
sns.set(font_scale = 1.5)

ax = sns.heatmap(corr_matrix, annot = False, linewidth = 0.2, cmap='coolwarm')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

Our conclusions stay the same with a filter in weight.

Lets syudy the resp by making an equal weighted index with the assets.

In [None]:
train_resp = pd.DataFrame(data = train[['date', 'resp']].groupby(['date'])['resp'].mean())
train_resp['resp_1'] = train[['date', 'resp_1']].groupby(['date'])['resp_1'].mean()
train_resp['resp_2'] = train[['date', 'resp_2']].groupby(['date'])['resp_2'].mean()
train_resp['resp_3'] = train[['date', 'resp_3']].groupby(['date'])['resp_3'].mean()
train_resp['resp_4'] = train[['date', 'resp_4']].groupby(['date'])['resp_4'].mean()

# Creates several time periods for resp_
train_resp['resp_0_lag_1'] = train_resp['resp'].shift(1)
train_resp['resp_1_lag_1'] = train_resp['resp_1'].shift(1)
train_resp['resp_2_lag_1'] = train_resp['resp_2'].shift(1)
train_resp['resp_3_lag_1'] = train_resp['resp_3'].shift(1)

train_resp['resp_0_2'] = train_resp.apply(lambda row: (1 + row['resp'])*(1 + row['resp_0_lag_1']) - 1, axis=1)
train_resp['resp_1_2'] = train_resp.apply(lambda row: (1 + row['resp_1'])*(1 + row['resp_1_lag_1']) - 1, axis=1)
train_resp['resp_2_2'] = train_resp.apply(lambda row: (1 + row['resp_2'])*(1 + row['resp_2_lag_1']) - 1, axis=1)
train_resp['resp_3_2'] = train_resp.apply(lambda row: (1 + row['resp_3'])*(1 + row['resp_3_lag_1']) - 1, axis=1)

train_resp

In [None]:
train_resp.groupby('date')[['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']].sum().cumsum().plot(figsize=(20,15))
plt.title('Cumulative Sum of Different RESP\'s', fontsize = 20)
plt.xlabel('Date', fontsize = 15)
plt.legend(fontsize = 15, ncol = 3, loc = 2)

It is hard to believe that the only diference between resp variables is the time horizon. If that were the case, we would be able to acumulate the return of one resp variable and get as result another resp variable. Moreover, looking at the graph, it does not even look like these returns are from the same security. If we were to analyze just this graph, we would say that resp_1 is the cumulative return of a treasury bound, and resp_4 the cumulative return of a successful hedge found.

Lets analyze the features variables.\
We already know that most features have a normal distribution, and feature_0 is a binary variable. One good notebook to see this is https://www.kaggle.com/isaienkov/jane-street-market-prediction-fast-understanding. \
We do not have any information about the features variables. We have some tag variables that were suposed to give us some information about the features, but it is not of much help. An exemple of how these tagas variables work is:
assume that tag_1 is true if a feature considers 10 days;\ 
tag_2 is true if a feature considers 30 days;\
tag_3 is true if a feature calculates the volume.\
Therefore, if feature_1 is the volume in 10 days and feature_2 is the volume in 30 days, we would have tags 1 and 3 true for feature_1 and tags 2 and 3 true for feature_2.

In [None]:
# Filter only feature variables.
mask = train.columns.str.contains('feature_')
features_only = train.loc[:,mask]

# Missing
train_null_features = pd.DataFrame(data = features_only.isnull().sum() / len(train))
train_null_features = train_null_features.sort_values(by = 0, ascending = False)
train_null_features.columns = ['%_missing']

display(train_null_features.head(30))
print()
display(train_null_features.tail(43))

13 features have almost 15% of missing values.
16 features have almost 3% of missing values.
For all the rest, we have less than 1% of missing values.

We have only 42 features without any missing values.

Lets start studying feature_0 (that has no missing values).

In [None]:
train_feature_0_bar_graph = pd.DataFrame(train['feature_0'].value_counts())

ax = train_feature_0_bar_graph.plot(kind='bar', figsize=(15,10), width = 0.58, rot = 0,
                                           align='center', color = 'LightGray', edgecolor = None)

total = 0
for bars in ax.patches:
    total += bars.get_height()

for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height/total:.2%}', (x + width/2, y + 100 + height), ha = 'center')

This did not help. Lets see the distribution changes considering only observations with weight different from zero.

In [None]:
train_feature_0_wo_weight0_bar_graph = pd.DataFrame(train_wo_weight0['feature_0'].value_counts())

ax = train_feature_0_wo_weight0_bar_graph.plot(kind='bar', figsize=(15,10), width = 0.58, rot = 0,
                                           align='center', color = 'LightGray', edgecolor = None)

total = 0
for bars in ax.patches:
    total += bars.get_height()

for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height/total:.2%}', (x + width/2, y + 100 + height), ha = 'center')

We can see that the distribution does not change much if we disconsider observations with weight equals to zero. Therefore, there is not any correlation between these variables.\
Lets see if the distribution of resp changes considering feature_0 equals to one and minus one.

In [None]:
train_feature_0_minus_one = train[np.in1d(train['feature_0'], -1)]
train_feature_0_one = train[np.in1d(train['feature_0'], 1)]

fig = plt.figure(figsize=(15,15))

ax0 = fig.add_subplot(3,1,1)
ax0 = train['resp'].hist(density = True,
                histtype = 'bar', bins = 100, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('resp')
plt.xlim(-0.2, 0.2)
plt.ylim(top = 40)


ax1 = fig.add_subplot(3,1,2)
ax1 = train_feature_0_minus_one['resp'].hist(density = True,
                histtype = 'bar', bins = 100, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('feature_0 = -1')
plt.xlim(-0.2, 0.2)
plt.ylim(top = 40)

ax2 = fig.add_subplot(3,1,3)
ax2 = train_feature_0_one['resp'].hist(density = True,
                histtype = 'bar', bins = 100, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('feature_0 = 1')
plt.xlim(-0.2, 0.2)
plt.ylim(top = 40)

In [None]:
print('resp describe:')
display(train['resp'].describe())
print()
print('resp describe considering feature_0 = -1:')
display(train_feature_0_minus_one['resp'].describe())
print()
print('resp describe considering feature_0 = 1:')
display(train_feature_0_one['resp'].describe())

It appears that feature_0 does not have a signicative impact at resp, but when considering feature_0 equal to 1 the resp variable is more concentrated at zero and has a lower variance.

Now lets see how important is each feature to our resp. Lets study the correlations.

In [None]:
#Correlation
corr_matrix = train.corr()
corr_table = corr_matrix['resp'].abs()
corr_table = corr_table.sort_values(ascending = False)
print('Features most correlated with resp:')
display(corr_table.head(10))
print()
print('Features least correlated with resp:')
display(corr_table.tail(10))

Unfortunately, our features have a low correlation with our reps.\
Lets see if e can get better correlations desconsidering observations with weight equals to zero.

In [None]:
corr_matrix = train_wo_weight0.corr()
corr_table = corr_matrix['resp'].abs()
corr_table = corr_table.sort_values(ascending = False)
print('Features most correlated with resp:')
display(corr_table.head(10))
print()
print('Features least correlated with resp:')
display(corr_table.tail(10))

This did not help.\
Lets see if feature_0 can cause any changes at the correlations.

In [None]:
corr_matrix_one = train_feature_0_one.corr()
corr_table_one = corr_matrix_one['resp'].abs()
corr_table_one = corr_table_one.sort_values(ascending = False)
print('Features most correlated with resp (considering feature_0 = 1):')
display(corr_table_one.head(10))
print()
print('Features least correlated with resp (considering feature_0 = 1):')
display(corr_table_one.tail(10))

In [None]:
corr_matrix_minus_one = train_feature_0_minus_one.corr()
corr_table_minus_one = corr_matrix_minus_one['resp'].abs()
corr_table_minus_one = corr_table_minus_one.sort_values(ascending = False)
print('Features most correlated with resp (considering feature_0 = -1):')
display(corr_table_minus_one.head(10))
print()
print('Features least correlated with resp (considering feature_0 = -1):')
display(corr_table_minus_one.tail(10))

In [None]:
print('Features with correlation higher than 1%:', len(corr_table[corr_table > 0.01]))
print('Features with correlation higher than 1% (considering feature_0 = 1):', len(corr_table_one[corr_table_one > 0.01]))

Unfortunately, we have low correlations with resp. However, we can see that when feature_0 equals to one, ours correlations increase a little.\
Lets study the correlation between features.

In [None]:
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def correlation_variables(db, features, n, correlation_cut):
    
    db_variables = db[features]

    #Correlation
    corr_matrix = db_variables.corr()
    corr_table = corr_matrix.abs().unstack()
    labels_to_drop = get_redundant_pairs(db_variables)
    corr_table = corr_table.drop(labels = labels_to_drop).sort_values(ascending = False)


    print("Top Absolute Correlations")
    print(corr_table[: n])
    print()
    print("Least Absolute Correlations")
    print(corr_table[len(corr_table) - n: ])
    print()
    print("Variables with correlation higher than", correlation_cut, ':', len(corr_table[abs(corr_table) > correlation_cut]))
    print("Percent:", round(len(corr_table[abs(corr_table) > correlation_cut]) / len(corr_table), 4))
    print()

    size_x = 20     #This is a good size to visualise the heatmap saved as .png
    size_y = 20
    plt.figure(figsize = (size_x, size_y))
    sns.set(font_scale = 1.5)

    ax = sns.heatmap(corr_matrix, annot = False, linewidth = 0.2, cmap='coolwarm')
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)

    plt.tight_layout()
    plt.savefig('correlation.png')

    plt.show() 

In [None]:
features_corr_higher_one = corr_table[corr_table > 0.01].index
features_corr_higher_one = features_corr_higher_one.drop(['resp_1', 'resp_2', 'resp_3', 'resp_4'])
correlation_variables(train, features_corr_higher_one, 20, 0.9)

In [None]:
print('Correlation considering feature_0 = 1')
features_corr_higher_one_f01 = corr_table_one[corr_table_one > 0.01].index
features_corr_higher_one_f01 = features_corr_higher_one_f01.drop(['resp_1', 'resp_2', 'resp_3', 'resp_4'])
correlation_variables(train, features_corr_higher_one_f01, 35, 0.9)

Not only our features are low correlated with resp, but also our features are high correlated between themselves.

Lets see what happens if we consider only the observations with feature_17 missing (our feature with the most missing values).

In [None]:
train_feature_17_missing = train[train['feature_17'].isnull()]
mask = train_feature_17_missing.columns.str.contains('feature_')
features_only_17_missing = train_feature_17_missing.loc[:,mask]

# Missing
train_null_features_feature_17 = pd.DataFrame(data = features_only_17_missing.isnull().sum() / len(features_only_17_missing))
train_null_features_feature_17 = train_null_features_feature_17.sort_values(by = 0, ascending = False)
train_null_features_feature_17.columns = ['%_missing']

print('Missings considering when feature_17 is missing:')
display(train_null_features_feature_17.head(30))
print()
display(train_null_features_feature_17.tail(45))

In [None]:
print('Distribution of observations per day:')

fig = plt.figure(figsize=(15,10))

ax = train_feature_17_missing['date'].value_counts(dropna = False).hist(density = True,
                histtype = 'bar', bins = 100, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('Observations per day')

train_feature_17_missing['date'].value_counts(dropna = False).describe()

In [None]:
print('Distribution of resp when feature_17 is missing:')

fig = plt.figure(figsize=(15,10))

ax = train_feature_17_missing['resp'].hist(density = True,
                histtype = 'bar', bins = 100, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('resp')

train_feature_17_missing['resp'].value_counts(dropna = False).describe()

Considering feature_17 missing, we could turn the number of observations per day more stable. Therefore, probably we excluded some type of asset and it may a good idea to make a different model. However, we still have the problem of a lot of missing in our dataset. \
Considering that missings values can have impotant information about the asset type, it is a good idea to work with categorical variables.

In conclusion:\
we can drop observations with weight = 0;\
we can make different models considering the value of feature_0 or when the value of some feature is missing;\
we can drop several varialbes that are uncorrelated with resp; and\
we can drop some variables that are highly correlated with another variable.