Table of contents:
- [1. ENVIRONMENT CREATION AND DATA IMPORT](#section-1)


- [2. DATA EXPLORATION, CLEANING AND FEATURE SELECTION](#section-2)
    - [Evaluation features](#section-2.1)
    - [Anonymized features](#section-2.2)
    - [Date features](#section-2.3)
    - [Features transformation](#section-2.4)

<a id="section-1"></a>
# 1. ENVIRONMENT CREATION AND DATA IMPORT #

I am using datatable trick to speed up the train data cvs to pandas conversion. Source: https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets

In [None]:
import contextlib
import time

@contextlib.contextmanager
def timer():
    start = time.time()
    
    yield

    end = time.time()
    runtime = 'Runtime: {:.2f}s \n'.format(end - start)
    print(runtime)

In [None]:
pip install datatable #==0.11.0 > /dev/null

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datatable as dt

%matplotlib inline

#train = pd.read_csv("/kaggle/input/jane-street-market-prediction/train.csv")
train_datatable = dt.fread('../input/jane-street-market-prediction/train.csv')
train = train_datatable.to_pandas()

features = pd.read_csv("/kaggle/input/jane-street-market-prediction/features.csv")
example_test = pd.read_csv("/kaggle/input/jane-street-market-prediction/example_test.csv")
example_sample_submission = pd.read_csv("/kaggle/input/jane-street-market-prediction/example_sample_submission.csv")

## Imported datasets overview ##

In [None]:
train.shape

In [None]:
train.head(10)

Train dataset is very large with 2.390.491 rows and 138 columns. 130 of those columns are anonymized features (we have metadata in features dataframe).
Remaining 8 columns represent:
- date - the day of trading opportunity
- weight - the trading opportunity magnitude (the higher the value, the more it contributes to the final result)
- resp_x - trading opportunity gain/loss for different time horizons
- resp - trading opportunity final gain/loss - target_feature
- ts_id - time ordering, it is also the id_feature

In [None]:
features.shape

In [None]:
features.head(10)

For every anonymized feature, there is 29 boolean columns of metadata.

In [None]:
example_test.shape

In [None]:
example_test.head(10)

The test set example is the same as train, except for resp columns. It confirms that we are predicting resp values. If the value is more than 0, the trading opportunity should be accepted, otherwise declined.

The resp value is continous, meaning the regression models will have to be used.

<a id="section-2"></a>
# 2. DATA EXPLORATION AND CLEANING #

First I am categorizing features into logical groups.

In [None]:
#Feature categorization
id_feature = "ts_id"
target_feature = "resp"
evaluation_features = ["weight", "resp_1", "resp_2", "resp_3", "resp_4", "resp"]
anonymized_features = [x for x in train.columns if "feature" in x]
datetime_features = ["date"]

Train dataframe is very large, that is why data exploration should be performed on its sample. Then I am running my custom function for general data overview.

In [None]:
train_sample = train.sample(n=100000, random_state=1).sort_index()

In [None]:
from scipy.stats import kendalltau, pearsonr, spearmanr
import scipy.stats as stats
from sklearn.feature_selection import chi2
from sklearn.metrics import mutual_info_score
from sklearn.metrics import adjusted_mutual_info_score

#Features overview function

def kendall_pval(x,y):
    return kendalltau(x,y)[1]

def pearsonr_pval(x,y):
    return pearsonr(x,y)[1]

def spearmanr_pval(x,y):
    return spearmanr(x,y)[1]

def generate_features_overview(df):
    df_info = pd.DataFrame()
    df_info["type"] = df.dtypes
    df_info["missing_count"] = df.isna().sum()
    df_info["missing_perc"] = df_info["missing_count"] / len(df)
    df_info["unique"] = df.nunique()
    df_info["skew"] = df.skew()
    df_info["kurt"] = df.kurt()
    df_info["corr"] = df.corrwith(df[target_feature], method="spearman")
    df_info["corr_p_value"] = df.corrwith(df[target_feature], method=spearmanr_pval)
    df_info = pd.concat([df_info, df.describe().T], axis=1)
    
    return df_info

In [None]:
#Generate overview dataframe
df_info = generate_features_overview(train_sample)

Checking the features data types.

In [None]:
df_info["type"].value_counts()

In [None]:
df_info[df_info["type"] == "int64"].index

Most of the features are floats, but there are also 3 integers. According to the competition documentation, "date" represents a day and "ts_id" represents time ordering.
However the third one is interesting and should be inspected further.

In [None]:
train_sample["feature_0"].value_counts()

Feature_0 type is even more than integer, it is binary!

<a id="section-2.1"></a>
## 2.1 Evaluation features ##

The evaluation metric is based on the product between weight and resp. Therefore, we take only the opportunities where resp is higher than 0, otherwise we will make a loss.
On the other hand, weight magnifies the gain/loss, if the weight is 0, we can't gain nor lose.

In [None]:
eval_info = df_info.loc[evaluation_features]
eval_info

All of the features are continuous:
- Weight ranges between 0 and 162 with the median of 0.55 (I am using median because of high skewness)
- Resp features are distributed around 0

In [None]:
sns.distplot(train_sample["weight"])

In [None]:
train_sample["weight"].value_counts()

Weight feature is extremely right skewed since we have 17.072 (app. 17% of the data) values 0. It's kurtosis of 72 is also extremely high meaning there are lots of outliers.

I am creating cumulative sums on resp values to visualise the trend.

In [None]:
cumsum_df = pd.DataFrame()
cumsum_df["resp_1"] = train_sample['resp_1'].cumsum()
cumsum_df["resp_2"] = train_sample['resp_2'].cumsum()
cumsum_df["resp_3"] = train_sample['resp_3'].cumsum()
cumsum_df["resp_4"] = train_sample['resp_4'].cumsum()
cumsum_df["resp"] = train_sample['resp'].cumsum()

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
cumsum_df["resp"].plot()

In [None]:
sns.distplot(train_sample["resp"])

Resp value trend is extremely positive. It is also distributed with low skewness and high kurtosis. Other resp features should be very similar. Let's compare them.

In [None]:
fig, ax = plt.subplots(2,2,figsize=(15,10))

eval_info.drop("weight", inplace=True)

sns.barplot(x=eval_info.index, y="mean", data=eval_info, ax=ax[0, 0])
sns.barplot(x=eval_info.index, y="std", data=eval_info, ax=ax[1, 0])
sns.barplot(x=eval_info.index, y="skew", data=eval_info, ax=ax[0, 1])
sns.barplot(x=eval_info.index, y="kurt", data=eval_info, ax=ax[1, 1])

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
cumsum_df["resp_1"].plot()
cumsum_df["resp_2"].plot()
cumsum_df["resp_3"].plot()
cumsum_df["resp_4"].plot()
cumsum_df["resp"].plot()
plt.legend(loc="upper left")

Mean and standard deviation charts look very similar. Raising mean tells us the resp is growing over time, but the std is also growing resulting in higher risk.
Skewness remains in the margins of normal like distribution and kurtosis drop results in reduction of outliers over time, however the number is still high - we should expect outliers.

Comparing all the resp values indicates that the resp_4 and resp value should be swapped to make resp values chronological.

I will create new feature named "action" for easier data visualisation. If resp > 0 then 1, otherwise 0.

In [None]:
train_sample["action"] = train_sample["resp"].apply(lambda x: 1 if x > 0 else 0)

<a id="section-2.2"></a>
## 2.2 Anonymized features ##

In [None]:
anony_info = df_info.loc[anonymized_features]
anony_info

There are 130 numerical values and as we already mentioned, the first one is binary.

First I will plot the values ranges.

In [None]:
fig, ax = plt.subplots(1,1,figsize=(20,7))
sns.barplot(x=anony_info.index, y="min", data=anony_info)
sns.barplot(x=anony_info.index, y="max", data=anony_info)

ax.set_xticklabels(range(0,130), rotation=90)
ax.set_ylabel('min / max')

Majority of features remain between 100 and -100, however there are a few far above those margins. Features 55 to 59 seem similar.

Now let's take a look at the number of unique values and missing values.

In [None]:
#Number of unique values
fig, ax = plt.subplots(1,1,figsize=(20,5))
sns.barplot(x=anony_info.index, y="unique", data=anony_info)

ax = ax.set_xticklabels(range(0,130), rotation=90)

There are lots of unique values in most features which indicates most features are continuos, however some of them stand out. Those are feature 0, 43, 51, 52, 53 and 69 and they should be further inspected.

In [None]:
#Percentage of missing values
fig, ax = plt.subplots(1,1,figsize=(20,5))
sns.barplot(x=anony_info.index, y="missing_perc", data=anony_info)

ax = ax.set_xticklabels(range(0,130), rotation=90)

In [None]:
high_missing = anony_info[anony_info["missing_perc"] > 0.10].index
high_missing

Most of the features have less them 4% of missing values, but there are also some with more than 16%.

We can also see a clear pattern here indicating that some features could be grouped.

Possible feature groups:
- Feature 7 - 36 (30 features)
- Feature 72 - 119 (48 features)

Next I will take a look at the distribution moments.

In [None]:
#Mean and standard deviation
fig, ax = plt.subplots(2,1,figsize=(20,10))

g1 = sns.barplot(x=anony_info.index, y="mean", data=anony_info, ax=ax[0])
g2 = sns.barplot(x=anony_info.index, y="std", data=anony_info, ax=ax[1])

g2.set_xticklabels(range(0,130), rotation=90)
g1.set(xlabel=None)
g1.set(xticklabels=[])

Mean values are very close to zero but there are some groups forming is the chart (features 72 to 83).
Some standard deviations are above 6 (features 55 - 59). 

In [None]:
#Skewness and kurtosis
fig, ax = plt.subplots(2,1,figsize=(20,10))
g1 = sns.barplot(x=anony_info.index, y="skew", data=anony_info, ax=ax[0])
g2 = sns.barplot(x=anony_info.index, y="kurt", data=anony_info, ax=ax[1])
g2.set_xticklabels(range(0,130), rotation=90)
g1.set(xlabel=None)
g1.set(xticklabels=[])

Skewness of most features is far above +/- 0.5 (threshold for normal distribution). There are also many high values of kurtosis, meaning we will have to deal with outliers.
Again we can see the group of features 55-59, which should be further inspected as a group.

But first I will take a look at the correlation to the target feature.

In [None]:
fig, ax = plt.subplots(2,1,figsize=(20,10))
g1 = sns.barplot(x=anony_info.index, y="corr", data=anony_info, ax=ax[0])
g2 = sns.barplot(x=anony_info.index, y="corr_p_value", data=anony_info, ax=ax[1])
g2.set_xticklabels(range(0,130), rotation=90)
#g1.set_xticklabels(range(0,130), rotation=90)
g1.set(xlabel=None)
g1.set(xticklabels=[])

There seems to be no correlation to target feature, since the highest coefficient value is around 0.04. Again we can see some features groups forming.

Next I will take a look at the correlation between anonymous features themselves, I will also use features dataframe.
But first I am filling the missing values with mean for faster calculation.

In [None]:
#Fill missing values
for i in range(0, 130):
    feat = "feature_{}".format(i)
    train_sample[feat].fillna(train_sample[feat].mean(), inplace=True)

## Correlations of anonymous features ##

In [None]:
#Correlation function

def show_corr_heatmap(df, method="pearson", width=10, calc_corr=False, annot=True):
    
    if calc_corr == True:
        if method == "MI":
            corr = MI_correlations(df)
        else:
            corr = df.corr(method)
    else:
        corr = df
        
    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(width, width))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=annot, fmt=".2f")
    
    if calc_corr == True:
        return corr


def MI_correlations(df):
    corrs = {}
    for col_init in df.columns:
        corrs[col_init] = {}
        for col_corr in df.columns:
            if col_init != col_corr:
                corrs[col_init][col_corr] = calc_MI(df[col_init], df[col_corr])

    return pd.DataFrame(corrs)

def calc_MI(col_init, col_corr):
    
    if col_init.dtype == np.object:
        col_init = col_init.astype('category').cat.codes
    elif col_init.dtype.name == "category":
        col_init = col_init.cat.codes
        
    if col_corr.dtype == np.object:
        col_corr = col_corr.astype('category').cat.codes
    elif col_corr.dtype.name == "category":
        col_corr = col_corr.cat.codes

    mi = mutual_info_score(col_init, col_corr)

    return mi

In [None]:
#Calculate correlation matrix using custom function
corr_matrix = show_corr_heatmap(train_sample[anonymized_features], method="spearman", width=30, calc_corr=True, annot=False)

In [None]:
#Features dataframe heatmap
fig, ax = plt.subplots(figsize=(30, 30))
sns.heatmap(features.iloc[:,1:])

Correlation matrix is really large and confusing but there are clearly some patterns. I will cut it in parts for easier understanding and compare it to features dataframe.

### Feature 0 - 40 ###

According to the correlation matrix and features df it seem like a good first step. My goal is to decode the features tags.

In [None]:
show_corr_heatmap(corr_matrix.iloc[0:41, 0:41], width=30)

In [None]:
fig, ax = plt.subplots(figsize=(20, 12))
sns.heatmap(features.iloc[:41,1:])

Highlights:
- Feature 0 is binary and has strong correlation with features 17-40.
- Odd features from 1 to 39 are strong positive correlated to even features 2 to 40 - tag 9 could be indicating this
- Features group 17-26 & 37-38 and group 27-36 & 39-40 are all strong positive correlated, however both groups are strong negative correlated between each other
- There should be used only one feature in the range 17-40 due to mutual correlations

Tags Decoding:
- Tags 0 to 4 could represent different resp values
- The features in tag 4 also have high percentage of missing values
- Alternating tag 9 could be representing the strong odd/even features correlation, therefore we should use one or another
- Tag 6 could be representing first main category of features
- Tags 7, 8, 10, 11, 12 and 13 could be the subcategory of 6 due to their features mutual correlation

In [None]:
#Possible feature groups, add 1 to get even values

g1_feat = [1, [3, 5]]
g1_feat_1 = [7, [17, 27]] #add 2 to get other resp values

### Feature 41 - 71 ###

This is my second group according to the correlation matrix and the lack of correlation between features 41-71 to 17-40.

In [None]:
show_corr_heatmap(corr_matrix.iloc[41:72, 41:72], width=30)

In [None]:
fig, ax = plt.subplots(figsize=(20, 12))
sns.heatmap(features.iloc[41:72,1:])

Highlights:
- High correlation between features 46 - 52, grouped in tag 19
- High correlation between features 55 - 59, those are also categorized in tag 0 - 4 (resp?). Also grouped by tag 21. They have much higher std/skew/kurtosis then the other columns.
- Feature 44 has one one good correlation in this group - feature 45, could be related to tag 15
- There is a clear pattern between features 60 - 68, they are all tagged with 12 and 13, which were also positive correlated in the previous group. The exception is feature 64.
- Good correlation between features 69 - 71 grouped in tag 20. Feature 70 and 71 could have alternating tag 9 like in previous group

In [None]:
#Possible feature groups

g2_feat = [[41, 42, 43, 53], [44, 45], [46, 47, 48, 49, 50, 51, 52], 54, [60, 61], [62, 63], 64, [65, 66], [67, 68], [69, 70, 71]]
g2_feat_1 = [55] #add 1 to get other resp values

### Feature 72 - 119 ###

In [None]:
show_corr_heatmap(corr_matrix.iloc[72:120, 72:120], width=30)

In [None]:
fig, ax = plt.subplots(figsize=(20, 12))
sns.heatmap(features.iloc[72:120,1:])

Highlights:
- Tag 23 is the category for this group of features
- Tags 24, 25, 26 and 27 are subcategories, the correlation also indicates this split
- Tags 15 and 17 also split those subcategories
- Tags 0 - 4 could be indicating resp
- Features groups 72-77, 78-83, 84-89, 90-95, 96-101, 102-107, 108-113, 114-119 have the same pattern in missing values - they could show the same data
- High correlation between features 95, 107, 119 and 89, 101, 113 - only one combo of tag 5 / tag 15 and tag 17 / tag 24, 25 and 26 should be used
- Tag 15 and 17 features are also in a strong correlation, meaning only features in one of those tags should be used

In [None]:
#Possible feature groups

g3_feat = [[77, 83], [89, 95, 101], [107, 113, 119]] #last 2 elements can be grouped
g3_feat_1 = [[72, 78], [84, 96, 108], [90, 102, 114]] #add 1 at first element to get other resp values, last 2 elements can be grouped

### Feature 120 - 129 ###

In [None]:
show_corr_heatmap(corr_matrix.iloc[120:130, 120:130])

In [None]:
fig, ax = plt.subplots(figsize=(20, 12))
sns.heatmap(features.iloc[120:130,1:])

Highlights:
- Even features 120, 122, 124, 126, 128 and odd ones 121, 123, 125, 127, 129 are strong positive correlated.
- The pattern again give us one more proof that tag 0 to 4 features are very similar and we should only use one of them

In [None]:
#Possible feature groups
g4_feat_1 = [120, 121] #add 2 to get other resp values

In [None]:
# All features groups
# g1_feat = [1, [3, 5]] #add 1 to get even values
# g2_feat = [[41, 42, 43, 53], [44, 45], [46, 47, 48, 49, 50, 51, 52], 54, [60, 61], [62, 63], 64, [65, 66], [67, 68], [69, 70, 71]]
# g3_feat = [[77, 83], [89, 95, 101], [107, 113, 119]] #last 2 elements can be grouped

# g1_feat_1 = [7, [17, 27]] #add 2 to get other resp values, add 1 to get even values
# g2_feat_1 = [55] #add 1 to get other resp values
# g3_feat_1 = [[72, 78], [84, 96, 108], [90, 102, 114]] #add 1 to get other resp values, last 2 elements can be grouped
# g4_feat_1 = [120, 121] #add 2 to get other resp values

I will now plot cumulative sums for resp related features.

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
train_sample["feature_7"].cumsum().plot()
train_sample["feature_9"].cumsum().plot()
train_sample["feature_11"].cumsum().plot()
train_sample["feature_13"].cumsum().plot()
train_sample["feature_15"].cumsum().plot()
plt.legend(loc="upper left")

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
train_sample["feature_17"].cumsum().plot()
train_sample["feature_19"].cumsum().plot()
train_sample["feature_21"].cumsum().plot()
train_sample["feature_23"].cumsum().plot()
train_sample["feature_25"].cumsum().plot()
plt.legend(loc="upper left")

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
train_sample["feature_55"].cumsum().plot()
train_sample["feature_56"].cumsum().plot()
train_sample["feature_57"].cumsum().plot()
train_sample["feature_58"].cumsum().plot()
train_sample["feature_59"].cumsum().plot()
plt.legend(loc="upper left")

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
train_sample["feature_84"].cumsum().plot()
train_sample["feature_85"].cumsum().plot()
train_sample["feature_86"].cumsum().plot()
train_sample["feature_87"].cumsum().plot()
train_sample["feature_88"].cumsum().plot()
plt.legend(loc="upper left")

Comparing the charts above to the resp cumulative sum values, we can see that purple feature best corresponds to our resp value. Therefore we can make the following assumptions:
- Tag_0 = resp_4
- Tag_1 = resp
- Tag_2 = resp_3
- Tag_3 = resp_2
- Tag_4 = resp_1

Now I will go through all the features and select the most relavant.

All odd features in group 1 have their twins in even features, this is also indicated by tag_9. I will now plot some pairs, to get better insight into data.

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
train_sample["feature_1"].cumsum().plot()
train_sample["feature_2"].cumsum().plot()

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
train_sample["feature_3"].cumsum().plot()
train_sample["feature_4"].cumsum().plot()

In [None]:
fig, ax = plt.subplots(1,1,figsize=(15,5))
train_sample["feature_7"].cumsum().plot()
train_sample["feature_8"].cumsum().plot()

Looks like odd and even values act as some kind of boundaries, where odd values are upper boundary and even values are bottom boundary.

In [None]:
#Select between features 3 - 6
df_info.iloc[10:14,:]

In [None]:
fig, ax = plt.subplots(1,4,figsize=(20,5))

sns.scatterplot(x="feature_3", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_4", y="resp", hue="action", data=train_sample, ax=ax[1])
sns.scatterplot(x="feature_5", y="resp", hue="action", data=train_sample, ax=ax[2])
sns.scatterplot(x="feature_6", y="resp", hue="action", data=train_sample, ax=ax[3])

Feature 3 / 4 is much better choice since it is more compact and has less outliers.

In [None]:
#Select between features 17 - 36
df_info.loc[["feature_25", "feature_26","feature_35", "feature_36"]]

In [None]:
fig, ax = plt.subplots(1,4,figsize=(20,5))

sns.scatterplot(x="feature_25", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_26", y="resp", hue="action", data=train_sample, ax=ax[1])
sns.scatterplot(x="feature_35", y="resp", hue="action", data=train_sample, ax=ax[2])
sns.scatterplot(x="feature_36", y="resp", hue="action", data=train_sample, ax=ax[3])

I will take feature 35 since it has the lowest kurtosis, however it has relatively high skewness, therefore it will have to be transformed later.
From the first group we have selected the features [1, 3, 15, 35].

Now to the second group, which is much more complex.

In [None]:
selected_group_1 = [1, 3, 15, 35]

In [None]:
#Select between features 41 - 43, 53
df_info.loc[["feature_41", "feature_42","feature_43", "feature_53"]]

In [None]:
fig, ax = plt.subplots(1,4,figsize=(20,5))

sns.scatterplot(x="feature_41", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_42", y="resp", hue="action", data=train_sample, ax=ax[1])
sns.scatterplot(x="feature_43", y="resp", hue="action", data=train_sample, ax=ax[2])
sns.scatterplot(x="feature_53", y="resp", hue="action", data=train_sample, ax=ax[3])

In [None]:
#Select between features 44 - 45
df_info.loc[["feature_44", "feature_45"]]

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.scatterplot(x="feature_44", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_45", y="resp", hue="action", data=train_sample, ax=ax[1])

In [None]:
#Select between features 46 - 52
df_info.loc[["feature_46", "feature_47", "feature_48", "feature_49", "feature_50", "feature_51", "feature_52"]]

In [None]:
fig, ax = plt.subplots(2,4,figsize=(20,10))
sns.scatterplot(x="feature_46", y="resp", hue="action", data=train_sample, ax=ax[0, 0])
sns.scatterplot(x="feature_47", y="resp", hue="action", data=train_sample, ax=ax[0, 1])
sns.scatterplot(x="feature_48", y="resp", hue="action", data=train_sample, ax=ax[0, 2])
sns.scatterplot(x="feature_49", y="resp", hue="action", data=train_sample, ax=ax[0, 3])
sns.scatterplot(x="feature_50", y="resp", hue="action", data=train_sample, ax=ax[1, 0])
sns.scatterplot(x="feature_51", y="resp", hue="action", data=train_sample, ax=ax[1, 1])
sns.scatterplot(x="feature_52", y="resp", hue="action", data=train_sample, ax=ax[1, 2])

In [None]:
#Select between features 60 - 61
df_info.loc[["feature_60", "feature_61"]]

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.scatterplot(x="feature_60", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_61", y="resp", hue="action", data=train_sample, ax=ax[1])

In [None]:
#Select between features 62 - 63
df_info.loc[["feature_62", "feature_63"]]

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.scatterplot(x="feature_62", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_63", y="resp", hue="action", data=train_sample, ax=ax[1])

In [None]:
#Select between features 65 - 66
df_info.loc[["feature_65", "feature_66"]]

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.scatterplot(x="feature_65", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_66", y="resp", hue="action", data=train_sample, ax=ax[1])

In [None]:
#Select between features 67 - 68
df_info.loc[["feature_67", "feature_68"]]

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.scatterplot(x="feature_67", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_68", y="resp", hue="action", data=train_sample, ax=ax[1])

In [None]:
#Select between features 69 - 71
df_info.loc[["feature_69", "feature_70", "feature_71"]]

In [None]:
fig, ax = plt.subplots(1,3, figsize=(15,5))

sns.scatterplot(x="feature_69", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_70", y="resp", hue="action", data=train_sample, ax=ax[1])
sns.scatterplot(x="feature_71", y="resp", hue="action", data=train_sample, ax=ax[2])

Comparing some of the chart pairs shows almost the same data, which is also indicated by features 100% correlation. I will use the first item from those pairs.
Group 2 selected features: [43, 44, 52, 54, 59, 60, 62, 64, 65, 67, 69]

Next we have group 3

In [None]:
selected_group_2 = [43, 44, 52, 54, 59, 60, 62, 64, 65, 67, 70]

In [None]:
#Select between features 77 - 83
df_info.loc[["feature_77", "feature_83"]]

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.scatterplot(x="feature_77", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_83", y="resp", hue="action", data=train_sample, ax=ax[1])

In [None]:
#Select between features 89, 95, 101, 107, 113, 119
df_info.loc[["feature_89", "feature_95", "feature_101", "feature_107", "feature_113", "feature_119"]]

In [None]:
fig, ax = plt.subplots(2,3, figsize=(15,10))

sns.scatterplot(x="feature_89", y="resp", hue="action", data=train_sample, ax=ax[0, 0])
sns.scatterplot(x="feature_101", y="resp", hue="action", data=train_sample, ax=ax[0, 1])
sns.scatterplot(x="feature_113", y="resp", hue="action", data=train_sample, ax=ax[0, 2])

sns.scatterplot(x="feature_95", y="resp", hue="action", data=train_sample, ax=ax[1, 0])
sns.scatterplot(x="feature_107", y="resp", hue="action", data=train_sample, ax=ax[1, 1])
sns.scatterplot(x="feature_119", y="resp", hue="action", data=train_sample, ax=ax[1, 2])

In [None]:
#Select between features 76, 82
df_info.loc[["feature_76", "feature_82"]]

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.scatterplot(x="feature_76", y="resp", hue="action", data=train_sample, ax=ax[0])
sns.scatterplot(x="feature_82", y="resp", hue="action", data=train_sample, ax=ax[1])

In [None]:
#Select between features 88, 100, 112, 94, 106, 118
df_info.loc[["feature_88", "feature_100", "feature_112", "feature_94", "feature_106", "feature_118"]]

In [None]:
fig, ax = plt.subplots(2,3, figsize=(15,10))

sns.scatterplot(x="feature_88", y="resp", hue="action", data=train_sample, ax=ax[0, 0])
sns.scatterplot(x="feature_100", y="resp", hue="action", data=train_sample, ax=ax[0, 1])
sns.scatterplot(x="feature_112", y="resp", hue="action", data=train_sample, ax=ax[0, 2])

sns.scatterplot(x="feature_94", y="resp", hue="action", data=train_sample, ax=ax[1, 0])
sns.scatterplot(x="feature_106", y="resp", hue="action", data=train_sample, ax=ax[1, 1])
sns.scatterplot(x="feature_118", y="resp", hue="action", data=train_sample, ax=ax[1, 2])

Again we can see some very similar values, here is my selection: [76, 83, 88, 107]. I group 4 we need to select only the correct resp features: [128, 129].

In [None]:
selected_group_3 = [76, 83, 88, 107]
selected_group_4 = [128, 129]

I will group those features in one list and create another list with actual column names.

In [None]:
selected_features = selected_group_1 + selected_group_2 + selected_group_3 + selected_group_4
print(selected_features)

In [None]:
selected_features_names = ["feature_" + str(index) for index in selected_features]
df_info.loc[selected_features_names]

In [None]:
print(selected_features_names)

In [None]:
# All features groups
# g1_feat = [1, [3, 5]] #add 1 to get even values
# g2_feat = [[41, 42, 43, 53], [44, 45], [46, 47, 48, 49, 50, 51, 52], 54, [60, 61], [62, 63], 64, [65, 66], [67, 68], [69, 70, 71]]
# g3_feat = [[77, 83], [89, 95, 101], [107, 113, 119]] #last 2 elements can be grouped

# g1_feat_1 = [7, [17, 27]] #add 2 to get other resp values, add 1 to get even values
# g2_feat_1 = [55] #add 1 to get other resp values
# g3_feat_1 = [[72, 78], [84, 96, 108], [90, 102, 114]] #add 1 to get other resp values, last 2 elements can be grouped
# g4_feat_1 = [120, 121] #add 2 to get other resp values

<a id="section-2.3"></a>
## 2.3 Date features ##

In [None]:
datetime_info = df_info.loc[datetime_features]
datetime_info

In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,5))
sns.distplot(train_sample[train_sample["action"] == 0]["date"], hist=False, label="action_0")
sns.distplot(train_sample[train_sample["action"] == 1]["date"], hist=False, label="action_1")
plt.legend(loc="lower center")

In [None]:
group = train_sample.groupby("date").agg({"weight": [np.median]}).reset_index()
group.columns = group.columns.get_level_values(0)

In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,5))
sns.lineplot(x="date", y="weight", data=group)

We can see 2 peaks in date distribution data. The first peak has more trading opportunities where we should take action, the second peak is the other way around.
Weight values seems to be raising over time, therefore we can gain more in the second peak but it is riskier.

<a id="section-2.4"></a>
## 2.4 Features transformation ##

After the feature analysis we should take care of skewness and kurtosis for each of them. Skewness should be between -0,5 and 0,5 for each feature to become normal like distribution. To take care of kurtosis I will remove outliers with z-score above 5 and below -5. I am trying to get the value between +2 and -2.

Finally I will downgrade the features for faster calculations.

In [None]:
selected_features_names = ['feature_0', 'feature_1', 'feature_5', 'feature_15', 'feature_35', 'feature_41', 'feature_43', 'feature_44', 'feature_45', 'feature_52', 'feature_59', 'feature_60', 'feature_62', 'feature_64', 'feature_65', 'feature_67', 'feature_70', 'feature_76', 'feature_83', 'feature_107', 'feature_128']

In [None]:
#Select above threshold anonymus features
features_to_remove_outliers = []
features_to_scale = []

#anonymized_features
for feat in selected_features_names:
    if df_info.loc[feat, "kurt"] > 2:
        features_to_remove_outliers.append(feat)

    if abs(df_info.loc[feat, "skew"]) > 0.5:
        features_to_scale.append(feat)

In [None]:
from scipy import stats

def transform(df_temp):
    
    df = df_temp.copy()
        
    # Remove weight = 0
    with timer():
        df = df[df["weight"] != 0]
        
    # Calculate z-scores values and remove outliers
    with timer():
        for feat in features_to_remove_outliers:
            feat_zscore = feat + '_zscore'
            df[feat_zscore] = (df[feat] - df[feat].mean())/df[feat].std()
        
        df["max_feat_zscore"] = df[df.columns[-len(features_to_remove_outliers):]].abs().max(axis=1)
        df = df[df["max_feat_zscore"] < 6]

    # Use arcsinh on features with high skewness
    with timer():
        for feat in features_to_scale:
            df[feat] = np.arcsinh(df[feat])
        

    #Fill missing values
    with timer():
        for feat in selected_features_names:
            df[feat].fillna(method='ffill', inplace=True)

    #Normalization
    with timer():
        df[selected_features_names]=(df[selected_features_names] - df[selected_features_names].mean()) / df[selected_features_names].std()
    
    return df

In [None]:
%%time
train_transformed = transform(train)[selected_features_names]