In [None]:
import pandas as pd 
import numpy as np
pd.set_option("display.max_rows", None, "display.max_columns", None)

Dataset from: [https://www.kaggle.com/c/jane-street-market-prediction/overview](http://)

From the problem description we get:
    
*In general, if one is able to generate a highly predictive model which selects the right trades to execute, 
they would also be playing an important role in sending the market signals that push prices closer to “fair” values. 
That is, a better model will mean the market will be more efficient going forward. However, developing good 
models will be challenging for many reasons, including a very low signal-to-noise ratio, <b>potential redundancy, 
strong feature correlation</b>, and difficulty of coming up with a proper mathematical formulation.***
    
So I thought it could be interesting to find and remove some of these highly correlated features in the dataset.

In [None]:
# Read data
df = pd.read_csv('/kaggle/input/jane-street-market-prediction/train.csv')


### Features Correlation

In [None]:
# check overall correlations
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

### Find features with correlation higher than 0.9

In [None]:
corr_values = corr.abs()
upper_triangle_matrix = corr_values.where(np.triu(np.ones(corr_values.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.9
cols_to_drop = [column for column in upper_triangle_matrix.columns if any(upper_triangle_matrix[column] > 0.9)]

cols_to_drop

In [None]:
# we don't want to drop 'ts_id' or 'resp'
cols_to_drop.pop(0)
cols_to_drop.pop()
cols_to_drop

### Drop columns

In [None]:
# Drop features 
df = df.drop(cols_to_drop, axis=1)

# Show new dataframe
df.head()

### Check new feature correlation matrix

In [None]:
# check overall correlations
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')