# Jane Street - EDA focused on “features”

Given dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data.

Because of the large number of variables, we might look at selecting variable or compressing them with PCA and so on. In order to do that, I got be curious to see what kind of relationship there is between the variables, so we did the analysis.

## Contents

1. [Loading and overviewing dataset](#1)
1. [Analysis with similarity matrix](#2)
1. [Analysis with clustering method](#3)
1. [Compressing](#4)

<a id="1"></a> <br>
# <div class="alert alert-block alert-success">Loading and overviewing dataset</div>

### Load library

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances
from sklearn.decomposition import PCA
import umap

### Load dataset

In [None]:
!ls ../input/jane-street-market-prediction

In [None]:
train = pd.read_csv("../input/jane-street-market-prediction/train.csv")
feature = pd.read_csv("../input/jane-street-market-prediction/features.csv")

train.csv contains historical data and returns.

In [None]:
train.head()

feature.csv includes metadata pertaining to the anonymized features.

In [None]:
feature.head()

### Preprocess

To analysis, I'll try some preprocess for dataframe.

In [None]:
feature_col = feature["feature"]
tag_col = [col for col in feature.columns if col not in ["feature"]]
feature = feature.rename(index=feature["feature"])[tag_col]

<a id="2"></a> <br>
# <div class="alert alert-block alert-info">Analysis with similarity matrix</div>

I create two similarity matrix and visualized as heatmap.

- cosine similarity matrix for feature.csv

- correlation matrix for train.csv

## Cosine similarity matrix for feature.csv

First, I'll check features' similarity by feature.csv.

### Calculate cosine similarity matrix

I regarded the dataframe as a vector representation of the features by tag_{0. .28}. So I create cosine similarity matrix for each feature pairs.

In [None]:
cos_matrix = cosine_similarity(feature, feature)
cos_matrix

### Visualize heatmap

Next, I'll visualize the matrix.

In [None]:
plt.figure(figsize=(15, 15))
g = sns.heatmap(data=cos_matrix)
g.set_title("Cosine similarity matrix of features' metadata", fontsize=15)

If you look at the heat map, you can see that there are highly similar features each other and not ones. For example, features{0..40} are more similar to each other than to features{0..40} and features{41..54}.

## Euqlid distance matrix for feature.csv

In [None]:
distance_matrix = pairwise_distances(feature, feature, metric='euclidean')
distance_matrix

In [None]:
plt.figure(figsize=(15, 15))
g = sns.heatmap(data=distance_matrix)
g.set_title("Euclid distance matrix of features' metadata", fontsize=15)

## Correlation matrix for train.csv

I'll also check features' similarity by train.csv.

### Calculate correlation matrix

By train.csv's data, we can calculate correlation matrix.

In [None]:
train_feature = train[feature_col]
train_feature_corr = train_feature.corr()
train_feature_corr

### Visualize heatmap

As in the previous example, we can visualize the matrix with heatmap.

In [None]:
plt.figure(figsize=(15, 15))
g = sns.heatmap(data=train_feature_corr)
g.set_title("Correlation matrix of features", fontsize=15)

The brighter the color, the higher the correlation is. As you can see, there are some correlated features. It's hard to see, but you can see some highly correlated blocks.  For example, features{84..120} are more similar to each other than to features{84..120} and features{18..26}.

## Comparison of heatmaps 

Let's compare the previous two heat maps.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15,6), gridspec_kw=dict(wspace=0.1, hspace=0.6))
fig.suptitle("Comparison of the heatmaps", fontsize=15)

g_1 = sns.heatmap(data=cos_matrix, ax=axes[0])
g_1.set_title("Cosine similarity matrix of features' metadata")

g_2 = sns.heatmap(data=distance_matrix, ax=axes[1])
g_2.set_title("Euqlid distance matrix of features")

g_3 = sns.heatmap(data=train_feature_corr, ax=axes[2])
g_3.set_title("Correlation matrix of features")

It is interesting that we can see that there are highly similar blocks floating on the diagonal elements on two heatmaps and the pattern is similar. Thus, it can be said that fetures with high similarity in terms of metadata tend to have also high correlation coefficients as well.

<a id="3"></a> <br>
# <div class="alert alert-block alert-info">Analysis with clustering method</div>

I also check that there are some similar groups of features by clustering method. I assumed that features are spatially similar, and estimated their labels by kmeans. And I droped features into two dimensions with Umap, and I checked that the feature of the same label is gathered. For data, I use feature.csv.

I estimate labels for each feature by kmeans. Note that I specified the n_clusters=3 because I visualized the data with Umap beforehand and I knew that it is divided into three clusters.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(feature)

I'll visualize the data with Umap

In [None]:
reducer = umap.UMAP()
embedding = reducer.fit_transform(feature)

In [None]:
fig = go.Figure(data=go.Scatter(x=embedding[:, 0],
                                y=embedding[:, 1],
                                mode='markers',
                                marker_color=kmeans.labels_))
fig.update_layout(title='features with kmeans labels')
fig.show()

I could roughly divide features into three groups. So we can say that there are three similar groups of features.

In [None]:
feature["kmeans_label"] = kmeans.labels_
feature[["kmeans_label"]]

<a id="4"></a> <br>
# <div class="alert alert-block alert-info">Compressing</div>

I'll try PCA and see how well the trainset can be represented by the variables.

In [None]:
pca = PCA().fit(train[feature_col].dropna())

In [None]:
#https://www.kaggle.com/kushal1506/deciding-n-components-in-pca

fig, ax = plt.subplots()
xi = np.arange(1, 131, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, 130, step=10)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

We found that with roughly 30 variables, 95% can be represented.