# Jane Street Market Prediction EDA
## Action Threshold and Feature Communities

Since the training set does not provide the label `action`, it is left up to us to determine how that label is applied to the data used to train our models. In this notebook, we set an arbitrary initial `action_threshold` value for `weight * resp` that will determine the positive `action` class. Then, we'll determine raw feature similarity by creating a graph (nodes and edges) using the feature and tags. Finally, we'll group the features according to the community structure exhibited in the graph and take a look at pairplots of the groups with our `action` label.

TODO: determine methods that will optimize the `action_threshold` value for creating the positive class.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
train = pd.read_csv('/kaggle/input/jane-street-market-prediction/train.csv')

train = train.astype({col: np.float32 for col in train.select_dtypes('float64').columns})
train = train.astype({col: np.int32 for col in train.select_dtypes('int64').columns})

In [None]:
class TrainData():
    
    def __init__(self, df, action_threshold=0.0):
        self.train = df.copy()
        self.action_threshold = action_threshold
        
    
    def add_weight_resp(self):
        """Calculates weight * resp for new column weight_resp."""
        self.train['weight_resp'] = self.train['weight'] * self.train['resp']
        self.train['weight_resp'] = self.train['weight_resp'].astype(np.float32)
        
    def add_action(self):
        """Adds action column if weight_resp > action_threshold."""
        self.train['action'] = np.where(
            self.train['weight_resp'] > self.action_threshold, 1, 0)
        self.train['action'] = self.train['action'].astype(np.int32)

## Training Data
For initial analysis, we are going to arbitrarily set the `action_threshold` parameter in our `TrainData` object. 

The `add_action` method will add the `action` column with a value of 1 if `weight * resp > action_threshold`, else 0.

In [None]:
act_thresh = 0.1

td = TrainData(df=train, action_threshold=act_thresh)
td.add_weight_resp()
td.add_action()

## Plots
Let's take a look at a few distributions with the arbitrarily set `action_threshold`...

In [None]:
sns.distplot(td.train['weight_resp'], rug=False, bins=100)
plt.title('Distribution: weight * resp')

In [None]:
plt.hist(td.train['action'])
plt.title(f'Distribution: action\nwith threshold = {act_thresh}')

The distribution of our dependent variable above shows a significant class imbalance. Oversampling may be a good idea here.

The following scatterplots show `weight` vs. `resp`. The first one is colored by `weight * resp`. The second one is colored by `action` and was used to come up with the `action_threshold` value. Note the outliers.

In [None]:
fig = plt.figure(figsize=(24, 16))
sns.scatterplot(x='weight', y='resp', hue='weight_resp', data=td.train, palette='icefire')
plt.title('weight vs. resp', fontsize=20)

In [None]:
fig = plt.figure(figsize=(24, 16))
sns.scatterplot(x='weight', y='resp', hue='action', data=td.train, palette='icefire')
plt.title('weight vs. resp', fontsize=20)

## Comparing `feature_0` to `action`
Here, I'm using Seaborn's pairplot to produce bivariate scatterplots for a handful of features. In particular, I was curious to see how `feature_0` - a feature with only two values - compared to the `action` variable derived from our `action_threshold`.

In [None]:
sns.pairplot(td.train.iloc[:, 7:12], hue='feature_0')

In [None]:
sns.pairplot(pd.concat([td.train.iloc[:, 8:12], td.train.loc[:, 'action']], axis=1), hue='action')

I find it very interesting that the positive `action` class appears to be centered within each scatter plot, and that `feature_0` value of 1 corresponds with this with a bit of a skew. Makes me wonder whether the former is a subset of the latter...

In [None]:
for val in td.train['feature_0'].unique():
    subdf = td.train.loc[td.train['feature_0'] == val]
    print(f"***\nfeature_0 = {val}\n{subdf['action'].value_counts()}\n")

Nope.

## Tags
Each of the features in our training set have a set of boolean tags associated with them. These are specified in `features.csv`. In order to find some similarities among the features, let's create a bipartite graph where the nodes are the features and tags, and where an edge exists between a feature and a tag if the tag value is `True`. Then, let's apply the `best_partition` method to better see the community structure.

In [None]:
import networkx as nx
import community
import matplotlib.cm as cm
from matplotlib.colors import Normalize

In [None]:
feats = pd.read_csv(
    '/kaggle/input/jane-street-market-prediction/features.csv')
feats = feats.astype({col: np.int32 for col in feats.select_dtypes('bool').columns})

In [None]:
class FeatureGraph():
    
    def __init__(self, df):
        self.G = nx.Graph()
        self.data = df.copy()
        self.partition = None
        
        
    def add_feature_nodes(self):
        """Adds nodes for features."""
        for feat in self.data['feature'].unique():
            self.G.add_nodes_from([feat], color='green')
            
            
    def add_tag_nodes(self):
        """Adds nodes for tags."""
        for col in self.data.columns:
            if 'tag' in col:
                self.G.add_nodes_from([col], color='red')
                
                
    def add_edges(self):
        """Adds edges between features and tags if value == 1."""
        for row in range(self.data.shape[0]):
            source_node = self.data.loc[row, 'feature']

            for col in range(1, self.data.shape[1]):
                target_node = self.data.columns[col]
                if self.data.iloc[row, col] == 1:
                    self.G.add_edge(source_node, target_node)

    
    def create_graph(self):
        """Creates graph object."""
        self.add_feature_nodes()
        self.add_tag_nodes()
        self.add_edges()
        
        
    def create_partition(self):
        """Partition the graph and adds partition attribute to each node."""
        self.partition = community.best_partition(self.G)
        nx.set_node_attributes(self.G, self.partition, 'partition')

In [None]:
fg = FeatureGraph(df=feats)
fg.create_graph()

Below, we have the features colored in green and the tags colored in red. We definitely see some community structure within the network.

In [None]:
fig = plt.figure(figsize=(24, 16))
pos = nx.spring_layout(fg.G)
col = nx.get_node_attributes(fg.G, 'color').values()
nx.draw(fg.G, pos=pos, font_size=8, with_labels=True, node_size=100, node_color=col)

Note that node `feature_0` has no edges. 

Now, let's color the nodes according to the communities determined by the `best_partition` method. We will also create `partition_dict` to look at scatter plots for each partition.

In [None]:
fg.create_partition()

cmap = cm.viridis
norm = Normalize(vmin=min(fg.partition.values()), 
                 vmax=max(fg.partition.values()))

part_list = []
for k in fg.partition.keys():
    part_list.append(fg.partition[k])

part_set = set(part_list)

partition_dict = {}
for i in part_set:
    partition_dict.update({i: []})

partition_colors = []

for node in fg.G.nodes(data=True):
    for k in partition_dict.keys():
        if node[1]['partition'] == k:
            partition_colors.append(cmap(norm(k)))
            if 'feature' in node[0]:
                partition_dict[k].append(node[0])

In [None]:
fig = plt.figure(figsize=(24, 16))
pos = nx.spring_layout(fg.G)
nx.draw(fg.G, pos=pos, font_size=8, with_labels=True, node_size=100, node_color=partition_colors)

In [None]:
partition_dict

## Pairplots by Partition
Now, let's take a look at the pairplots for a few of the smaller partitions, colored by `action`.

In [None]:
l_list = []

for k, v in partition_dict.items():
    l_list.append(len(v))

In [None]:
plot_list = partition_dict[np.argsort(l_list)[1]] + ['action']
sns.pairplot(td.train.loc[:, plot_list], hue='action')

In [None]:
plot_list = partition_dict[np.argsort(l_list)[2]] + ['action']
sns.pairplot(td.train.loc[:, plot_list], hue='action')

In [None]:
# plot_list = partition_dict[np.argsort(l_list)[3]] + ['action']
# sns.pairplot(td.train.loc[:, plot_list], hue='action')

In [None]:
# plot_list = partition_dict[np.argsort(l_list)[4]] + ['action']
# sns.pairplot(td.train.loc[:, plot_list], hue='action')