# Temporal feature inspection (Features 60-69)

- Please give credit to @nanomathias for starting discussion on these features and writing code to display relationships in his notebook here: https://www.kaggle.com/nanomathias/feature-0-beyond-feature-0 

Features 60-69 represent intraday temporal features. This notebook will give you actionable insights on these features, allowing for your own feature engineering. 

In [None]:
!pip install seaborn --upgrade --quiet

In [None]:
from typing import List

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:

# Get the first 10k rows, which have not be
ordered_subset = pd.read_csv('/kaggle/input/jane-street-market-prediction/train.csv', nrows=50000)

### Feature 64, and how we can use it with other features

Feature 64 represents the time of day at a trade. This is a guess based on discussion from the community, information from the competition organizers, and a simple sanity check. The rows in our training data set are monotonically increasing, with every new row representing the next timestep. Feature 64 shares this... feature, only ever decreasing when a new day begins.

In [None]:
decr_64_ts_id = ordered_subset[ordered_subset['feature_64'].diff()<0]['ts_id']
# print(decr_64_ts_id)
decr_64_date = []
date_ts_subset = pd.concat((ordered_subset['ts_id'], ordered_subset['date']),axis=1)
for ts_id in decr_64_ts_id:
    dates = (date_ts_subset.loc[ts_id-1,'date'], date_ts_subset.loc[ts_id,'date'], date_ts_subset.loc[ts_id+1,'date'])
    print(f'Date BEFORE DECREASE in 64: {dates[0]}, date AT DECREASE in 64: {dates[1]}, date AFTER DECREASE in 64: {dates[2]}')


In [None]:
plt.plot(ordered_subset['ts_id'], ordered_subset['feature_64'])
for ts_id in decr_64_ts_id:
    plt.axvline(ordered_subset.loc[ts_id, 'ts_id'], color='m')


#### Timeskip in feature 64

We also see a peculiar skip in time, with a lack of data matching the conditions: 'feature_64' > .7 and 'feature_64' < 1.3. We'll continue our visual investigation on day 0, and shift our data in regards to this skip.

### Features 60-68

These features all share tag 22, and show interesting dynamics. For each feature in this range, we fit and plot a Kernel Density Estimate, with accompanying scatter plots.  We do the same for each feature adjusted by feature 64. The KDEs are separated by feature 0, which significantly affects the distributions in features.


In [None]:
def plotFeatureSplits(df: pd.DataFrame, feature_list: List[int]) -> None:
    for i in feature_list:
        if i != 64:

            # Create a plot with original timeseries, and split by feature 0
            _, axes = plt.subplots(1, 2, figsize=(15, 5))

            # Original timeseries
            
            axes[0].scatter(df['ts_id'], df[f'feature_{i}'], s=.2)
            axes[0].set_title(f'Feature {i}')
            axes[0].set_ylabel(f'Feature {i}')
            axes[0].set_xlabel(f'Trade ID')

            # Plot by feature 0 split
            axes[1].scatter(df['ts_id'],df['feature_64']-df[f'feature_{i}'], s=0.5,alpha=0.8)
            axes[1].set_title(f'Feature {i}, adj by feature 64')
            axes[1].set_ylabel(f'Feature {i}')
            axes[1].set_xlabel(f'Trade ID')

            # Show figure with legend
            plt.legend()    
            plt.show()
            
            

            
            sns.displot( data=df,x="ts_id",y=f'feature_{i}', kind='kde', hue='feature_0')
            df[f'feature _64-feature_{i}'] = df['feature_64']-df[f'feature_{i}']
            sns.displot( data=df,x="ts_id",y=f'feature _64-feature_{i}', kind='kde', hue='feature_0')
            
            plt.show()

# Show features 60-68 and their relationship
ordered_subset['action'] = (ordered_subset['resp']>0).astype('int')
new = ordered_subset.copy()
new.loc[new['ts_id']>3255,'ts_id']+=500


#### Features 60-64
The unadjusted plot of feature 60 seems to show discrete values of feature 60, suggesting feature 60 is binned, much like an open-high-low-close (OHLC) data set may be binned. This behavior is repeated for features 61, 62, and 63. We also begin to see patterns emerging. The block of time that is not included in our dataset and that we adjusted our plots for seems to propogate, creating a streak with no points. The KDE plots show the result of this when the data, creating 3 groups: trades before the removed block, trades after the removed block and above some threshold, and trades after the removed block and below some threshold. 

In [None]:
plotFeatureSplits(new[:5587], np.arange(60, 65))

#### Features 65-68

Feature 65 doesn't have the same binned look as features 60-64. Instead, points seem to gravitate towards where feature 64 may be during the day. This may indicate tick data. We see changes in the KDE distributions as well when accounting for feature 0. The same missing data propogation phenonemon occurs, but in a slightly different manner. While I don't have much evidence for this, the different manner may indicate we have logarithmic data rather than tick data present. There are other possibilities, but we may be able to garner new insights without prodding the anonymous tags too much. The same 3 groupings are present as from features 60-64. All of these points are also varyingly applicable to features 66, 67, 68.

In [None]:
plotFeatureSplits(new[:5587], np.arange(65, 69))

### Conclusion

We may be able to increase our model's predictive ability by incorporating the hidden information in features 60-68. Creating new features from symbolic representations between these features may provide valuable information to our neural nets, boosted trees, and simple regressors.