# Selecting features exploring the temporal nature of the dataset

This is a very basic notebook focusing only in the feature selection part. If you enjoy it, than you can adapt to the models you're using or simply use the stable features set we find here.

I describe this approach on [this blog post](https://lgmoneda.github.io/2020/12/07/temporal-feature-selection-with-shap-values.html), which contains another practical example. 

## Motivation

It's common to see performance dropping overtime. It's very likely our models explore correlations that don't hold in out of distribution data coming from contexts slightly different from the ones we have learned:

![trend](https://lgmoneda.github.io/images/temporal-feature-selection/model_degrade.jpg)

The idea here is to look to the features that are consistent important through the many time periods we have available hoping they are the most robust to keep their predictive power in the future unseend data.


## Results summary

Models for comparison:  
- Challenger: default lgbm with features selected using temporal shap
- Benchmarks:
  1. Default lgbm with all features (benchmark I)
  2. Default lgbm with shap importance in the whole train and the same number of features (benchmark II)
 
 
 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import shap

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import janestreet
import warnings

from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score

plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12, 4)
warnings.filterwarnings('ignore')

In [None]:
TARGET_THRESHOLD = 0
TIME_SPLIT = 400
TIME_COLUMN = "date"
TARGET = "action"

In [None]:
data = pd.read_csv("/kaggle/input/jane-street-market-prediction/train.csv")
### Due to resources restrictions, we use a sample of it
#data = data.sample(frac=0.1)

In [None]:
data["action"] = data["resp"] > TARGET_THRESHOLD
features = [col for col in data.columns if "feature" in col]

## Split the data

We're going to create two temporal sets, `in time` and `out of time`. 
Then, we'll split the `in time` into train and test.

![split](https://lgmoneda.github.io/images/temporal-feature-selection/holdout_split.jpg)


In [None]:
in_time = data[data[TIME_COLUMN] <= TIME_SPLIT]
out_of_time = data[data[TIME_COLUMN] > TIME_SPLIT]

In [None]:
train, test = train_test_split(in_time, 
                               test_size=0.2, 
                               random_state=42)

## Train a full model to extract importance

In [None]:
model = LGBMClassifier()
model.fit(train[features], train[TARGET])

## Storing pooled feature importance for the benchmark

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test[features])

In [None]:
pooled_shap_importance = np.abs(shap_values[1]).mean(axis=0)
pooled_shap_importance = pd.DataFrame(pooled_shap_importance)
pooled_shap_importance.index = features
pooled_shap_importance.sort_values(by=0, ascending=False, inplace=True)
pooled_shap_importance

## Extracting the importance for every period using the in time test set

In [None]:
importance = []
for period in in_time[TIME_COLUMN].unique():
    test_period = test[test[TIME_COLUMN] == period]
    shap_contrib = np.abs(model.predict(test_period[features], pred_contrib=True)[:, :-1]).mean(0)
    df = pd.DataFrame(shap_contrib.reshape(1, len(features)), columns=features)
    df["period"] = period
    importance.append(df)
    
importance = pd.concat(importance)    

In [None]:
importance

In [None]:
average = importance.drop(columns=["period"]).mean()

In [None]:
average.sort_values()

In [None]:
deviation = importance.drop(columns=["period"]).std() / average

In [None]:
deviation.fillna(0, inplace=True)
deviation

## Stable definition

Here you can change the threshold and play with the "stable" definition, or just use the median. 

In the second case, do:

```
threshold_average_contrib = average.median()
```

In [None]:
median_average_contrib = average.median()
median_std_contrib = deviation.median()

threshold_average_contrib = np.percentile(average, 50)
threshold_std_contrib = np.percentile(deviation, 80)

## Visualize how features are distributed regarding contribution and instability

We want to select the ones with high contribution and low instability.

In [None]:
plt.scatter(average, deviation)
xmin, xmax, ymin, ymax = plt.axis()
plt.hlines(threshold_std_contrib, xmin, xmax, linestyle="dotted", color="red")
plt.vlines(threshold_average_contrib, ymin, ymax, linestyle="dotted", color="red")
plt.legend(bbox_to_anchor=(1.05, 1.0))
plt.title("Contribution x Instability")
plt.ylabel("Instability")
plt.xlabel("Contribution")
plt.show()

In [None]:
aggregate_importance = pd.DataFrame()
aggregate_importance["average"] = average.values
aggregate_importance["instability"] = deviation.values
aggregate_importance.index = average.index

aggregate_importance

In [None]:
mask = (aggregate_importance["average"] >= threshold_average_contrib) & (aggregate_importance["instability"] <= threshold_std_contrib)
stable_features = aggregate_importance[mask].index

Just a sanity check about the volume of features selected. If you play with the stable definition, this volume is going to change. 

The higher the threshold for contribution and the lower for the standard deviation, the more restrictive the definition and then this proportion should decrease.

In [None]:
fraction_selected = len(stable_features) / len(aggregate_importance)
fraction_selected

Here's the list of stable features. If you want, you can just pick them and test how they perform using your approach. 

Of course, the stable definition depends here both on the approach and the thresholds. So you might want to adjust the base model for importance to follow what you're using.

In [None]:
stable_features

## Checking how's the performance in the out of time

In [None]:
def utility_score_numba(date, weight, resp, action):
    Pi = np.bincount(date, weight * resp * action)
    t = np.sum(Pi) / np.sqrt(np.sum(Pi ** 2)) * np.sqrt(250 / len(Pi))
    u = min(max(t, 0), 6) * np.sum(Pi)
    return u

def jane_utility(data, action_column="action"):
    return utility_score_numba(data["date"].values, 
                               data["weight"].values, 
                               data["resp"].values, 
                               data[action_column].values)

### Benchmark I

In [None]:
model = LGBMClassifier()
model.fit(train[features], train["action"])

test["benchmark_1"] = model.predict_proba(test[features])[:, 1]
out_of_time["benchmark_1"] = model.predict_proba(out_of_time[features])[:, 1]

test["benchmark_1_action"] = model.predict(test[features])
out_of_time["benchmark_1_action"] = model.predict(out_of_time[features])

print("Test AUC (in time): {:.6f}".format(roc_auc_score(test[TARGET], test["benchmark_1"])))
print("Out of time AUC: {:.6f}".format(roc_auc_score(out_of_time[TARGET], out_of_time["benchmark_1"])))
print("-----------")
print("Test Jane Utility (in time): {:.2f}".format(jane_utility(test, "benchmark_1_action")))
print("Out of time Jane Utility: {:.2f}".format(jane_utility(out_of_time, "benchmark_1_action")))

### Benchmark II

In [None]:
n_top_features = int(len(features) * fraction_selected)
n_top_features = pooled_shap_importance.index[:n_top_features]

benchmark_model = LGBMClassifier()
benchmark_model.fit(train[n_top_features], train["action"])

test["benchmark_2"] = benchmark_model.predict_proba(test[n_top_features])[:, 1]
out_of_time["benchmark_2"] = benchmark_model.predict_proba(out_of_time[n_top_features])[:, 1]

test["benchmark_2_action"] = benchmark_model.predict(test[n_top_features])
out_of_time["benchmark_2_action"] = benchmark_model.predict(out_of_time[n_top_features])

print("Test AUC (in time): {:.6f}".format(roc_auc_score(test[TARGET], test["benchmark_2"])))
print("Out of time AUC: {:.6f}".format(roc_auc_score(out_of_time[TARGET], out_of_time["benchmark_2"])))
print("-----------")
print("Test Jane Utility (in time): {:.2f}".format(jane_utility(test, "benchmark_2_action")))
print("Out of time Jane Utility: {:.2f}".format(jane_utility(out_of_time, "benchmark_2_action")))

### Challenger

In [None]:
stable_model = LGBMClassifier()
stable_model.fit(train[stable_features], train["action"])

test["challenger"] = stable_model.predict_proba(test[stable_features])[:, 1]
out_of_time["challenger"] = stable_model.predict_proba(out_of_time[stable_features])[:, 1]

test["challenger_action"] = stable_model.predict(test[stable_features])
out_of_time["challenger_action"] = stable_model.predict(out_of_time[stable_features])

print("Test AUC (in time): {:.6f}".format(roc_auc_score(test[TARGET], test["challenger"])))
print("Out of time AUC: {:.6f}".format(roc_auc_score(out_of_time[TARGET], out_of_time["challenger"])))
print("-----------")
print("Test Jane Utility (in time): {:.2f}".format(jane_utility(test, "challenger_action")))
print("Out of time Jane Utility: {:.2f}".format(jane_utility(out_of_time, "challenger_action")))

## Visualizing comparison

In [None]:
pd.concat([test, out_of_time]).groupby("date").apply(lambda x: np.sum(x["resp"] * x["weight"] * x["benchmark_1_action"])).rolling(60).mean().plot(label="All features (benchmark)", color="purple")
pd.concat([test, out_of_time]).groupby("date").apply(lambda x: np.sum(x["resp"] * x["weight"] * x["benchmark_2_action"])).rolling(60).mean().plot(label="Selected features (benchmark 2)")
pd.concat([test, out_of_time]).groupby("date").apply(lambda x: np.sum(x["resp"] * x["weight"] * x["challenger_action"])).rolling(60).mean().plot(label="Temporal selected features (challenger)", color="green")

xmin, xmax, ymin, ymax = plt.axis()
plt.vlines(TIME_SPLIT, ymin, ymax, linestyle="dotted", color="red", label="Out of time split")
plt.legend(bbox_to_anchor=(1.05, 1.0))
plt.title("Performance moving average of 60 periods window for both test and out of time periods", pad=16)
plt.ylabel("sum(Weight * Resp * Action)")
plt.xlabel("Date")
plt.show()

## Is there any pattern on the stable features regarding their nature?

To answer this question, we're going to take a look into the tags: 

In [None]:
feature_tags = pd.read_csv("/kaggle/input/jane-street-market-prediction/features.csv")

In [None]:
tags = [col for col in feature_tags if "tag" in col]
feature_tags["Stable"] = feature_tags["feature"].apply(lambda x: True if x in stable_features else False)
feature_tags.groupby("Stable")[tags].mean().transpose().plot(kind="bar")
plt.title("Proportion of features with a certain tag considering stable and unstable", fontsize=16, pad=16)
plt.show()

## Submission

We retrain with the full dataset and the stable features only. 

In [None]:
model = LGBMClassifier()
model.fit(data[stable_features], data["action"])

In [None]:
### I'm going to use 0.5 since it's what the benchmark submission is using
threshold = data["action"].mean()
threshold

In [None]:
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set
for (test_df, sample_prediction_df) in iter_test:
    sample_prediction_df["action"] = model.predict(test_df[stable_features]).astype(int)
    env.predict(sample_prediction_df)