![](https://numer.ai/homepage/img/Numerai-Logo-Side-Black.png)

(UPDATE)
There has been some confusion with this method mainly regarding how to perform feature neutralization in the test data. I add a section about it. Also have a look at [Avoid Overfitting by Feature Neutralization](https://www.kaggle.com/c/jane-street-market-prediction/discussion/215305).


It has been fun to overfit the public LB (historical data) by whatever NN you like (MLP, 1DCNN, ResNet...you name it), but I feel like it is time to consider what we can do to avoid overfitting to perform well in the private test set (future data).

There are several ways to proceed, and easy ones are:

- increase regularization in a NN model (increase dropout, reduce batch size, add noise, etc)
- increase the action threshold (from 0.500 to somewhat higher)

In this notebook, I introduce you a very effective appoarch widely used in the [Numerai Tournament](https://numer.ai/): **Feature Neutralization**. 

Numerai Tournament is essentially very similar to this JaneStreet competition where participants submit a model prediction to the weekly tournament data and get paid by its performance. So whatever works in the Numerai Tournament should hopefully work here.

So what is feature neutralization? How can we avoid overfitting by using it??


# Feature Neutralization in a Nutshell

![](https://forum.numer.ai/uploads/default/original/1X/e1d39e8f38ae51c1189ee85fd9578a2a555a3c06.jpeg)

Feature neutralization is a technique to keep as much infomration as possible of a vector, simultaneously reducing linear-dependent information on another vector. Essentially we **take the residuals of a vector by linear regression** with another vector as a feature.

How is it useful?

It is useful because by applying the feature neutralization to the features on the target, we can get a set of features that contain as much original information as possible but decorrelate with the target.

Imagine a situation where one feature is very correlated with the target (which is often the case with ML in finance). When you train a model, the model jumps to the feature and ignores else. This is fine as long as you are validating your model with historical data. You may get a very high utility score.

However, when you deploy the model for forecasting, there may be a situation in the future where that very strong feature becomes useless. Like pandemic, war, political affairs, you name it. What if that happens? 

Maybe the sign of the strong feature flips. 

Then **the sign of your model prediction also flips**...sell when you should buy, buy when you should sell.  

Apparently this is a catastrophic scenario for an investor...but this is often the consequence when you use an overfitting model just because no feature is perfect in a financial market. 

**You might want to have a model where it is not dependent on single features!**

Feature neutralization helps you to have that kind of model.

In [None]:
# code to feature neutralize

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm.auto import tqdm

def neutralize_series(series : pd.Series, by : pd.Series, proportion=1.0):
    """
    neutralize pandas series (originally from the Numerai Tournament)
    """
    scores = series.values.reshape(-1, 1)
    exposures = by.values.reshape(-1, 1)
    exposures = np.hstack((exposures, np.array([np.mean(series)] * len(exposures)).reshape(-1, 1)))
    correction = proportion * (exposures.dot(np.linalg.lstsq(exposures, scores)[0]))
    corrected_scores = scores - correction
    neutralized = pd.Series(corrected_scores.ravel(), index=series.index)
    return neutralized

By using the above script, you can neutralize your feature vector by another one (e.g. target).

# Is our model prediction strongly correlated with a particular feature?
First, let's check this. I use a model prediction from [[JaneStreet] simple LGB with GroupKFold](https://www.kaggle.com/code1110/janestreet-simple-lgb-with-groupkfold?scriptVersionId=52979104).

In [None]:
# load competition data
train = pd.read_feather('../input/janestreet-save-as-feather/train.feather')
train = train.query('weight > 0').reset_index(drop=True)
print(train.shape)
train.head()

In [None]:
# load model prediction
oof = np.load('../input/janestreet-simple-lgb-with-groupkfold/oof.npy')
print(oof.shape)

In [None]:
# add oof prediction to train
train['oof'] = oof

# fillna
features = train.columns[train.columns.str.startswith('feature')].values.tolist()
for f in features:
    train[f] = train[f].fillna(train[f].mean())

To see how strongly our model prediction is correlated with a particularly feature, we compute **feature exposure**.

In [None]:
from scipy.stats import spearmanr

def feature_exposures(df, prediction_name='resp'):
    feature_names = [f for f in df.columns
                     if f.startswith("feature")]
    exposures = []
    for f in feature_names:
#         fe = spearmanr(df[prediction_name], df[f])[0]
        fe = np.corrcoef(df[prediction_name], df[f])[0, 1]
        exposures.append(fe)
    return np.array(exposures)


def max_feature_exposure(df):
    return np.max(np.abs(feature_exposures(df)))


def feature_exposure(df):
    return np.sqrt(np.mean(np.square(feature_exposures(df))))

In [None]:
# compute feature exposure
fes = feature_exposures(train, 'oof')
fes

In [None]:
print('Max feature exposure of the LGB model = {}'.format(np.max(np.abs(fes))))

In [None]:
# visualize
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from matplotlib_venn import venn2
from matplotlib import pyplot
from matplotlib.ticker import ScalarFormatter
sns.set_context("talk")
style.use('fivethirtyeight')
pd.options.display.max_columns = None

import warnings
warnings.filterwarnings('ignore')

fes_df = pd.DataFrame()
fes_df['features'] = np.array(features)
fes_df['features'] = fes_df['features'].apply(lambda x : str(x).split('_')[-1])
fes_df['corr_to_oof'] = fes

fig, ax = plt.subplots(1, 1, figsize=(20, 5))
sns.barplot(x='features', y='corr_to_oof', data=fes_df, ax=ax)

Wow, although the LGB model is not that strong, the prediction is highly correlated with some features! So the LGB model may fail misrablly if those features become useless.

How does the model overfit? Likely because the provided features are highly correlated with the target in the first place.

In [None]:
fes = feature_exposures(train, 'resp')

fes_df = pd.DataFrame()
fes_df['features'] = np.array(features)
fes_df['features'] = fes_df['features'].apply(lambda x : str(x).split('_')[-1])
fes_df['corr_to_resp'] = fes


fig, ax = plt.subplots(1, 1, figsize=(20, 5))
sns.barplot(x='features', y='corr_to_resp', data=fes_df, ax=ax)

Exactly. The LGB model just learned the correlation pattern between the features and targets. So if a strong feature flips its sign for whatever reason, we are doomed.


Let's perform feature neutralization to avoid such doom. This can be done in the following.

In [None]:
# feature neutralization
for f in tqdm(features):
    # neut
    neut = neutralize_series(train[f], train['resp'], proportion=1.0)
    
    # verify
    original_corr = np.corrcoef(train[f].values, train['resp'].values)[0, 1]
    neut_corr = np.corrcoef(neut, train['resp'].values)[0, 1]
    f_corr = np.corrcoef(train[f].values, neut)[0, 1]
    print('{}: original corr to target={:.3f}, corr to target after neut={:.3f}, corr with old and neut feat={:.3f}'.format(f,
        original_corr, neut_corr, f_corr))
    
    # assign
    train[f] = neut.values

So **by feature neutralization we can effectively decrease the correlation to the target while keeping the correlation to the original feature**.

How the correlation of each feature with the target change?

In [None]:
fes = feature_exposures(train, 'resp')

fes_df = pd.DataFrame()
fes_df['features'] = np.array(features)
fes_df['features'] = fes_df['features'].apply(lambda x : str(x).split('_')[-1])
fes_df['corr_to_resp'] = fes

fig, ax = plt.subplots(1, 1, figsize=(20, 5))
sns.barplot(x='features', y='corr_to_resp', data=fes_df, ax=ax)

If you compare this figure to the previous one in terms of the y-axis, you realize that after the feature neutralization **the correlation of each feature to the target is decreased a lot**!

If you think completely neutralizing all the features to the target is too much, you can 

- decrease the proportion of neutralization
- change the sets of features to be neutralized

# How can we use feature neutralization in the test data?
This is of course what everyone is interested in. Initially I did not want to disclose how to do it but yeah, you can find the answer anyway in this Numerai forum.

[Model Diagnostics: Feature Exposure](https://forum.numer.ai/t/model-diagnostics-feature-exposure/899)

Essentially you need to **neutralize prediction by features** such that your prediction is not sorely dependent on some strong features.

The following function is what Numeraiers are using every weekend:D

In [None]:
def neutralize(df, target="resp", by=None, proportion=1.0):
    if by is None:
        by = [x for x in df.columns if x.startswith('feature')]

    scores = df[target]
    exposures = df[by].values

    # constant column to make sure the series is completely neutral to exposures
    exposures = np.hstack((exposures, np.array([np.mean(scores)] * len(exposures)).reshape(-1, 1)))

    scores -= proportion * (exposures @ (np.linalg.pinv(exposures) @ scores.values))
    return scores / scores.std()

In conclusion, if you want to avoid overfitting, you might consider feature neutralization. 

Enjoy!