# TL'DR
This is the second part in a series of tutorials showcasing how to automate much of feature engineering.

In [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i), we illustrated *how to estimate the best performance you can achieve consistently (i.e. in-sample and out-of-sample) using a set of features, in a single line of code, and without training any model.*

In Part II (this tutorial), we explain why, when it comes to features, sometimes less is more. Then, building on [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i), ***we illustrate how to filter out redundant and non-informative features in a model-free fashion.***

In [Part III](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-iii), we will build on this tutorial to show you how to seamlessly add shrinkage/feature selection to any regressor in Python.

*Please upvote and share it you found it useful.*

# Table of Contents

- **I. [Background](#background)**
 - **I.1 [When It Comes to Features, Sometimes Less Is More](#less-is-more)**
 - **I.2 [Why Model-Free Feature Selection Matters](#model-free-is-important)** 
 - **I.3 [A Simple Model-Free Feature Selection Algorithm](#solution)** 
- **II. [Application](#application)**
 - **II.1 [Getting Started](#setup)**
 - **II.2 [The Set of Candidate Features](#features)**
 - **II.3 [Model-Free Feature Selection in a Single Line of Code](#one-liner)**
   - **II.3.a [Model-free feature selection for cross-sectional predictions](#cross-sectional)**
   - **II.3.b [Model-free feature selection for single-asset predictions](#cross-sectional)**

# I. Background <a name="background"></a>
In [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i) we argued that a good feature engineering pipeline should achieve two seemingly conflicting objectives. 

First, it should turn raw inputs $x$ into a feature vector $z$ that is related to the target $y$ in a way that is consistent with the types of patterns that models in our toolbox can actually learn. 

Second, the pipeline should ensure that features $z$ remain as insightful about the target $y$ as raw inputs $x$ are. By the [data processing inequality](https://en.wikipedia.org/wiki/Data_processing_inequality), any feature transformation is bound to lose some *juice* that was in $x$ for predicting $y$, but a good feature transformation should keep this loss of juice to a minimum, while making the *juice* easier to extract using models in our toolbox.

Given a vector of candidate features $z$, we showed in [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i) how to use the ``kxy`` package to compute the best performance achievable consistently (i.e. in-sample and out-of-sample) using ***all*** candidate features $z$ to predict the target $y$. However, when it comes to features, oftentimes less is more. 

In this tutorial, we will show you how to select the best features in a candidate set, without compromising on predictive power.

## I.1 The Issue With Too Many Features <a name="less-is-more"></a>

Large feature sets often include features that are redundant or non-informative. 

**Non-informative features** are features that cannot contribute to improving the performance of a regressor. Formally, these are features that are statistically *(unconditionally)* independent from the target,  ***and*** statistically independent from the target *conditional on/given **any other feature(s)** in the set*.

**Redundant features** on the other hand are features that, while they might be informative about the target, they cannot contribute to improving the performance of a regressor when used in conjunction with *some other features* in the set of candidates. Formally, these are features for which there exists *a subset of other features conditional on/given which* they are stastically independent from the target.

Working with redundant or useless features presents several challenges.

* **Less effective:** Some models perform poorly in the presence of redundant features (e.g. OLS with linearly dependent features). Additionally, while non-informative features do not affect the theoretical best performance achievable, in practice, working with non-informative features may degrade model performance. Not only will model training be more prone to overfitting in the presence of non-informative features, but in general it would tend to give non-zero *weights* to features that shouldn't have any. 
* **Less efficient:** memory and compute resources needed to train or run a production model increase with the number of features the model uses.
* **Harder to explain:** The more features a model uses, the harder it is to explain what it does, especially when some features are redundant or non-informative.
* **More outages/downtime:** Different features could be served in production by different pipelines (e.g. the raw inputs might come from different databases or tables, processing could be done by different processes etc.). In such as case, the more features a deployed model uses, the more feature delivery system outages the model will be exposed to, which will eventually increase downtime as any feature outage will likely take down prediction capabilities.
* **Faster drift:** Data distributions drift over time, causing production models to be retrained. The higher the number of features, the quicker the drift will arise, and the more often production models will need to be retrained.

When engineering features, care should always be taken to avoid non-informative and redundant features.

## I.2 The Importance of Model-Free Feature Selection <a name="model-free-is-important"></a>

Because the choice of features (in particular the inclusion or not of redundant or non-informative features) affects model performance, feature selection is best done before and independently from model training. 

Tying feature selection to model training leaves a few questions unanswered and open to guesses. Say for instance that the trained model didn't perform as well as the theoretical-best (as per [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i)). Why could that be? Is it because of a poor choice of hyper-parameters or model-class? Or did the presence of redundant and/or non-informative features give the optimizer a hard time?

Popular (model-based) feature selection methods can hardly answer these questions. 

Let's review a few of these approaches so as to illustrate why.

- **LASSO:** Features to which LASSO gives big weights are features that are linearly related to the target. LASSO can easily give a non-informative feature a bigger weight than a very informative feature, simply because that very informative feature is in a non-linear relationship with the target. Thus, while LASSO works great to learn linear patterns between features and the target (including in the presence of redundant and/or non-informative features), it does not extend to non-linear relationships.

- **PCA:** PCA can be used to linearly transform a feature set into a new set of features that are mutually decorrelated (i.e. not linearly related). New features are constructed so as to maximize the amount of variance of the original feature set they account for, so that we may only retain $q < d$ of the $d$ new features while accounting for almost all the variance of the original set. However, decorrelation of features does not rule out feature redundancy. Two features could be decorrelated but mutually redundant. E.g. If $x$ is a standard Gaussian, $x$ and $x^2$ are decorrelated, but clearly $x^2$ is redundant when $x$ is known. As a more practical example, a longitude (resp. a latitude) intuitively ought to statistically independent from a house number, but knowing both the longitude and the latitude makes the house number redundant for geo-localization. Moreover, preserving the variance of the original feature set does not imply preserving its predective power for forecasting the target. Thus, PCA can neither reliably remove redundant features, nor can it remove non-informative features or preserve the theoretical-best performance achievable by the initial set of features.

- **Recursive Features Elimination (RFE):** RFE selects features by starting with all candidate features and removing features one at a time based on a model-specific feature importance score. This approach can be resource-intensive and, depending on the feature importance score used, it can also present additional limitations. For instance, most methods (e.g. *tree-based/impurity-based feature importance*, *permutation-based feature importance*, and *SHAP*) do not detect non-informative features. Instead, they detect features that the specific trained model to which they are applied does not rely on. However, because a trained model does not rely on a feature does not make it intrinsically non-informative. The model could be poorly trained, or the model class could be inadequate. These methods aren't capable of detecting redundant features either. Let's take the extreme example of identical features. A model trained with two identical features might expect them to remain identical out-of-sample. Applying permutation-based feature importance on these two features will likely underestimate their importance to the trained model overall, but more importantly both will be given the same importance, even though one is redundant relative to the other. SHAP will also give both features the same importance. In fact, theoretically, SHAP cannot not account for redundant features at all. The second most important features as per SHAP could very well be totally useless when the most important feature is known.



## I.3 A Simple Model-Free Feature Selection Algorithm <a name="solution"></a>
Building on how to estimate the theoretical-best performance achievable using an input vector to predict a target, which we recalled in [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i), [[1]](#paper1) proposed the following simple model-free variable selection algorithm.

#### Algorithm (Model-Free Variable Selection)
 - **Inputs**: 
     - $z \in \mathbb{R}^d$: features
     - $y \in \mathbb{R}$: target
     - $u \to \bar{\rho}(P_{y, u})$: function calculating the highest Pearson's correlation between prediction and target achievable using features $u$ to predict target $y$.
 - **Outputs**: 
     - Selection orders $S$: indices of features selected in decreasing order of importance
     - Running achievable performance $P$: the highest Pearson correlation achievable associated to $S$.
 - **Initialization**: $z^\prime = []; S = []; P = [];$
 - For $i$ from $1$ to $d$: 
     - $s_i = \underset{k \in [1, d], ~ k \notin S}{\text{argmax}} ~~~ \bar{\rho}\left(P_{y, z^\prime + [z[k]]}\right)$
     - $S = S + [s_i]$ 
     - $z^\prime = z^\prime + [z[s_i]]$
     
In plain English, the first feature selected is the feature that, when used by itself, has the highest achievable performance; in this case the highest achievable Pearson correlation between the prediction and the target. More generally, the $i$-th feature selected is the one, out of all $(d-i+1)$ features not yet selected, such that when it is used in conjuction with all $(i-1)$ previously selected features, it yields the highest achievable performance. The algorithm returns features in the order they were selected, from the most important to the least important, as well as the highest performance achievable using the top-$i$ features for any $i \in [1, d]$. 

Note that, by construction, at each step, this algorithm gives no importance to feautures that are either non-informative, or redundant with respect to previously selected features.

In Part III, we will see how this algorithm can be used to turn any regressor into a boosted-regressor that avoids redundant and non-informative features.



**Reference:**

- [1]<a name="paper1"></a> Samo, Y.L.K., 2021. LeanML: A Design Pattern To Slash Avoidable Wastes in Machine Learning Projects. arXiv preprint arXiv:2107.08066.
- [2]<a name="paper2"></a> Samo, Y.L.K., 2021, March. Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula Approach. In International Conference on Artificial Intelligence and Statistics (pp. 2242-2250). PMLR.



# II. Application <a name="application"></a>

## II.1 Getting Started <a name="setup"></a>
We will use the ``kxy`` package. It requires an API key. See [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i) on how to get yours.

In [None]:
!pip install kxy -U

In [None]:
import os
import numpy as np
import pandas as pd
import pprint as pp
import kxy

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
kxy_api_key = user_secrets.get_secret('KXY_API_KEY')
os.environ['KXY_API_KEY'] = kxy_api_key

## II.2 The Set of Candidate Features We Will Be Using <a name="features"></a>

Here we generate a set of 37 candidate features, temporal and cross-sectional.

In [None]:
TRAIN_CSV = '/kaggle/input/g-research-crypto-forecasting/train.csv'
df_train = pd.read_csv(TRAIN_CSV)

def nanmaxmmin(a, axis=None, out=None):
    ''' '''
    return np.nanmax(a, axis=axis, out=out)-np.nanmin(a, axis=axis, out=out)


def get_features(df):
    ''' 
    An example function generating a candidate list of features.
    '''
    features = df[['Count', 'Open', 'High', 'Low', 'Close', \
                                    'Volume', 'VWAP','timestamp', 'Target', 'Asset_ID']].copy()
    # Upper shadow
    features['UPS'] = (df['High']-np.maximum(df['Close'], df['Open']))
    features['UPS'] = features['UPS'].astype(np.float16)
    
    # Lower shadow
    features['LOS'] = (np.minimum(df['Close'], df['Open'])-df['Low'])
    features['LOS'] = features['LOS'].astype(np.float16)
    
    # High-Low range
    features['RNG'] = ((features['High']-features['Low'])/features['VWAP'])
    features['RNG'] = features['RNG'].astype(np.float16)
    
    # Daily move
    features['MOV'] = ((features['Close']-features['Open'])/features['VWAP'])
    features['MOV'] = features['MOV'].astype(np.float16)
    
    # Close vs. VWAP
    features['CLS'] = ((features['Close']-features['VWAP'])/features['VWAP'])
    features['CLS'] = features['CLS'].astype(np.float16)
    
    # Log-volume
    features['LOGVOL'] = np.log(1.+features['Volume'])
    features['LOGVOL'] = features['LOGVOL'].astype(np.float16)
    
    # Log-count
    features['LOGCNT'] = np.log(1.+features['Count'])
    features['LOGCNT'] = features['LOGCNT'].astype(np.float16)
    
    # Drop raw inputs
    features.drop(columns=['Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', \
                           'Count'], errors='ignore', inplace=True)
    
    # Previous target !!WARNING: THIS FEATURE IS NOT TRADEABLE !!
    features['PREVTARGET'] = features.groupby('Asset_ID')['Target'].shift(1)
    
    # Enrich the features dataframe with some temporal feautures 
    # (specifically, some stats on the last hour worth of bars)
    features = features.kxy.temporal_features(max_lag=3, \
                exclude=['LOGVOL', 'LOGCNT', 'timestamp', 'Target'],
                groupby='Asset_ID')
    
    # Enrich the features dataframe context around the time
    # (e.g. hour, day of the week, etc.)
    time_features = features.kxy.process_time_columns(['timestamp'])
    features.drop(columns=['timestamp'], errors='ignore', inplace=True)
    features = pd.concat([features, time_features], axis=1)
    
    return features

In [None]:
try:
    # Reading candidate features from disk
    training_features = pd.read_parquet('../input/cross-asset-featuresparquet/cross_asset_features.parquet')
    
except:
    # Randomly select 10% of days to speed-up features generation
    df_train['DAYSTR'] = pd.to_datetime(df_train['timestamp'], unit='s').apply(lambda x: x.strftime("%Y%m%d"))
    all_days = list(set([_ for _ in df_train['DAYSTR'].values]))
    selected_days = np.random.choice(all_days, size=int(len(all_days)/10), replace=False)
    df_train = df_train[df_train['DAYSTR'].isin(selected_days)]
    df_train.drop(columns=['DAYSTR'], errors='ignore', inplace=True)
    # Generating candidate features
    training_features = get_features(df_train)
    # Saving to disk
    to_save = training_features.astype(np.float32)
    to_save.to_parquet('cross_asset_features.parquet')
    del to_save
training_features

In [None]:
# Printing all feautures
all_features = sorted([_ for _ in training_features.columns])
pp.pprint(all_features)

## II.3 Model-Free Feature Selection in a Single Line of Code <a name="one-liner"></a>
The syntax is ``features_df.kxy.variable_selection(target_column, problem_type='regression')`` and it works on any pandas DataFrame object, so long as you import the ``kxy`` package.

### II.3.a Case 1: For Cross-Asset Predictions <a name="cross-sectional"></a>
This is how to select features using ``kxy`` when you intend to build a single model to trade all cryptocurrencies.

In [None]:
cs_variable_selection_df = training_features.kxy.variable_selection('Target', problem_type='regression')

In [None]:
cs_variable_selection_df['Running Achievable Pearson Correlation'] = \
    cs_variable_selection_df['Running Achievable R-Squared'].apply(lambda x: '%.2f' % np.sqrt(float(x)))
cs_variable_selection_df

### II.3.b Case 2: For Single-Asset Predictions <a name="single-asset"></a>

This is how to select features using ``kxy`` when you intend to build one model per cryptocurrency.

In [None]:
ASSET_CSV = '/kaggle/input/g-research-crypto-forecasting/asset_details.csv'
asset_details = pd.read_csv(ASSET_CSV)
asset_details.set_index(['Asset_ID'], inplace=True)

#### Bitcoin Models

In [None]:
asset_id = asset_details[asset_details['Asset_Name']=='Bitcoin'].index[0]
df = training_features[training_features['Asset_ID']==asset_id]
btc_variable_selection_df = df.kxy.variable_selection('Target', problem_type='regression')

In [None]:
btc_variable_selection_df['Running Achievable Pearson Correlation'] = \
    btc_variable_selection_df['Running Achievable R-Squared'].apply(lambda x: '%.2f' % np.sqrt(float(x)))
btc_variable_selection_df

#### Ethereum Models

In [None]:
asset_id = asset_details[asset_details['Asset_Name']=='Ethereum'].index[0]
df = training_features[training_features['Asset_ID']==asset_id]
eth_variable_selection_df = df.kxy.variable_selection('Target', problem_type='regression')

In [None]:
eth_variable_selection_df['Running Achievable Pearson Correlation'] = \
    eth_variable_selection_df['Running Achievable R-Squared'].apply(lambda x: '%.2f' % np.sqrt(float(x)))
eth_variable_selection_df

#### Litecoin Models

In [None]:
asset_id = asset_details[asset_details['Asset_Name']=='Litecoin'].index[0]
df = training_features[training_features['Asset_ID']==asset_id]
ltc_variable_selection_df = df.kxy.variable_selection('Target', problem_type='regression')

In [None]:
ltc_variable_selection_df['Running Achievable Pearson Correlation'] = \
    ltc_variable_selection_df['Running Achievable R-Squared'].apply(lambda x: '%.2f' % np.sqrt(float(x)))
ltc_variable_selection_df

#### Dogecoin Models

In [None]:
asset_id = asset_details[asset_details['Asset_Name']=='Dogecoin'].index[0]
df = training_features[training_features['Asset_ID']==asset_id]
doge_variable_selection_df = df.kxy.variable_selection('Target', problem_type='regression')

In [None]:
doge_variable_selection_df['Running Achievable Pearson Correlation'] = \
    doge_variable_selection_df['Running Achievable R-Squared'].apply(lambda x: '%.2f' % np.sqrt(float(x)))
doge_variable_selection_df

#### Maker Models

In [None]:
asset_id = asset_details[asset_details['Asset_Name']=='Maker'].index[0]
df = training_features[training_features['Asset_ID']==asset_id]
mkr_variable_selection_df = df.kxy.variable_selection('Target', problem_type='regression')

In [None]:
mkr_variable_selection_df['Running Achievable Pearson Correlation'] = \
    mkr_variable_selection_df['Running Achievable R-Squared'].apply(lambda x: '%.2f' % np.sqrt(float(x)))
mkr_variable_selection_df

### Remark

As can be seen above, **PREVTARGET** (and related features) seem to contribute a great deal to the highest achievable performance. This is not surprising because they have look-ahead bias. Indeed, trading bars in this competition are minute bars, but the target is calculated using log-returns over the next $15$ minutes. So, the previous target actually peeks into $14$ minutes in the future.

Here's the cross-asset variable selection analysis without **PREVTARGET** features.

In [None]:
clean_features = training_features.drop(
    columns=[_ for _ in training_features.columns if 'PREVTARGET' in _], errors='ignore')
clean_feature_names = sorted([_ for _ in clean_features.columns])
pp.pprint(clean_feature_names)
clean_ca_data_valuation_df = clean_features.kxy.variable_selection(
    'Target', problem_type='regression')
clean_ca_data_valuation_df['Running Achievable Pearson Correlation'] = \
    clean_ca_data_valuation_df['Running Achievable R-Squared'].apply(lambda x: '%.2f' % np.sqrt(float(x)))
clean_ca_data_valuation_df