# TL'DR
This is the third part in a series of tutorials showcasing how to automate much of feature engineering.

- In [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i), we illustrated *how to estimate the best performance you can achieve consistently (i.e. in-sample and out-of-sample) using a set of features, in a single line of code, and without training any model.*

- In [Part II](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-ii), we explained why, when it comes to features, sometimes less is more. Then, building on [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i), *we illustrated how to filter out redundant and non-informative features in a model-free fashion.*

- In Part III (this tutorial), ***we illustrate how*** the model-free feature selection algorithm of [Part II](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-ii) may be used ***to shrink any model towards avoiding redundant and non-informative features.***

*Please upvote and share it you found it useful.*

# Table of Contents

- **I. [Background](#background)**
 - **I.1 [The Limits of Recursive Feature Selection](#rfe)**
 - **I.2 [The Key To Effective Feature Selection](#key)** 
 - **I.3 [How To Add Shrinkage To Any Regressor](#solution)** 
- **II. [Application](#application)**
 - **II.1 [Getting Started With KXY](#setup)**
 - **II.2 [Experimental Setup](#exp)**
     - **II.2.a [The set of candidate features](#features)**
     - **II.2.b [How we construct training/testing sets](#split)**
 - **II.3 [How to Add Effective Feature Selection to Any Regressor Class in Python](#one-liner)**
     - **II.3.a [Training  and predicting with KXY feature selection](#train-pred)**
     - **II.3.b [Visualizing the learned features ranking](#viz)**
 - **II.4 [Effectiveness Assessment](#effectiveness)**
    - **II.4.a [High compression rate](#number)**
    - **II.4.b [Improved performance](#performance)**
    - **II.4.c [Slower performance decay](#decay)**
- **III. [Conclusion](#conclusion)**
    
    
# I. Background <a name="background"></a>

In [Part II](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-ii), we argued that using too many features might degrade model performance, and make models costlier to build, harder to explain, and costlier to maintain. This is because a large number of features will often include **non-informative features** and/or **redundant features**. 

**Non-informative features** are features that we can safely remove from a set of candidate features without incurring any performance loss, no matter what other features are in the set. 

**Redundant features**, on the other hand, are features that we can safely remove from a set of candidate features without incurring any performance loss, ***so long as*** some other features in the candidate set are kept.


## I.1 The Limits of RFE (*Recursive Features Elimination*) <a name="rfe"></a>

While features selection isn't really a new topic, commonly used approaches such as the celebrated *Recursive Features Elimination (RFE)*, are far from being satisfactory.

A major limitation of RFE is that it cannot detect redundant features, even when seemingly powerful feature importance scores such as SHAP and permutation-based scores are used. 

As an illustration, if we duplicate the most important feature (as per SHAP) in a set of features, and retrain our model, assuming non-degeneracy due to dependent features, the duplicated feature is likely going to be in the new top-2 most important features (again as per SHAP). As such, SHAP-based RFE will have a hard time eliminating the duplicate feature, even though it is completely redundant. 

Had we used a permutation-based importance score instead, the situation would have been worse. In effect, the model would have been trained with the two features having identical values. Randomly shuffling values of one of the two features, and not values of the other, would result in a test distribution that is not representative of the true data generating distribution, which would render RFE unsound.


## I.2 The Key to Effective Feature Selection <a name="key"></a>

To understand what it would take to effectively trim off redundant and non-informative features, we recall two important facts.

First, as discussed in [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i), the highest performance achievable when using features vector $z=(z_1, \dots, z_d)$ to predict target $y$ is an increasing function of the mutual information $I(y; z)$.

Second, by the *mutual information chain rule*, for $d>1$ and for any permutation $\{\pi_1, \dots, \pi_d\}$ of $\{1, \dots, d \}$: $$\begin{align}I(y; z_1, \dots, z_d) &= I\left(y; z_1\right) + \sum_{i=2}^d I\left(y; z_i | z_1, \dots, z_{i-1}\right) \\ &= I\left(y; z_{\pi_1}\right) + \sum_{i=2}^d I\left(y; z_{\pi_i} | z_{\pi_1}, \dots, z_{\pi_{i-1}}\right)\end{align}.$$ 

The mutual information term $I(y; z_{\pi_1})$ reflects the highest performance that may be achieved when using feature $z_{\pi_1}$ by itself to predict target $y$.

The conditional mutual information term $I\left(y; z_{\pi_i} | z_{\pi_1}, \dots, z_{\pi_{i-1}}\right) := I\left(y; z_{\pi_1}, \dots, z_{\pi_i}\right)-I\left(y; z_{\pi_1}, \dots, z_{\pi_{i-1}}\right)$ reflects the increase in the highest performance achievable that is due *solely* to adding feature $z_{\pi_i}$ to features $(z_{\pi_1}, \dots, z_{\pi_{i-1}})$.

These two facts suggest that trimming off redundant and non-informative features amounts to looking for a special permutation $\{s_1, \dots, s_d\}$ of $\{1, \dots, d\}$ such that $$I(y; z) \approx I\left( y; z_{s_1}, \dots, z_{s_q} \right) = I\left(y; z_{s_1}\right) + \sum_{i=2}^q I\left(y; z_{s_i} | z_{s_1}, \dots, z_{s_{i-1}} \right)$$ for a small $q < d$. 

$dI := I(y; z) - I\left( y; z_{s_1}, \dots, z_{s_q} \right):= I\left(y; z_{s_{q+1}}, \dots, z_{s_d} \vert z_{s_1}, \dots, z_{s_q} \right)$ reflects the *loss of juice* resulting from feature selection, while $1-\frac{q}{d}$ represents the associated compression rate. 

For a given *loss of juice* $dI$, the smaller $q$ (i.e. the higher the compression rate), the more effective the permutation $\{s_1, \dots, s_d\}$.

## I.3 How To Add Shrinkage To Any Regressor <a name="solution"></a>

Given a number of features $q$ or compression rate, finding the features subset $\{s_1, \dots, s_q\}$ that will yield the smallest *loss of juice* has combinatorial complexity $O(C_d^q)$! 

To appreciate how bad this is, when $d=50$ and $q=10$, we would need to estimate more than ten billion mutual information terms to find the best subset of $10$ features. 

What's worse is that, in practice, we would not know what compression rate we would afford without giving up too much performance. Searching for the right $q$ to use will add to the computational intractability.

To circumvent this combinatorial complexity, and inspired by the chain rule $I(y; z_1, \dots, z_d) = I\left(y; z_{\pi_1}\right) + \sum_{i=2}^d I\left(y; z_{\pi_i} | z_{\pi_1}, \dots, z_{\pi_{i-1}}\right)$, we squeeze as much mutual information into the terms on the right hand side, sequentially, and one term at a time. 

The resulting greedy algorithm proceeds as follows:
 $$s_1 := \underset{k \in [1, d]}{\text{argmax}} I(y; z_k),$$ $$\forall i > 1, ~ s_i := \underset{k \in [1, d], ~ k \notin \{s_1, \dots, s_{i-1}\}}{\text{argmax}} I(y; z_k | z_{s_1}, \dots, z_{s_{i-1}}).$$

This is essentially what the model-free variable selection approach described in [Part II](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-ii) does. It learns the permutation $\{s_1, \dots, s_d\}$, and returns the highest performances achievable using $(z_{s_1}, \dots, z_{s_i})$ to predict $y$, namely $\bar{\rho}_{s_i} := \bar{\rho}\left( P_{y; z_{s_1}, \dots, z_{s_i}} \right)$, for all $s_i$.

Once we have the result of the model-free variable selection algorithm of [Part II](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-ii), we can add feature selection to any regressor in such a way that strikes the right balance between keeping the number of features $q$ small and model performance. 

First we initialize $q$ to the smallest number of features required to achieve a certain fraction $\alpha$ (e.g. $0.9$) of the overall achievable performance: $\bar{\rho}\left( P_{y; z_{s_1}, \dots, z_{s_q}} \right) = \alpha \bar{\rho}\left( P_{y; z_{s_1}, \dots, z_{s_d}} \right).$ 

Then we train our regressor using $(z_{s_1}, \dots, z_{s_q})$ and we increase $q$, one feature at a time, until doing so no longer increases model performance on a validation set, or we hit a hard cap on the number of features we could use.





**Reference:**

- [1]<a name="paper1"></a> Samo, Y.L.K., 2021. LeanML: A Design Pattern To Slash Avoidable Wastes in Machine Learning Projects. arXiv preprint arXiv:2107.08066.
- [2]<a name="paper2"></a> Samo, Y.L.K., 2021, March. Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula Approach. In International Conference on Artificial Intelligence and Statistics (pp. 2242-2250). PMLR.



# II. Application <a name="application"></a>

## II.1 Getting Started with KXY <a name="setup"></a>
We will use the ``kxy`` package. It requires an API key. See [Part I](https://www.kaggle.com/leanboosting/automating-feature-engineering-part-i) on how to get yours.

In [None]:
!pip install kxy -U

In [None]:
import gc
import os
import numpy as np
import pandas as pd
import pprint as pp
import pylab as plt
import kxy

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
kxy_api_key = user_secrets.get_secret('KXY_API_KEY')
os.environ['KXY_API_KEY'] = kxy_api_key

## II.2 Experimental Setup <a name="exp"></a>
### II.2.a  The set of candidate features we will be using <a name="features"></a>

Here we generate a set of 56 candidate features, temporal and cross-sectional.

In [None]:
TRAIN_CSV = '/kaggle/input/g-research-crypto-forecasting/train.csv'

def nanmaxmmin(a, axis=None, out=None):
    ''' '''
    return np.nanmax(a, axis=axis, out=out)-np.nanmin(a, axis=axis, out=out)


def get_features(df):
    ''' 
    An example function generating a candidate list of features.
    '''
    features = df[['Count', 'Open', 'High', 'Low', 'Close', \
        'Volume', 'VWAP','timestamp', 'Target', 'Asset_ID']].copy()
    # Upper shadow
    features['UPS'] = (df['High']-np.maximum(df['Close'], df['Open']))
    features['UPS'] = features['UPS'].astype(np.float32)
    
    # Lower shadow
    features['LOS'] = (np.minimum(df['Close'], df['Open'])-df['Low'])
    features['LOS'] = features['LOS'].astype(np.float32)
    
    # High-Low range
    features['RNG'] = ((features['High']-features['Low'])/features['VWAP'])
    features['RNG'] = features['RNG'].astype(np.float32)
    
    # Daily move
    features['MOV'] = ((features['Close']-features['Open'])/features['VWAP'])
    features['MOV'] = features['MOV'].astype(np.float32)
    
    # Close vs. VWAP
    features['CLS'] = ((features['Close']-features['VWAP'])/features['VWAP'])
    features['CLS'] = features['CLS'].astype(np.float32)
    
    # Log-volume
    features['LOGVOL'] = np.log(1.+features['Volume'])
    features['LOGVOL'] = features['LOGVOL'].astype(np.float32)
    
    # Log-count
    features['LOGCNT'] = np.log(1.+features['Count'])
    features['LOGCNT'] = features['LOGCNT'].astype(np.float32)
    
    # Volume/Count
    features['VOL2CNT'] = features['Volume']/(1.+features['Count'])
    features['VOL2CNT'] = features['VOL2CNT'].astype(np.float32)
    
    # Drop raw inputs
    features.drop(columns=['Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', \
        'Count'], errors='ignore', inplace=True)
    
    # Enrich the features dataframe with some temporal feautures 
    # (specifically, some stats on the last hour worth of bars)
    features = features.kxy.temporal_features(max_lag=14, \
                exclude=['timestamp', 'Target'],
                groupby='Asset_ID')
    
    # Enrich the features dataframe context around the time
    # (e.g. hour, day of the week, etc.)
    time_features = features.kxy.process_time_columns(['timestamp'])
    features.drop(columns=['timestamp'], errors='ignore', inplace=True)
    features = pd.concat([features, time_features], axis=1)
    
    return features

In [None]:
try:
    # Reading candidate features as external data
    PATH = '/kaggle/input/gresearch-features-v2' # '/kaggle/input/d/leanboosting/gresearch-features'
    training_features = pd.read_parquet('%s/2018Q1.parquet' % PATH)
except:
    PATH = '/kaggle/output'
    try:
        # Reading candidate features from disk
        training_features = pd.read_parquet('%s/2018Q1.parquet' % PATH)
    except:
        # Regenerating candidate features
        df_train = pd.read_csv(TRAIN_CSV)
        training_features = get_features(df_train)
        del df_train
        # Saving to disk
        months = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
        for year in range(2018, 2022):
            year_df = training_features[training_features['timestamp.YEAR()']==year]
            for q in range(1, 5):
                df = year_df[year_df['timestamp.MONTH()'].isin(months[q-1])]
                df.to_parquet('%s/%dQ%d.parquet' % (PATH, year, q))

#### Complete list of all candidate features

In [None]:
# Printing all feautures
all_features = sorted([_ for _ in training_features.columns])
pp.pprint(all_features)

### II.2.b How we construct training/testing sets <a name="split"></a>

To evaluate our approach, we build rolling training/testing sets. 

The first training set we use is the first quarter of 2018, and we use it to predict the second quarter of 2018. 

Thereafter, we build a model to predict moves in quarter $k+1$ using a training set obtained by sampling 80% of data from quarter $k$ and 20% of the training set we previously used to predict moves in quarter $k$.

This AR(1)-style of constructing training sets allows us to keep historical data going far back in the past, while avoiding sample size explosion and while giving more importance/weight to recent observations.


In [None]:
bitcoin_asset_id = 1
quarters = ['%dQ%d' % (year, quarter) for year in range(2018, 2022) \
            for quarter in range(1, 5)]
quarters.remove('2021Q4')
training_data = pd.read_parquet('%s/2018Q1.parquet' % PATH)
training_data = training_data.astype(np.float32)

for i in range(1, len(quarters)-4):
    testing_data = pd.read_parquet('%s/%s.parquet' % (PATH, quarters[i]))
    testing_data = testing_data.astype(np.float32)
    # Update the training set with 
    training_data = training_data.sample(frac=0.2, random_state=28)
    testing_data = testing_data.sample(frac=0.8, random_state=28)
    training_data = pd.concat([training_data, testing_data], axis=0)
    gc.collect()
    
months_df = training_data['timestamp.YEAR()'].astype(int).astype(str) + \
    training_data['timestamp.MONTH()'].astype(int).astype(str).apply(lambda x: x.zfill(2))
months_df = pd.DataFrame(months_df, columns=["Month"])
training_data.drop(columns=['timestamp.MONTH()', 'timestamp.YEAR()'], inplace=True)

# Test set
testing_data = pd.read_parquet('%s/%s.parquet' % (PATH, quarters[-3]))
testing_data = testing_data.astype(np.float32)
testing_data.drop(columns=['timestamp.MONTH()', 'timestamp.YEAR()'], inplace=True)    
testing_x = testing_data.drop(columns=['Target'])
testing_y = testing_data['Target']

bitcoin_training_data = training_data[training_data['Asset_ID']==bitcoin_asset_id]
bitcoin_testing_data = testing_data[testing_data['Asset_ID']==bitcoin_asset_id]
bitcoin_testing_x = bitcoin_testing_data.drop(columns=['Target'])
bitcoin_testing_y = bitcoin_testing_data['Target']

# Decay sets
q2_2021_data = pd.read_parquet('%s/%s.parquet' % (PATH, quarters[-2]))
q2_2021_data = q2_2021_data.astype(np.float32)
q2_2021_data.drop(columns=['timestamp.MONTH()', 'timestamp.YEAR()'], inplace=True)
    
q2_2021_x = q2_2021_data.drop(columns=['Target'])
q2_2021_y = q2_2021_data['Target']

q3_2021_data = pd.read_parquet('%s/%s.parquet' % (PATH, quarters[-1]))
q3_2021_data = q3_2021_data.astype(np.float32)
q3_2021_data.drop(columns=['timestamp.MONTH()', 'timestamp.YEAR()'], inplace=True)
    
q3_2021_x = q3_2021_data.drop(columns=['Target'])
q3_2021_y = q3_2021_data['Target']

bitcoin_q2_2021_data = q2_2021_data[q2_2021_data['Asset_ID']==bitcoin_asset_id]
bitcoin_q2_2021_x = bitcoin_q2_2021_data.drop(columns=['Target'])
bitcoin_q2_2021_y = bitcoin_q2_2021_data['Target']

bitcoin_q3_2021_data = q3_2021_data[q3_2021_data['Asset_ID']==bitcoin_asset_id]
bitcoin_q3_2021_x = bitcoin_q3_2021_data.drop(columns=['Target'])
bitcoin_q3_2021_y = bitcoin_q3_2021_data['Target']

In [None]:
to_plot = months_df.groupby("Month").size()
to_plot = to_plot/to_plot.sum()
to_plot.plot(kind="bar", figsize=(15, 10))
axis = plt.gca()
_ = axis.set_ylabel('Proportion of Observations', fontsize=15)
_ = axis.set_xlabel('Month', fontsize=15)
_ = axis.set_title('Composition of Training Data', fontsize=18)


## II.3 How to Add Effective Feature Selection to Any Regressor Class in Python <a name="one-liner"></a>

The approach described in the background section is implemented in the ``kxy`` package. 

#### Training 
The syntax is ``training_data_df.kxy.fit(target_column, learner_func, problem_type='regression')`` and it works on any pandas DataFrame object, so long as you import the ``kxy`` package. 

``learner_func`` is a function that takes a single optional parameter ``n_vars``, and returns an instance of a regressor class expecting ``n_vars`` features, and following the ``sklearn`` API (i.e. ``m.fit(x, y)`` to fit the instance, and ``m.predict(x)`` to make predictions).

#### Predictions 
The syntax to make predictions is ``target_predictions_df = training_data_df.kxy.predict(testing_data_df)``, where ``testing_data_df`` is a dataframe with test features (and no target values), and ``target_predictions_df`` is the dataframe with a single column (whose name is the same as the target) containing predictions, and sharing the same index as ``testing_data_df``.


#### Working with existing ML libraries

The ``kxy`` package contains a range of utility functions that allow you to use any supervised learner in ``sklearn``, ``xgboost``, ``lightgbm``, ``tensorflow`` and many more frameworks as base learner. 

The code snippet below shows you how to use ``lightgbm`` regressors as base learners. 

For other libraries see ``kxy.learning.base_learners``.



In [None]:
from kxy.learning import get_lightgbm_learner_learning_api, get_sklearn_learner
# Constructing a LightGBM base regressor
params = {
    'objective': 'rmse',  
    'boosting_type': 'gbdt',
    'n_jobs': -1,
    'learning_rate': 0.1,
    'verbose': -1,
}
lightgbm_regressor_cls = get_lightgbm_learner_learning_api(params, num_boost_round=2000, \
    early_stopping_rounds=5, split_random_seed=28)
lasso_regressor_cls = get_sklearn_learner('sklearn.linear_model.LassoCV')
lr_regressor_cls = get_sklearn_learner('sklearn.linear_model.LinearRegression')


#### Experimental setup

We focus on cross-sectional predictions (i.e. we want to train a single model to predict all crypto-currencies), and bitcoin predictions for brevity.

We run our training/testing split logic until Q4 2020. We use Q1 2021 as testing set, and we use Q2 and Q3 2021 to compare performance decay with and without feature selection.

### II.3.a Training  and predicting with KXY feature selection <a name="train-pred"></a>

In [None]:
ASSET_CSV = '/kaggle/input/g-research-crypto-forecasting/asset_details.csv'
asset_details = pd.read_csv(ASSET_CSV)

# Scoring function
def gresearch_score(true_df, pred_df):
    ''' '''
    res = {}
    total_weight = 0.0
    score = 0.0
    for asset in asset_details.to_dict(orient='records'):
        try:
            true_df_ = true_df[true_df['Asset_ID']==asset['Asset_ID']][['Target']]
            if not true_df_.empty:
                pred_df_ = pred_df[true_df['Asset_ID']==asset['Asset_ID']]
                w = asset['Weight']
                t = true_df_['Target'].values-np.nanmean(true_df_['Target'].values)
                p = pred_df_['Target'].values-np.nanmean(pred_df_['Target'].values)
                corr = np.nanmean(t*p)/(np.nanstd(t)*np.nanstd(p))
                if not np.isnan(corr):
                    score += corr*w
                    total_weight += w
                    res[asset['Asset_Name']] = '%.4f' % corr
        except:
            continue
    score = score/total_weight
    res['Overall (Weighted)'] = '%.4f' % score
    return res

In [None]:
# Training with effective feature selection
lr_training_data = training_data.copy()
shrinkage_results = training_data.kxy.fit('Target', lightgbm_regressor_cls, \
    problem_type='regression', missing_value_imputation=True, \
    train_frac=0.8)
pp.pprint(shrinkage_results)
lr_shrinkage_results = lr_training_data.kxy.fit('Target', lr_regressor_cls, \
    problem_type='regression', missing_value_imputation=True, \
    train_frac=0.8)
pp.pprint(lr_shrinkage_results)

lr_bitcoin_training_data = bitcoin_training_data.copy()
bitcoin_shrinkage_results = bitcoin_training_data.kxy.fit('Target', 
    lightgbm_regressor_cls, problem_type='regression', \
    missing_value_imputation=True, train_frac=0.8)
pp.pprint(bitcoin_shrinkage_results)
lr_bitcoin_shrinkage_results = lr_bitcoin_training_data.kxy.fit('Target', 
    lr_regressor_cls, problem_type='regression', \
    missing_value_imputation=True, train_frac=0.8)
pp.pprint(lr_bitcoin_shrinkage_results)

# Predictions
shrinkage_predicted_y = training_data.kxy.predict(testing_x)
shrinkage_perf = gresearch_score(testing_data, shrinkage_predicted_y)

shrinkage_predicted_q2_2021 = training_data.kxy.predict(q2_2021_x)
shrinkage_predicted_q3_2021 = training_data.kxy.predict(q3_2021_x)

lr_shrinkage_predicted_y = lr_training_data.kxy.predict(testing_x)
lr_shrinkage_perf = gresearch_score(testing_data, lr_shrinkage_predicted_y)

lr_shrinkage_predicted_q2_2021 = lr_training_data.kxy.predict(q2_2021_x)
lr_shrinkage_predicted_q3_2021 = lr_training_data.kxy.predict(q3_2021_x)

bitcoin_shrinkage_predicted_y = bitcoin_training_data.kxy.predict(bitcoin_testing_x)
bitcoin_shrinkage_perf = gresearch_score(bitcoin_testing_data, \
                                         bitcoin_shrinkage_predicted_y)

bitcoin_shrinkage_predicted_q2_2021 = bitcoin_training_data.kxy.predict(bitcoin_q2_2021_x)
bitcoin_shrinkage_predicted_q3_2021 = bitcoin_training_data.kxy.predict(bitcoin_q3_2021_x)

lr_bitcoin_shrinkage_predicted_y = lr_bitcoin_training_data.kxy.predict(bitcoin_testing_x)
lr_bitcoin_shrinkage_perf = gresearch_score(bitcoin_testing_data, \
                                         lr_bitcoin_shrinkage_predicted_y)

lr_bitcoin_shrinkage_predicted_q2_2021 = lr_bitcoin_training_data.kxy.predict(bitcoin_q2_2021_x)
lr_bitcoin_shrinkage_predicted_q3_2021 = lr_bitcoin_training_data.kxy.predict(bitcoin_q3_2021_x)

### II.3.b Visualizing the learned permutations $\{s_1, \dots, s_q\}$ <a name="viz"> </a>

#### Learned $\{s_1, \dots, s_q\}$ for cross-sectional LightGBM models

In [None]:
training_data.kxy.predictor.variable_selection_results

####  Learned $\{s_1, \dots, s_q\}$ for bitcoin LightGBM models

In [None]:
bitcoin_training_data.kxy.predictor.variable_selection_results

## II.4 Effectiveness Assessment <a name="effectiveness"></a>

#### Training and testing using all 56 features


In [None]:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

# Training using all features
m_all_features = lightgbm_regressor_cls()
feature_columns = [_ for _ in training_data.columns if _ != 'Target']
x_train = training_data[feature_columns].values
y_train = training_data['Target'].values
x_train[np.isnan(x_train)] = 0.0
y_train[np.isnan(y_train)] = 0.0
x_test = testing_data[feature_columns].values
y_test = testing_data['Target'].values
x_test[np.isnan(x_test)] = 0.0
y_test[np.isnan(y_test)] = 0.0
q2_2021_x_test = q2_2021_x[feature_columns].values
q3_2021_x_test = q3_2021_x[feature_columns].values
m_all_features.fit(x_train, y_train)
m_lasso_all_features = lasso_regressor_cls()
m_lasso_all_features.fit(x_train, y_train)
m_lr_all_features = lr_regressor_cls()
m_lr_all_features.fit(x_train, y_train)

bitcoin_all_features = lightgbm_regressor_cls()
feature_columns = [_ for _ in bitcoin_training_data.columns if _ != 'Target']
bitcoin_x_train = bitcoin_training_data[feature_columns].values
bitcoin_y_train = bitcoin_training_data['Target'].values
bitcoin_x_test = bitcoin_testing_data[feature_columns].values
bitcoin_y_test = bitcoin_testing_data['Target'].values
bitcoin_q2_2021_x_test = bitcoin_q2_2021_x[feature_columns].values
bitcoin_q3_2021_x_test = bitcoin_q3_2021_x[feature_columns].values
bitcoin_all_features.fit(bitcoin_x_train, bitcoin_y_train)
bitcoin_lasso_all_features = lasso_regressor_cls()
bitcoin_lasso_all_features.fit(x_train, y_train)
bitcoin_lr_all_features = lr_regressor_cls()
bitcoin_lr_all_features.fit(x_train, y_train)

# Prediction using all features
all_predicted_y = m_all_features.predict(x_test)
all_predicted_y = pd.DataFrame(all_predicted_y, index=testing_data.index, \
                               columns=['Target'])
all_features_perf = gresearch_score(testing_data, all_predicted_y)
lasso_all_predicted_y = m_lasso_all_features.predict(x_test)
lasso_all_predicted_y = pd.DataFrame(lasso_all_predicted_y, index=testing_data.index, \
                               columns=['Target'])
lasso_all_features_perf = gresearch_score(testing_data, lasso_all_predicted_y)
lr_all_predicted_y = m_lr_all_features.predict(x_test)
lr_all_predicted_y = pd.DataFrame(lr_all_predicted_y, index=testing_data.index, \
                               columns=['Target'])
lr_all_features_perf = gresearch_score(testing_data, lr_all_predicted_y)

all_predicted_q2_2021 = m_all_features.predict(q2_2021_x_test)
all_predicted_q3_2021 = m_all_features.predict(q3_2021_x_test)
all_predicted_q2_2021 = pd.DataFrame(all_predicted_q2_2021, \
                            index=q2_2021_data.index, columns=['Target'])
all_predicted_q3_2021 = pd.DataFrame(all_predicted_q3_2021, \
                            index=q3_2021_data.index, columns=['Target'])

lasso_all_predicted_q2_2021 = m_lasso_all_features.predict(q2_2021_x_test)
lasso_all_predicted_q3_2021 = m_lasso_all_features.predict(q3_2021_x_test)
lasso_all_predicted_q2_2021 = pd.DataFrame(lasso_all_predicted_q2_2021, \
                            index=q2_2021_data.index, columns=['Target'])
lasso_all_predicted_q3_2021 = pd.DataFrame(lasso_all_predicted_q3_2021, \
                            index=q3_2021_data.index, columns=['Target'])

lr_all_predicted_q2_2021 = m_lr_all_features.predict(q2_2021_x_test)
lr_all_predicted_q3_2021 = m_lr_all_features.predict(q3_2021_x_test)
lr_all_predicted_q2_2021 = pd.DataFrame(lr_all_predicted_q2_2021, \
                            index=q2_2021_data.index, columns=['Target'])
lr_all_predicted_q3_2021 = pd.DataFrame(lr_all_predicted_q3_2021, \
                            index=q3_2021_data.index, columns=['Target'])

bitcoin_all_predicted_y = bitcoin_all_features.predict(bitcoin_x_test)
bitcoin_all_predicted_y = pd.DataFrame(bitcoin_all_predicted_y, \
                                       index=bitcoin_testing_data.index, \
                                       columns=['Target'])
bitcoin_all_features_perf = gresearch_score(bitcoin_testing_data, \
                                            bitcoin_all_predicted_y)

bitcoin_lasso_all_predicted_y = bitcoin_lasso_all_features.predict(bitcoin_x_test)
bitcoin_lasso_all_predicted_y = pd.DataFrame(bitcoin_lasso_all_predicted_y, \
                                       index=bitcoin_testing_data.index, \
                                       columns=['Target'])
bitcoin_lasso_all_features_perf = gresearch_score(bitcoin_testing_data, \
                                            bitcoin_lasso_all_predicted_y)

bitcoin_lr_all_predicted_y = bitcoin_lr_all_features.predict(bitcoin_x_test)
bitcoin_lr_all_predicted_y = pd.DataFrame(bitcoin_lr_all_predicted_y, \
                                       index=bitcoin_testing_data.index, \
                                       columns=['Target'])
bitcoin_lr_all_features_perf = gresearch_score(bitcoin_testing_data, \
                                            bitcoin_lr_all_predicted_y)

bitcoin_all_predicted_q2_2021 = bitcoin_all_features.predict(bitcoin_q2_2021_x_test)
bitcoin_all_predicted_q2_2021 = pd.DataFrame(bitcoin_all_predicted_q2_2021, \
                            index=bitcoin_q2_2021_data.index, columns=['Target'])
bitcoin_all_predicted_q3_2021 = bitcoin_all_features.predict(bitcoin_q3_2021_x_test)
bitcoin_all_predicted_q3_2021 = pd.DataFrame(bitcoin_all_predicted_q3_2021, \
                            index=bitcoin_q3_2021_data.index, columns=['Target'])

bitcoin_lasso_all_predicted_q2_2021 = bitcoin_lasso_all_features.predict(bitcoin_q2_2021_x_test)
bitcoin_lasso_all_predicted_q2_2021 = pd.DataFrame(bitcoin_lasso_all_predicted_q2_2021, \
                            index=bitcoin_q2_2021_data.index, columns=['Target'])
bitcoin_lasso_all_predicted_q3_2021 = bitcoin_lasso_all_features.predict(bitcoin_q3_2021_x_test)
bitcoin_lasso_all_predicted_q3_2021 = pd.DataFrame(bitcoin_lasso_all_predicted_q3_2021, \
                            index=bitcoin_q3_2021_data.index, columns=['Target'])

bitcoin_lr_all_predicted_q2_2021 = bitcoin_lr_all_features.predict(bitcoin_q2_2021_x_test)
bitcoin_lr_all_predicted_q2_2021 = pd.DataFrame(bitcoin_lr_all_predicted_q2_2021, \
                            index=bitcoin_q2_2021_data.index, columns=['Target'])
bitcoin_lr_all_predicted_q3_2021 = bitcoin_lr_all_features.predict(bitcoin_q3_2021_x_test)
bitcoin_lr_all_predicted_q3_2021 = pd.DataFrame(bitcoin_lr_all_predicted_q3_2021, \
                            index=bitcoin_q3_2021_data.index, columns=['Target'])

In [None]:
# Training with RFE using the same number of features as KXY
def rfe(x, y, n_vars, x_columns):
    ''' '''
    x_ = x.copy()
    d = x_.shape[1]
    x_columns_ = list(x_columns)
    
    m = lightgbm_regressor_cls()
    m.fit(x_, y)        
    importances = m._model.feature_importance(importance_type='split')
    least_important = np.argmin(importances)
    importances = [_ for _ in importances]
    importance_df = pd.DataFrame(sorted(zip(importances, x_columns_)), \
                                 columns=['Importance','Feature'])
    importance_df = importance_df.sort_values('Importance', ascending=False)
    
    while d > n_vars:
        least_important = np.argmin(importances)
        x_ = np.delete(x_, least_important, axis=1)
        x_columns_.pop(least_important)
        d = x_.shape[1]
        m = lightgbm_regressor_cls()
        m.fit(x_, y)        
        importances = m._model.feature_importance(importance_type='split')
        least_important = np.argmin(importances)
        importances = [_ for _ in importances]
        importance_df = pd.DataFrame(sorted(zip(importances, x_columns_)), \
                                     columns=['Importance','Feature'])
        importance_df = importance_df.sort_values('Importance', ascending=False)

        
    return m, importance_df, x_columns_

In [None]:
# LightGBM + RFE (cross-sectional models)
n_features = len(training_data.kxy.predictor.selected_variables)
rfe_m, rfe_importances, rfe_selected_columns = rfe(x_train, y_train, n_features, feature_columns)

In [None]:
# LightGBM + RFE (Bitcoin models)
bitcoin_n_features = len(bitcoin_training_data.kxy.predictor.selected_variables)
bitcoin_rfe_m, bitcoin_rfe_importances, bitcoin_rfe_selected_columns = rfe(
        x_train, y_train, bitcoin_n_features, feature_columns)

In [None]:
x_test_rfe = testing_data[rfe_selected_columns].values
rfe_predicted_y = rfe_m.predict(x_test_rfe)
rfe_predicted_y = pd.DataFrame(rfe_predicted_y, \
        index=testing_data.index, columns=['Target'])
rfe_features_perf = gresearch_score(testing_data, rfe_predicted_y)

rfe_q2_2021_x_test = q2_2021_x[rfe_selected_columns].values
rfe_q3_2021_x_test = q3_2021_x[rfe_selected_columns].values

rfe_predicted_q2_2021 = rfe_m.predict(rfe_q2_2021_x_test)
rfe_predicted_q3_2021 = rfe_m.predict(rfe_q3_2021_x_test)
rfe_predicted_q2_2021 = pd.DataFrame(rfe_predicted_q2_2021, \
                            index=q2_2021_data.index, columns=['Target'])
rfe_predicted_q3_2021 = pd.DataFrame(rfe_predicted_q3_2021, \
                            index=q3_2021_data.index, columns=['Target'])

x_test_bitcoin_rfe = testing_data[bitcoin_rfe_selected_columns].values
bitcoin_rfe_predicted_y = bitcoin_rfe_m.predict(x_test_bitcoin_rfe)
bitcoin_rfe_predicted_y = pd.DataFrame(bitcoin_rfe_predicted_y, \
        index=testing_data.index, columns=['Target'])
bitcoin_rfe_features_perf = gresearch_score(testing_data, bitcoin_rfe_predicted_y)

### II.4.a High compression rate <a name="number"></a>

In [None]:
all_features = {'Bitcoin Model': x_train.shape[1], 'Cross-Sectional Model': x_train.shape[1]}

lightgbm_compression_rate = {\
    'Bitcoin Model': 1.-len(bitcoin_training_data.kxy.predictor.selected_variables)/x_train.shape[1],\
    'Cross-Sectional Model': 1.-len(training_data.kxy.predictor.selected_variables)/x_train.shape[1]}

lr_compression_rate = {\
    'Bitcoin Model': 1.-len(lr_bitcoin_training_data.kxy.predictor.selected_variables)/x_train.shape[1],\
    'Cross-Sectional Model': 1.-len(lr_training_data.kxy.predictor.selected_variables)/x_train.shape[1]}

compression_rate = pd.DataFrame([lightgbm_compression_rate, lr_compression_rate], \
                                index=['LightGBM', 'Linear Regression']).astype(float)
compression_rate.T.plot.bar(figsize=(15, 10), fontsize=15, color=['b', 'g'])
axis = plt.gca()
_ = axis.set_ylabel(r'''$1-\frac{q}{d}$''', fontsize=15)
_ = axis.set_title('KXY Feature Selection Compression Rate', fontsize=18)
for item in axis.get_legend().get_texts():
    item.set_fontsize(15)

### II.4.b Improved performance <a name="performance"></a>

In [None]:
lightgbm_all_results = {'Bitcoin Model': bitcoin_all_features_perf['Bitcoin'], \
               'Cross-Sectional Model': all_features_perf['Overall (Weighted)']}

lasso_results = {'Bitcoin Model': bitcoin_lasso_all_features_perf['Bitcoin'], \
               'Cross-Sectional Model': lasso_all_features_perf['Overall (Weighted)']}

lr_results = {'Bitcoin Model': bitcoin_lr_all_features_perf['Bitcoin'], \
               'Cross-Sectional Model': lr_all_features_perf['Overall (Weighted)']}

lightgbm_selected_results = {'Bitcoin Model': bitcoin_shrinkage_perf['Bitcoin'], \
    'Cross-Sectional Model': shrinkage_perf['Overall (Weighted)']}

lr_selected_results = {'Bitcoin Model': lr_bitcoin_shrinkage_perf['Bitcoin'], \
    'Cross-Sectional Model': lr_shrinkage_perf['Overall (Weighted)']}

lightgbm_rfe_results = {'Bitcoin Model': bitcoin_rfe_features_perf['Bitcoin'], \
    'Cross-Sectional Model': rfe_features_perf['Overall (Weighted)']}



The figure below illustrates that, when using linear models, ``KXY``'s feature selection algorithm is competitive relative to LASSO.

In [None]:
cross_sectional_results = pd.DataFrame([lr_results, lasso_results, lr_selected_results], \
                    index=['Linear Regression Using All Features', \
                        'Linear Regression With LASSO Feature Selection', \
                        'Linear Regression With KXY Feature Selection']).astype(float)
cross_sectional_results.T.plot.bar(figsize=(15, 10), fontsize=15, \
                                   color=[ 'r', 'y', 'g'])
axis = plt.gca()
_ = axis.set_ylabel('Pearson Correlation', fontsize=15)
_ = axis.set_title('Effect of Feature Selection in Linear Models on Testing Performance\n'
                   '(Training Data Until Q4 2020, Testing Q1 2021)', fontsize=18)
for item in axis.get_legend().get_texts():
    item.set_fontsize(15)

The figure below illustrates that ``KXY``'s feature selection algorithm is more effective than RFE when using LightGBM, and that it yields better performance than not using any feature selection.

In [None]:
cross_sectional_results = pd.DataFrame([lightgbm_all_results, lightgbm_rfe_results, \
                                        lightgbm_selected_results], \
                    index=['LightGBM Using All Features', \
                        'LightGBM With RFE Feature Selection', \
                        'LightGBM With KXY Feature Selection']).astype(float)
cross_sectional_results.T.plot.bar(figsize=(15, 10), fontsize=15, \
                                   color=['r', 'y', 'g'])
axis = plt.gca()
_ = axis.set_ylabel('Pearson Correlation', fontsize=15)
_ = axis.set_title('Effect of Feature Selection in LightGBM on Testing Performance\n'
                   '(Training Data Until Q4 2020, Testing Q1 2021)', fontsize=18)
for item in axis.get_legend().get_texts():
    item.set_fontsize(15)


### II.4.c Slower performance decay <a name="decay"></a>

In [None]:
# Q2 2021 Performance
shrinkage_perf_q2_2021 = gresearch_score(q2_2021_data, shrinkage_predicted_q2_2021)
all_perf_q2_2021 = gresearch_score(q2_2021_data, all_predicted_q2_2021)
lasso_all_perf_q2_2021 = gresearch_score(q2_2021_data, lasso_all_predicted_q2_2021)
lr_all_perf_q2_2021 = gresearch_score(q2_2021_data, lr_all_predicted_q2_2021)
lr_shrinkage_perf_q2_2021 = gresearch_score(q2_2021_data, lr_shrinkage_predicted_q2_2021)
rfe_features_q2_2021_perf = gresearch_score(q2_2021_data, rfe_predicted_q2_2021)

# Q3 2021 Performance
shrinkage_perf_q3_2021 = gresearch_score(q3_2021_data, shrinkage_predicted_q3_2021)
all_perf_q3_2021 = gresearch_score(q3_2021_data, all_predicted_q3_2021)
lasso_all_perf_q3_2021 = gresearch_score(q3_2021_data, lasso_all_predicted_q3_2021)
lr_all_perf_q3_2021 = gresearch_score(q3_2021_data, lr_all_predicted_q3_2021)
lr_shrinkage_perf_q3_2021 = gresearch_score(q3_2021_data, lr_shrinkage_predicted_q3_2021)
rfe_features_q3_2021_perf = gresearch_score(q3_2021_data, rfe_predicted_q3_2021)

In [None]:
# Cross-section models
lightgbm_all_results = {'2021Q1': all_features_perf['Overall (Weighted)'], \
               '2021Q2': all_perf_q2_2021['Overall (Weighted)'], \
               '2021Q3': all_perf_q3_2021['Overall (Weighted)']}

lightgbm_selected_results = {'2021Q1': shrinkage_perf['Overall (Weighted)'], \
               '2021Q2': shrinkage_perf_q2_2021['Overall (Weighted)'], \
               '2021Q3': shrinkage_perf_q3_2021['Overall (Weighted)']}

lasso_results = {'2021Q1': lasso_all_features_perf['Overall (Weighted)'], \
               '2021Q2': lasso_all_perf_q2_2021['Overall (Weighted)'], \
               '2021Q3': lasso_all_perf_q3_2021['Overall (Weighted)']}

lr_results = {'2021Q1': lr_all_features_perf['Overall (Weighted)'], \
               '2021Q2': lr_all_perf_q2_2021['Overall (Weighted)'], \
               '2021Q3': lr_all_perf_q3_2021['Overall (Weighted)']}

lr_selected_results = {'2021Q1': lr_shrinkage_perf['Overall (Weighted)'], \
               '2021Q2': lr_shrinkage_perf_q2_2021['Overall (Weighted)'], \
               '2021Q3': lr_shrinkage_perf_q3_2021['Overall (Weighted)']}

lightgbm_rfe_results = {'2021Q1': rfe_features_perf['Overall (Weighted)'], \
               '2021Q2': rfe_features_q2_2021_perf['Overall (Weighted)'], \
               '2021Q3': rfe_features_q3_2021_perf['Overall (Weighted)']}

The figure below illustrates that, when using linear models, the *juice* in features selected by ``KXY`` lasts longer than the *juice* in features selected by (cross-validated) LASSO.

In [None]:
cross_sectional_results = pd.DataFrame([lr_results, lasso_results, lr_selected_results], \
            index=['No Feature Selection', \
                    'LASSO Feature Selection', \
                    'KXY Feature Selection']).astype(float)
cross_sectional_results.T.plot.bar(figsize=(15, 10), fontsize=15, color=['r', 'y', 'g'])
axis = plt.gca()
_ = axis.set_ylabel('Pearson Correlation', fontsize=15)
_ = axis.set_title(\
    'Effect of Feature Selection in Linear Models on Performance Decay\n(Training Data Until Q4 2020)',\
    fontsize=18)
for item in axis.get_legend().get_texts():
    item.set_fontsize(15)

The figure below illustrates that LightGBM with ``KXY``'s feature selection significantly outperform LightGBM with RFE and LightGBM without feature selection, two quarters in a row after model training.

In [None]:
cross_sectional_results = pd.DataFrame([lightgbm_all_results, lightgbm_rfe_results, \
                                        lightgbm_selected_results], \
            index=['No Feature Selection', \
                    'RFE Feature Selection', \
                    'KXY Feature Selection']).astype(float)
cross_sectional_results.T.plot.bar(figsize=(15, 10), fontsize=15, color=['r', 'y', 'g'])
axis = plt.gca()
_ = axis.set_ylabel('Pearson Correlation', fontsize=15)
_ = axis.set_title(\
    'Effect of Feature Selection in LightGBM on Performance Decay\n(Training Data Until Q4 2020)',\
    fontsize=18)
for item in axis.get_legend().get_texts():
    item.set_fontsize(15)

# III. Conclusion <a name="conclusion"></a>

*We propose an approach to seemlessly add effective feature selection to any regressor in Python in a single line of code.*

We show that, in this competition, our approach is competitive with LASSO when the regressor is a linear, and achieves a 20% reduction in the number of features.

We find that tree-based regressors (LightGBM to be specific), have the potential to outperform linear models in this competition. However, to unlock this potential, it is crucial to mitigate overfitting with effective feature selection. 

We find that the ``KXY`` feature selection approach significantly outperforms RFE, and that RFE performs similarly to not including any feature selection (in this competition).

A major downside of non-linear models in this competition is that they tend to decay faster than linear models. However, thanks to``KXY``'s feature selection, LightGBM consistently outperforms linear models up to 6 months after model training. 

## III.1 Preparing Submission

In [None]:
# Construct training data
submission_training_data = pd.read_parquet('%s/2018Q1.parquet' % PATH)
submission_training_data = submission_training_data.astype(np.float32)

for i in range(1, len(quarters)-1):
    testing_data = pd.read_parquet('%s/%s.parquet' % (PATH, quarters[i]))
    testing_data = testing_data.astype(np.float32)
    # Update the training set with 
    submission_training_data = submission_training_data.sample(frac=0.2, random_state=28)
    testing_data = testing_data.sample(frac=0.8, random_state=28)
    submission_training_data = pd.concat([submission_training_data, testing_data], axis=0)
    gc.collect()
    
months_df = submission_training_data['timestamp.YEAR()'].astype(int).astype(str) + \
    submission_training_data['timestamp.MONTH()'].astype(int).astype(str).apply(lambda x: x.zfill(2))
months_df = pd.DataFrame(months_df, columns=["Month"])
submission_training_data.drop(columns=['timestamp.MONTH()', 'timestamp.YEAR()'], inplace=True)

In [None]:
# Fit a LightGBM regressor with KXY feature selection
submission_results = submission_training_data.kxy.fit('Target', lightgbm_regressor_cls, \
    problem_type='regression', missing_value_imputation=True, \
    train_frac=0.8)

In [None]:
# Save selected variables and the trained model
import lightgbm
import pickle as pkl
pkl.dump(submission_training_data.kxy.predictor.selected_variables, open('selected_variables.sav', 'wb'))
model = submission_training_data.kxy.predictor.models[0]._model
model.save_model('lightgbm_regressor.sav', num_iteration=model.best_iteration) 