# Feature selection with Boruta-SHAP to increase your score

![immagine.png](attachment:941cf5aa-8564-475b-ac05-5e2505a85605.png)

#### In this competition there are quite a few of features. Is there a way to eliminate the unuseful ones? 

Feature selection has important advantages:
 1. by training new useful models that others don't have in their ensemble
 2. by making your models run better

Gradient Boosting incorporates feature selection, since the trees spit only on significant features (or at least they should). In reality, this is not always true as sometimes noisy, irrelevant splits may appear in the tree. Moreover, working with not useful features will cause your training to go slower.

Generally, widely recognized benefits of featue selection are:

* simplification of models to make them easier to interpret
* shorter training times,
* to avoid the curse of dimensionality,
* more generalization by reducing overfitting (reduction of variance)

Boruta-SHAP is a package combining Boruta (https://github.com/scikit-learn-contrib/boruta_py), a feature selection method based on repeated tests of the importance of a feature in a model, with the interpretability method SHAP (https://christophm.github.io/interpretable-ml-book/shap.html).

Boruta-SHAP, developed by Eoghan Keany (https://github.com/Ekeany/Boruta-Shap), is extremely simple to use: get your best model, let it run some time on Boruta-SHAP and evaluate the results!

p.s
p.s. You can read more about Boruta-SHAP on this Medium article by the author: https://medium.com/analytics-vidhya/is-this-the-best-feature-selection-algorithm-borutashap-8bc238aa1677

#### Let's start by uploading packages and data

In [None]:
!pip install BorutaShap

In [None]:
!pip install scikit-learn -U

In [None]:
# Importing core libraries
import numpy as np
import pandas as pd
from time import time
import pprint
import joblib
from functools import partial

# Suppressing warnings because of skopt verbosity
import warnings
warnings.filterwarnings("ignore")

# Feature selection
from BorutaShap import BorutaShap
from xgboost import XGBClassifier

# Validation
from sklearn.model_selection import KFold, StratifiedKFold

In [None]:
# Derived from the original script https://www.kaggle.com/gemartin/load-data-reduce-memory-usage 
# by Guillaume Martin

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
## Loading data 
X_train = pd.read_csv("../input/tabular-playground-series-oct-2021/train.csv").set_index('id')
X_test = pd.read_csv("../input/tabular-playground-series-oct-2021/test.csv").set_index('id')

# Feature engineering
unique_values = X_train.iloc[:1000].nunique()
categoricals = [col for col in  unique_values.index[unique_values < 10] if col!='target']
numeric = [col for col in X_test.columns if col not in categoricals]

X_train['mean_numeric'] = X_train[numeric].mean(axis=1)
X_train['std_numeric'] = X_train[numeric].std(axis=1)
X_train['min_numeric'] = X_train[numeric].min(axis=1)
X_train['max_numeric'] = X_train[numeric].max(axis=1)
X_train['sum_categoricals'] = X_train[categoricals].sum(axis=1)

X_test['mean_numeric'] = X_test[numeric].mean(axis=1)
X_test['std_numeric'] = X_test[numeric].std(axis=1)
X_test['min_numeric'] = X_test[numeric].min(axis=1)
X_test['max_numeric'] = X_test[numeric].max(axis=1)
X_test['sum_categoricals'] = X_test[categoricals].sum(axis=1)

# Sampling for speeding up things
X_train = X_train.sample(n=200_000, random_state=0)

# target
y_train = X_train.target
X_train = X_train.drop('target', axis='columns')

#### Now we pick our best model and let Boruta-SHAP run a few experiments (usually 50 are enough) before getting the results.

#### We cross-validate our experiments in order to ascertain that we are indeed picking the right variables

#### as the results are prepared and we can plot them to visualize the Z-scores intervals of our features. That will signal us the confidence of the choice made by the algorithm in selecting or rejecting features.

#### Please notice that the last two features are noisy features used by Boruta-SHAP to fgure out the important features. Clearly they are non-significant.

#### Cross-validation takes time. Meanwhile we can grab a cup of coffee and relax as Boruta-SHAP is doing all the heavy-lift job.

![immagine.png](attachment:8530c9d8-3db0-4a8e-b4a0-6d5e5c1cf18f.png)

In [None]:
folds = 5
kf = KFold(n_splits=folds,
           shuffle=True, 
           random_state=0)

selected_columns = list()
    
for k, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):
    
    print(f"FOLD {k+1}/{folds}")
    
    model = XGBClassifier(
        colsample_bytree= 0.50, 
        subsample= 0.50, 
        learning_rate= 0.012, 
        max_depth= 3, 
        min_child_weight= 252,
        n_estimators= 1000,
        random_state=0,
        use_label_encoder=False,
        objective='binary:logistic',
        eval_metric='auc',
        tree_method='gpu_hist',
        gpu_id=0,
        predictor='gpu_predictor'
     )
    
    Feature_Selector = BorutaShap(model=model,
                                  importance_measure='shap', 
                                  classification=True)

    Feature_Selector.fit(X=X_train.iloc[train_idx, :], y=y_train.iloc[train_idx], 
                         n_trials=50, random_state=0)
    
    Feature_Selector.plot(which_features='all', figsize=(24,12))
    
    selected_columns.append(sorted(Feature_Selector.Subset().columns))
    
    print(f"Selected features at fold {k+1} are: {selected_columns[-1]}")

#### Here we finally have the good set of features to be used in this competition (at least using XGBoost - better to test for other algorithms)

In [None]:
final_selection = sorted({item for selection in selected_columns for item in selection})
print(final_selection)

## Happy Kaggling!