Hello everyone,

Following the notebook [Comprehensive Guide on Feature Selection](https://www.kaggle.com/prashant111/comprehensive-guide-on-feature-selection) with a very detailed explanation of various feature selection procedures, here I show how we can simplify the feature selection procedure utilising a relatively new open source Python library called [Feature-engine](https://feature-engine.readthedocs.io/en/latest/index.html).

The latest version of [Feature-engine](https://feature-engine.readthedocs.io/en/latest/index.html) features several methods to [select features](https://feature-engine.readthedocs.io/en/latest/selection/index.html) that are not available in other libraries at the moment.

[Feature-engine](https://feature-engine.readthedocs.io/en/latest/index.html) classes preserve Scikit-learn functionality with the methods **fit** and **transform** to first learn the parameters from the data, and then transform the data utilizing those parameters.

By selecting features, we can build simpler, faster and more interpretable machine learning models.
 

## Table of Contents

- Remove constant and quasi-constant features
- Remove duplicated features
- Remove correlated features with a brute force approach or selecting features smartly
- Select important features by feature shuffling
- Select features based on a univariate model performance
- Select features recursively
- Build an entire machine learning pipeline followed by a machine learning model

I hope you find this kernel useful and if you do, your **UPVOTES** will be highly appreciated.


## Additional Resources

- [Feature Selection for Machine Learning](https://www.courses.trainindata.com/p/feature-selection-for-machine-learning) - Online Course
- [Feature Selection for Machine Learning: A Comprehensive Overview](https://trainindata.medium.com/feature-selection-for-machine-learning-a-comprehensive-overview-bd571db5dd2d) - Article
- [Feature Selection with Feature-engine](https://feature-engine.readthedocs.io/en/latest/selection/index.html) - Package Documentation


In [None]:
# let's install Feature-engine

!pip install feature-engine

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# import selection classes from Feature-engine

from feature_engine.selection import (
    DropDuplicateFeatures,
    DropConstantFeatures,
    DropDuplicateFeatures,
    DropCorrelatedFeatures,
    SmartCorrelatedSelection,
    SelectByShuffling,
    SelectBySingleFeaturePerformance,
    RecursiveFeatureElimination,
)

In [None]:
# load the Santander customer satisfaction dataset

data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

In [None]:
data.head()

In [None]:
# separate dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

In [None]:
# check if there missing data (this datasets do not show NAs
# as we will see in the empty list output)

[x for x in X_train.columns if X_train[x].isnull().sum() > 0]

## Remove constant features

Constant features are those which contain only 1 value for all the observations.

[DropConstantFeatures](https://feature-engine.readthedocs.io/en/latest/selection/DropConstantFeatures.html)

In [None]:
# with tol=1 we tell the transformer to remove constant features
constant = DropConstantFeatures(tol=1)

# finds the constant features on the train set
constant.fit(X_train)

In [None]:
# the constant features can be found in the attribute
# features_to_drop_

len(constant.features_to_drop_)

In [None]:
# show the names of the first 3 constant features

constant.features_to_drop_[0:3]

In [None]:
# check that the feature is indeed constant (that is,
# it has only 1 value in all the observations)

X_train['ind_var2_0'].unique()

In [None]:
# remove constant features - transform method

print('Number of variables before removing constant: ', X_train.shape[1])

X_train = constant.transform(X_train)
X_test = constant.transform(X_test)

print('Number of variables after removing constant: ', X_train.shape[1])

## Remove Quasi-constant features

Quasi-constant features are those that show the same value in most of the observations in the dataset.

In [None]:
# with tol=0.998 we tell the transformer that we want to remove
# all features that show the same value in more than 99.8% of the
# observations in the dataset

quasi_constant = DropConstantFeatures(tol=0.998)

# find quasi-constant features in the train set
quasi_constant.fit(X_train)

In [None]:
# the constant features can be found in the attribute
# features_to_drop_

len(constant.features_to_drop_)

In [None]:
# show the names of the first 3 constant features

quasi_constant.features_to_drop_[0:3]

In [None]:
# we can evaluate the percentage of observations that show
# each value

X_train['imp_op_var40_efect_ult1'].value_counts() / len(X_train)

We can see that most of the observations show the value 0.0. Very few take a different value.

In [None]:
# remove quasi-constant features - transform method

print('Number of variables before removing quasi-constant: ', X_train.shape[1])

X_train = quasi_constant.transform(X_train)
X_test = quasi_constant.transform(X_test)

print('Number of variables after removing quasi-constant: ', X_train.shape[1])

## Remove duplicated features

That is, features that are identical

[DropDuplicateFeatures](https://feature-engine.readthedocs.io/en/latest/selection/DropDuplicateFeatures.html)

In [None]:
duplicates = DropDuplicateFeatures()

# find duplicated features in the train set
duplicates.fit(X_train)

In [None]:
# the groups or identical variables can be seen in the 
# attribute duplicated_feature_sets

duplicates.duplicated_feature_sets_

In [None]:
# we can go ahead and check that these variables are indeed identical
# take for example the first pair in the above cell

X_train['ind_var26'].equals(X_train['ind_var26_0'])

In [None]:
# inspect the values of some observations

X_train[['ind_var26','ind_var26_0']].head()

In [None]:
# in the attribute features_to_drop_ we find the variables
# from the groups of duplicates that will be dropped

# the transformer only leaves 1 variable per group and removes
# the rest.

duplicates.features_to_drop_

In [None]:
# remove duplicates - transform method

print('Number of variables before removing duplicates: ', X_train.shape[1])

X_train = duplicates.transform(X_train)
X_test = duplicates.transform(X_test)

print('Number of variables after removing duplicates: ', X_train.shape[1])

## Drop Correlated features

Brute force approach. We will use a class that removes the correlated features in a first come first served basis.

[DropCorrelatedFeatures](https://feature-engine.readthedocs.io/en/latest/selection/DropCorrelatedFeatures.html)

In [None]:
# if variables is set to None, the transformer will examine all variables
# we can choose the correlation method to use (pearson, spearman or kendal)
# and the correlation threshold

correlated = DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.8)

# find correlated variables in the train set
correlated.fit(X_train)

In [None]:
# in the attribute correlated_feature_sets_ we find the 
# variables that are correlated with each other

# note that several variables can be correlated with each other

correlated.correlated_feature_sets_

In [None]:
# let's plot a correlation heat map for the following group:
# (the first one in the sets above)

corrmat = X_train[[
    'imp_op_var39_comer_ult1',
    'imp_op_var39_comer_ult3',
    'imp_op_var41_comer_ult1',
    'imp_op_var41_comer_ult3']].corr(method='pearson')

# we can make a heatmap with the package seaborn
# and customise the colours of searborn's heatmap
cmap = sns.diverging_palette(220, 20, as_cmap=True)

# some more parameters for the figure
fig, ax = plt.subplots()
fig.set_size_inches(5,5)

# and now plot the correlation matrix
sns.heatmap(corrmat, cmap=cmap)

We can see that indeed all those variables show a correlation coefficient higher than 0.8 with each other.

In [None]:
# in the features_to_drop_ the transformer stores all the
# variables that will be dropped. 

# the transformer selects 1 variable per group of correlated ones
# and drops the rest on a first come, first serve basis

len(correlated.features_to_drop_)

In [None]:
# remove correlated variables

print('Number of variables before removing correlated: ', X_train.shape[1])

X_train = correlated.transform(X_train)
X_test = correlated.transform(X_test)

print('Number of variables after removing correlated: ', X_train.shape[1])

## Drop Correlated Features Smartly

With this class, each feature in the correlated group is selected based on different characteristics:

- the number of missing values
- the variance
- the cardinality
- the importance derived from a machine learning model

The transformer will select the feature with less missing values, or highest variance, cardinality or performance, depending what we choose on the selection_method parameter.

[SmartCorrelatedSelection](https://feature-engine.readthedocs.io/en/latest/selection/SmartCorrelatedSelection.html)

In [None]:
smart_corr = SmartCorrelatedSelection(
    variables=None, # examines all variables
    method="pearson", # the correlation method
    threshold=0.7, # the correlation coefficient threshold
    missing_values="ignore",
    selection_method="model_performance", # how to select the features
    estimator=RandomForestClassifier(n_estimators=10, random_state=1), # the model from which to derive the importance
)

# find correlated features and select the best from each group

# the method builds a random forest using each single feature from the correlated feature group
# and retains the feature from the group with the best performance

smart_corr.fit(X_train, y_train)

In [None]:
# the correlated feature groups

smart_corr.correlated_feature_sets_

In [None]:
# lets examine the performace of a random forest based on
# each feature from the fifth group from above, to understand
# what the transformer is doing

# select fifth group of correlated features
group = smart_corr.correlated_feature_sets_[4]

# build random forest with cross validation for
# each feature

for f in group:
    
    model = cross_validate(
        RandomForestClassifier(n_estimators=10, random_state=1),
        X_train[f].to_frame(),
        y_train,
        cv=3,
        return_estimator=False,
        scoring='roc_auc',
    )

    print(f, model["test_score"].mean())

The variable **num_var30_0** returns the highest performing random forest, therefore this one will be retained and the other ones removed.

In [None]:
# this variable, which shows the best performance will be retained
# and thus is not in the features_to_drop_ attribute

'num_var30_0' in smart_corr.features_to_drop_

In [None]:
# this variable will be dropped, and thus it is in the features_to_drop_ attribute

'ind_var12_0' in smart_corr.features_to_drop_

In [None]:
# this variable will be dropped, and thus it is in the features_to_drop_ attribute

'ind_var24_0' in smart_corr.features_to_drop_

In [None]:
# remove correlated variables

print('Number of variables before removing correlated: ', X_train.shape[1])

X_train = smart_corr.transform(X_train)
X_test = smart_corr.transform(X_test)

print('Number of variables after removing correlated: ', X_train.shape[1])

## Select features in a pipeline

We can perform all feature selection procedures in 1 step using a Pipeline from Scikit-learn.

In [None]:
# load data again
data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

# separate dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

In [None]:
pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)), # drops constand and quasi-constant altogether
    ('duplicated', DropDuplicateFeatures()), # drops duplicates
    ('correlation', SmartCorrelatedSelection( # drops correlated
        threshold=0.8,
        selection_method="model_performance",
        estimator=RandomForestClassifier(n_estimators=10, random_state=1),
    )),
])

# find features to remove

pipe.fit(X_train, y_train)

In [None]:
# remove variables

print('Number of original variables: ', X_train.shape[1])

X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

print('Number of variables after selection: ', X_train.shape[1])

We can appreciate how in 1 cells we chopped down the number of features from 369 to 81.

## Select features by Shuffling

This class, builds a model with all features, then shuffles each feature, one at a time, and determines a drop in model performance. If the feature is important, we should see a big drop. Otherwise, the drop will be small, and we could remove the feature.

[SelectByShuffling](https://feature-engine.readthedocs.io/en/latest/selection/SelectByShuffling.html)

In [None]:
# load data again
data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

# separate dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

In [None]:
# let's remove constant, quasi-constant and duplicates to speed things up

pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)), # drops constand and quasi-constant altogether
    ('duplicated', DropDuplicateFeatures()),
])

# find features to remove
pipe.fit(X_train, y_train)

# remove variables

X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

In [None]:
shuffle = SelectByShuffling(
    estimator = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=1), # the model
    scoring="roc_auc", # the metric to determine model performance
    cv=3, # the cross-validation fold
)

shuffle.fit(X_train, y_train)

In [None]:
# this is the performace of the model (roc-auc) using all the features

shuffle.initial_model_performance_

In [None]:
# in the attribute performance_drifts_ we can find the 
# performance drift caused by shuffling each feature

shuffle.performance_drifts_

In [None]:
pd.Series(shuffle.performance_drifts_).plot.bar(figsize=(20,5))
plt.ylabel('Performance drift after shuffling')
plt.show()

In [None]:
# here we find the attributes that will be dropped

len(shuffle.features_to_drop_)

In [None]:
# remove variables

print('Number of variables before removing non important: ', X_train.shape[1])

X_train = shuffle.transform(X_train)
X_test = shuffle.transform(X_test)

print('Number of variables after removing non important: ', X_train.shape[1])

In [None]:
# we can go ahead and train a random forest using the selected features and evaluate
# its performance

rf = RandomForestClassifier(n_estimators=10, max_depth=3, random_state=1)

rf.fit(X_train, y_train)

pred = rf.predict_proba(X_train)
print('Train roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))

pred = rf.predict_proba(X_test)
print('Test roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

We see that the model with few features shows better performance than the model with all the features. And, it is much simpler and easier to interpret for those who will actually use the model.

## Select features by univariate model performance

This selection procedure builds 1 model per feature, and selects those features that return models with a performance above a certain threshold.

[SelectBySingleFeaturePerformance](https://feature-engine.readthedocs.io/en/latest/selection/SelectBySingleFeaturePerformance.html)

In [None]:
# load data again
data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

# separate dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

# let's remove constant, quasi-constant and duplicates to speed things up

pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)), # drops constand and quasi-constant altogether
    ('duplicated', DropDuplicateFeatures()),
])

# find features to remove
pipe.fit(X_train, y_train)

# remove variables

X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

In [None]:
sel = SelectBySingleFeaturePerformance(
    estimator = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=1), # the model
    scoring="roc_auc", # the metric to determine model performance
    cv=3, # the cross-validation fold,
    threshold=None, # the performance threshold
)

sel.fit(X_train, y_train)

In [None]:
# the univariate performance of the features

sel.feature_performance_

In [None]:
pd.Series(sel.feature_performance_).plot.bar(figsize=(20,5))
plt.ylabel('roc-auc')
plt.show()

In [None]:
# the features that will be dropped

len(sel.features_to_drop_)

In [None]:
# when we leave the threshold to None, the selector selects features which
# performance is bigger than the mean performance of all features

sel.threshold

In [None]:
# remove variables

print('Number of variables before removing non important: ', X_train.shape[1])

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

print('Number of variables after removing non important: ', X_train.shape[1])

## Select Features Recursively

- This method starts by building a model with all features
- Then it ranks features by importance, derived from the model, from most to least important
- Then removes least important features
- Trains a new model and determines performance
- If performance drop is big, then retains the feature, otherwise it removes it
- Repeats steps 3-5 untill all features have been examined.

[RecursiveFeatureElimination](https://feature-engine.readthedocs.io/en/latest/selection/RecursiveFeatureElimination.html)

In [None]:
# load data again
data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

# separate dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

# let's remove constant, quasi-constant and duplicates to speed things up

pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)), # drops constand and quasi-constant altogether
    ('duplicated', DropDuplicateFeatures()),
])

# find features to remove
pipe.fit(X_train, y_train)

# remove variables

X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

In [None]:
rfe = RecursiveFeatureElimination(
    estimator = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=1), # the model
    scoring="roc_auc", # the metric to determine model performance
    cv=3, # the cross-validation fold
    threshold = 0.04, 
)

rfe.fit(X_train, y_train)

In [None]:
# the feature importance derived from the first model, trained
# using all the features

rfe.feature_importances_

In [None]:
# plot of feature importance, derived from the Random Forests
pd.Series(rfe.feature_importances_).plot.bar(figsize=(20,5))
plt.ylabel('Feature importance derived from the random forests')
plt.show()

The model begins by removing features, 1 by 1, from those on the left, to those on the right.

In [None]:
# the performance of the random forest trained on all features

rfe.initial_model_performance_

In [None]:
# the drop in performance caused when removing each feature

rfe.performance_drifts_

In [None]:
# same as above in a plot

pd.Series(rfe.performance_drifts_).sort_values().plot.bar(figsize=(20,5))
plt.ylabel('change in performance when removing feature')
plt.show()

In [None]:
# the number of features that will be dropped
len(rfe.features_to_drop_)

In [None]:
# remove variables

print('Number of variables before removing non important: ', X_train.shape[1])

X_train = rfe.transform(X_train)
X_test = rfe.transform(X_test)

print('Number of variables after removing non important: ', X_train.shape[1])

## Feature Selection and Machine Learning Pipeline

Now we will select features and train a machine learning model altogether in 1 pipeline.

In [None]:
# load data again
data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

# separate dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

In [None]:
pipe = Pipeline([
    # ======== FEATURE SELECTION =======
    ('constant', DropConstantFeatures(tol=0.998)), # drops constand and quasi-constant altogether
    ('duplicated', DropDuplicateFeatures()), # drop duplicated
    ('shuffle', SelectByShuffling( # select by feature shuffling
        estimator = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=1), # the model
        scoring="roc_auc", # the metric to determine model performance
        cv=3, # the cross-validation fold
    )),
    
    # =====  the machine learning model ====
    ('random_forest', RandomForestClassifier(n_estimators=10, max_depth=2, random_state=1)),
])

# find features to remove
pipe.fit(X_train, y_train)

In [None]:
# the pipeline takes in the raw data, removes all unwanted features and then
# makes the prediction with the model trained on the final subset of variables

# obtain predictions and determine model performance

pred = pipe.predict_proba(X_train)
print('Train roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))

pred = pipe.predict_proba(X_test)
print('Test roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

That is all for now. I hope you find this notebook and this library useful. If you do, please upvote the notebook :)


## References and further reading


- [Feature-engine](https://feature-engine.readthedocs.io/en/latest/index.html), Python open-source library
- [Feature Selection for Machine Learning](https://www.courses.trainindata.com/p/feature-selection-for-machine-learning), Online Course
- [Comprehensive Guide on Feature Selection](https://www.kaggle.com/prashant111/comprehensive-guide-on-feature-selection), Kaggle notebook