In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import make_classification

# Boruta Feature Selection motivation and BorutaPy use

In this notebook we see a brief review of classical feature selection techniques (that are teached but not much used). Then we see how "SelectKBest with model feature_importances" naturaly drives towards Boruta. At last we use BorutaPy to see how its used in practice.

In [2]:
n_features = 20

X, y = make_classification(n_samples=1000,
                            n_features=n_features,
                            n_informative=2,
                            n_redundant=2,
                            n_classes=2,
                            flip_y=0.1,
                            shuffle=False,
                            random_state=42)

First 4 features are the important ones: column_1, column_2, column_3 and column_4.

In [3]:
X = pd.DataFrame(X, columns=[f'column_{i}' for i in range(1, n_features+1)])
X.head()

Unnamed: 0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20
0,-1.050478,-1.323568,0.912474,1.009796,0.829475,-0.193826,-0.264515,-2.003862,0.635418,-1.239258,0.059933,0.277377,1.360659,-1.30882,-3.019512,0.18385,1.800511,1.238946,0.209659,-0.491636
1,-1.580834,-2.747104,1.777419,1.85043,0.807123,-0.973546,0.476358,0.50547,1.06021,2.75966,0.392416,-0.508964,-0.025574,-1.769076,-0.694713,-0.409282,-0.524088,0.152355,-0.82242,1.121031
2,-0.885704,-0.6146,0.501004,0.631813,0.000207,-0.0093,-0.327895,0.155191,0.825098,-0.86713,-0.658116,-0.303726,-1.345871,-0.819258,-0.476221,0.874389,0.262561,0.19359,0.850898,-0.137372
3,-1.525438,-2.967793,1.884777,1.92441,0.390465,-0.103222,0.265362,-0.582759,-2.438817,-0.134279,1.422748,0.926215,0.965397,1.236131,0.088658,0.197316,-0.617652,-0.316073,0.615771,1.203884
4,-1.076826,-1.014619,0.752233,0.885267,-0.139446,-0.450189,0.000528,0.601207,-1.443855,-2.296181,-0.550537,-1.220712,-0.50814,-0.14778,-0.453248,1.452468,0.326745,0.300474,0.622207,-1.138833


## Sequential Selector

```python
from sklearn.feature_selection import SequentialFeatureSelector

sfs = SequentialFeatureSelector(estimator, direction = "forward", n_features_to_select=3) # other option: "backward"
sfs.fit(X, y)
feature_mask = sfs.get_support()
X_selected_features = sfs.transform(X)
```

This Sequential Feature Selector adds (forward selection) or removes (backward selection) features to form a feature subset in a greedy fashion (one feature at a time).

Cons: Even with the greedy approach, it's expensive.

## Variance Selector

```python
from sklearn.feature_selection import VarianceThresholdn

selector = VarianceThreshold()
selector.fit(X)
X_selected_features = selector.transform(X)
```

Feature selector that removes all low-variance features.

Cons: which threshold to pick?

## K-Best Selector

SelectKBest is probably the most common technique. We simply select features according to the k highest scores (some measure of feature importances).

For instance, you can take the most "correlated" features to the target:
```python
from sklearn.feature_selection import SelectKBest, mutual_info_regression

X_selected_features = SelectKBest(mutual_info_regression, k=20).fit_transform(X, y)

```

In practice, we normaly use it with some model measure of feature importances.

In [4]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42).fit(X, y)

In [5]:
k = 4

(pd.DataFrame(list(zip(X.columns, rfc.feature_importances_)),
              columns=['feature_name', 'feature_importance'])
 .sort_values(by='feature_importance', ascending=False)
 .head(k)
 .feature_name
 .to_list()
)

['column_2', 'column_3', 'column_4', 'column_1']

## How to choose k?

It looks quite ad-hoc at the last example.

Idea: Create a noise variable that we know is not usefull. We can look at the columns that were better than it.

In [6]:
noised_X = (X.assign(noise_column = np.random.RandomState(42).normal(size=X.shape[0])))
noised_X[noised_X.columns[::-1]].head()

Unnamed: 0,noise_column,column_20,column_19,column_18,column_17,column_16,column_15,column_14,column_13,column_12,...,column_10,column_9,column_8,column_7,column_6,column_5,column_4,column_3,column_2,column_1
0,0.496714,-0.491636,0.209659,1.238946,1.800511,0.18385,-3.019512,-1.30882,1.360659,0.277377,...,-1.239258,0.635418,-2.003862,-0.264515,-0.193826,0.829475,1.009796,0.912474,-1.323568,-1.050478
1,-0.138264,1.121031,-0.82242,0.152355,-0.524088,-0.409282,-0.694713,-1.769076,-0.025574,-0.508964,...,2.75966,1.06021,0.50547,0.476358,-0.973546,0.807123,1.85043,1.777419,-2.747104,-1.580834
2,0.647689,-0.137372,0.850898,0.19359,0.262561,0.874389,-0.476221,-0.819258,-1.345871,-0.303726,...,-0.86713,0.825098,0.155191,-0.327895,-0.0093,0.000207,0.631813,0.501004,-0.6146,-0.885704
3,1.52303,1.203884,0.615771,-0.316073,-0.617652,0.197316,0.088658,1.236131,0.965397,0.926215,...,-0.134279,-2.438817,-0.582759,0.265362,-0.103222,0.390465,1.92441,1.884777,-2.967793,-1.525438
4,-0.234153,-1.138833,0.622207,0.300474,0.326745,1.452468,-0.453248,-0.14778,-0.50814,-1.220712,...,-2.296181,-1.443855,0.601207,0.000528,-0.450189,-0.139446,0.885267,0.752233,-1.014619,-1.076826


In [7]:
noised_rfc = RandomForestClassifier(random_state=42).fit(noised_X, y)

In [8]:
(pd.DataFrame(list(zip(noised_X.columns, noised_rfc.feature_importances_)),
              columns=['feature_name', 'feature_importance'])
 .sort_values(by='feature_importance', ascending=False)
 .query(f"feature_importance > {noised_rfc.feature_importances_[-1]}")
 .feature_name
 .to_list()
)

['column_2',
 'column_3',
 'column_4',
 'column_1',
 'column_6',
 'column_10',
 'column_14']

## Statistical Significance

Different random states or distribution of the random variable we are inputing can make we select different features.

In [9]:
noised2_X = (X.assign(noise_column = np.random.RandomState(42).exponential(size=X.shape[0])))
noised2_rfc = RandomForestClassifier(random_state=0).fit(noised2_X, y)

(pd.DataFrame(list(zip(noised2_X.columns, noised2_rfc.feature_importances_)),
              columns=['feature_name', 'feature_importance'])
 .sort_values(by='feature_importance', ascending=False)
 .query(f"feature_importance > {noised_rfc.feature_importances_[-1]}")
 .feature_name
 .to_list()
)

['column_2',
 'column_3',
 'column_4',
 'column_1',
 'column_14',
 'column_6',
 'column_10',
 'column_9']

Sometimes a bad feature can appear and other times, good features can be unlucky and appear bellow the noised one, by chance.

## Boruta main ideas

- Boruta tries to solve this inconsistency repeating the process many times.

- At each time, we write down if the feature was better than an noised one or not (in the sense of having better feature importance than it).

- For each feature, we then apply an statiscal test to test the hypothesis: *"does this feature has 50% chance of beeing better than a noised feature?"*.

- The result of this test gives us 3 regions: the ones that we are certain to be better than randomness, the ones that we are certain that are just bad features and the ones we are not confident enough to but in the other classes.

- PS: to be fair, Boruta creates the features in an different way than we did in this example. Instead of creating then from scratch, using a new random variable, we just shuffle the columns of the original dataframe. In Boruta literature they are called *shadow variables* instead of *noised*.

Our discussions solidified the ideas needed for you to understand Boruta in the details. You can dive deeper now with this [excellent blog post](https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a).

## Using Boruta

The [post](https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a) gives a pretty way of using the BorutaPy library. Im just adding some comments.

In [10]:
from boruta import BorutaPy

### Initialize model we want to use as base estimator

- Note that we can add hyper-parameters we find relevant, such as `class_weight`.

- When using tree ensembles (let's be honest, always), deeper trees will change slightly the feature_importance methods and will just take longer to compute. In practice, setting `max_depth` as an int is a time saver with not very much loss in performance of the selection because we will be able to set number of boruta trails bigger because of it. Default RandomForests are expanded until all leaves are pure or until all leaves contain less than min_samples_split (default is set to 1) samples which is very computational consuming.

In [11]:
forest = RandomForestClassifier(max_depth=7, random_state=42)

### Set Boruta object and fit it

- Boruta's `n_estimators` overwrites the estimator's `n_estimators`. By default, it's set to 1000. If 'auto', then it is determined automatically based on the size of the dataset.
- `alpha` and `perc` are parameters you may want to tune a little.

In [12]:
boruta = BorutaPy(
   estimator = forest,
   max_iter = 100, # number of trials to perform
   random_state = 42
)

### fit Boruta (it accepts np.array, not pd.DataFrame)
boruta.fit(np.array(X), np.array(y))

BorutaPy(estimator=RandomForestClassifier(max_depth=7, n_estimators=1000,
                                          random_state=RandomState(MT19937) at 0x1CE2687EDB0),
         random_state=RandomState(MT19937) at 0x1CE2687EDB0)

### Get the selected features and the ones we are not sure we can safely drop

In [13]:
green_area = X.columns[boruta.support_].to_list()
blue_area = X.columns[boruta.support_weak_].to_list()

In [14]:
green_area

['column_1', 'column_2', 'column_3', 'column_4', 'column_6', 'column_10']

In [15]:
blue_area

['column_9']

___