# Feature Selection - Recursive Feature Selection (RFE) Using Tree and Gradient Based Estimators

### Recursive Feature Elimination (RFE)

As it’s name suggests, it eliminates the features recursively and build a model using remaining attributes then again calculates the model accuracy of the model..Moreover how it do it train the model on all the dataset and it tries to remove the least performing feature and again it trains the model and find out the feature importance among the remaining features and so on it’s kind of recursive so it tries to eliminate the features recursively.


Scikit Learn does most of the heavy lifting just import RFE from sklearn. feature_selection and pass any classifier model to the RFE() method with the number of features to select. Using familiar Scikit Learn syntax, the .fit() method must then be called.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
data = load_breast_cancer()

In [None]:
data.keys()

In [None]:
print(data.DESCR)

In [None]:
x = pd.DataFrame(data.data, columns=data.feature_names)
x.head()

In [None]:
y = data.target

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

### Feature Selection by Feature Importance Using Random Forest Classifier (RFC)

In [None]:
sel = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
)
sel.fit(x_train, y_train)

True are selected and False are ignored.

In [None]:
np.mean(sel.estimator_.feature_importances_)

In [None]:
sel.estimator_.feature_importances_

In [None]:
# Features greater than mean will be selected.
sel.get_support()

In [None]:
len(sel.get_support())

In [None]:
x_train.columns

In [None]:
# Selecting True columns fro training dataset
features = x_train.columns[sel.get_support()]

In [None]:
features, len(features)

In [None]:
x_train_rfc = sel.transform(x_train)
x_test_rfc = sel.transform(x_test)

In [None]:
def run_random_forest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print("Accuracy on test set: ", accuracy_score(y_test, y_pred))
    print()

In [None]:
%%time
# After processing the data.
run_random_forest(x_train_rfc, x_test_rfc, y_train, y_test)

In [None]:
%%time
# Original data.
run_random_forest(x_train, x_test, y_train, y_test)

Here we can see that after feature selection the accuracy has been decreased.

### Recursive Feature Elimination (RFE)

In [None]:
from sklearn.feature_selection import RFE

In [None]:
sel = RFE(
    RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
    n_features_to_select=15,
)
sel.fit(x_train, y_train)

In [None]:
sel.get_support()

In [None]:
features = x_train.columns[sel.get_support()]
features

In [None]:
len(features)

In [None]:
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)

In [None]:
%%time
# After processing the data.
run_random_forest(x_train_rfe, x_test_rfe, y_train, y_test)

In [None]:
%%time
# Original data.
run_random_forest(x_train, x_test, y_train, y_test)

Here we can see that after feature selection the accuracy has been increased.

### Feature Selection Using Gradient Boost Tree Importance

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
sel = RFE(
    GradientBoostingClassifier(n_estimators=100, random_state=0),
    n_features_to_select=12,
)
sel.fit(x_train, y_train)

In [None]:
sel.get_support()

In [None]:
features = x_train.columns[sel.get_support()]
features

In [None]:
len(features)

In [None]:
x_train_gra = sel.transform(x_train)
x_test_gra = sel.transform(x_test)

In [None]:
%%time
# After processing the data.
run_random_forest(x_train_gra, x_test_gra, y_train, y_test)

In [None]:
%%time
# Original data.
run_random_forest(x_train, x_test, y_train, y_test)

### How to find the "n_features_to_select" value?

In [None]:
for index in range(1, 31):
    sel = RFE(
        GradientBoostingClassifier(n_estimators=100, random_state=0),
        n_features_to_select=index,
    )
    sel.fit(x_train, y_train)
    x_train_gra = sel.transform(x_train)
    x_test_gra = sel.transform(x_test)
    print("Selected Features Index: ", index)
    run_random_forest(x_train_gra, x_test_gra, y_train, y_test)
    features = x_train.columns[sel.get_support()]
    print("Selected Features Names: ", features)
    print()

As we can see the maximum accuracy we got was with 6 features so our n_features_to_select should be 6.

### Lets find the "n_features_to_select" using RandomForestClassifier

In [None]:
for index in range(1, 31):
    sel = RFE(
        RandomForestClassifier(n_estimators=100, random_state=0),
        n_features_to_select=index,
    )
    sel.fit(x_train, y_train)
    x_train_gra = sel.transform(x_train)
    x_test_gra = sel.transform(x_test)
    print("Selected Features Index: ", index)
    run_random_forest(x_train_gra, x_test_gra, y_train, y_test)
    features = x_train.columns[sel.get_support()]
    print("Selected Features Names: ", features)
    print()