# Feature Selection - Recursive Feature Selection (RFE) Using Tree and Gradient Based Estimators

### Recursive Feature Elimination (RFE)

As it’s name suggests, it eliminates the features recursively and build a model using remaining attributes then again calculates the model accuracy of the model..Moreover how it do it train the model on all the dataset and it tries to remove the least performing feature and again it trains the model and find out the feature importance among the remaining features and so on it’s kind of recursive so it tries to eliminate the features recursively.


Scikit Learn does most of the heavy lifting just import RFE from sklearn. feature_selection and pass any classifier model to the RFE() method with the number of features to select. Using familiar Scikit Learn syntax, the .fit() method must then be called.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

In [4]:
from sklearn.datasets import load_breast_cancer

In [5]:
data = load_breast_cancer()

In [6]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [7]:
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [8]:
x = pd.DataFrame(data.data, columns=data.feature_names)
x.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [9]:
y = data.target

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [11]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((455, 30), (114, 30), (455,), (114,))

### Feature Selection by Feature Importance Using Random Forest Classifier (RFC)

In [12]:
sel = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1))
sel.fit(x_train, y_train)

array([ True, False,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

True are selected and False are ignored.

In [18]:
np.mean(sel.estimator_.feature_importances_)

0.03333333333333334

In [20]:
sel.estimator_.feature_importances_

array([0.03699612, 0.01561296, 0.06016409, 0.0371452 , 0.0063401 ,
       0.00965994, 0.0798662 , 0.08669071, 0.00474992, 0.00417092,
       0.02407355, 0.00548033, 0.01254423, 0.03880038, 0.00379521,
       0.00435162, 0.00452503, 0.00556905, 0.00610635, 0.00528878,
       0.09556258, 0.01859305, 0.17205401, 0.05065305, 0.00943096,
       0.01565491, 0.02443166, 0.14202709, 0.00964898, 0.01001304])

In [21]:
# Features greater than mean will be selected.
sel.get_support()

array([ True, False,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

In [13]:
len(sel.get_support())

30

In [14]:
x_train.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [15]:
# Selecting True columns fro training dataset
features = x_train.columns[sel.get_support()]

In [17]:
features, len(features)

(Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
        'mean concave points', 'area error', 'worst radius', 'worst perimeter',
        'worst area', 'worst concave points'],
       dtype='object'),
 10)

In [22]:
x_train_rfc = sel.transform(x_train)
x_test_rfc = sel.transform(x_test)

In [46]:
def run_random_forest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print("Accuracy on test set: ", accuracy_score(y_test, y_pred))
    print()

In [24]:
%%time
# After processing the data.
run_random_forest(x_train_rfc, x_test_rfc, y_train, y_test)

Accuracy on test set: 
0.9473684210526315
CPU times: user 354 ms, sys: 67.4 ms, total: 421 ms
Wall time: 331 ms


In [25]:
%%time
# Original data.
run_random_forest(x_train, x_test, y_train, y_test)

Accuracy on test set: 
0.9649122807017544
CPU times: user 435 ms, sys: 54.2 ms, total: 490 ms
Wall time: 403 ms


Here we can see that after feature selection the accuracy has been decreased.

### Recursive Feature Elimination (RFE)

In [26]:
from sklearn.feature_selection import RFE

In [28]:
sel = RFE(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1), n_features_to_select = 15)
sel.fit(x_train, y_train)

RFE(estimator=RandomForestClassifier(n_jobs=-1, random_state=0),
    n_features_to_select=15)

In [29]:
sel.get_support()

array([ True,  True,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True,  True,  True,  True,  True, False,  True,
        True,  True, False])

In [30]:
features = x_train.columns[sel.get_support()]
features

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean concavity', 'mean concave points', 'area error', 'worst radius',
       'worst texture', 'worst perimeter', 'worst area', 'worst smoothness',
       'worst concavity', 'worst concave points', 'worst symmetry'],
      dtype='object')

In [31]:
len(features)

15

In [32]:
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)

In [33]:
%%time
# After processing the data.
run_random_forest(x_train_rfe, x_test_rfe, y_train, y_test)

Accuracy on test set: 
0.9736842105263158
CPU times: user 378 ms, sys: 62 ms, total: 440 ms
Wall time: 358 ms


In [34]:
%%time
# Original data.
run_random_forest(x_train, x_test, y_train, y_test)

Accuracy on test set: 
0.9649122807017544
CPU times: user 376 ms, sys: 49.5 ms, total: 426 ms
Wall time: 354 ms


Here we can see that after feature selection the accuracy has been increased.

### Feature Selection Using Gradient Boost Tree Importance

In [35]:
from sklearn.ensemble import GradientBoostingClassifier

In [37]:
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select = 12)
sel.fit(x_train, y_train)

RFE(estimator=GradientBoostingClassifier(random_state=0),
    n_features_to_select=12)

In [38]:
sel.get_support()

array([False,  True, False, False,  True, False, False,  True,  True,
       False, False, False, False,  True, False, False,  True, False,
       False, False,  True,  True,  True,  True, False, False,  True,
        True, False, False])

In [39]:
features = x_train.columns[sel.get_support()]
features

Index(['mean texture', 'mean smoothness', 'mean concave points',
       'mean symmetry', 'area error', 'concavity error', 'worst radius',
       'worst texture', 'worst perimeter', 'worst area', 'worst concavity',
       'worst concave points'],
      dtype='object')

In [40]:
len(features)

12

In [42]:
x_train_gra = sel.transform(x_train)
x_test_gra = sel.transform(x_test)

In [43]:
%%time
# After processing the data.
run_random_forest(x_train_gra, x_test_gra, y_train, y_test)

Accuracy on test set: 
0.9736842105263158
CPU times: user 334 ms, sys: 70.9 ms, total: 405 ms
Wall time: 310 ms


In [44]:
%%time
# Original data.
run_random_forest(x_train, x_test, y_train, y_test)

Accuracy on test set: 
0.9649122807017544
CPU times: user 396 ms, sys: 47 ms, total: 443 ms
Wall time: 377 ms


### How to find the "n_features_to_select" value?

In [47]:
for index in range(1, 31):
    sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select = index)
    sel.fit(x_train, y_train)
    x_train_gra = sel.transform(x_train)
    x_test_gra = sel.transform(x_test)
    print("Selected Features Index: ", index)
    run_random_forest(x_train_gra, x_test_gra, y_train, y_test)
    features = x_train.columns[sel.get_support()]
    print("Selected Features Names: ", features)
    print()

Selected Features:  1
Accuracy on test set:  0.8771929824561403

Index(['worst concave points'], dtype='object')

Selected Features:  2
Accuracy on test set:  0.9035087719298246

Index(['mean concave points', 'worst concave points'], dtype='object')

Selected Features:  3
Accuracy on test set:  0.9649122807017544

Index(['mean concave points', 'worst area', 'worst concave points'], dtype='object')

Selected Features:  4
Accuracy on test set:  0.9736842105263158

Index(['mean concave points', 'worst texture', 'worst area',
       'worst concave points'],
      dtype='object')

Selected Features:  5
Accuracy on test set:  0.9649122807017544

Index(['mean concave points', 'worst texture', 'worst perimeter', 'worst area',
       'worst concave points'],
      dtype='object')

Selected Features:  6
Accuracy on test set:  0.9912280701754386

Index(['mean concave points', 'area error', 'worst texture', 'worst perimeter',
       'worst area', 'worst concave points'],
      dtype='object')

Sel

As we can see the maximum accuracy we got was with 6 features so our n_features_to_select should be 6.

### Lets find the "n_features_to_select" using RandomForestClassifier

In [48]:
for index in range(1, 31):
    sel = RFE(RandomForestClassifier(n_estimators=100, random_state=0), n_features_to_select = index)
    sel.fit(x_train, y_train)
    x_train_gra = sel.transform(x_train)
    x_test_gra = sel.transform(x_test)
    print("Selected Features Index: ", index)
    run_random_forest(x_train_gra, x_test_gra, y_train, y_test)
    features = x_train.columns[sel.get_support()]
    print("Selected Features Names: ", features)
    print()

Selected Features Index:  1
Accuracy on test set:  0.8947368421052632

Selected Features Names:  Index(['worst perimeter'], dtype='object')

Selected Features Index:  2
Accuracy on test set:  0.9298245614035088

Selected Features Names:  Index(['mean concave points', 'worst perimeter'], dtype='object')

Selected Features Index:  3
Accuracy on test set:  0.9473684210526315

Selected Features Names:  Index(['mean concave points', 'worst perimeter', 'worst concave points'], dtype='object')

Selected Features Index:  4
Accuracy on test set:  0.9649122807017544

Selected Features Names:  Index(['mean concave points', 'worst perimeter', 'worst area',
       'worst concave points'],
      dtype='object')

Selected Features Index:  5
Accuracy on test set:  0.9649122807017544

Selected Features Names:  Index(['mean concave points', 'worst radius', 'worst perimeter', 'worst area',
       'worst concave points'],
      dtype='object')

Selected Features Index:  6
Accuracy on test set:  0.95614035