### Step backward feature selection

Step Backward Feature Selection starts by fitting a model using all features in the data set and determining its performance. 

Then, it trains models on all possible combinations of all features -1, and removes the feature that returns the model with the lowest performance.

In the third step it trains models in all possible combinations of the features remaining from step 2 -1 feature, and removes the feature that produced the lowest performing model.

The algorithm stops on a criteria determined by the user. This criteria could be that the model performance does not decrease beyond a certain threshold, or alternatively, as in the mlxtend implementation, when we reach a certain number of selected features.

The evaluation metric can be the roc_auc for classification or the r squared for regression for example, and is determined by the user.

Step Backward Feature Selection is called greedy, because it evaluates all possible n, and then n-1 and n-2 and so on feature combinations. Therefore, it is very computationally expensive, and sometimes, if feature space is big, even unfeasible.

There is a special package in Python that implements this type of feature selection: mlxtend.
http://rasbt.github.io/mlxtend/

In the mlxtend implementation of the Step Backward Feature Selection, the stopping criteria is an arbitrarily set number of features. So the search will finish when we reach the desired number of selected features. 

This is somewhat arbitrary, we might be selecting a sub-opimal number of features, or likewise, a high number of features. But, by looking at the performance metric returned by the algorithm as it selects the features, we can have a view, if more features do add value, or not. 


**Note**
If we wanted to stop the search by using another criteria, we would have to code the algorithm ourselves, unfortunately :(

Here I will use the Step Backward Feature Selection algorithm from mlxtend in a classification and regression dataset.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score, r2_score

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

## Classification

In [24]:
# load dataset

data = pd.read_csv('C:/Users/RAJENDRA REDDY/Downloads/Genre0.csv')
data.shape

(204, 36)

In [25]:
data.head()

Unnamed: 0,chroma_stft_min,chroma_stft_max,chroma_cqt_min,chroma_cqt_max,chroma_cens_min,chroma_cens_max,melspectogram_min,melspectogram_max,mfcc_min,mfcc_max,...,zero_crossing_rate_min,zero_crossing_rate_max,tempogram_min,tempogram_max,delta_mfcc_min,delta_mfcc_max,mel_to_stft_min,mel_to_stft_max,class,song
0,0.000465,1,0.0155,1,0.0,0.896673,4.63e-06,8115.6733,-179.931,152.82954,...,0.017578,0.510742,-3.41e-16,1,-22.53457,24.518091,0,19.284609,0,Sai Aaye Ghar Mere_shortened.wav
1,0.000995,1,0.055937,1,0.015298,0.7114,4.67e-07,911.08636,-205.9167,153.3341,...,0.044922,0.393555,-2.9e-16,1,-24.84063,25.185534,0,10.810534,0,Sai Baba De Kol_shortened.wav
2,0.002606,1,0.045407,1,0.0,0.748225,1.75e-06,4857.339,-153.78363,138.30722,...,0.023438,0.501953,-3.06e-16,1,-22.603357,29.282093,0,17.607744,0,Sai Baba Humko_shortened.wav
3,0.001447,1,0.041263,1,0.0,0.782758,3.11e-07,3757.0784,-194.9471,146.71315,...,0.020508,0.225586,-3.02e-16,1,-23.918428,24.815857,0,15.681977,0,Sai Baba Ji Kar Do_shortened.wav
4,0.002157,1,0.040596,1,0.001198,0.717803,5.21e-06,4824.614,-189.73987,157.58157,...,0.02002,0.328613,-3.98e-16,1,-19.628017,24.007666,0,16.976337,0,Sai Baba Mujhe Gale Se_shortened.wav


**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [30]:
# separate train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['mel_to_stft_max','class','song'], axis=1),
    data['mel_to_stft_max'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((142, 33), (62, 33))

### Remove correlated features

Step Backward Feature Selection takes a long time to run, so to speed it up we will reduce the feature space by removing correlated features first.

In [31]:
# remove correlated features to reduce the feature space

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

corr_features = correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )

correlated features:  7


In [32]:
# removed correlated  features
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((142, 26), (62, 26))

### Step Backward Feature Selection

For the Step Backward feature selection algorithm, we are going to use the class SFS from MLXtend:
http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

In [33]:
# within the SFS we indicate:

# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)

# 2) the stopping criteria: want to select 50 features

# 3) wheter to perform step forward or step backward

# 4) the evaluation metric: in this case the roc_auc
# 5) the want cross-validation

# this is going to take a while, do not despair

sfs = SFS(RandomForestClassifier(n_estimators=10, n_jobs=4, random_state=0),
          k_features=10, # the lower the features we want, the longer this will take
          forward=False,
          floating=False,
          verbose=2,
          scoring='roc_auc',
          cv=2)

sfs = sfs.fit(np.array(X_train), y_train)

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
Traceback (most recen

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 331, in fit
    y, expanded_class_weight = self._validate_y_class_weight(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\ensemble\_forest.py", line 559, in _validate_y_class_weight
    check_classification_targets(y)
  File "c:\users\rajendra reddy\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Traceback (most recent call last):
  File "c:\users\rajendra reddy\appdata\local\programs\python\pyth

In the previous log, we see that the performance does not decrease, so we could continue removing features. If you have time and patience, why don't you try that?

### Compare performance of feature subsets

In [34]:
# function to train random forests and evaluate the performance

def run_randomForests(X_train, X_test, y_train, y_test):
    rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

In [35]:
selected_feat= X_train.columns[list(sfs.k_feature_idx_)]
selected_feat

Index(['chroma_stft_min', 'chroma_stft_max', 'chroma_cqt_min',
       'chroma_cqt_max', 'chroma_cens_min', 'chroma_cens_max',
       'melspectogram_min', 'melspectogram_max', 'mfcc_min', 'mfcc_max'],
      dtype='object')

In [16]:
selected_feat= X_train.columns[list(sfs.k_feature_idx_)]
X_train = X_train[selected_feat]
X_test =  X_test[selected_feat]
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import LinearSVC
from sklearn.datasets import make_classification

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('Ada Boost roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))


clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
  max_depth=1, random_state=0).fit(X_train, y_train)

clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('GradientBoostingClassifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))


clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(
     estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
     voting='soft')

params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200]}

grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(X_train,y_train)
y_pred = grid.predict_proba(X_train)
print('Voting Classifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))

clf = DecisionTreeClassifier(criterion="entropy", max_depth=9)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print("Decision Tree Accuracy:",roc_auc_score(y_train, y_pred,multi_class="ovo"))

clf = BaggingClassifier(base_estimator=SVC(),
                        n_estimators=10, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('BaggingClassifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))

clf = KNeighborsClassifier(n_neighbors = 5)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)

print("KNN {}nn score: {}",roc_auc_score(y_train, y_pred,multi_class="ovo"))

clf = GaussianNB()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print("Accuracy of Naive Bayes Algo: ", roc_auc_score(y_train, y_pred,multi_class="ovo"))


clf = MLPClassifier(random_state=1, max_iter=600).fit(X_train, y_train)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print("Accuracy of MLPClassifier",roc_auc_score(y_train, y_pred,multi_class="ovo"))



Ada Boost roc-auc: 0.767770930311323
GradientBoostingClassifier roc-auc: 0.9240862195550715


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Voting Classifier roc-auc: 0.9859618182260258
Decision Tree Accuracy: 0.9659604806592382
BaggingClassifier roc-auc: 0.6223067822038812
KNN {}nn score: {} 0.8570653175140605
Accuracy of Naive Bayes Algo:  0.7189843535578462
Accuracy of MLPClassifier 0.6437039054514578


In [36]:
# evaluate performance of algorithm built
# using selected features

run_randomForests(X_train[selected_feat],
                  X_test[selected_feat],
                  y_train, y_test)

ValueError: Unknown label type: 'continuous'

In [12]:
# and for comparison, we train random forests using
# all features (except the correlated ones, which we removed already)

run_randomForests(X_train,
                  X_test,
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.7119921185820277
Test set
Random Forests roc-auc: 0.6957598691250635


Performance, as expected is roughly the same.

## Regression

Let's now repeat the process but in the context of regression. With the house prices dataset from Kaggle, the aim is to predict the continuous target: House Price.

In [13]:
# load dataset

data = pd.read_csv('../houseprice.csv')
data.shape

(1460, 81)

In [14]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(1460, 38)

In [15]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 37), (438, 37))

### Remove correlated features

In [16]:
# find and remove correlated features

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

corr_features = correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )

correlated features:  3


In [17]:
# removed correlated features
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((1022, 34), (438, 34))

In [18]:
X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

### Step Backward Feature Selection

In [19]:
# step backward feature selection algorithm

sfs = SFS(RandomForestRegressor(n_estimators=10, n_jobs=4, random_state=10), 
           k_features=20, 
           forward=False, 
           floating=False, 
           verbose=2,
           scoring='r2',
           cv=2)

sfs = sfs.fit(np.array(X_train), y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  34 out of  34 | elapsed:    2.6s finished

[2020-09-22 17:30:21] Features: 33/20 -- score: 0.825434533342885[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  33 out of  33 | elapsed:    2.5s finished

[2020-09-22 17:30:23] Features: 32/20 -- score: 0.8269182238540728[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    2.5s finished

[2020-09-22 17:30:26] Features: 31/20 -- score: 0.8321203993856869[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done

In [20]:
sfs.k_feature_idx_

(0, 3, 4, 5, 6, 7, 9, 12, 14, 16, 18, 20, 22, 23, 24, 26, 27, 28, 31, 32)

In [21]:
X_train.columns[list(sfs.k_feature_idx_)]

Index(['Id', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'BsmtFinSF1', 'TotalBsmtSF', '2ndFlrSF', 'GrLivArea',
       'BsmtHalfBath', 'HalfBath', 'KitchenAbvGr', 'Fireplaces', 'GarageCars',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'MiscVal', 'MoSold'],
      dtype='object')

### Compare performance of feature subsets

In [22]:
# function to train random forests and evaluate the performance

def run_randomForests(X_train, X_test, y_train, y_test):
    
    rf = RandomForestRegressor(n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict(X_train)
    print('Random Forests roc-auc: {}'.format(r2_score(y_train, pred)))
    
    print('Test set')
    pred = rf.predict(X_test)
    print('Random Forests roc-auc: {}'.format(r2_score(y_test, pred)))

In [23]:
selected_feat = X_train.columns[list(sfs.k_feature_idx_)]

In [24]:
# evaluate performance of algorithm built
# using selected features

run_randomForests(X_train[selected_feat],
                  X_test[selected_feat],
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.8702345974928545
Test set
Random Forests roc-auc: 0.8240294353877188


In [25]:
# and for comparison, we train random forests using
# all features (except the correlated ones, which we removed already)

run_randomForests(X_train,
                  X_test,
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.8699152317492538
Test set
Random Forests roc-auc: 0.8190809813112794


This is all for this lecture. I hope you enjoyed it, and see you in the next one!