Exhaustive feature selection
-------
Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d.

In an exhaustive feature selection the best subset of features is selected, over all possible feature subsets, by optimizing a specified performance metric for a certain machine learning algorithm. For example, if the classifier is a logistic regression and the dataset consists of 4 features, the algorithm will evaluate all 15 feature combinations as follows:

- all possible combinations of 1 feature
- all possible combinations of 2 features
- all possible combinations of 3 features
- all the 4 features
- and select the one that results in the best performance (e.g., classification accuracy) of the logistic regression classifier.

This is another greedy algorithm as it evaluates all possible feature combinations. It is quite computationally expensive, and sometimes, if feature space is big, even unfeasible.

There is a special package for python that implements this type of feature selection: mlxtend.

In the mlxtend implementation of the exhaustive feature selection, the stopping criteria is an arbitrarily set number of features. So the search will finish when we reach the desired number of selected features.

This is somewhat arbitrary because we may be selecting a subopimal number of features, or likewise, a high number of features.

Here I will use the Exhaustive feature selection algorithm from mlxtend in a classification (Paribas) and regression (House Price) dataset.

참고 자료 - http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

In [4]:
warnings.filterwarnings(action='ignore')

In [6]:
file_path = '/Users/wontaek/Documents/Lecture_dataset/BNP_Paribas_Cardif_claims/train.csv'
data = pd.read_csv(file_path, nrows=50000)
data.shape

# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape


# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target', 'ID'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape



# find and remove correlated features
# in order to reduce the feature space a bit
# so that the algorithm takes shorter

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

corr_features = correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )


# removed correlated  features
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

correlated features:  55


((35000, 57), (15000, 57))

In [7]:
X_train.columns[0:10]

Index(['v1', 'v2', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11'], dtype='object')

조합을 만들어서 feature를 선택하는 방법이다.
- 최소 조합의 개수와 최대 조합의 개수를 선택해서 하는 방식
- feature가 많을 수록 경우의 수도 많으니 오래 걸린다.

In [24]:
# exhaustive feature selection
# I indicate that I want to select 10 features from
# the total, and that I want to select those features
# based on the optimal roc_auc

# in order to shorter search time for the demonstration
# i will ask the algorithm to try all possible 1,2,3 and 4
# feature combinations from a dataset of 4 features

# if you have access to a multicore or distributed computer
# system you can try more greedy searches

efs1 = EFS(RandomForestClassifier(n_jobs=4, random_state=0), 
           min_features=1,
           max_features=5, 
           scoring='roc_auc',
           print_progress=True,
           cv=2)

efs1 = efs1.fit(np.array(X_train[X_train.columns[0:5]].fillna(0)), y_train)

Features: 31/31

In [13]:
def run_randomForests(X_train, X_test, y_train, y_test):
    rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)
    print('Train set')
    
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    print('Test set')
    
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

In [18]:
efs1.subsets_

{0: {'feature_idx': (0,),
  'cv_scores': array([0.50410602, 0.5043581 ]),
  'avg_score': 0.5042320625850624,
  'feature_names': ('0',)},
 1: {'feature_idx': (1,),
  'cv_scores': array([0.50496613, 0.49612179]),
  'avg_score': 0.5005439617005987,
  'feature_names': ('1',)},
 2: {'feature_idx': (2,),
  'cv_scores': array([0.50457252, 0.50092121]),
  'avg_score': 0.5027468671542781,
  'feature_names': ('2',)},
 3: {'feature_idx': (3,),
  'cv_scores': array([0.5050246, 0.5037485]),
  'avg_score': 0.5043865498028958,
  'feature_names': ('3',)},
 4: {'feature_idx': (0, 1),
  'cv_scores': array([0.5022019 , 0.51371453]),
  'avg_score': 0.5079582156770654,
  'feature_names': ('0', '1')},
 5: {'feature_idx': (0, 2),
  'cv_scores': array([0.51318981, 0.51302453]),
  'avg_score': 0.5131071659204627,
  'feature_names': ('0', '2')},
 6: {'feature_idx': (0, 3),
  'cv_scores': array([0.50586411, 0.5099452 ]),
  'avg_score': 0.5079046578999561,
  'feature_names': ('0', '3')},
 7: {'feature_idx': (1, 2

In [14]:
efs1.best_idx_

(0, 1, 2)

In [15]:
selected_feat= X_train.columns[list(efs1.best_idx_)]
selected_feat

Index(['v1', 'v2', 'v4'], dtype='object')

In [16]:
# evaluate performance of classifier using selected features

run_randomForests(X_train[selected_feat].fillna(0),
                  X_test[selected_feat].fillna(0),
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.5433561866210962
Test set
Random Forests roc-auc: 0.5253970921093112


regression도 동일하게 진행한다.