# Feature Selection/Ranking Methods

Selecting good features may lead to
less overfitting by reducing the number of parameters
less computation time
improved accuracy (especially on high dimensional data)

We will explore important features of this 'Gender Recognition by Voice' dataset using various methods.
1. Model Based Ranking
2. Regularization (L1 and L2)
3. Univariate Feature Selection
4. Recursive Feature Elimination
5. Random Forest Feature Importance
6. Stability Selection


We won't cover dimensionality reduction techniques like Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) and focus on common feature selection techniques.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

**Data preprocessing**

In [None]:
data = pd.read_csv('../input/voice.csv')

In [None]:
data.sample(5)

In [None]:
sns.heatmap(data.corr())

In [None]:
data.info()

In [None]:
data.isnull().any()

In [None]:
data.label.value_counts()

We see that there are 3168 instances, 20 features and a class label. 
There are no missing data values in any columns.
There are equal number of females and males at 1584 instances each, 

In [None]:
# convert class label into binary number, 1: female, 0: male
data.label = np.where(data.label.values == 'female', 1, 0)
data.label.value_counts()

In [None]:
X = data.drop('label', axis = 1)
y = data.label

In [None]:
X.shape, y.shape

In [None]:
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=333)

# 1. Model Based Ranking

We can fit a classfier to each feature and rank the predictive power.
This method selects the most powerful features individually but ignores the predictive power when features are combined.

Random Forest Classifier is used in this case because it is robust, nonlinear, and doesn't require scaling.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf = RandomForestClassifier(n_estimators = 50, max_depth = 4)

scores = []
num_features = len(X.columns)
for i in range(num_features):
    col = X.columns[i]
    score = np.mean(cross_val_score(clf, X[col].values.reshape(-1,1), y, cv=10))
    scores.append((int(score*100), col))

print(sorted(scores, reverse = True))



In [None]:
def print_best_worst (scores):
    scores = sorted(scores, reverse = True)
    
    print("The 5 best features selected by this method are :")
    for i in range(5):
        print(scores[i][1])
    
    print ("The 5 worst features selected by this method are :")
    for i in range(5):
        print(scores[len(scores)-1-i][1])

'meanfun', 'IQR', 'Q25', 'sd' are among the most important features, and 
'modindx', 'minfun', 'maxfun', 'Q75' are among the least important features.

# 2-1 L1-regularization (Lasso)

L1-regularization adds L1 penalty to the parameters which forces many parameters to be zero as regularization strength increases. Thus weak features should have zero as coefficients 


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# {'logisticregression__C': [1, 10, 100, 1000]
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = make_pipeline(StandardScaler(), LogisticRegression(penalty = 'l1'))     
grid = GridSearchCV(pipe, param_grid, cv = 10)
grid.fit(X, y)
print(grid.best_params_)

Using pipeline to avoid data leakage during scaling, grid search was used to find the best C.
High C means low regularization strength and low C means high regularization strength.
The best C chosen by the grid search was 0.1.

In [None]:
X_scaled = StandardScaler().fit_transform(X)
clf = LogisticRegression(penalty = 'l1', C = 0.1)
clf.fit(X_scaled,y)

In [None]:
zero_feat = []
nonzero_feat = []
# type(clf.coef_)
for i in range(num_features):
    coef = clf.coef_[0,i]
    if coef == 0:
        zero_feat.append(X.columns[i])
    else:
        nonzero_feat.append((coef, X.columns[i]))
        
print ('Features that have coeffcient of 0 are: ', zero_feat)
print ('Features that have non-zero coefficients are:')
print (sorted(nonzero_feat, reverse = True))
        

Features 'meanfreq', 'sd', 'median', 'Q25', 'kurt', 'sp.ent', 'centroid', 'maxfun', 'meandom', 'mindom', 'maxdom', 'dfrange' were zeroed out by lasso.

Features 'meanfun', 'skew', 'sfm', 'modindx', 'mode', 'Q75', 'minfun', 'IQR' survived.

# 2-2 L2-regularization (Ridge)

L2-regularization is numerically more stable and produces more consistent coefficients than L1, however, it does not cause sparsity (coefficients do not get zeroed out).

In [None]:
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = make_pipeline(StandardScaler(), LogisticRegression(penalty = 'l2'))     
grid = GridSearchCV(pipe, param_grid, cv = 10)
grid.fit(X, y)
print(grid.best_params_)

The best C chosen by the grid search was 1.

In [None]:
X_scaled = StandardScaler().fit_transform(X)
clf = LogisticRegression(penalty = 'l2', C = 1)
clf.fit(X_scaled,y)

In [None]:
abs_feat = []
for i in range(num_features):
    coef = clf.coef_[0,i]
    abs_feat.append((abs(coef), X.columns[i]))
        
print (sorted(abs_feat, reverse = True))

In [None]:
print_best_worst(abs_feat)

# 3. Univariate Feature Selection

Univariate feature selection selects the best features by running univariate statistical tests like chi-squared test, F-1 test, and mutual information methods.




In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, mutual_info_classif

test = SelectKBest(score_func=chi2, k=2)
test.fit(X, y)

In [None]:
scores = []
for i in range(num_features):
    score = test.scores_[i]
    scores.append((score, X.columns[i]))
        
print (sorted(scores, reverse = True))

In [None]:
print_best_worst(scores)

In [None]:
test = SelectKBest(score_func = mutual_info_classif, k=2)
test.fit(X, y)

In [None]:
scores = []
for i in range(num_features):
    score = test.scores_[i]
    scores.append((score, X.columns[i]))
        
print (sorted(scores, reverse = True))

In [None]:
print_best_worst(scores)

# 4. Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) recursively selects important subsets of features based on built-in attributes like coefficients or feature importance of a given estimator. Hence RFE heavily depends on which estimator we are using.



In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression


rfe = RFE(LogisticRegression(), n_features_to_select=1)
rfe.fit(X,y)

In [None]:
scores = []
for i in range(num_features):
    scores.append((rfe.ranking_[i],X.columns[i]))
    
print_best_worst(scores)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfe = RFE(RandomForestClassifier(), n_features_to_select = 1)
rfe.fit(X,y)

In [None]:
scores = []
for i in range(num_features):
    scores.append((rfe.ranking_[i],X.columns[i]))
    
print_best_worst(scores)

# 5. Random Forest Feature Importance


# 5-1. Mean Decrease Impurity

Emsembled tree classifiers like random forest or extra trees that use bagging have multiple decision trees that are grown by decreasing impurities based on measures like information gain or Gini impurity. When a node in a tree splits, the impurity of the tree is decreased by drawing a decision line on one of the features. The impurity decrease of each feature for a forest then can be averaged over many trees.


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X,y)


In [None]:
scores = []
for i in range(num_features):
    scores.append((clf.feature_importances_[i],X.columns[i]))
        
print_best_worst(scores)

# 5-2: Mean Decrease Accuracy

For mean decrease accuracy, we measure the decline in accuracy when we shuffle or exclude a single feature.
Shuffling or exlucding an important feature should result in a drop in accuracy.


In [None]:
from sklearn.model_selection import cross_val_score

scores = []
clf = RandomForestClassifier()
score_normal = np.mean(cross_val_score(clf, X, y, cv = 10))

# X_shuffled = X.copy()
# np.random.shuffle(X_shuffled[X.columns[i]])

# X_shuffled.meanfreq
for i in range(num_features):
    X_shuffled = X.copy()
    scores_shuffle = []
    for j in range(3):
        np.random.seed(j*3)
        np.random.shuffle(X_shuffled[X.columns[i]])
        score = np.mean(cross_val_score(clf, X_shuffled, y, cv = 10))
        scores_shuffle.append(score)
        
    scores.append((score_normal - np.mean(scores_shuffle), X.columns[i]))
    

In [None]:
scores,score_normal

In [None]:
print_best_worst(scores)

# 6. Stability Selection

Stability selection method uses randomized lasso for regression and randomized logistic regression for classification.
It randomly subsamples instances and features, selects good features on each subset and aggregates the results.
It is straightforward to implement.


In [None]:
from sklearn.linear_model import RandomizedLogisticRegression

clf = RandomizedLogisticRegression()
clf.fit(X,y)


In [None]:
zero_feat = []
nonzero_feat = []
# type(clf.coef_)
for i in range(num_features):
    coef = clf.scores_[i]
    if coef == 0:
        zero_feat.append(X.columns[i])
    else:
        nonzero_feat.append((coef, X.columns[i]))
        
print ('Features that have coeffcient of 0 are: ', zero_feat)
print ('Features that have non-zero coefficients are:')
print (sorted(nonzero_feat, reverse = True))

# Conclusion


For feature selection, incorporating some of these techniques in combination and cross-validating should give reliable results.
For feature ranking, it is important to take each method with caution. It is recommended to take multiple subsets of your data and aggregate the results to ensure stability of the outcomes.
