# Feature Selection
Automatically select the features that contribute the most to predicting the y-label. Benefits of feature selection are (1) reduced overfitting, (2) improved accuracy, and (3) reduced training time.

In [1]:
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier

  from numpy.core.umath_tests import inner1d


In [2]:
# these examples use the Pima Indian diabetes dataset
url = "pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

In [3]:
# separate array into features (X) and label (y) parts
X = array[:,0:8]
y = array[:,8]

## Univariate Selection
Statistical test to select those featues with the strongest relationship with the output variable. This example uses the chi-squared statistical test for non-negative features. This test looks at whether or not a statistically significant relationship exists between the two variables.

In [4]:
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, y)

# see the scores
numpy.set_printoptions(precision=3)
print(fit.scores_)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


In [5]:
# the selected features (see the first 5 rows)
features = fit.transform(X)
print(features[0:5,:])

[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


## Recursive Feature Elimination
Recursively removes attributes and building a model on the remaining attributes. Uses model accuracy to determine which attributes (or combination of attributes) contribute the most. This example uses RFE with logistic regression to select the top 3 features.

In [6]:
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, y)

print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


## Principal Component Analysis
PCA uses linear algebra to transform the dataset into a compressed form (i.e. it's a data reduction technique). It's used to emphasize variation in the dataset. Visual explanation: http://setosa.io/ev/principal-component-analysis

You can choose the number of dimensions (or principal components) in the transformed result. This example uses PCA to select 3 principal components.

In [7]:
pca = PCA(n_components=3)
fit = pca.fit(X)

print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


In [8]:
# print the first 5 rows, transformed based on the 3 principal components
components = fit.fit_transform(X)
print(components[0:5,:])

[[-7.571e+01 -3.595e+01 -7.261e+00]
 [-8.236e+01  2.891e+01 -5.497e+00]
 [-7.463e+01 -6.791e+01  1.946e+01]
 [ 1.108e+01  3.490e+01 -5.302e-02]
 [ 8.974e+01 -2.747e+00  2.521e+01]]


More info: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

## Feature Importance
Random Forest and Extra Trees can be used to estimate the importance of features. This example uses the ExtraTreesClassifier algorithm.

In [9]:
model = ExtraTreesClassifier()
model.fit(X, y)
print(model.feature_importances_)

[0.117 0.222 0.106 0.079 0.077 0.135 0.126 0.139]


The higher the score, the more important the attribute.

In [10]:
# to see the features ordered from most important to least
names = numpy.array(names[0:8])
scores = model.feature_importances_
ind = scores.argsort()
sorted_names = names[ind] # sorted from least to most important
most_important = numpy.fliplr([sorted_names])[0] # flip to sort in descending order

# show to top 3 most important features
most_important[0:3]

array(['plas', 'age', 'mass'], dtype='<U4')