### Feature Selection:
Feature selection is a process where you automatically select those features in your data that
contribute most to the prediction variable or output in which you are interested. Having
irrelevant features in your data can decrease the accuracy of many models, especially linear
algorithms like linear and logistic regression.Three benefits of performing feature selection
before modeling your data are:
<ul>
    <li>Reduces Overfitting: Less redundant data means less opportunity to make decisions
        based on noise.
    </li>
    <li>
        Improves Accuracy: Less misleading data means modeling accuracy improves.
    </li>
    <li>
        Reduces Training Time: Less data means that algorithms train faster.
    </li>
</ul> 
    

### 1.0 Univariate Selection

In [4]:
import pandas as pd
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Load Data
names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ]
data = pd.read_csv('pima.csv', names = names)
data.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
# Preprocess Data
array = data.values
X = array[:, 0:8]
y = array[:, 8]

In [15]:
# Select Features
test = SelectKBest(score_func = chi2, k = 4)
fit = test.fit(X, y)

# Summarize Scores
set_printoptions(precision = 3)
print(fit.scores_)
features = fit.transform(X)

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]


In [16]:
print(features[0:5,:])

[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]


### 1.1 Recursive Feature Elimination

In [24]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X,y)

print("Num Features: ", fit.n_features_)
print("Selected Features: ", fit.support_) 
print("Feature Ranking: ", fit.ranking_)

Num Features:  3
Selected Features:  [ True False False False False  True  True False]
Feature Ranking:  [1 2 3 5 6 1 1 4]


### 1.2 Principal Component Analysis:
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a
compressed form. Generally this is called a data reduction technique. A property of PCA is that
you can choose the number of dimensions or principal components in the transformed result. In
the example below, we use PCA and select 3 principal components. Learn more about the PCA
class in scikit-learn by reviewing the API

In [45]:
from sklearn.decomposition import PCA
pca  = PCA(n_components = 3)
fit = pca.fit(X)

# summarize components
print("Explained Variance: ", fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance:  [ 0.889  0.062  0.026]
[[ -2.022e-03   9.781e-02   1.609e-02   6.076e-02   9.931e-01   1.401e-02
    5.372e-04  -3.565e-03]
 [ -2.265e-02  -9.722e-01  -1.419e-01   5.786e-02   9.463e-02  -4.697e-02
   -8.168e-04  -1.402e-01]
 [ -2.246e-02   1.434e-01  -9.225e-01  -3.070e-01   2.098e-02  -1.324e-01
   -6.400e-04  -1.255e-01]]


### 1.3 Feature Importance:
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance
of features. In the example below we construct a ExtraTreesClassifier classifier for the Pima
Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class 5
in the scikit-learn API.

You can see that we are given an importance score for each attribute where the larger the
score, the more important the attribute. The scores suggest at the importance of plas, age
and mass.

In [47]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_)

[ 0.122  0.237  0.097  0.077  0.078  0.137  0.123  0.128]
