# SelectKBest, Pipeline, GridSearchCV

In this module, we will look at three different sklearn tools that help with feature selection, model building streamlining, and hyperparameter optimization.

## Pipeline
Instead of going through the fitting and transformation steps for the training and test dataset separately, we can chain the StandardScaler, PCA, and LogisticRegression objects in a pipeline.

In [14]:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [6]:
boston = datasets.load_boston()
X = pd.DataFrame(boston['data'], columns=boston['feature_names'])

del X['CHAS']
y = boston['target']
y = np.array(y > y.mean()).astype(int)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

The Pipeline object takes a list of tuples as input, where the first value in each tuple is an arbitrary identifier string that we can use to access the individual elements in the pipeline, as we will see later in this chapter, and the second element in every tuple is a scikit-learn transformer or estimator.

The intermediate steps in a pipeline constitute scikit-learn transformers, and the last step is an estimator. In the preceding code example, we built a pipeline that consisted of two intermediate steps, a StandardScaler and a PCA transformer, and a logistic regression classifier as a final estimator. When we executed the fit method on the pipeline pipe_lr, the StandardScaler performed fit and transform on the training data, and the transformed training data was then passed onto the next object in the pipeline, the PCA. Similar to the previous step, PCA also executed fit and transform on the scaled input data and passed it to the final element of the pipeline, the estimator. We should note that there is no limit to the number of intermediate
steps in this pipeline.

In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('scl', StandardScaler()),
    ('pca', PCA(n_components=3)),
    ('rf', RandomForestClassifier())])

pipeline = fit(X_train, y_train)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

Test Accuracy: 0.783


## SelectKBest

The SelectKBest method is used in conjunction with the Pipeline() method. The Pipeline() method allows you to apply a series of transformations to some final estimator object. You can tell that something is a transformation because it will have a .tranform() method. SelectKBest() is a transformation. Another example is sklearn.preprocessing.StandardScaler(), which can be used to normalize your features. 

Note that the last object in the Pipeline argument should be the classifier / estimator you're going to use. 

### How does SelectKBest work?

SelectKBest applies a univariate analysis, testing each available feature against a given label. The "k" features that score the best are noted. By applying the SelectKBest.transform() method to your classifier / estimator, you can reduce (simplify) it to only look at the 'K' best features in your data set. You can specify the scoring algorithm that SelectKBest will use. The most common options will be:

+ __f_classif__ - ANOVA F-value between labe/feature for classification tasks.
+ __f_regression__ - F-value between label/feature for regression tasks.

In [29]:
from sklearn.feature_selection import SelectKBest

# Create SelectKBest() Object
kbest = SelectKBest(k=5)

# Create the Pipeline() obejct, and then fit it against the data set. 
pipeline = Pipeline([('scl', StandardScaler()),
    ('kbest', SelectKBest(k=5)),
    ('rf', RandomForestClassifier(random_state=1))])
  
pipeline.fit(X_train, y_train)

# Let's see how our SelectKBest classifier performed
print '\n', 'Test Accuracy: %.3f' % pipeline.score(X_test, y_test)


Test Accuracy: 0.822


In [31]:
# this shows you which fields were selected
print 'SelectKBest Feature Selections:'
print 50 * '-'
for f, p in zip(X.columns, pipeline.named_steps['kbest'].get_support()):
    print f, '> ', p

SelectKBest Feature Selections:
--------------------------------------------------
CRIM >  False
ZN >  False
INDUS >  True
NOX >  False
RM >  True
AGE >  True
DIS >  False
RAD >  False
TAX >  False
PTRATIO >  True
B >  False
LSTAT >  True


In [32]:
# this shows you each fields' scores
print 'SelectKBest Freature Scores:'
print 50 * '-'
for f, p in zip(X.columns, pipeline.named_steps['kbest'].scores_):
    print f, '> ', p

SelectKBest Freature Scores:
--------------------------------------------------
CRIM >  28.4901903277
ZN >  56.8925775276
INDUS >  117.814262359
NOX >  80.6977615043
RM >  149.87988113
AGE >  93.1370582776
DIS >  33.2661738285
RAD >  48.1654844612
TAX >  82.3217957654
PTRATIO >  118.054422134
B >  23.034010459
LSTAT >  269.476769308


In [33]:
# this shows you each fields' p-values
print 'SelectKBest Freature p-values:'
print 50 * '-'
for f, p in zip(X.columns, pipeline.named_steps['kbest'].pvalues_):
    print f, '> ', p

SelectKBest Freature p-values:
--------------------------------------------------
CRIM >  1.69129059729e-07
ZN >  3.95945854788e-13
INDUS >  7.20495038008e-24
NOX >  1.62493444336e-17
RM >  5.94154805009e-29
AGE >  1.04745354572e-19
DIS >  1.76075304334e-08
RAD >  1.89190751618e-11
TAX >  8.33824390776e-18
PTRATIO >  6.58028085655e-24
B >  2.35622743573e-06
LSTAT >  2.27483786418e-45


## GridSearchCV

Unlike parameters that are learned from the data and created by the model, hyperparameters are parameters that are tuned prior to model creation, and thus need to be optimized separately. Grid search is a hyperparameter optimization technique used to optimize model performance by finding the optimal combination of hyperparameter values. 

In [50]:
from sklearn.grid_search import GridSearchCV
# Create a grid_search_dict. Using these with a pipeline can be a little tricky - note that
#    you have to specify which pipeline component you want to grid search by using the alias
#    you provided when you created the Pipeline object + '__' + [argument name]. 
grid_search_dict = {'kbest__k': range(1, len(X.columns)+1)}     

grid_search = GridSearchCV(pipeline, param_grid=grid_search_dict, scoring='accuracy', cv=10, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Let's see how our SelectKBest/GridSearch classifier performed
print 'SelectKBest/GridSearch - Accuracy: ', accuracy_score(y_test, grid_search.predict(X_test))

 SelectKBest/GridSearch - Accuracy:  0.828947368421


In [51]:
# You lose the ability to see which fields were selected. 
# However, you can still see the final optimized output by calling the following attribute
grid_search.best_estimator_

Pipeline(steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kbest', SelectKBest(k=10, score_func=<function f_classif at 0x000000000BE123C8>)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_...estimators=10, n_jobs=1, oob_score=False, random_state=1,
            verbose=0, warm_start=False))])

Question: do we need to redefine the model using the `best_estimator_` feature and then refit and recalculate the score? It seems that we get the same answer above...

In [52]:
grid_search = grid_search.best_estimator_
grid_search.fit(X_train, y_train)

# Let's see how our SelectKBest/GridSearch classifier performed
print 'SelectKBest/GridSearch - Accuracy: ', grid_search.score(X_test, y_test)

SelectKBest/GridSearch - Accuracy:  0.828947368421


## References
- Raschka, Sebastian. _Python Machine Learning_. Packt Publishing, 2015, Birmingham, UK.
- sklearn - Univariate Feature Selection: http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
- sklearn.feature_selection.SelectKBest: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
- sklearn.pipeline.Pipeline: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- sklearn.grid_search.GridSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
- Alternative example on everything covered here: https://civisanalytics.com/blog/data-science/2016/01/06/workflows-python-using-pipeline-gridsearchcv-for-compact-code/