In [1]:
from __future__ import print_function

from traitlets.config.manager import BaseJSONConfigManager
path = '/Users/jmk/anaconda2/envs/data601/etc/jupyter/nbconfig'
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
              'theme': 'night',
              'scroll': True,
              #'transition': 'zoom',
              'start_slideshow_at': 'selected',
})

{'scroll': True, 'start_slideshow_at': 'selected', 'theme': 'night'}

# Feature Selection
Often we have more features than we know what to do with.  Generally speaking the more features you use in your model, the more data you need to help sort through the various permutations.  This is referred to as the "curse of dimensionality" (Bellman) and is described as:

> In machine learning problems that involve ... data samples in a **high-dimensional feature space** with each feature having a number of possible values, **an enormous amount of training data is required to ensure that there are several samples with each combination of values**. With a fixed number of training samples, the predictive power reduces as the dimensionality increases, and this is known as Hughes phenomenon (named after Gordon F. Hughes).

... both from https://en.m.wikipedia.org/wiki/Curse_of_dimensionality

Finding a way to quickly identify features that don't offer predictive power lets us _iterate faster_ with _less data_ which lets us experiment more.

## Some approaches to feature selection

* Removing features with _low variance_ (`VarianceThreshold()`)
* Univariate feature selection (`SelectKBest`)
* Recursive feature elimination (`RFE`)
* Using a simple, sparser model to identify which features have predictive power (`SelectFromModel`)

## VarianceThreshold

This approach simply analyzes the variance of each feature and removes any features that have a low variance.  Example:  Let's say we're analyzing which bikes get rented at a bicycle shop.  The number of seats on the bicycle is likely always `1`, so it doesn't help us decide anything.

The intuition is that they don't change much, so they can't have predictive power.

In [2]:
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], 
     [0, 1, 0], 
     [1, 0, 0], 
     [0, 1, 1], 
     [0, 1, 0], 
     [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

## SelectKBest
Removes all but the $k$ highest scoring features.  "Highest scoring" is  configurable using any function of `(X, y)` that returns a set of scores per feature.

Predefined scoring functions include `chi2` ($\chi^2$ test, testing for the likelihood that the relationship between each variable and the outcome is due to that feature), `f_classif` (ANOVA or analysis of variation), and mutual information metrics for both classification and regression problems.  User-defined scoring functions can be passed as well.

In [3]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
chi2(X, y)  
#  Returns two arrays:  chi2 scores and p-values, each per _feature_.

(array([ 10.81782088,   3.59449902, 116.16984746,  67.24482759]),
 array([4.47651499e-03, 1.65754167e-01, 5.94344354e-26, 2.50017968e-15]))

In [4]:
X.shape

(150, 4)

In [5]:
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape

(150, 2)

Note that the final dimensionality of each sample is now 2.

## RFE
Recursive feature elimination (RFE) uses an external estimator that assigns weights to features (e.g. the coefficients of a linear model), RFE selects features by recursively considering smaller and smaller sets of features.  

It does this by training the external estimator and then dropping the least important features (either through `.coef_` or `.feature_importances_`).  

It then repeats this procedure until the desired number of features is selected.

In [6]:
%matplotlib inline

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt

# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

# Create the RFE object and rank each pixel
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
rfe.fit(X, y)
ranking = rfe.ranking_.reshape(digits.images[0].shape)

print(rfe.ranking_)

[64 50 31 23 10 17 34 51 57 37 30 43 14 32 44 52 54 41 19 15 28  8 39 53
 55 45  9 18 20 38  1 59 63 42 25 35 29 16  2 62 61 40  5 11 13  6  4 58
 56 47 26 36 24  3 22 48 60 49  7 27 33 21 12 46]


## SelectFromModel

`SelectFromModel` is similar to `RFE`, but instead of deciding on how many features to retain and removing features until we reach that limit, `SelectFromModel` uses a threshold on the `coef_` or `feature_importances_` values instead.

In [7]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape) 

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X_new.shape

(150, 4)


(150, 3)

In all of these cases, we're selecting features _from the original set_ that we're going to keep.  We didn't compute any new feature, we simply _select_ from the available set.

An alternative approach called _dimensionality reduction_ lets us synthesize new features from the existing features.  We'll discuss it next.