# Lab 5 - Feature selection

In the previous lab we introduced the concept of _data dimensionality_, that is the number of variables, or _features_, used to represent our data. Think of features as... well, features, that is characteristics of the data.

For instance, in a gene expression microarray dataset from a case-control association study, the input data $X$ will be a $n \times d$ matrix where each _row_ is a vector of length $d$ which contains the expression levels of $d$ genes for a single individual.

In some fields such as the analysis of biomolecular data $d$ may be in the order of the tens or hundreds of thousands, or even millions. The point is, they are not all equally important. On the contrary, it is safe to assume that _only a small subset_ of all the available features is actually _relevant_, that is plays a role in the relationship between input and output.

For this reason, a crucial step is to _reduce the dimensionality of the data_ by identifying which are the relevant variables.

Main concepts:

 * Data dimensionality
 * Feature selection
 




## Taxonomy of variable selection methods

Roughly, feature selection methods are divided in three groups, based on how they select the features:

 * **Filter methods**: select subsets of variables as a pre-processing step, independently of the chosen learning machine. Example: statistical univariate tests (t-test, etc.).
 * **Wrapper methods**: Use the learning machine of interest as a black box to score subsets of variable according to their predictive power. Example: Recursive Feature Elimination (RFE).
 * **Embedded methods**: Perform variable selection _as part_ of the training process. Examples: Lasso, ElasticNET
 
In this tutorial we will see different kinds of algorithms for feature selection, however, because of how they are used in the learning pipeline, they will _behave_ like filter methods.

## Imports and setup

The usual stuff:

 * Magic command `%matplotlib inline` so that plots are displayed correctly in the notebook.
 * `matplotlib` followed by `seaborn` in order to have fancy plots
 * `numpy` _et similia_ for number crunching.
 
Additional libraries will be imported whenever they are needed

In [1]:
%matplotlib inline

import matplotlib
from matplotlib import pyplot as plt

import seaborn

import numpy as np

## Dataset generation

We generate a synthetic classification dataset again using the `make_classification` function from the `scikit-learn` library. However, this time we choose higher values for $d$, and also set the number of _informative_ (i.e., relevant) variables to be much smaller than $d$.

Notice that from now on we will no longer we able to plot the datasets, since the data dimensionality is too high.

<img style="float: left;" src="warning.png" width="20px"> &nbsp; **Warning**: by setting the argument `shuffle=False` we force the first `d_informative` columns of the data matrix to be those relative to relevant variables.


In [2]:
from sklearn.datasets import make_classification

### Set the number of samples and dimensions
n, d = 100, 5000
d_informative = 10

np.random.seed(2)

X, y = make_classification(n_samples=n, 
                           n_features=d, 
                           n_informative=d_informative, 
                           n_redundant=0, 
                           n_repeated=0, 
                           n_classes=2, 
                           n_clusters_per_class=1, 
                           flip_y=0.02, 
                           class_sep=1.0, 
                           shuffle=False,
                           )

### We have to shuffle the dataset manually, because reasons
idx = np.arange(n)
np.random.shuffle(idx)
X = X[idx, :]
y = y[idx]

# X = X[:, :d_informative]

Now split the dataset in half (first half for training, second half for test) and fit a Linear SVM (use `sklearn.svm.LinearSVC`) on the training set, using KCV to select the best model.
Then, predict the labels for the test set and compute the prediction accuracy.

<img style="float: left;" src="info.png" width="20px"> &nbsp; **Hint**: I included a "suggested" range for the `C` parameter of the SVM.

In [3]:
C_range = np.logspace(-4,1, 20)

### BEGIN STUDENTS ###

### END STUDENTS ###

acc_score = accuracy_score(y_test, y_pred)

print("Accuracy score: {}".format(acc_score))

Accuracy score: 0.64


Now reduce the dataset to the relevant variables only (they are the first `d_informative` columns of the $X$. Slice the matrix using the `:` operator as seen before).

Of course we can do this only with a synthetic dataset where we know in advance _which_ are the relevant variables.

In [4]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score

C_range = np.logspace(-4,1, 20)

### BEGIN STUDENTS ###

### END STUDENTS ###

acc_score = accuracy_score(y_test, y_pred)

print("Accuracy score: {}".format(acc_score))

Accuracy score: 0.88


You should notice that the accuracy has greatly improved, as the noise introduced by those extra, useless variables is no longer in the dataset.

However, in real case scenarios we do not know in advance which variables to keep and which ones to throw away.

### Enter variable selection!

Now repeat the experiment above, this time adding a variable selection step to the mix.

Use Recursive Feature Elimination from (RFE, docs [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)) with a Linear SVM classifier (the same one as before) as estimator.

<img style="float: left;" src="info.png" width="20px"> &nbsp; **Hint**: Use the suggested step `0.1` otherwise the feature selection process will take too long.

<img style="float: left;" src="warning.png" width="20px"> &nbsp; **Waring**: Due to how the RFE is implemented, parameters to the grid search object must be passed in a slightly different way, here is an example:

```python
...
grid_search = GridSearchCV(
                        rfe_est, 
                        param_grid={
                            'n_features_to_select':[5, 10, 20], 
                            'estimator__C':C_range # <- look here!
                        }, 
                        cv=4)
...
```

Notice how the values for the `C` parameter of the inner estimator has to be passed by prepending `estimator__` to the key in the dictionary.



In [5]:
from sklearn.feature_selection import RFE

### BEGIN STUDENTS ###


### END STUDENTS ###

acc_score = accuracy_score(y_test, y_pred)

print("Accuracy score: {}".format(acc_score))

Accuracy score: 0.88
