<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2018


<img style="width: 400px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Title_pics.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b> Feature selection & cross validation </b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>February 29, 2018</b></p>

<hr style="height:5px;border:none" />

# 1. Feature selection
<hr style="height:1px;border:none" />

We talked about the curse of dimensionality in our previous class. To get around it, we can reduce the dimensionality of the data (e.g., PCA). Anther approach is to eliminate features that are not associated with the target, and to retain only those features that likely contribute classification of a data set, the process known as **feature selection**. There are a number of approaches for feature selection. The ones I present here are based on statistical principles, and may be familiar to most of you.

## Example: cryotherapy data

To demonstrate feature selection, we will examine the cryotherapy data again (**`Cryotherapy.csv`**). As you recall, there are 6 features in this data set, of which two are categorical (**`Sex`** and **`Type`**) and four are continuous (**`Age`**, **`Time`**, **`NumWarts`**, and **`Area`**). Here, we load the data and separate categorical and continuous features. 

`<CryoFeatures.py>`

In [2]:
%matplotlib inline

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import chi2, f_classif

# loading the data
CryoData = pd.read_csv('Cryotherapy.csv')

# features, categorical and continuous
xCat = CryoData[['Sex','Type']]
xCont = CryoData[['Age','Time','NumWarts','Area']]
y = CryoData.Success

### Categorical features

The association between a categorical feature and the target (a categorical variable) can be assessed by a $\chi^2$ test. The function **`chi2`** in **`sklearn.feature_selection`** can perform a $\chi^2$ test between each feature and the target. The `chi2` function requires two input parameters, the feature data array and the target labels. It returns 2 parameters; the first output parameter is an array of $\chi^2$ test statistics and the second output parameter is an array of corresponding p-values.

In [3]:
# categorical features
chiStat, chiP = chi2(xCat,y)
print(chiP)

[0.73684818 0.00149219]


Here, you can see that the feature `Type` is highly associated with the target, but not `Sex`.

### Continuous features

The association between a continuous feature and the target (a categorical variable) can be assessed by an ANOVA. In particular, an ANOVA F-test examine whether there is any mean difference in the feature of interest between target classes. The function **`f_classif`** in **`sklearn.feature_selection`** can perform an ANOVA F-test between each feature and the target. The `f_classif` function requires two input parameters, the feature data array and the target labels. It returns 2 parameters; the first output parameter is an array of ANOVA F-test statistics and the second output parameter is an array of corresponding p-values.


In [5]:
# continuous features
fStat, fP = f_classif(xCont,y)
print(fP)

[3.26472884e-08 2.72305388e-12 4.63372617e-01 7.45913301e-02]


So it looks like only `Age` and `Time` are significantly associated with the target.

# 2. Cross validation
<hr style="height:1px;border:none" />

## What is cross validation?

We have used a training data set to generate a classifier and a testing data set to evaluate the performance of the resulting classifier. But how can we be sure that the classifiation results are consistent regardless of which training and testing data sets to use? One way to verify is to generate multiple training and testing data sets and evaluate classification performance multiple times. **Cross validation** is one such approach. In a **k-fold** cross validation, the data set is divided into k equal sizes. In the first iteration, the first of the k segments is used as the testing data set, while the remaining k-1 segments are used as the training data set. In the second iteration, the second segment is used as the testing data set. And so on. Here is a schematic of 5-fold cross validation.

<img style="width: 500px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/CV_5fold.png?raw=true" alt="5-fold cross validation"/>

As you can see, a k-fold validation enables the classification performance evaluation k times. 

## Example: iris data

Let's perform a 5-fold cross validation on the iris data. 


`<IrisCV.py>`

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier


# Loading the iris data
iris = datasets.load_iris()
X = iris.data[:,[0,3]]  # sepal length and petal width only
y = iris.target

First, we need to define a classifier object to be examined by the cross validation. Here, we use a k nearest neighbor (kNN) classifier.

In [2]:
# defining the nearest neighbor classifier
kNN = KNeighborsClassifier(5, weights='uniform')

As for the actual cross validation, we can use the **`cross_val_score`** function in **`sklearn.model_selection`**. In `cross_val_score`, we need to provide the classifier object as an input parameter, as well as the data matrix for the features and the target variable. The number of *folds* can be specified by the parameter **`cv`**. Then `cross_val_score` splits the data into k-folds and perform a classifier analysis (building and evaluating a classifier) k times automatically. The results can be returned as the **accuracy** score. The **accuracy** is defined by the proportion of observations correctly classified, compared to all available observations. Or, in a confusion matrix, the total number of observation along the main diagonal, divided by the total number of observations in a testing data.

In [4]:
# 5-fold cross validation
scores = cross_val_score(kNN, X, y, cv=5)
print(scores)
print(scores.mean())

[0.93333333 0.96666667 0.93333333 0.93333333 1.        ]
0.9533333333333334


As you can see, the performance of this classifier seems consistent regardless of training & testing data. 

## Example: cryotherapy data

We also perform a 5-fold cross validation on the cryotherapy data. Here, we only focus on two continuous features, namely **`Age`** and **`Time`**.

`<CryoCV.py>`

In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


# loading the data
CryoData = pd.read_csv('Cryotherapy.csv')

# Creating the data set
X = np.array(CryoData.loc[:,['Age','Time']])
y = np.array(CryoData.Success)
targetNames = ['Failure', 'Success']

Classification on this data set is somewhat tricky because features have to be standardized before the analysis. Standardization has to occur every time after a training set is created, and must be applied to the corresponding testing data set. In order to implement this, we combine the standardization transformation object and the classifier object into a single object using the function **`make_pipeline`** under **`sklearn.pipeline`**. 

In [6]:
# A pipeline of stadardization and kNN classifier
kNN = make_pipeline(StandardScaler(), 
                    KNeighborsClassifier(15, weights='uniform'))



Here, `StandardScalar` transformation object is defined first, followed by the kNN classifier with k=15 and **`weights='uniform`**. The resulting pipeline object can be used in the *`cross_val_score`*  under `sklearn.pipeline.`*. 

In [7]:
# 5-fold cross validation
scores = cross_val_score(kNN, X, y, cv=5)
print(scores)

[0.94736842 0.84210526 1.         0.82352941 0.88235294]


As you can see the accuracy scores differ tremendously 

### Exercise
1. **`Seed data, cross validation`**. For this exercise, features and target labels can be found in  **`SeedCV.py`**. Perform a 5-fold cross validation with k=15. Print out the scores resulting from this classifier. 

# 3. Chossing parameters
<hr style="height:1px;border:none" />

