<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2018


<img style="width: 400px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Title_pics.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b> Feature selection & cross validation </b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>February 29, 2018</b></p>

<hr style="height:5px;border:none" />

# 1. Feature selection
<hr style="height:1px;border:none" />

We talked about the curse of dimensionality in our previous class. To get around it, we can reduce the dimensionality of the data (e.g., PCA). Anther approach is to eliminate features that are not associated with the target, and to retain only those features that likely contribute classification of a data set, the process known as **feature selection**. There are a number of approaches for feature selection. The ones I present here are based on statistical principles, and may be familiar to most of you.

## Example: cryotherapy data

To demonstrate feature selection, we will examine the cryotherapy data again (**`Cryotherapy.csv`**). As you recall, there are 6 features in this data set, of which two are categorical (**`Sex`** and **`Type`**) and four are continuous (**`Age`**, **`Time`**, **`NumWarts`**, and **`Area`**). Here, we load the data and separate categorical and continuous features. 

`<CryoFeatures.py>`

In [2]:
%matplotlib inline

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import chi2, f_classif

# loading the data
CryoData = pd.read_csv('Cryotherapy.csv')

# features, categorical and continuous
xCat = CryoData[['Sex','Type']]
xCont = CryoData[['Age','Time','NumWarts','Area']]
y = CryoData.Success

### Categorical features

The association between a categorical feature and the target (a categorical variable) can be assessed by a $\chi^2$ test. The function **`chi2`** in **`sklearn.feature_selection`** can perform a $\chi^2$ test between each feature and the target. The `chi2` function requires two input parameters, the feature data array and the target labels. It returns 2 parameters; the first output parameter is an array of $\chi^2$ test statistics and the second output parameter is an array of corresponding p-values.

In [3]:
# categorical features
chiStat, chiP = chi2(xCat,y)
print(chiP)

[0.73684818 0.00149219]


Here, you can see that the feature `Type` is highly associated with the target, but not `Sex`.

### Continuous features

The association between a continuous feature and the target (a categorical variable) can be assessed by an ANOVA. In particular, an ANOVA F-test examine whether there is any mean difference in the feature of interest between target classes. The function **`f_classif`** in **`sklearn.feature_selection`** can perform an ANOVA F-test between each feature and the target. The `f_classif` function requires two input parameters, the feature data array and the target labels. It returns 2 parameters; the first output parameter is an array of ANOVA F-test statistics and the second output parameter is an array of corresponding p-values.


In [5]:
# continuous features
fStat, fP = f_classif(xCont,y)
print(fP)

[3.26472884e-08 2.72305388e-12 4.63372617e-01 7.45913301e-02]


So it looks like only `Age` and `Time` are significantly associated with the target.

# 2. Cross validation
<hr style="height:1px;border:none" />



# 3. Chossing parameters
<hr style="height:1px;border:none" />

