# Introduction to Feature Selection:

Feature selection is the process in machine learning and statistics used select a subset of the available features to use in modeling and implementation. Conventional wisdom says the more features and data we have available to us, the better the model we can create. However, there are several reasons why we would choose to not use the full feature set. 

First, by selecting a subset, models can be simplified and easier to interpret by analysts or researchers. Along these same lines, if there is a cost associated with collecting data (e.g. acquiring and setting up sensors, or survey data), you would only want to collect data for the features that best predict the output. 

Second, reducing the number of inputs to the model can reduce the time and computational resources needed to train the model. Additionally, some models (like linear regression) cannot be run when you have more features than examples.  

Finally, feature selection is important in creating models that generalize from training data to testing data. By using feature selection and cross validation, we can measure the right level of complexity for the model and choose the number of features accordingly. 

Now that we know <b> why </b> we need feature selection, let's now focus on the the <b>how</b>. We will cover three different algorithms in the sklearn package and write a simple function of our own to do the fourth method: 
<li> Variance Threshold </li>
<li> Univariate Selection </li>
<li> Recursive Feature Elimination </li>
<li> Greedy Forward Selection </li>

To complete the tutorial, we will use data from Center for Machine Learning and Intelligent Systems to predict wine quality, being a low quality wine and 1 being a high quality wine. The available features in this set are: total sulfur dioxide, free sulfur dioxide, fixed acidity, residual sugar, alcohol, citric acid, volatile acidity, sulphates, pH, chlorides, and density.	

Below we will import the necessary packages, the dataset, and code to separate the dataset into training and test sets. 


In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import svm

###randomly separate test and train
np.random.seed(12345)
wine_data = pd.read_csv('winequality-red2.csv')
msk = np.random.rand(len(wine_data)) < 0.7
train = wine_data[msk]
test =  wine_data[~msk]

###this is the outcome variable from the CSV (0 and 1)
outcome_train = train["outcome-good?"]
outcome_test = test["outcome-good?"]

##create training set
input_train = train.copy(deep=True)
input_train = input_train.drop("quality", axis = 1 )
input_train = input_train.drop("outcome-good?", axis = 1 )

##create test set
input_test = test.copy(deep=True)
input_test = input_test.drop("quality", axis = 1 )
input_test = input_test.drop("outcome-good?", axis = 1)

##get number of features
print "Full Set of Features : ",  input_train.shape[1]
print input_train.columns.values


Full Set of Features :  11
['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol']


Although this tutorial is not about classification, we do need some means of comparing the selected features using some model's performance. For this, we will use sklearn's SVM classifier with a linear kernal, and use accuracy to assess the quality of the features selected. Other permformance metrics like AUC, precision, or recall could also be used here. 

In [2]:
def classify(train_data,data_output): 
    clf = sklearn.svm.SVC(kernel = "linear")
    clf.fit(train_data, data_output)
    return clf

def evaluate_classifier(classifier, X_validation, y_validation):
    predicted_values = classifier.predict(X_validation)
    y_validation = list(y_validation)
    correct_sum = 0
    for i in range (len(predicted_values)):
        if predicted_values[i] == y_validation[i]:
            correct_sum = correct_sum + 1 
    accuracy = (correct_sum*1.0)/(len(predicted_values)*1.0)
    return accuracy

Using the train and test data sets that we created above, we can start with a baseline using all of the features. This will servie as a baseline to compare each of the algorithms with. 

Using the 11 features, we get a ~75% accuracy when determining whether a wine is good enough. For reference, the class is approximately balanced, so this represents a ~50% lift over a default classifyer.  

In [3]:
clf_full = classify(input_train, outcome_train)

print "train accuracy : " , evaluate_classifier(clf_full, input_train, outcome_train)
print "test accuracy : ", evaluate_classifier(clf_full, input_test, outcome_test)

train accuracy :  0.757668711656
test accuracy :  0.777777777778


## Removing Features with low variance

One of the simpliest feature selection techniques within sklearn is VarianceThreshold, which basically removes features with low variances. Intuitively, this makes sense because if our output varies , if our inputs are to have any relationship with the output, they must also vary. 

Below, I have implemented this in two steps. First, I use the package to just view the variances, and then I actually select a cut off threshold of 1.0  which leaves me with the 5 top features. 


In [4]:
from sklearn.feature_selection import VarianceThreshold

## To start, I won't put any value as the threshold, I will just view all of variances. 
sel = VarianceThreshold((0.0))
sel.fit_transform(input_train, outcome_train)

##view the variances in a *pretty* dataframe
df = pd.DataFrame()
df["features"]= list(input_train.columns.values)
df["variance"]= sel.variances_


df = df.sort_values(by ="variance", ascending= False)
df

Unnamed: 0,features,variance
6,total sulfur dioxide,1005.299714
5,free sulfur dioxide,114.102707
0,fixed acidity,2.611355
3,residual sugar,2.48365
10,alcohol,1.110811
2,citric acid,0.037475
1,volatile acidity,0.033972
9,sulphates,0.033288
8,pH,0.023422
4,chlorides,0.002583


In [5]:
## since I know I want the top five values, I can also choose a threshold value that will give me a cut off of that. 
sel = VarianceThreshold((1))
var_features = sel.fit_transform(input_train)
var_features_test = sel.transform(input_test)
var_feature_names =input_train.columns.values[sel.fit(input_train).get_support(indices = True)]
print "number of features remaining = ", var_features.shape[1]

## now plug the reduced var_features data frame into the classifier and report the accuracy
classifier = classify(var_features, outcome_train)
train_accuracy_var = evaluate_classifier(classifier, var_features, outcome_train)
test_accuracy_var = evaluate_classifier(classifier, var_features_test, outcome_test)
print "accuracy: ", train_accuracy_var
print "test accuracy: ", test_accuracy_var



number of features remaining =  5
accuracy:  0.717353198948
test accuracy:  0.736897274633


Given how simple this method is, it actually does a pretty good job selecting features that predict the outcome. 

For full disclosure, in practice, you probably wouldn't see anyone  using this method to select the k best features from a large feature set. However, if your reason for feature selection is computational resources, this can be a fast and inexpensive way to remove excess features before you begin additional feature selection or modeling. 

## Univariate Feature Selection

Univariate feature selection is another very simple, but widely used algorithm for selecting features. Univariate feature selection works by performing univariate statistical tests with respect to the output on each feature individually. The implementer can choose the following tests in the SelectKBest method depending on the research question: 
</n>
<li> <u> Classification </u> : f_classif, chi2, SelectFpr, SelectFdr, SelectFwe, mutual_info_classif </li>
<li> <u> Regression </u> : mutual_info_regression, f_regression </li>
<li> <u> Both </u> : GenericUnivariateSelect </li>

The second parameter it accepts is k, which is the number of features it should return. Below I've implemented Univariate feature selection using f_classif, and k=5. If we change this to f_classif, we are choosing a different statistical tests and therefore we get a different set of features. If you were to use this method in practice. 

In [6]:
from sklearn.feature_selection import *

##Fit
univariate_sel = SelectKBest(f_classif, k=5).fit(input_train, outcome_train)

##Transform both test and train according to train
univariate_features = univariate_sel.transform(input_train) 
univariate_features_test = univariate_sel.transform(input_test) 
univariate_feature_names = input_train.columns.values[univariate_sel.get_support(indices = True)]

#Retrieve the Column Names of the Remaining features
print "remaining features :" ,univariate_features.shape[1]
print univariate_feature_names

remaining features : 5
['fixed acidity' 'volatile acidity' 'citric acid' 'sulphates' 'alcohol']


In [7]:
##Use same code to evaluate the feature selected
classifier = classify(univariate_features, outcome_train)
train_accuracy_uni =evaluate_classifier(classifier, univariate_features, outcome_train)
test_accuracy_uni = evaluate_classifier(classifier, univariate_features_test, outcome_test)


print "train accuracy: ",train_accuracy_uni
print "test accuracy: ", test_accuracy_uni

train accuracy:  0.737510955302
test accuracy:  0.769392033543


Even though this is such a straightforward and simple method, we get really good results compared to the full feature set. Additionally, univariate selection is one of the least computationally expense methods we will look at. 

Unvariate feature selection does have its limitations, however. In practice, we often have large, correlated feature sets but univariate selection does independent tests. Therefore, you could end up with 5 highly correlated features, where 4 of them are providing duplicative information. 

## Recursive Feature Elimination

The next feature selection algorithm we will look at is Recursive Feature Elimination(RFE). RFE first works by training a linear model using the full feature set. Using the weights of each feature, RFE removes n number of features with the smallest absolute weights until the desired number of features is reached. 

RFE is a _greedy_ algorithm, meaning that that the exhaustive set of possibilities is never explored. The algorithm always selects the feature with the smallest weights irrespective with how different combinations perform together after other features are removed.  

RFE from sklearn accepts many more parameters than the previous algorithms we looked at. First, the algorithm has to get the weights for each feature. Below you can see that I first used a Support Vector Machine with a linear kernal to get the weights. If we were instead solving a regression problem, we would probabaly want to use a  linear regression model for the weights. One of the limitations to this method is that you have to fit an algorithm that has weights, therefore, a more complex, non linear SVM kernal and other machine learning algorithms won't work. 

After fitting the model, you call the RFE method and pass the model object, the number of features you want to retain, and the number of features to remove at each step. 

In [8]:
from sklearn.feature_selection import RFE

##linear model to fit
svc = sklearn.svm.SVR(kernel="linear", C=1)

##create rfe object 
rfe = RFE(svc, 5, step=1)

##fit the rfe with our training data
fit = rfe.fit(input_train, outcome_train)

##transform train and test data bases on fit 
##note: this is why we choose fit and not fit transform
rfe_features = rfe.transform(input_train)
rfe_features_test = rfe.transform(input_test)

##get feature names
rfe_features_names = input_train.columns.values[fit.get_support(indices = True)]

print "remaining features :" ,rfe_features.shape[1]
print rfe_features_names


remaining features : 5
['volatile acidity' 'citric acid' 'chlorides' 'pH' 'sulphates']


In [None]:
classifier = classify(rfe_features, outcome_train)
train_accuracy_rfe = evaluate_classifier(classifier, rfe_features, outcome_train)
test_accuracy_rfe = evaluate_classifier(classifier, rfe_features_test, outcome_test)
print "train accuracy" , train_accuracy_rfe
print "train accuracy", test_accuracy_rfe

train accuracy 0.708150744961
train accuracy 0.731656184486


## Greedy Forward Selection
Finally, one of the most common feature selection techniques in practice is Greedy Forward Feature Selection (GFFS). This will be the only one that we don't implement from a sklearn. 

Greedy Forward Selection work exactly as it sounds. Beginning with the single feature that best predicts the outcome, in each step you add the next best feature that, combined with the features already selected, best predicts the outcome. The evaluation of "next best" is somewhat up for debate. I chose to use accuracy because that is what I have shown above, but you could just as easily use AUC, Recall, Precision, etc. For a regression task, you might use something like R-squares, mean squared error, or absolute square error. These decisions would depend on the application or research question. 


In [None]:
def forward_select(input_train, outcome_train, number_of_features):
    ###initialize dataframes of selected features and remaining features
    sel_feat = pd.DataFrame()
    remain = input_train.copy()
    ###initialize a list that will be used for target measures (accuracy in this case)
    for i in range(number_of_features):
        acc_df = []
        for feature in remain: 
            considered = sel_feat.copy(deep = True)
            considered[feature] = remain[feature]
            model =  classify(considered, outcome_train)
            acc_df.append(evaluate_classifier(model, considered, outcome_train))
        index_max = np.argmax(acc_df)
        selected_feature = remain.columns.values[index_max]
        sel_feat[selected_feature] = remain[selected_feature]
        del remain[selected_feature]
    print "features selected : ", sel_feat.columns.values
    return sel_feat
    
    
FS_features = forward_select(input_train, outcome_train, 5)
FS_test_features = input_test[FS_features.columns.values]

classifier = classify(FS_features, outcome_train)
train_accuracy_fw = evaluate_classifier(classifier, FS_features, outcome_train)
test_accuracy_fw = evaluate_classifier(classifier, FS_test_features, outcome_test)
print "train accuracy" , train_accuracy_fw
print "test accuracy", test_accuracy_fw
        

features selected :  ['volatile acidity' 'alcohol' 'fixed acidity' 'free sulfur dioxide'
 'total sulfur dioxide']


The results of this method are most similar to the univariate analysis. This could potentially be due to the small desired feature set, but univariate includes both sulfer dioxide metrics, which are likely correlated. Univariate does not pick up on this, but greedy does.

## Bringing it all together

The code below depicts which features were selected by which algorithm, and performance of a model with those features. In practice, feature selection tends to be a highly supervised process, and therefore you can learn about about your data and models from looking at the results of these algorithms. 

The wine data set that I chose was somewhat limited in that there wasn't a huge difference in the model performance between algorithms. However, the key take-aways from this lesson should be two-fold. 

First, we achieved <b>equal or better performance using less than 50% </b> of the available features. Secondly, <b>no two methods selected the exact same features </b>. 


In [None]:
def display_df(full_features, selected_features):
    xlist = []
    for feature in full_features:
        if feature in selected_features:
            xlist.append("x")
        else:
            xlist.append("")
    return xlist

variance_threshold = display_df(input_train, var_feature_names)
univariate = display_df(input_train, univariate_feature_names)
rfe = display_df(input_train, rfe_features_names)
forward_selection = display_df(input_train, FS_features)

df=pd.DataFrame()
df["Features"] = pd.Series(list(input_train.columns.values) + ["train acc", "test acc"])
df["Variance Threshold"] = pd.Series(variance_threshold + [train_accuracy_var, test_accuracy_var])
df["Univariate Selection"] = pd.Series(univariate + [train_accuracy_uni, test_accuracy_uni])
df["Recursive Feature Elim"] = pd.Series(rfe + [train_accuracy_rfe, test_accuracy_rfe])
df["Forward Selection"] = pd.Series(forward_selection + [train_accuracy_fw, test_accuracy_fw])
df
        

## Further Reading

Feature selection is an active area of research in machine learning and statistics. Two methods that I did not cover in tutorial, but are widely used for certain applications are Tree-Based Selection and Orthagonal Matching Pursuit. 

Tree-Based Learning uses the results of random forests to calculate and rank feature performance based on how many trees the feature is selected in. Orthagonal Matching Persuit is a typically used for video and audio files and can reliably recover a signal using random linear measurements of that signal.

More info here: 
</n>

Tree-Based Learning: 
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

Orthagonal Mathching Persuit: 
http://www.stat.yale.edu/~snn7/courses/stat679fa13/references/omptrogil.pdf
http://math.mit.edu/~liewang/OMP.pdf


