# <center>Introduction to Feature Selection</center>

You all have seen datasets. Sometimes they are small but often at times they are tremendously large in size. It becomes very difficult to process the datasets which are very large, at least large to cause a processing bottleneck. 

So, what makes these datasets this large? Well, its features. The more the number of features the larger the datasets will be. Well, not always. You will find datasets where the number of features is very much but they do not contain that much instances. But that is not the point of discussion here. So, you might wonder with a commodity computer in hand how to process these type of datasets without beating the bush. 

Often, in a high dimensional dataset, there remain some absolutely irrelevant, insignificant and unimportant features. It has been seen that, the contribution of these types of features is often very very less towards predictive modeling as compared to the important features. They may have zero contribution as well. These features cause a number of problems which in turn prevents the process of efficient predictive modeling - 

- Unnecessary resource allocation for these features
- These features act as a noise for which the machine learning model can perform terribly bad
- The machine model takes more time to get trained

So, what's the solution here? The most economic solution is <b>Feature Selection</b>. 

Feature Selection is the process of selecting out the most significant features from a given dataset. In many of the cases, Feature Selection can enhance the performance of a machine learning model as well. 

Sounds interesting right? 

You got an informal introduction to Feature Selection and its importance in the world of Data Science and Machine Learning. This post will cover: 

- Introduction to feature selection and understanding its importance
- Difference between feature selection and dimensionality reduction
- Different types of feature selection methods
- Implementation of different feature selection methods with <b>scikit-learn</b>

## Introduction to feature selection

Feature selection is also known as <b>Variable selection</b> or <b>Attribute selection</b>.

Essentially, it is the automatic selection of attributes in your data (such as columns in tabular data) that are most relevant to the predictive modeling problem you are working on.


## Understanding the importance of feature selection

Feature selection methods aid you in your mission to create an accurate predictive model. They help you by choosing features that will give you as good or better accuracy whilst requiring less data.

Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.

Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

<i>"The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data."</i>

-<a href = "http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf">An Introduction to Variable and Feature Selection</a>

Now let's understand the difference between <b><i>dimensionality reduction</i></b> and feature selection.

Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them.

Examples of dimensionality reduction methods include Principal Component Analysis, Singular Value Decomposition and Sammon’s Mapping.

Let me summarize the importance of feature selection for you:
- It enables the machine learning algorithm to train faster.
- It reduces the complexity of a model and makes it easier to interpret.
- It improves the accuracy of a model if the right subset is chosen.
- It reduces Overfitting.

In the next section, you will study about the different types of general feature selection methods - Filter methods, Wrapper methods and Embedded methods.

## Filter methods

The following image best describes filter-based feature selection methods:


![FilterMethods]("https://www.analyticsvidhya.com/wp-content/uploads/2016/11/Filter_1.png")


<b>Image Source: Analytics Vidhya</b>

Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. You can refer to the following table for defining correlation co-efficients for different types of data (in this case continuous and categorical).

![DifferentCorrelationCoefficient]("https://www.analyticsvidhya.com/wp-content/uploads/2016/11/FS1.png")

<b>Image Source: Analytics Vidhya</b>

Some examples of some filter methods include the <b><i>Chi squared test</i></b>, <b><i>information gain</i></b> and <b><i>correlation coefficient scores</i></b>.

Next you will see Wrapper methods.

## Wrapper methods

Like filter methods, let me give you a same kind of info-graphic which will help you to understand wrapper methods better:

![WrapperMethod]("https://www.analyticsvidhya.com/wp-content/uploads/2016/11/Wrapper_1.png")

<b>Image Source: Analytics Vidhya</b>

Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.

The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features.

Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.

- <b>Forward Selection</b>: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
- <b>Backward Elimination</b>: In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.
- <b>Recursive Feature elimination</b>: It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

One of the best ways for implementing feature selection with wrapper methods is to use <a href ="http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/"><b>Boruta</b></a> package that finds the importance of a feature by creating shadow features.

It works in the following steps:

- Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).
- Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
- At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z-score than the maximum Z-score of its shadow features) and constantly removes features which are deemed highly unimportant.
- Finally, the algorithm stops either when all features get confirmed or rejected or it reaches a specified limit of random forest runs.

Good enough!

Now let's study embedded methods.

## Embedded methods

Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods.

Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients).

Examples of regularization algorithms are the <b>LASSO</b>, <b>Elastic Net</b>, <b>Ridge Regression</b> etc.

## Difference between filter and wrapper methods

Allow me to give some more points on filter and wrapper methods in a comparing manner, so that you don't miss out on anything.


- Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
- Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
- Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use <b>cross validation</b>.
- Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
- Using the subset of features from the wrapper methods make the model more prone to Overfitting as compared to using subset of features from the filter methods.


So far you have studied the importance of feature selection, understood its difference with dimensionality reduction. You also covered various types of feature selection methods. So far, so good!

Now, let's see some traps that you may get into while selecting features:

## Important consideration

Feature selection is another key part of the applied machine learning process, like model selection. You cannot fire and forget.

It is important to consider feature selection a part of the model selection process. If you do not, you may inadvertently introduce bias into your models which can result in Overfitting.

For example, you must include feature selection within the inner-loop when you are using accuracy estimation methods such as cross-validation. This means that feature selection is performed on the prepared fold right before the model is trained. A mistake would be to perform feature selection first to prepare your data, then perform model selection and training on the selected features.

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Enough of theories! let's get straight to some coding now.

## Case study in Python

For this case study, you will use Pima Indians Diabetes dataset. The description of the dataset can be found <a href = "https://www.kaggle.com/uciml/pima-indians-diabetes-database">here</a>.

The dataset corresponds to a classification problem on which you need to make predictions on the basis of whether a person is to suffer diabetes given the 8 features in the dataset.

There are a total of 768 observations in the dataset. Your first task is to load the dataset so that you can proceed. But before that let's import the basic dependencies, you are going to need. You import the other ones as you go along.

In [2]:
import pandas as pd
import numpy as np

Now that the dependencies are imported let's load Pima Indians dataset into a Dataframe object with the help of Pandas library.

In [3]:
data = pd.read_csv("diabetes.csv")

The dataset is successfully loaded into the Dataframe object <i>data</i>. Now, let's take a look at the data.

In [8]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


So you can 8 different features labeled into the outcomes of 1 and 0 where 1 stands for the observation has diabetes, and 0 denotes the observation does not have diabetes. The dataset is known to have missing values. Specifically, there are missing observations for some columns that are marked as a zero value. We can corroborate this by the definition of those columns, and the domain knowledge that a zero value is invalid for those measures, e.g., zero for body mass index or blood pressure is invalid.

This <a href = "https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models">DataCamp article</a> discusses about handling the dataset's missing values in detail. You should definitely refer to it. But for this tutorial, you will directly use the preprocessed version of the dataset. 

In [4]:
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)

You loaded the data in a DataFrame object called <i>dataframe</i> now. 

Let's convert the DataFrame object to a NumPy array to ease the computations. Also, let's split the data into train and test sets.

In [5]:
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

You have prepared the data. 

First you will implement <b><i>Chi-Squared</i></b> statistical test for non-negative features to select 4 of the best features from the dataset. As you saw earlier Chi-Squared test belongs the class of filter methods. If anyone's curious about knowing the internals of Chi-Squared, <a href = "https://www.youtube.com/watch?v=VskmMgXmkMQ">this video</a> does an excellent job. 

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features, in this case it is Chi-Squared.

In [12]:
# Import the necessary libraries first

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

You imported the libraries to run the experiments. Now, let's see it in action. 

In [14]:
# Feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

# Summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)

features = fit.transform(X)
# Summarize selected features
print(features[0:5,:])

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): plas, test, mass and age. This scores will help you further in choosing the best features for training your model.

<b>P.S.: The first row denotes the names of the features. For preprocessing of the dataset, the names have been numerically encoded.</b>

Next you will implement <b><i>Recursive Feature Elimination</i></b> which is a type of wrapper feature selection method.

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the <a href = "http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE">scikit-learn documentation</a>.

In [15]:
# Import your necessary dependencies

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

You will use RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

In [21]:
# Feature extraction

model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


You can see that RFE chose the the top 3 features as preg, mass and pedi.

These are marked True in the <i>support_</i> array and marked with a choice “1” in the <i>ranking_</i> array.

Next up you will use <b><i>Ridge regression</i></b> which is basically a regularization technique and an embedded feature selection techniques as well. 

<a href="https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#three">This article</a> gives you an excellent explanation on Ridge regression. Be sure to check it out. 

In [1]:
# First things first

from sklearn.linear_model import Ridge

Next you will use Ridge regression to determine the coefficient R<sup>2</sup>.

Also, <a href = "http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html">check scikit-learn's official documentation on Ridge regression</a>. 

In [8]:
ridge = Ridge(alpha=1.0)
ridge.fit(X,Y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In order to better understand the results of Ridge regression you will implement a little helper function that will help you to print the results in a better so that you can interpret them easily. 

In [10]:
# A helper method for pretty-printing the coefficients

def pretty_print_coefs(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

Next, you will pass Ridge model's coefficient terms to this little function and see what happens.

In [12]:
print ("Ridge model:", pretty_print_coefs(ridge.coef_))

Ridge model: 0.021 * X0 + 0.006 * X1 + -0.002 * X2 + 0.0 * X3 + -0.0 * X4 + 0.013 * X5 + 0.145 * X6 + 0.003 * X7


You can spot all the coefficient terms appended with the feature variables. It will again help you to choose the most important features. Below are some points that you should keep in mind while applying Ridge regression:

- It is also known as <b>L2-Regularization</b>.
- For correlated features, it means that they tend to get similar coefficients. 
- Feature having negative coefficients don't contribute that much. But in a more complex scenario where you are dealing with lots of features, then this score will definitely help you in the ultimate feature selection decision making process. 

Well, that concludes the case study section. The methods that you implemented in the above section will definitely hep you to understand the features of a particular dataset in a comprehensive manner. Let me give you some important points on these techniques: 

- Feature selection is essentially a part of data preprocessing which is considered to be the most time-consuming part of any machine learning pipeline. 
- These techniques will help you to approach it in a more systematic way and machine learning friendly way. You will be able to interpret the features more accurately. 

In this post, you covered one of the most well studied and well researched statistical topic i.e. feature selection. You also got familiar with its different variant and used them to see which features in a dataset are important. 

Brilliant! 

Following are some resources if you would like to dig more on this topic:

- <a href = "http://machinelearningmastery.com/an-introduction-to-feature-selection/">An introduction to feature selection</a>
- <a href = "http://machinelearningmastery.com/feature-selection-to-improve-accuracy-and-decrease-training-time/">
Feature Selection to Improve Accuracy and Decrease Training Time</a>
- <a href = "http://www.amazon.com/dp/079238198X?tag=inspiredalgor-20">Feature Selection for Knowledge Discovery and Data Mining</a>
- <a href = "http://www.amazon.com/dp/3540341374?tag=inspiredalgor-20">Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop</a>
- <a href = "https://www.youtube.com/watch?v=y2Jsa4sgD5w&t=1073s">Feature Selection : Problem statement and Uses</a>

Below are the references that I used in order to write this tutorial.

- <a href = "https://www.elsevier.com/books/data-mining-concepts-and-techniques/han/978-0-12-381479-1">Data Mining: Concepts and Techniques; Jiawei Han Micheline Kamber Jian Pei</a>.
- <a href = "https://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/">Selecting good features – Part II</a>
- <a href = "https://www.datacamp.com/courses/hierarchical-and-mixed-effects-models
">Hierarchical and Mixed Model - DataCamp course</a>

Be sure to post your doubts in the comments section if you have any!