# Introduction to Feature Selection

-----

Feature selection helps you in the mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy while requiring less data.

The key benefits of performing feature selection on the data are:

- Reduces Overfitting: Less redundant data means less chance to make decisions based on noise.
- Improves Accuracy: Less misleading data means improvements in modeling accuracy.
- Reduces Training Time: Less data means algorithms train faster.
- Improves Interpretability: Less complexity of a model makes it easier to interpret.

In some cases, a domain expert can indicate which features have the most predictive power and which features can be ignored. When this is not possible (and in some cases even when it is possible), we can employ algorithmic feature selection to automatically quantify the importance of features so that a threshold can be used to identify the best features for a particular application.

Broadly speaking there are three general classes of feature selection algorithms:
- Filter methods
- Wrapper methods
- Embedded methods

The scikit-learn provides a number of [feature selection algorithms][skfs] that implement these techniques. The rest of this notebook explores them in more detail.

-----

[skfs]: http://scikit-learn.org/stable/modules/feature_selection.html

## Table of Contents

[Data](#Data)

[Filter Methods](#Filter-Methods)
- [Statistical Tests](#Statistical-Tests)

- [Univariate Techniques](#Univariate-Techniques)

[Wrapper methods](#Wrapper-methods)
- [Recursive Feature Elimination](#Recursive-Feature-Elimination)

[Embedded Methods](#Embedded-Methods)

-----

Before proceeding with the rest of this notebook, we first have our standard notebook setup code.

-----

In [1]:
# Set up Notebook

%matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

# Set global fiugure properties
import matplotlib as mpl
mpl.rcParams.update({'axes.titlesize' : 20,
                     'axes.labelsize' : 18,
                     'legend.fontsize': 16})

# Set default seaborn plotting style
sns.set_style('white')

-----
[[Back to TOC]](#Table-of-Contents)

## Data

To perform feature selection, we need representative data. In this section we introduce two data sets that we use to perform feature selection within this notebook.


### Iris Data

The first data set we use to perform feature selection is the [Iris data][id]. Previously, we used seaborn to load iris data to a dataframe. In this notebook we use scikit-learn library which loads the iris data to an object. The object has data and target attributes which contains training features and target label of iris data in numpy array format. These data contain four features: sepal length, sepal width, petal length and petal width, for three different Iris varieties. There are fifty examples of each type of Iris, for 150 total instances in the data set. **To increase the challenge, we will occasionally add random _noise_ features to these data in order to test if a feature selection technique can distinguish between signal and noise.**

-----

[id]: http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

In [2]:
import sklearn.datasets as ds

# Load Iris Data
iris = ds.load_iris()

# Extract features & labels
features = iris.data
labels = iris.target

print(f'Feature names:{iris.feature_names}')
# Output examples of each class
print(f'Feature {features[0]}: Label {labels[0]}')
print(f'Feature {features[50]}: Label {labels[50]}')
print(f'Feature {features[100]}: Label {labels[100]}')

Feature names:['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature [5.1 3.5 1.4 0.2]: Label 0
Feature [7.  3.2 4.7 1.4]: Label 1
Feature [6.3 3.3 6.  2.5]: Label 2


### Adult Income Data
The second data set we use throughout this notebook is the [Adult income prediction task][uciad]. These data were extracted by Barry Becker from the 1994 Census database and consist of the following features: age, workclass, fnlwgt, education, education-level, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, and salary. Of these, five are continuous features:  fnlwgt, education-num, capital-gain, capital-loss, and hours-per-week, the others are discrete features. The last column, salary, is discrete and contains one of two strings to indicate if the salary was below or above $50,000. This is the column we will use to make our label.

The following Code cell prepares the data:

1. Load data(we use a subset of original data)
2. Create label from Salary column
3. Encode categorical features that have string value
4. Combine numerical features and encoded categorical features.

-----
[uciad]: https://archive.ics.uci.edu/ml/datasets/Adult

In [3]:
from sklearn.preprocessing import LabelEncoder

# Read CSV data
adult_data = pd.read_csv('data/adult_income.csv')

# Create label column, one for >50K, zero otherwise.
adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 if '>50K' in x else 0)

# Generate categorical features
categorical_features = adult_data[['Workclass', 'Education', 'MaritalStatus', 
               'Occupation', 'Relationship', 'Race', 'Sex', 'NativeCountry']]

# Encocde categorical features
categorical_features = categorical_features.apply(LabelEncoder().fit_transform)

# Extract numerical features
numerical_features = adult_data[['Age', 'FNLWGT', 'EducationLevel', 'CapitalGain', 'CapitalLoss', 'HoursPerWeek']]

adult_features = pd.concat([numerical_features, categorical_features], axis=1)

adult_label = adult_data['Label']
adult_features.head()

Unnamed: 0,Age,FNLWGT,EducationLevel,CapitalGain,CapitalLoss,HoursPerWeek,Workclass,Education,MaritalStatus,Occupation,Relationship,Race,Sex,NativeCountry
0,62,68268,9,0,0,40,2,11,2,14,0,4,1,36
1,50,215990,9,0,0,40,4,11,2,3,0,4,1,36
2,36,185405,9,0,0,50,4,11,2,1,0,4,1,36
3,64,258006,10,0,0,40,4,15,6,1,1,4,0,5
4,28,39388,11,0,0,60,6,8,2,5,0,4,1,36


-----
[[Back to TOC]](#Table-of-Contents)

## Filter Methods
Filter methods typically involve the application of a statistical measure to score the different features. This score allows the features to be ranked, and this ranking is used to determine which features to keep and which can be removed from the data. Generally each feature is considered on its own (i.e., a univariate test).

### Statistical Tests

One of the simplest techniques for algorithmically selecting features is to measure the variance in each feature. Some machine learning algorithms, such as the Decision Tree, explicitly measure the variance of features and split those features with the greatest variance. The reason for this approach is that features with greatest variance contain significant information, whereas features with the least variance are tightly bunched and contain little discriminative power. As an extreme example, a feature that has zero variance provides no descriptive power (since all data have same value) and can easily be removed from analysis without impacting the predictive performance of an algorithm.

Formally, this technique is known as [variance thresholding][vt], which is implemented in the scikit-learn library by the [`VarianceThreshold`][skvt] selector. The following two Code cells demonstrate this technique on the Iris data. First, the technique is applied directly to the Iris data, which provides a ranking of feature importance (via the variance measures). 

`VarianceThreshold` takes one argument `threshold`. The selector will remove all features with variance lower than the `threshold`. The default value of `threshold` is 0 which means all features will be kept. We will use default `threshold` in this notebook so that we can get variance of all features. Once the `VarianceThreshold` is fit and transformed on the features dataset, we can retrieve feature variances from the selector's `variances_` attribute. We then zip the variances with feature names and display each feature's variance.

However, since the original features are not scaled, the variance comparison is inaccurate. Some features have a naturally larger spread due to the sizes of the widths and lengths of the petals and sepals. Thus, the second Code cell scales these features to the same zero to one range, and then perform variance thresholding. Notice how the results change such that the petal width becomes more important than the petal length. This example emphasizes the importance of ensuring that the statistical tests are performed in a uniform manner in order to avoid biasing the results. 

The major disadvantage of variance threshold is that it doesn't take the target feature in consideration when calculating the score. This is especially problematic with an unbalanced dataset. For example, in a cancer screening dataset, assume only 10% of the data has a positive label. There could be a feature that's highly correlated to the label, i.e., has small values for negative labels and large values for positive labels. About 90% of the feature then have small values, which makes the feature's variance very low. Since the feature is highly correlated to the label, this feature actually has great predicting power which makes it a very important feature.

On the other hand, since variance threshold doesn't need a target feature, this technique can be used to select features for unsupervised learning which will be introduced in the future lessons.

-----

[vt]: http://scikit-learn.org/stable/modules/feature_selection.html#removing-features-with-low-variance
[skvt]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

In [4]:
# Perform variance thresholding on raw features
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold()

# Compute and display variances
vt.fit_transform(features)
feature_variances = vt.variances_
for var, name in zip(feature_variances, iris.feature_names):
    print(f'{name:>10} variance = {var:.3f}')

sepal length (cm) variance = 0.681
sepal width (cm) variance = 0.189
petal length (cm) variance = 3.096
petal width (cm) variance = 0.577


In [5]:
# Scale features and then perform variance thresholding
from sklearn.preprocessing import MinMaxScaler

# Normalize data to [0, 1] range
features_ss = MinMaxScaler().fit_transform(features)

# Compute and display variances
vt.fit_transform(features_ss)
for var, name in zip(vt.variances_, iris.feature_names):
    print(f'{name:>10} variance = {var:.3f}')

sepal length (cm) variance = 0.053
sepal width (cm) variance = 0.033
petal length (cm) variance = 0.089
petal width (cm) variance = 0.100


----

As an additional example, we can apply variance thresholding to the adult income data set. In this case, we display feature variances in descending order.

The feature variances show that `HoursPerWeek` is among the features with lowest variance. However, we know that `HoursPerWeek` is definitely a very important factor of income. On the other hand, binary categorical features like `Sex` normally have variance close to 0.25 when classes are balanced(50% male and 50% female). This tells us that variance threshold may not be a reliable indicator of feature importance when using alone.

----

In [6]:
# Normalize data to [0, 1] range
adult_features_ss = MinMaxScaler().fit_transform(adult_features)

# Compute and display variances
vt.fit_transform(adult_features_ss)

for var, name in sorted(zip(vt.variances_, adult_features.columns), key=lambda x: x[1], reverse=True):
    print(f'{name:>18} variance = {var:.3f}')

         Workclass variance = 0.034
               Sex variance = 0.220
      Relationship variance = 0.102
              Race variance = 0.046
        Occupation variance = 0.092
     NativeCountry variance = 0.033
     MaritalStatus variance = 0.063
      HoursPerWeek variance = 0.016
            FNLWGT variance = 0.007
    EducationLevel variance = 0.029
         Education variance = 0.069
       CapitalLoss variance = 0.008
       CapitalGain variance = 0.006
               Age variance = 0.036


-----
[[Back to TOC]](#Table-of-Contents)

### Univariate Techniques

Another technique for identifying the features that encode the majority of the signal in a data set is to examine each feature individually to determine the strength of the relationship of the feature with the target feature.

The two main techniques for performing this type of feature selection are [`SelectKBest`][skb] and [`SelectPercentile`][sp]. The former selects the **k** best features, while the latter selects the best percentage of features. Each of these techniques accepts a `score_func` that implements the statistical measure. Provided measures include the following:
- `f_classif`: default value, computes the ANOVA F-value between the features and labels, used for classification.
- `mutual_info_classif`: computes the mutual information of discrete label, used for classification.
- `chi2`: computes chi-squared statistic of non-negative features, used for classification.
- `f_regression`: computes the ANOVA F-value between the features and labels, used for regression.
- `mutual_info_regression`: computes the mutual information for continuous label, used for regression.

Several other specific techniques are also provided by the scikit-learn library, but they are beyond the scope of this notebook. The [online documentation][skut] provides more details on all of these methods.

To demonstrate these techniques, we will start with the original Iris data set. We use the `SelectKBest` technique to compute the scores for all features, and we use the default scoring function which is [`f_classif`][fc]. This statistic measures the degree of linear dependence between two features. We indicate all features should be kept by setting `k='all'` so that we can print out scores of all features. If we set `k` to a number _n_, then only the best *n* features are kept.

The results indicate that the petal features are most important, which agrees with the feature importance results we saw in earlier notebooks.

-----
[skut]: http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
[skb]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
[sp]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html
[fc]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html

In [7]:
from sklearn.feature_selection import SelectKBest

skb = SelectKBest(k='all')

skb.fit(features, labels)
for var, name in zip(skb.scores_, iris.feature_names):
    print(f'{name:>18} score = {var:.3f}')


 sepal length (cm) score = 119.265
  sepal width (cm) score = 49.160
 petal length (cm) score = 1180.161
  petal width (cm) score = 960.007


-----

To test these techniques more thoroughly, we can add random noise features into the analysis. To do this, the following Code cell generates ten new features (called _NoiseXX_ where the XX is replaced by the ordinal number of the new feature) that contain values that are uniformly sampled from the range zero to one. We combine these new noise features with our Iris data, which have been properly normalized to the same range, by using the NumPy `hstack` method, and we also create a new list of feature names that aligns with our new set of features.

Next, we again perform feature selection by using the `SelectKBest` technique. Now, however, we display the features, sorted by their relative importance. In this case, the real features are identified with higher importance.

-----

In [8]:
# Number of noise features to add
num_nf = 10

# Set random state
rng = np.random.RandomState(23)

# Create noise features
noise = rng.uniform(0., 1.0, size=(len(iris.data), num_nf))

# Features plus noise
features_pn = np.hstack((features_ss, noise))

# Feature names
f_names = iris.feature_names.copy()
for i in range(noise.shape[1]):
    f_names.append(f'Noise {i:0>2}')
    
# Fit features plus noise
skb.fit(features_pn, labels)

# Display scores for features and noise
for var, name in sorted(zip(skb.scores_, f_names), key=lambda x: x[0], reverse=True):
    print(f'{name:>18} score = {var:.3f}')

 petal length (cm) score = 1180.161
  petal width (cm) score = 960.007
 sepal length (cm) score = 119.265
  sepal width (cm) score = 49.160
          Noise 03 score = 7.669
          Noise 09 score = 3.705
          Noise 00 score = 1.888
          Noise 04 score = 1.447
          Noise 08 score = 0.944
          Noise 02 score = 0.554
          Noise 05 score = 0.417
          Noise 06 score = 0.209
          Noise 01 score = 0.105
          Noise 07 score = 0.020


-----

To provide additional insight into these techniques, we now switch to the adult income data set. In the following example, we change the score_func to mutual_info_classif which quantifies the dependency between features and the label. When two features are independent, this statistic goes to zero, and as the dependency increases the statistic also increases.

Notice that we get a completely different order of importance when comparing to the result from variance threshold. In this case, since the adult data is an unbalanced dataset(25% high income and 75% low income), SelectKBest is more reliable than VarianceThreshold for this dataset.

-----


In [9]:
from sklearn.feature_selection import mutual_info_classif

skb = SelectKBest(mutual_info_classif, k='all')
skb.fit(adult_features, adult_label)

# Display scores for features and noise
for var, name in sorted(zip(skb.scores_, adult_features.columns), key=lambda x: x[0], reverse=True):
    print(f'{name:>18} score = {var:.3f}')

      Relationship score = 0.111
     MaritalStatus score = 0.105
       CapitalGain score = 0.074
    EducationLevel score = 0.066
               Age score = 0.064
        Occupation score = 0.054
         Education score = 0.052
               Sex score = 0.037
      HoursPerWeek score = 0.032
       CapitalLoss score = 0.026
            FNLWGT score = 0.019
         Workclass score = 0.006
              Race score = 0.000
     NativeCountry score = 0.000


-----
[[Back to TOC]](#Table-of-Contents)

## Wrapper methods
Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy. Since we must train a model for each feature combination, this approach is much more expensive than a filter method. 

One popular wrapper method is the recursive feature elimination algorithm.


### Recursive Feature Elimination

[Recursive Feature Elimination (RFE)][rfe] works by recursively removing attributes and building a model from the remaining attributes. The model accuracy is used to identify the attributes (and combination of attributes) that most contribute to predicting the target attribute. The `RFE` implementation provided by the scikit-learn library is in the `feature_selection` module.

There are two key arguments to construct a `RFE` selector:
- estimator : A supervised learning estimator with a ``fit`` method that provides information about feature importance either through a ``coef_`` attribute or through a ``feature_importances_`` attribute.
- n_features_to_select : int or None (default=None). The number of features to select. If `None`, half of the features are selected.

For detail of other `RFE` arguments please refer to the help document.(`help(RFE)`)

In the next two cells, we employ RFE to determine the most important features for both the Iris and adult income  data. The first Code cell below uses a linear support vector classifier to perform RFE. In this case, we analyze the Iris data set plus ten _noise_ features. We set `n_features_to_select` to 1 so the result of this operation identifies the most important feature, but we can still retrieve the ranks of all features via the selector's `ranking` attribute. Notice how three of the _real_ features are the top three ranked features, but next are several _noise_ features, indicating the remaining _real_ feature encodes less information than a random feature.

-----

[rfe]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html



In [10]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import RFE

# Create classifier
svc = LinearSVC(random_state=23)

# Perform RFE, select top feature (but rank all)
rfe = RFE(estimator=svc, n_features_to_select=1)

# Fit features plus noise
rfe.fit(features_pn, labels)
    
# Display feature ranking
for var, name in sorted(zip(rfe.ranking_, f_names), key=lambda x: x[0]):
    print(f'{name:>18} rank = {var}')

  petal width (cm) rank = 1
 petal length (cm) rank = 2
  sepal width (cm) rank = 3
          Noise 03 rank = 4
          Noise 09 rank = 5
          Noise 00 rank = 6
 sepal length (cm) rank = 7
          Noise 07 rank = 8
          Noise 05 rank = 9
          Noise 06 rank = 10
          Noise 02 rank = 11
          Noise 08 rank = 12
          Noise 04 rank = 13
          Noise 01 rank = 14


-----

We now transition to the adult income data set. The following Code cell takes a little longer time to run. The reason is that RFE trains the selected model(RandomForestClassifier in this case) repeatedly to rank the features. It'll take a while when the dataset is large.

-----

In [11]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=23)
# Create RFE model with only one feature
rfe = RFE(estimator=rfc, n_features_to_select=1)
rfe.fit(adult_features, adult_label)

# Display feature ranking
for var, name in sorted(zip(rfe.ranking_, adult_features.columns), key=lambda x: x[0]):
    print(f'{name:>18} rank = {var}')

            FNLWGT rank = 1
               Age rank = 2
      Relationship rank = 3
    EducationLevel rank = 4
       CapitalGain rank = 5
      HoursPerWeek rank = 6
        Occupation rank = 7
     MaritalStatus rank = 8
         Workclass rank = 9
         Education rank = 10
       CapitalLoss rank = 11
     NativeCountry rank = 12
               Sex rank = 13
              Race rank = 14


-----

<font color='red' size = '5'> Student Exercise </font>

Now that you have run the previous cells, try making changes to the notebook:

1. Try using a different classifier, such as a decision tree or logistic regression.

-----

[[Back to TOC]](#Table-of-Contents)

## Embedded Methods
Embedded Methods perform feature selection directly in the model construction. Some algorithms, such as the decision tree and the ensemble techniques based on a decision tree, provide access to measures of the feature importance. This extra information can be used to rank features for use with these models. For example, the [Random Forest Classifier (RFC)][rfc], as an ensemble method, builds models by randomly selecting features when building each tree. In this process, RFC computes the overall importance of each feature in building the final model. By extracting the feature importances from the final model, we obtain a ranked ordering of the features used to build the model.

In the following Code cells, we apply random forest classifier on the Iris dataset (with noise added) and adult income dataset. Then we display the feature importance of each dataset with the model's `feature_importances_` attribute.

Among the machine learning algorithms we learned so far, both the Decision Tree and the Random Forest model have `feature_importances_` attribute.

-----

[rfc]: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [12]:
from sklearn.ensemble import RandomForestClassifier

# Build model
rfc = RandomForestClassifier(random_state=23)
rfc.fit(features_pn, labels)

# Display scores for features and noise
print(f'{"Label":18s}: Importance')
print(26*'-')
for val, name in sorted(zip(rfc.feature_importances_, f_names), 
                        key=lambda x: x[0], reverse=True):
    print(f'{name:>18}: {val:.2%}')

Label             : Importance
--------------------------
  petal width (cm): 31.63%
 petal length (cm): 28.89%
 sepal length (cm): 12.09%
  sepal width (cm): 7.35%
          Noise 03: 3.44%
          Noise 04: 2.63%
          Noise 09: 2.50%
          Noise 02: 2.00%
          Noise 00: 1.95%
          Noise 05: 1.76%
          Noise 01: 1.64%
          Noise 07: 1.62%
          Noise 08: 1.26%
          Noise 06: 1.23%


-----

We now use random forest classifier to determine feature importance in  the adult income data. 

-----
[sfm]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

In [13]:
rfc = RandomForestClassifier(random_state=23)
rfc.fit(adult_features, adult_label)

# Display scores for features and noise
print(f'{"Label":18s}: Importance')
print(26*'-')
for val, name in sorted(zip(rfc.feature_importances_, adult_features.columns), 
                        key=lambda x: x[0], reverse=True):
    print(f'{name:>18}: {val:.2%}')

Label             : Importance
--------------------------
            FNLWGT: 14.72%
               Age: 14.20%
       CapitalGain: 11.84%
    EducationLevel: 10.30%
      Relationship: 9.54%
      HoursPerWeek: 8.55%
        Occupation: 7.34%
     MaritalStatus: 7.33%
         Workclass: 4.31%
         Education: 3.83%
       CapitalLoss: 3.15%
     NativeCountry: 1.86%
               Sex: 1.52%
              Race: 1.51%


-----

<font color='red' size = '5'> Student Exercise </font>

Now that you have run the previous cells, try making changes to the
notebook:

1. Try changing some hyperparameters (i.e., max_depth) for the RFC estimator. How do these changes affect the feature importance?



-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. Series of blog articles on feature selection in Python: [Part I][2a], [Part II][2b], [Part II][2c], and [Part IV][2d]
2. An introduction to [feature selection][3]


-----

[1]: http://adataanalyst.com/machine-learning/comprehensive-guide-feature-engineering/
[2a]: http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/
[2b]: http://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/
[2c]: http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
[2d]: http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/

[3]: https://machinelearningmastery.com/an-introduction-to-feature-selection/

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode