## 1 Thresholding Numerical Feature Variance
### Problem
You have a set of numerical features and want to remove those with low variance (i.e.,
likely containing little information).
### Solution
Select a subset of features with variances above a given threshold

In [1]:
#Load libraries
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold
# import some data to play with
iris = datasets.load_iris()
# Create features and target
features = iris.data
target = iris.target
# Create thresholder
thresholder = VarianceThreshold(threshold=.5)
# Create high variance feature matrix
features_high_variance = thresholder.fit_transform(features)
# View high variance feature matrix
features_high_variance[0:3]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])

In [2]:
# View variances
thresholder.fit(features).variances_

array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

`Finally, if the features have been standardized (to mean zero and unit variance), then
for obvious reasons variance thresholding will not work correctly:`

## 2 Thresholding Binary Feature Variance
### Problem
You have a set of binary categorical features and want to remove those with low variance (i.e., likely containing little information).
### Solution
Select a subset of features with a Bernoulli random variable variance above a given
threshold:

In [3]:
# Load library
from sklearn.feature_selection import VarianceThreshold
# Create feature matrix with:
# Feature 0: 80% class 0
# Feature 1: 80% class 1
# Feature 2: 60% class 0, 40% class 1
features = [[0, 1, 0],
            [0, 1, 1],
            [0, 1, 0],
            [0, 1, 1],
            [1, 0, 0]]
# Run threshold by variance
thresholder = VarianceThreshold(threshold=(.75 * (1 - .75)))
thresholder.fit_transform(features)

array([[0],
       [1],
       [0],
       [1],
       [0]])

`Var x = p * 1 − p
where p is the proportion of observations of class 1. Therefore, by setting p, we can
remove features where the vast majority of observations are one class.`

## 3 Handling Highly Correlated Features
### Problem
You have a feature matrix and suspect some features are highly correlated.
### Solution
Use a correlation matrix to check for highly correlated features. If highly correlated
features exist, consider dropping one of the correlated features:

In [4]:
# Load libraries
import pandas as pd
import numpy as np
# Create feature matrix with two highly correlated features
features = np.array([[1, 1, 1],
                     [2, 2, 0],
                     [3, 3, 1],
                     [4, 4, 0],
                     [5, 5, 1],
                     [6, 6, 0],
                     [7, 7, 1],
                     [8, 7, 0],
                     [9, 7, 1]])
# Convert feature matrix into DataFrame
dataframe = pd.DataFrame(features)
# Create correlation matrix
corr_matrix = dataframe.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
                                  k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
dataframe.drop(dataframe.columns[to_drop], axis=1)

Unnamed: 0,0,2
0,1,1
1,2,0
2,3,1
3,4,0
4,5,1
5,6,0
6,7,1
7,8,0
8,9,1


## 4 Removing Irrelevant Features for Classification
### Problem
You have a categorical target vector and want to remove uninformative features.
### Solution
If the features are categorical, calculate a chi-square (χ2
) statistic between each feature
and the target vector:

In [5]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
# Load data
iris = load_iris()
features = iris.data
target = iris.target
# Convert to categorical data by converting data to integers
features = features.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2


`If the features are quantitative, compute the ANOVA F-value between each feature
and the target vector:`

In [6]:
# Select two features with highest F-values
fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2


In [7]:
# Load library
from sklearn.feature_selection import SelectPercentile
# Select top 75% of features with highest F-values
fvalue_selector = SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features, target)
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 3


## 5 Recursively Eliminating Features
### Problem
You want to automatically select the best features to keep.
### Solution
Use scikit-learn’s RFECV to conduct recursive feature elimination (RFE) using crossvalidation (CV). That is, repeatedly train a model, each time removing a feature until
model performance (e.g., accuracy) becomes worse. The remaining features are the
best:

In [8]:
# Load libraries
import warnings
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn import datasets, linear_model
# Suppress an annoying but harmless warning
warnings.filterwarnings(action="ignore", module="scipy",
 message="^internal gelsd")
# Generate features matrix, target vector, and the true coefficients
features, target = make_regression(n_samples = 10000,
                                   n_features = 100,
                                   n_informative = 2,
                                   random_state = 1)
# Create a linear regression
ols = linear_model.LinearRegression()
# Recursively eliminate features
rfecv = RFECV(estimator=ols, step=1, scoring="neg_mean_squared_error")
rfecv.fit(features, target)
rfecv.transform(features)

array([[ 0.00850799, -0.28547464,  0.7031277 ],
       [-1.07500204, -0.8689623 ,  2.56148527],
       [ 1.37940721, -0.14714771, -1.77039484],
       ...,
       [-0.80331656, -1.030216  , -1.60648007],
       [ 0.39508844, -0.91553464, -1.34564911],
       [-0.55383035, -0.69804472,  0.82880112]])

In [9]:
# Number of best features
rfecv.n_features_

3

In [10]:
# Rank features best (1) to worst
rfecv.ranking_


array([80, 98, 39, 86, 31,  1,  8, 74, 58, 33, 30, 32,  4, 20, 13, 71, 24,
       12, 96, 60, 62, 26, 23,  9, 36, 82, 19, 84, 10, 21, 17, 61,  1, 73,
       16, 14, 65, 63, 11,  1, 40, 92, 37, 91, 94, 85, 29, 15, 67, 57, 66,
       56, 83, 18, 46, 38, 41, 45, 54, 68, 89, 59, 81, 75, 42, 78, 88, 22,
       25,  2, 44, 34,  6, 48, 50,  5, 72, 93, 77, 43, 76, 27,  7, 64, 69,
       70,  3, 97, 87, 90, 47, 53, 95, 49, 28, 55, 52, 51, 35, 79])