# Dimensionality Reduction Using Feature Selection

**Feature Selection** Selecting hight-quality, informative features and dropping less useful features. There are three type of feature selection: 

Filter : Select the best features by examining their statistical properties.

Wrapper : Use trial and error to find the subset of features that produce models with the highest quality prediction.

Embedded : Select the best feature subset as part or as an extension of a learning algorithm's training process.

## Thresholding Numerical Feature Variance

You have a set of numerical features and want to remove those with low variance (i.e., Containing little information).

Motivated by the idea that features with low variance are likely less interestinf (and useful) than features with hight variance, the first step will be to calculate the variance of each feature, secondly, dropping those features whose variance does not satisfy the selected threshold.

Things to keep in mind: 

1-. The variance is not centered because it is in the square unit of the feature itself. It will not work with features in different units (i.e, time and money)

2-. The variance tthreshold is selected manually on our own judgement or by using a model selecion technique.

In [143]:
# Load Libraries

from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

In [144]:
# Load Data

iris = datasets.load_iris()

In [145]:
# Create feature and target

features = iris.data
target = iris.target

In [146]:
# Create Thresholder

thresholder = VarianceThreshold(threshold=.5)

In [147]:
# Create hight variance feature matrix 

features_hight_variance = thresholder.fit_transform(features)

In [148]:
# View hight variance feature matrix

features_hight_variance[0:3]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])

In [149]:
# View variances
thresholder.fit(features).variances_

array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

Finally, if thefeatures have been standardized (mean = 0, variance = 1) variance thresholding will not work correctly. (OBVIOUSLY)

In [150]:
# Load library

from sklearn.preprocessing import StandardScaler

In [151]:
# Standardize feature matrix

scaler = StandardScaler()
features_std = scaler.fit_transform(features)

In [152]:
### Calcalute the variance of each feature 

selector = VarianceThreshold()
selector.fit(features_std).variances_

array([1., 1., 1., 1.])

## Thresholding Binary feature variance

You have a set of categorical features and want to remove those with low variance (containing fewer information)

The solution is to select a subset of features with a bernoulli random variable vatiance above a given threshold.

In [153]:
# Load library 

from sklearn.feature_selection import VarianceThreshold

In [154]:
# Create feature matrix with: 

#Feature 0 : 80% class 0
#Feature 1 : 80% class 1 
#Feature 2 : 60% class 0, 40 % class 1 

features = [[0,1,0],[0,1,1],[0,1,0],[0,1,1],[1,0,0]]

In [155]:
# Run threshold by variance 
thresholder = VarianceThreshold(threshold=(.75 * (1- .75)))
thresholder.fit_transform(features)

array([[0],
       [1],
       [0],
       [1],
       [0]])

#### Discussion : 

One strategy for selecting hightly informative categorical features is to examine their variances. Formula: Var(x) = p(1-p) where p is the proportion of observations of class 1. Therefore, by setting p we can remove features where the vast majority of observations are one class.

## Handling Hightly Correlated Features

We suspect that features on feature matrix are hightly correlated so that we need to check it out by using a Correlation Matrix. Consider dropping out one of the correlated features.

In [156]:
# Load libraries

import pandas as pd
import numpy as np

In [157]:
# Create a feature matrix with to hightly correlated features

features= np.array([[1,1,1],[2,2,0],[3,3,1],[4,4,0],[5,5,1],[6,6,0],[7,7,1],[8,7,0],[9,7,1]])

In [158]:
# Convert feature matrix into DataFrame
df = pd.DataFrame(features)

In [159]:
# Create Correlation Matrix

corr_matrix = df.corr().abs()
print("CORRELATION MATRIX : " ) 
print(corr_matrix)

CORRELATION MATRIX : 
          0         1         2
0  1.000000  0.976103  0.000000
1  0.976103  1.000000  0.034503
2  0.000000  0.034503  1.000000


In [160]:
# Look the upper triangle of the correlation matrix to identify pairs of hightly correlated features.

upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
print("UPPER TRIANGLE : " ) 
print(upper)

UPPER TRIANGLE : 
    0         1         2
0 NaN  0.976103  0.000000
1 NaN       NaN  0.034503
2 NaN       NaN       NaN


In [161]:
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

In [162]:
# Remove one correlated feature
df_final = df.drop(df.columns[to_drop], axis=1).head()
print("FINAL DATAFRAME: ")
print(df_final)

FINAL DATAFRAME: 
   0  2
0  1  1
1  2  0
2  3  1
3  4  0
4  5  1


# Removing Irrelevant Features for Classification

You have a categorical target vector and want to remove uninformative features.

When facing categorical values a good solution is to calculate a chi-square statistic between each featre and the target vector.

## **Chi Square Statistic** : Examines the independence of two categorical vectors. 

It represents the difference between the observed number of obervations in each class of a categorical feature and what we would expect if that feature was independent with the target vector (No relationship). 

By calculating Chi2 between feature and target vector we obtain ameasurement of the independence between the two. 

If the target is independent of the feature variable is irrelevant for our purposes because it contains no useful information for classification.
On the other hand, if the two variables are dependent they likjely are very informative for training our model

Can only be calculated between two categorical vectors and all the values need to be non negatives

In [163]:
# Load libraries

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif

In [164]:
# Load data

iris = load_iris()
features = iris.data
target = iris.target

In [165]:
# Convert to categorical data by converting data to integers

features = features.astype(int)

1-. We calculate the chi2 on each feature and the target vector.
2-. SelectKBest provides features with best statistics (k = number of feature we want to keep)


In [166]:
# Select two features with highest chi-squared statistics

chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features,target)
features_kbest[0:3]

array([[1, 0],
       [1, 0],
       [1, 0]])

In [167]:
# Show Result

print("Original number of features : " + "Shape[0]: " + str(features.shape[0]) + "Shape[1]: " + str(features.shape[1]))
print("Reduced number of features : " + "Shape[0]: " + str(features_kbest.shape[0]) + "Shape[1]: " + str(features_kbest.shape[1]))

Original number of features : Shape[0]: 150Shape[1]: 4
Reduced number of features : Shape[0]: 150Shape[1]: 2


## ANOVA F-VALUE

Having quantitative features we can calculate the ANOVA F-VALUE between each feature and the target vector.

We can use f_classif to calculate the ANOVA F-Value with each feature and the target vector.

F- Value is a mean comparison, it tell us if the mean for each group is significatly different (i.e., women VS men).

Score mean for women Vs Score mean for men

H0: Similar Mean
H1: Different Mean

Accept HO: Doesnt help to predict
Accept H1: It is useful

In [168]:
# Select two features with highest F-Values

fvalue_selector = SelectKBest(f_classif, k = 2)
features_kbest = fvalue_selector.fit_transform(features,target)

In [169]:
# Show Result

print("Original number of features : " + "Shape[0]: " + str(features.shape[0]) + "Shape[1]: " + str(features.shape[1]))
print("Reduced number of features : " + "Shape[0]: " + str(features_kbest.shape[0]) + "Shape[1]: " + str(features_kbest.shape[1]))

Original number of features : Shape[0]: 150Shape[1]: 4
Reduced number of features : Shape[0]: 150Shape[1]: 2


## PERCEPTIL SELECTION 

Instead of selecting a specific number of features we can also select the top n percent of features.

In [170]:
# Load Library

from sklearn.feature_selection import SelectPercentile

In [171]:
# Select top 75% of features with highest F-Values

fvalue_selector = SelectPercentile(f_classif, percentile = 75) # Percentile 67 is the edge, 3 features detected
features_kbest = fvalue_selector.fit_transform(features, target)

In [172]:
# Show Result

print("Original number of features : " + "Shape[0]: " + str(features.shape[0]) + "Shape[1]: " + str(features.shape[1]))
print("Reduced number of features : " + "Shape[0]: " + str(features_kbest.shape[0]) + "Shape[1]: " + str(features_kbest.shape[1]))

Original number of features : Shape[0]: 150Shape[1]: 4
Reduced number of features : Shape[0]: 150Shape[1]: 3


# RECURSIVELY ELIMINATING FEATURES

Select automatically the best features to keep.

RFE: Recursive Failure Eliminator.

RFECV: Conduct RFE using Cross Validation.

Train a model, each time removing a feature until model performance become worse. The remaining features are the best.

RFE : Train a model that contains some parameters(weight coefficients) like linear regression or support vector machine repeteadly.
The first time we train the model we include all the features, then,we find the feature with the smallest parameter (assuming data is standardize) and removing it because it is less important.

In [173]:
# Load Libraries

import warnings
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn import datasets, linear_model

In [174]:
# Supress an annoying but harmless warning

warnings.filterwarnings(action = "ignore", module = "scipy", message = "^internal gelsd")

In [175]:
# Generate feature matrix, target, vector and the true coefficients

features, target = make_regression(n_samples = 10000, n_features = 100, n_informative = 2)

In [176]:
# Create a linear regression

ols = linear_model.LinearRegression()

In [181]:
# Recursively eliminate features

rfecv = RFECV(estimator = ols, step = 1, scoring = "neg_mean_squared_error") # estimator = ols, SVM ; step = features deletedper time ; scoring : quality metric= 
rfecv.fit(features, target)
rfecv.transform(features)

array([[-0.44765157,  0.49282359,  1.2183224 ,  1.01099846],
       [-1.67831485, -0.24767319,  0.37487546,  1.27089437],
       [ 1.63634543, -1.11006511,  1.13698511, -0.93085913],
       ...,
       [ 0.06514231, -0.1330187 , -0.81817928, -1.19043446],
       [-0.29483262, -0.98992622, -0.50707802, -0.21564811],
       [ 0.65336141,  1.21830774,  1.36632941,  0.11158976]])

In [182]:
# Once we have conducted RFE we can see the  number of features we shoud keep: 
rfecv.n_features_

4

In [183]:
# We cam añsp see which of those features we should keep: 

rfecv.support_

array([False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False])

In [184]:
# W3 can also viwe the ranking of the features: (1 to worst)

rfecv.ranking_

array([69, 82, 34, 43, 94, 39, 25, 65, 23,  1, 41,  6,  2, 76, 59, 53,  5,
       36, 19, 14, 46, 57, 81, 31, 55, 72,  7,  1, 83, 20, 90, 74, 64, 67,
       30, 37, 11, 35, 61, 50, 47, 17, 18, 40, 70, 44,  1,  8, 95, 97, 16,
       79, 22,  4, 56, 28, 27, 73, 88, 12, 24, 51, 58, 26, 85, 21, 92, 77,
       66, 87, 86, 33, 75, 29, 13, 10, 63, 49, 84, 48, 38, 68, 15, 32, 78,
        9, 60, 54, 62, 71, 45,  1, 96, 52,  3, 91, 80, 89, 42, 93])