# Feature Selection - Filtering Methods - Constant, Quasi Constants and Duplicate Feature Removal

## Filtering method

Unnecessary and redundant features not only slow down the training time of an algorithm, but they also affect the performance of the algorithm.

There are several advantages of performing feature selection before training machine learning models:

 - Models with less number of features have higher explainability.
 - It is easier to implement machine learning models with reduced features.
 - Fewer features lead to enhanced generalization which in turn reduces overfitting.
 - Feature selection removes data redundancy.
 - Training time of models with fewer features is significantly lower.
 - Models with fewer features are less prone to errors.

## What is filter method?
Features selected using filter methods can be used as an input to any machine learning models.

 - Univariate -> Fisher Score, Mutual Information Gain, Variance etc
 - Multi-variate -> Pearson Correlation

The univariate filter methods are the type of methods where individual features are ranked according to specific criteria. The top N features are then selected. Different types of ranking criteria are used for univariate filtermethods, for example fisher score, mutual information, and variance of the feature.

Multivariate filter methods are capable of removing redundant features from the data since they take the mutual relationship between the features into account.

## Univariate Filtering Methods in this lesson
 - Constant Removal
 - Quasi Constant Removal
 - Duplicate Feature Removal

Download Data Files https://github.com/laxmimerit/Data-Files-for-Feature-Selection

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
from sklearn.ensemble import RandomForestClassifier

# VarianceThreshold - Feature selector that removes all low-variance features.
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("data/santander.csv", nrows=20000)
data.head()

In [None]:
x = data.drop("TARGET", axis=1)  # Features
y = data["TARGET"]  # Outcome

x.shape, y.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=0, stratify=y
)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

### Constant Features Removal

In [None]:
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(x_train)

In [None]:
# No. of features after constants removal
constant_filter.get_support().sum()

In [None]:
# Returns True for all the features which are constants.
constant_list = [
    not temp for temp in constant_filter.get_support()
]  # Inversing the True to False and False to True
constant_list

In [None]:
# Name of all the features which are constants
x.columns[constant_list]

In [None]:
# removing all the constants from our Training and Test dataset.
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)

In [None]:
# Now take a look at the original and the transformed data (after removing the constants)
x_train.shape, x_test.shape, x_train_filter.shape, x_test_filter.shape

## Quasi Constants Feature Removal

In [None]:
quasi_constant_filter = VarianceThreshold(threshold=0.01)

In [None]:
quasi_constant_filter.fit(x_train_filter)

In [None]:
quasi_constant_filter.get_support().sum()

In [None]:
x_train_quasi_filter = quasi_constant_filter.transform(x_train_filter)
x_test_quasi_filter = quasi_constant_filter.transform(x_test_filter)

In [None]:
# Now take a look at the original and the transformed data (after removing the constants)
x_train.shape, x_test.shape, x_train_filter.shape, x_test_filter.shape, x_train_quasi_filter.shape, x_train_quasi_filter.shape

## Duplicate Features Removal

In [None]:
x_train_T = x_train_quasi_filter.T
x_test_T = x_test_quasi_filter.T

In [None]:
# As we can see the pandas dataframe has been transformed in to numpy array after transpose.
type(x_train_T)

In [None]:
# Changing numpy array back to pandas dataframe
x_train_T = pd.DataFrame(x_train_T)
x_test_T = pd.DataFrame(x_test_T)

In [None]:
# Now we can see after transpose the rows has become columns and columns has become rows.
x_train_T.shape, x_test_T.shape

In [None]:
# Getting duplicate features count
x_train_T.duplicated().sum()

In [None]:
duplicated_features = x_train_T.duplicated()
duplicated_features

# True is duplicated and False is non duplicated rows.

In [None]:
# Removing duppicated features.
# After this the False becomes True and True becomes false.

# Inversing the True to False and False to True
features_to_keep = [not index for index in duplicated_features]
features_to_keep

In [None]:
# Final dataset after removing constants, quasi constants and duplicates.

# Transposing again to original form
x_train_unique = x_train_T[features_to_keep].T

# Transposing again to original form
x_test_unique = x_test_T[features_to_keep].T

In [None]:
x_train.shape, x_test.shape, x_train_unique.shape, x_test_unique.shape

## Build Model and Compare the Performance after and before removal.

In [None]:
def run_random_forest(x_train, x_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print("Accuracy on test set: ")
    print(accuracy_score(y_test, y_pred))

In [None]:
%%time
# Run on final data.
run_random_forest(x_train_unique, x_test_unique, y_train, y_test)

In [None]:
%%time
# Run on original data.
run_random_forest(x_train, x_test, y_train, y_test)

As we can see the accuracy and time taken is less after removing the constants, quasi constants and duplicates compare to the original data. 

What we can say here is that removing constants, quasi constants and duplicates doesn't depricates the accuracy it rather improves it.