# Feature Selection - Filter Method

* Selection of variables are indenendant of the ML model and instead it rely on the characteristics of the data.

## Pros

* Less computationally expensive.
* Well suited for a quick removal of irrelevant features.
* Model agnostic, means the features subset can be use in any ML models.

## Cors

* Lower prediction performance as compared to wrapper method.


## Steps in filter method

Consist of 2 steps:

1. Rank features based on certain criteria.
2. Select the highest ranking features.

Each feature is ranked independently of the other features/feature space because of which it may result in rendendent variables, as they do not consider the relationship between features.

**Ranking Criteria** relies on various statistical tests like:

* Chi-square | Fisher score
* Univariate parameteric tests
* Mutual information
* Variance
    
    * Constant features
    * Quasi-constant features

## General usage

Filter selection will help to remove:

1. Constant features
2. Quasi-constant features
3. Duplicated features, which may arise after one-hot encoding of categorical variables.

### 1. Constant features:

* Are those that show the just one value for all the observations of the dataset.
* The same value for all the rows of the dataset.
* These features provide no info that allows a ML model to predict a target.

#### Example 1 - Customer transaction prediction

Lets remove constant features by loading a dataset and check the presence of null data.

References:

* https://www.kaggle.com/raviprakash438/filter-method-feature-selection/notebook

In [38]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

"""
Load the 10000 rows from the datasets
"""
data = pd.read_csv("./_datasets_/customer-transaction.csv", nrows=10000)
print(data.shape)

"""
Update the loaded data by removing all the null values from the dataset's columns
"""
[col for col in data.columns if data[col].isnull().sum() > 0]

"""
Split the datasets into training set (70%) and test set (30%)
"""
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0
)

print(X_train.shape, X_test.shape)

(10000, 371)
(7000, 370) (3000, 370)


In [41]:
"""
- Sklearn's variance threshold is a simple baseline approach for feature selection.
- It removes all features which variance doesn't meet some threshold.
- By default, it removes all zero-variance features i.e. features that have the same value in all samples.

This feature selection algorithm looks only at the features (X), not the desired outputs(y).

Reference:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
"""
f_sel = VarianceThreshold(threshold=0)

f_sel.fit(X_train)

VarianceThreshold(threshold=0)

In [47]:
"""
- get_support return True and False value for each feature.
- True: Not a constant feature
- False: Constant feature(It contains same value in all samples.)
"""
s = f_sel.get_support()
print(s)

"""
Find total number of constant and non-constant features
"""
import collections
print(collections.Counter(s))

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True False False
  True  True  True  True  True  True  True  True  True  True  True  True
  True False False  True  True  True  True  True False False  True  True
  True  True  True  True  True  True  True  True  True False False False
 False  True  True  True  True  True  True  True  True  True  True  True
 False False  True  True  True  True  True  True  True False  True  True
  True False False  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True False False  True  True  True
  True  True False False  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
 False False False False  True  True  True  True  True  True  True  True
  True  True False False  True  True  True  True  True  True  True  True
 False  True  True  True  True  True False False  T

In [56]:
"""
- We can see there are 86 features/columns having constant value. This mean they have same value in all samples.
- Lets print constant feature names.
"""
constCol=[col for col in X_train.columns if col not in X_train.columns[s]]
print(constCol)

print(X_train['ind_var2_0'].value_counts())
print(X_train['ind_var13_medio'].value_counts())

['ind_var2_0', 'ind_var2', 'ind_var13_medio_0', 'ind_var13_medio', 'ind_var18_0', 'ind_var18', 'ind_var27_0', 'ind_var28_0', 'ind_var28', 'ind_var27', 'ind_var34_0', 'ind_var34', 'ind_var41', 'ind_var46_0', 'ind_var46', 'num_var13_medio_0', 'num_var13_medio', 'num_var18_0', 'num_var18', 'num_var27_0', 'num_var28_0', 'num_var28', 'num_var27', 'num_var34_0', 'num_var34', 'num_var41', 'num_var46_0', 'num_var46', 'saldo_var13_medio', 'saldo_var18', 'saldo_var28', 'saldo_var27', 'saldo_var34', 'saldo_var41', 'saldo_var46', 'delta_imp_amort_var18_1y3', 'delta_imp_amort_var34_1y3', 'delta_imp_reemb_var17_1y3', 'delta_imp_reemb_var33_1y3', 'delta_imp_trasp_var17_out_1y3', 'delta_imp_trasp_var33_out_1y3', 'delta_num_reemb_var17_1y3', 'delta_num_reemb_var33_1y3', 'delta_num_trasp_var17_out_1y3', 'delta_num_trasp_var33_out_1y3', 'imp_amort_var18_hace3', 'imp_amort_var18_ult1', 'imp_amort_var34_hace3', 'imp_amort_var34_ult1', 'imp_var7_emit_ult1', 'imp_reemb_var13_hace3', 'imp_reemb_var17_hace3', 

In [57]:
"""
Constant features do not play any role in predicting the result. 
So we will remove it from our training set and test set.
"""
print('Shape before drop: ', X_train.shape, X_test.shape)

"""
Transform will remove all the constant columns from training set and test set
but we will not use it because it will transform a dataframe to numpy array. 
"""
# X_train = f_sel.transform(X_train)
# X_test = f_sel.transform(X_test)

X_train.drop(columns=constCol, axis=1, inplace=True)
X_test.drop(columns=constCol, axis=1, inplace=True)

print('Shape after drop: ', X_train.shape, X_test.shape)

Shape before drop:  (7000, 370) (3000, 370)
Shape after drop:  (7000, 284) (3000, 284)
