# Feature Selection

In this notebook, we will go through some of the basic techinques for dealing with feature selection. This is a companion workbook for the 365 Data Science course on ML Process. This notebook only foucses on implementation. Check out the course or the documentation for the in-depth explanations of each approach.

We will cover:

- Filter Methods/Uni-variate Selection Methods
- Wrapper Methods

## Import Libraries

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load Data

We'll be working with a Bank Churn dataset. Churn Prediction is a common problem for **all** companies (not just tech companies). Having a healthy business is a result of minimizing churn as much as possible. 


In [10]:
df = pd.read_csv("./data/BankChurners.csv")
df.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,9.3e-05,0.99991
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,5.7e-05,0.99994
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,2.1e-05,0.99998
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,0.000134,0.99987
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,4716.0,0,4716.0,2.175,816,28,2.5,0.0,2.2e-05,0.99998


In [7]:
df.shape

(10127, 23)

## Pre-processing

The first thing we need to do pre-process our dataset. In this dataset, we need to turn the `Attrition_Flag` into a dummy variable of 0 and 1. Otherwise, we won't be able to make a proper prediction. 

In [11]:
## Create the target variable
def make_target(df, column):
    target_dummies = pd.get_dummies(df['Attrition_Flag'])
    df = pd.concat([df, target_dummies], axis = 1)
    return df

## To keep things simple, we'll just use the int columns as the feature columns. 
def get_int_columns(df, dtype):
    features = []
    for col, t in zip(df.columns, list(df.dtypes)):
        if t == dtype:
            features.append(col)
    return features

df = make_target(df, column = 'Attrition_flag')

target = 'Attrited Customer'
features = get_int_columns(df, dtype='int64')

y = df[target]
X = df[features]

## Filter Methods (Univariate Feature Selection)

### Correlation/ANOVA

The first method we'll go over is correlation/ANOVA. This is simply where we just select features that have the highest correlation with the target variable. For this dataset, since we're predicting churn, we'll be using ANOVA, as ANOVA is how we measure correlation for categorical variables. 

We wrote our own implementation that you can use. Here the arguments can be defined as: 

`df`: dataframe input

`features`: list of features within the dataframe you want to evaluate

`target`: the target variable you want to run the correlation against

`threshold`: The correlation threshold. IF you set it to 0.10, it'll return all features that have a correlation greater than 0.10. 

In [12]:
correlation_threshold = 0.10

def correlation_selection(df,
                          features, 
                          target,
                          threshold):
    
    correlations = df[features + [target]].corr()[target]
    selected_features = correlations[abs(correlations)>threshold]
    
    remove_target = selected_features.index[selected_features.index != target]
    return selected_features[remove_target]

selected = correlation_selection(df,
                                 features,
                                 target,
                                 threshold = 0.10)

print(selected)

Total_Relationship_Count   -0.150005
Months_Inactive_12_mon      0.152449
Contacts_Count_12_mon       0.204491
Total_Revolving_Bal        -0.263053
Total_Trans_Amt            -0.168598
Total_Trans_Ct             -0.371403
Name: Attrited Customer, dtype: float64


### Chi-Squares, ANOVA, F-Test, Mutual Info Gain

In addition to correlation, there are a number of other uni-variate methods. Sklearn has an implementation called `SelectKBest` whih selects features according to the k highest scores. Sklearn has a ton of different options here:

`chi2`: Chi-Squared statistics comparing features against categorical target. 

`f_regression`: F-statistic between the feature and the target.

`f_classif`: ANOVA F-value between feature and target. 

`r_regression`: Pearson Correlation. Similar to the previous cell. 

`mutual_info_classif`: Mutual information for a discrete target.

`mutual_info_regression`: Mutual information for a continuous target.

In [13]:
from sklearn.feature_selection import (
    SelectKBest, 
    chi2, 
    f_classif, 
    f_regression,
    r_regression,
    mutual_info_classif,
    mutual_info_regression
)

kb = SelectKBest(chi2, k=4)
X_new = kb.fit_transform(X,y)
X_new = pd.DataFrame(X_new)
X_new.columns = kb.get_feature_names_out()

X_new

Unnamed: 0,CLIENTNUM,Total_Revolving_Bal,Total_Trans_Amt,Total_Trans_Ct
0,768805383,777,1144,42
1,818770008,864,1291,33
2,713982108,0,1887,20
3,769911858,2517,1171,20
4,709106358,0,816,28
...,...,...,...,...
10122,772366833,1851,15476,117
10123,710638233,2186,8764,69
10124,716506083,0,10291,60
10125,717406983,0,8395,62


## Wrapper Methods

### Forward Stepwise

Next, we'll use forward stepwise. We can use sklearn's implementation as well, called `SequentialFeatureSelector`. We'll need to just speciy this in the parameters: 

`estimator`: This is the estimator/model we want to evaluate the features on. You can input random forest or logistic regression or any model you'd use for your problem. 

`n_features_to_select`: The number of features we want to select. 

`direction`: Denote `'forward'` for forward stepwise. Denote `'backward'` for backward stepwise. 



In [14]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression

kb = SequentialFeatureSelector(LogisticRegression(),
                               n_features_to_select=4,
                              direction = 'forward')
X_new = kb.fit_transform(X,y)
X_new = pd.DataFrame(X_new)
X_new.columns = kb.get_feature_names_out()

X_new

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Contacts_Count_12_mon
0,45,3,39,3
1,49,5,44,2
2,51,3,36,0
3,40,4,34,1
4,40,3,21,0
...,...,...,...,...
10122,50,2,40,3
10123,41,2,25,3
10124,44,1,36,4
10125,30,2,36,3


### Backward Stepwise

Forward backward stepwise, we'll just change the argument `direction` to equal `'backward'`:;

In [15]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression

kb = SequentialFeatureSelector(LogisticRegression(),
                               n_features_to_select=4,
                              direction = 'backward')
X_new = kb.fit_transform(X,y)
X_new = pd.DataFrame(X_new)
X_new.columns = kb.get_feature_names_out()

X_new

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,Total_Relationship_Count,Months_Inactive_12_mon,Total_Revolving_Bal,Total_Trans_Amt
0,5,1,777,1144
1,6,1,864,1291
2,4,1,0,1887
3,3,4,2517,1171
4,5,1,0,816
...,...,...,...,...
10122,3,2,1851,15476
10123,4,2,2186,8764
10124,5,3,0,10291
10125,4,3,0,8395


## Recursive Feature Elimination

Recursive Feature Elimination is where we train the model on all features. Then, we prune the least important features. Rerun the model. Prune features again. We keep repeating this until we hit our desired feature amount.

Recursive Feature Elimination and Sequential are very similar. The difference is Sequential is using the cross validation score to remove features. RFE is using the importance or weight of the feature to decide on removal:

In [16]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

kb = RFE(LogisticRegression(), n_features_to_select=4)
X_new = kb.fit_transform(X,y)
X_new = pd.DataFrame(X_new)
X_new.columns = kb.get_feature_names_out()

X_new

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,Dependent_count,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon
0,3,5,1,3
1,5,6,1,2
2,3,4,1,0
3,4,3,4,1
4,3,5,1,0
...,...,...,...,...
10122,2,3,2,3
10123,2,4,2,3
10124,1,5,3,4
10125,2,4,3,3


### Exhaustive Stepwise

Exhaustic Stepwise is the most rigorous version of the previous techniques. Rather than recursively selecting features, exhaustive stepwise is where we try every combination from a set of features. We found an implementation using `Mlxtend`:

[Link to Mlxtend Docs](http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/)


In [41]:
# # Install a pip package in the current Jupyter kernel
# import sys
# !{sys.executable} -m pip install mlxtend

In [18]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

lr = LogisticRegression()

efs1 = EFS(lr, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=5)

efs1 = efs1.fit(X, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best accuracy score: 0.87
Best subset (indices): (4, 5, 7, 8)
Best subset (corresponding names): ('Total_Relationship_Count', 'Months_Inactive_12_mon', 'Total_Revolving_Bal', 'Total_Trans_Amt')


In [19]:
efs1.best_feature_names_

('Total_Relationship_Count',
 'Months_Inactive_12_mon',
 'Total_Revolving_Bal',
 'Total_Trans_Amt')

### Bi-Directional Elimination

Bi-Directional Elimination performs both forward and backward stepwise. First, it performs a step of forward stepwise, adding featuers that are significant. Then, it performs a backward elimination, removing any feature that is not insignificant. 

In [20]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sbs = SFS(LogisticRegression(),
         k_features=4,
         forward=True,
         floating=True,
         cv=0)
sbs.fit(X, y)
sbs.k_feature_names_

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

('Total_Relationship_Count',
 'Contacts_Count_12_mon',
 'Total_Revolving_Bal',
 'Total_Trans_Amt')

In [21]:
sbs.k_feature_names_

('Total_Relationship_Count',
 'Contacts_Count_12_mon',
 'Total_Revolving_Bal',
 'Total_Trans_Amt')

## Variance Threshold

The last method here is Variance Threshold. This method will just remove all low variance features from the dataset:

In [22]:
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold()
selector.fit_transform(X)


array([[768805383,        45,         3, ...,       777,      1144,
               42],
       [818770008,        49,         5, ...,       864,      1291,
               33],
       [713982108,        51,         3, ...,         0,      1887,
               20],
       ...,
       [716506083,        44,         1, ...,         0,     10291,
               60],
       [717406983,        30,         2, ...,         0,      8395,
               62],
       [714337233,        43,         2, ...,      1961,     10294,
               61]])

## Conclusion

In this section we learned the following techniques:
- Univariate Selection Techniques
- Forward/Backward Stepwise
- Bidirectional Elimination
- Exhaustive Stepwise
- Variance Threshold