# Feature Selection
Src: https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2

https://heartbeat.fritz.ai/hands-on-with-feature-selection-techniques-filter-methods-f248e0436ce5

https://pbiecek.github.io/ema/doItYourselfWithPython.html

## Why do we need Feature Selection?
1. Curse of Dimensionality - Overfitting

As the number of features (or dimensions) grows, the amount of data we need to generalize accurately grows exponentially.

If the number of features is bigger than the number of samples, we will be able to train the data perfectly, but not generalize it to new samples (overfit).

2. Explainability

We want the models to be simple and explainable.

3. Garbage information

We want to remove unnecessary information. 

## Methods
- **Filter based**: filter features based on some metrics (ex: correlation, chi-square)

- **Wrapped-based**: selection of features is treated as a search problem (ex: recursive feature elimination)

- **Embedded**: use of algorithms that have built-in feature selection methods (ex: lasso and RF)

Import libraries

In [1]:
import pandas as pd
import glob
import numpy as np
root = '../'

Read dataset

In [2]:
disease = ""
path = root + "CSV/TabNet/Rates/"
all_files = glob.glob(path + "*.csv")
risk = pd.read_csv(root +'CSV/SatScan/muncod_risk.csv', index_col=0)

years = ["08", "09", "10", "11", "12", "13", "14", "15", "16", "17", "18"]

final_df = pd.DataFrame()


for year in years:
    col_year = "RATE_" + year
    year_df = risk
    for file in all_files:
        file_name = file.split("\\")[1]
        disease = file_name.split("_RATE")[0]
        
        disease_df = pd.read_csv(path + disease + '_RATE_08_18.csv', sep=',', index_col=0)
        disease_df = disease_df[[col_year, "MUNCOD"]]
        disease_df = disease_df.rename(columns={col_year: disease})

        year_df = pd.merge(disease_df, year_df, left_on="MUNCOD", right_on="MUNCOD")

    year_df = year_df.drop("MUNCOD", axis=1)
    final_df = pd.concat([final_df, year_df])
    
final_df.head()        

Unnamed: 0,TRAUMATISMO_INTRACRANIANO,TRANSTORNOS_MENTAIS,TECIDO_MOLE,OSTEOPOROSE,INSUFICIENCIA_RENAL,INSUFICIENCIA_CARDIACA,HIV,HIPERTENSAO,ESCLEROSE_MULTIPLA,EPILEPSIA,ENXAQUECA,DPOC,DORSOPATIAS,DOENCA_DE_PARKINSON,DOENCA_CARDIACA,DIABETES_MELLITUS,CANCER,ASMA,RISK
0,52.894983,12.206535,0.0,4.068845,4.068845,248.199536,0.0,537.087521,0.0,16.275379,0.0,248.199536,162.753794,0.0,28.481914,219.717622,12.206535,777.149367,1
1,6.988853,0.0,0.0,0.0,0.0,45.427543,3.494426,213.16001,0.0,6.988853,0.0,87.36066,41.933117,0.0,20.966558,69.888528,48.921969,66.394101,1
2,27.800945,35.214531,1.853396,0.0,9.266982,129.737744,1.853396,42.628116,0.0,18.533963,0.0,83.402836,9.266982,0.0,20.38736,155.685293,51.895098,87.109628,1
3,88.521954,5.419711,9.032852,1.80657,65.036538,274.598716,7.226282,167.107771,0.0,18.065705,0.0,122.846794,46.067548,0.903285,40.647836,200.529325,74.972676,127.36322,1
4,16.154219,13.461849,8.077109,0.0,16.154219,105.002423,2.69237,204.620107,0.0,18.846589,5.38474,210.004846,21.538959,0.0,24.231328,145.38797,45.770287,110.387163,1


Defining X and y

In [6]:
X = final_df.drop(columns="RISK")
y = final_df["RISK"]
num_feat = 10

## 1. Filter Based

### 1.1. Correlation Feature Selection

In [7]:
def cor_feature_selector(X,y,n):
    cor_list = []
    for i in list(X.columns):
        cor = np.corrcoef(X[i], y)[0,1]
        cor_list.append([i, cor])
    cor_ranking = sorted(cor_list, key=lambda a : abs(a[1]),reverse=True)
    cor_feature = [x[0] for x in cor_ranking[:n]]
    cor_support = [True if i in cor_feature else False for i in X.columns]
    return cor_support, cor_feature
cor_support, cor_feature = cor_feature_selector(X,y,num_feat)
print(str(len(cor_feature)), 'selected features')
print(cor_feature)

10 selected features
['TRANSTORNOS_MENTAIS', 'CANCER', 'DPOC', 'INSUFICIENCIA_CARDIACA', 'INSUFICIENCIA_RENAL', 'DOENCA_CARDIACA', 'DIABETES_MELLITUS', 'DORSOPATIAS', 'TECIDO_MOLE', 'OSTEOPOROSE']


### 1.2. Chi-Squared Feature Selection

In [8]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=num_feat)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')
print(chi_feature)

10 selected features
['TRANSTORNOS_MENTAIS', 'TECIDO_MOLE', 'OSTEOPOROSE', 'INSUFICIENCIA_RENAL', 'INSUFICIENCIA_CARDIACA', 'DPOC', 'DORSOPATIAS', 'DOENCA_DE_PARKINSON', 'DIABETES_MELLITUS', 'CANCER']


### 1.3. Mutual Information
Mutual information a measure of the mutual dependence of two variables. It measures the amount of information obtained about one variable through observing the other variable. 

In [9]:
from sklearn.feature_selection import mutual_info_classif, SelectKBest

mi_selector = SelectKBest(mutual_info_classif, k=num_feat)
mi_selector.fit(X, y)
mi_support = mi_selector.get_support()
mi_feature = np.array(X.columns)[mi_support]
print(str(len(mi_feature)), 'selected features')
print(mi_feature)

10 selected features
['TRANSTORNOS_MENTAIS' 'TECIDO_MOLE' 'OSTEOPOROSE'
 'INSUFICIENCIA_CARDIACA' 'ESCLEROSE_MULTIPLA' 'DPOC' 'DORSOPATIAS'
 'DOENCA_CARDIACA' 'DIABETES_MELLITUS' 'CANCER']


## 2. Wrapped-based

### 2.1. Recursive Feature Elimination
"The goal of recursive feature elimination (RFE) is to select features by **recursively considering smaller and smaller sets of features**. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached." (`sklearn` documentation)

In [10]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=num_feat, step=10, verbose=5)
rfe_selector.fit(X_norm, y)
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')
print(rfe_feature)

Fitting estimator with 18 features.
10 selected features
['TRAUMATISMO_INTRACRANIANO', 'TRANSTORNOS_MENTAIS', 'INSUFICIENCIA_RENAL', 'HIV', 'ESCLEROSE_MULTIPLA', 'DPOC', 'DORSOPATIAS', 'DOENCA_DE_PARKINSON', 'DOENCA_CARDIACA', 'CANCER']


## 3. Embedded
### 3.1. Lasso: SelectFromModel
This is an Embedded Method

In [20]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1", solver='liblinear'))
embeded_lr_selector.fit(X_norm, y)

embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')
print(embeded_lr_feature)

16 selected features
['TRAUMATISMO_INTRACRANIANO', 'TRANSTORNOS_MENTAIS', 'TECIDO_MOLE', 'OSTEOPOROSE', 'INSUFICIENCIA_RENAL', 'HIV', 'ESCLEROSE_MULTIPLA', 'EPILEPSIA', 'ENXAQUECA', 'DPOC', 'DORSOPATIAS', 'DOENCA_DE_PARKINSON', 'DOENCA_CARDIACA', 'DIABETES_MELLITUS', 'CANCER', 'ASMA']


### 3.2 Tree-Based: SelectFromModel

In [18]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
embeded_rf_selector.fit(X, y)

embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')
print(embeded_rf_feature)

6 selected features
['TRAUMATISMO_INTRACRANIANO', 'TRANSTORNOS_MENTAIS', 'INSUFICIENCIA_CARDIACA', 'DPOC', 'DOENCA_CARDIACA', 'CANCER']


# All together

In [16]:
feature_selection_df = pd.DataFrame({'Feature':X.columns, 'Pearson':cor_support, 'Chi-2':chi_support,'RFE':rfe_support, 'Logistics':embeded_lr_support,
                                    'Random Forest':embeded_rf_support})
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feat+5)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,Total
1,TRANSTORNOS_MENTAIS,True,True,True,True,True,5
2,DPOC,True,True,True,True,True,5
3,CANCER,True,True,True,True,True,5
4,INSUFICIENCIA_RENAL,True,True,True,True,False,4
5,DORSOPATIAS,True,True,True,True,False,4
6,DOENCA_CARDIACA,True,False,True,True,True,4
7,INSUFICIENCIA_CARDIACA,True,True,False,False,True,3
8,DOENCA_DE_PARKINSON,False,True,True,True,False,3
9,TRAUMATISMO_INTRACRANIANO,False,False,True,True,False,2
10,TECIDO_MOLE,True,True,False,False,False,2
