# Feature Selection
Src: https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2

https://heartbeat.fritz.ai/hands-on-with-feature-selection-techniques-filter-methods-f248e0436ce5

https://pbiecek.github.io/ema/doItYourselfWithPython.html

## Why do we need Feature Selection?
1. Curse of Dimensionality - Overfitting

As the number of features (or dimensions) grows, the amount of data we need to generalize accurately grows exponentially.

If the number of features is bigger than the number of samples, we will be able to train the data perfectly, but not generalize it to new samples (overfit).

2. Explainability

We want the models to be simple and explainable.

3. Garbage information

We want to remove unnecessary information. 

## Methods
- **Filter based**: filter features based on some metrics (ex: correlation, chi-square)

- **Wrapped-based**: selection of features is treated as a search problem (ex: recursive feature elimination)

- **Embedded**: use of algorithms that have built-in feature selection methods (ex: lasso and RF)

Import libraries

In [1]:
import pandas as pd
import glob
import numpy as np
root = '../'

In [2]:
suicide = pd.read_csv(root +'CSV/Suicide/suicide_rates_08_18.csv', index_col=0)
suicide

Unnamed: 0,MUNCOD,RATE_08,RATE_09,RATE_10,RATE_11,RATE_12,RATE_13,RATE_14,RATE_15,RATE_16,RATE_17,RATE_18
0,110001,20.344224,8.212203,8.189337,4.127456,12.464166,7.773632,3.898332,11.728829,7.841292,11.793844,4.316485
1,110002,9.458389,2.338060,4.427031,4.368243,9.703818,1.974938,4.860976,4.789226,5.665936,9.315758,1.883807
2,110003,0.000000,14.936520,0.000000,0.000000,0.000000,15.396459,0.000000,31.471282,15.900779,0.000000,18.389114
3,110004,5.110972,7.626311,2.544497,1.266480,5.042229,1.164646,5.776607,6.878683,10.241588,5.649271,9.432516
4,110005,0.000000,0.000000,11.743981,0.000000,11.868028,0.000000,0.000000,11.119760,11.136478,5.576001,6.081245
...,...,...,...,...,...,...,...,...,...,...,...,...
5376,522200,7.874636,0.000000,15.937525,15.817779,15.702285,0.000000,7.494566,0.000000,0.000000,7.312614,0.000000
5377,522205,0.000000,0.000000,0.000000,0.000000,13.199578,12.605572,0.000000,12.238404,0.000000,0.000000,0.000000
5378,522220,67.249496,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5379,522230,0.000000,18.660198,0.000000,0.000000,0.000000,18.315018,0.000000,0.000000,0.000000,0.000000,0.000000


Read dataset

In [3]:
disease = ""
path = root + "CSV/TabNet/Rates/"
all_files = glob.glob(path + "*.csv")
suicide = pd.read_csv(root +'CSV/Suicide/suicide_rates_08_18.csv', index_col=0)

years = ["08", "09", "10", "11", "12", "13", "14", "15", "16", "17", "18"]

final_df = pd.DataFrame()


for year in years:
    col_year = "RATE_" + year
    year_df = suicide[[col_year, "MUNCOD"]]
    year_df = year_df.rename(columns={col_year: "RATE"})
    for file in all_files:
        file_name = file.split("\\")[1]
        disease = file_name.split("_RATE")[0]
        
        disease_df = pd.read_csv(path + disease + '_RATE_08_18.csv', sep=',', index_col=0)
        disease_df = disease_df[[col_year, "MUNCOD"]]
        disease_df = disease_df.rename(columns={col_year: disease})
        
        
        year_df = pd.merge(disease_df, year_df, left_on="MUNCOD", right_on="MUNCOD")
        
    year_df = year_df.drop("MUNCOD", axis=1)
    final_df = pd.concat([final_df, year_df])

final_df = final_df[final_df["RATE"] > 0]
final_df.head()        

Unnamed: 0,TRAUMATISMO_INTRACRANIANO,TRANSTORNOS_MENTAIS,TECIDO_MOLE,OSTEOPOROSE,INSUFICIENCIA_RENAL,INSUFICIENCIA_CARDIACA,HIV,HIPERTENSAO,ESCLEROSE_MULTIPLA,EPILEPSIA,ENXAQUECA,DPOC,DORSOPATIAS,DOENCA_DE_PARKINSON,DOENCA_CARDIACA,DIABETES_MELLITUS,CANCER,ASMA,RATE
0,52.894983,12.206535,0.0,4.068845,4.068845,248.199536,0.0,537.087521,0.0,16.275379,0.0,248.199536,162.753794,0.0,28.481914,219.717622,12.206535,777.149367,20.344224
2,27.800945,35.214531,1.853396,0.0,9.266982,129.737744,1.853396,42.628116,0.0,18.533963,0.0,83.402836,9.266982,0.0,20.38736,155.685293,51.895098,87.109628,1.853396
3,88.521954,5.419711,9.032852,1.80657,65.036538,274.598716,7.226282,167.107771,0.0,18.065705,0.0,122.846794,46.067548,0.903285,40.647836,200.529325,74.972676,127.36322,10.839423
4,16.154219,13.461849,8.077109,0.0,16.154219,105.002423,2.69237,204.620107,0.0,18.846589,5.38474,210.004846,21.538959,0.0,24.231328,145.38797,45.770287,110.387163,16.154219
6,23.207608,18.460597,6.329348,2.373505,15.559646,26.108559,22.416439,10.548913,0.0,15.823369,0.0,12.394972,4.483288,0.0,32.965352,35.338857,57.491574,28.482064,4.483288


Defining X and y

In [4]:
X = final_df.drop(columns="RATE")
y = final_df["RATE"]
num_feat = 10

## 1. Filter Based

### 1.1. Correlation Feature Selection

In [5]:
def cor_feature_selector(X,y,n):
    cor_list = []
    for i in list(X.columns):
        cor = np.corrcoef(X[i], y)[0,1]
        cor_list.append([i, cor])
    cor_ranking = sorted(cor_list, key=lambda a : abs(a[1]),reverse=True)
    cor_feature = [x[0] for x in cor_ranking[:n]]
    cor_support = [True if i in cor_feature else False for i in X.columns]
    return cor_support, cor_feature
cor_support, cor_feature = cor_feature_selector(X,y,num_feat)
print(str(len(cor_feature)), 'selected features')
print(cor_feature)

10 selected features
['DPOC', 'CANCER', 'TRANSTORNOS_MENTAIS', 'INSUFICIENCIA_CARDIACA', 'DIABETES_MELLITUS', 'TECIDO_MOLE', 'DOENCA_DE_PARKINSON', 'DOENCA_CARDIACA', 'OSTEOPOROSE', 'INSUFICIENCIA_RENAL']


## 2. Wrapped-based

### 2.1. Recursive Feature Elimination
"The goal of recursive feature elimination (RFE) is to select features by **recursively considering smaller and smaller sets of features**. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached." (`sklearn` documentation)

In [6]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
rfe_selector = RFE(estimator=RandomForestRegressor(), n_features_to_select=num_feat, step=10, verbose=5)
rfe_selector.fit(X, y)
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')
print(rfe_feature)

Fitting estimator with 18 features.
10 selected features
['TRAUMATISMO_INTRACRANIANO', 'TRANSTORNOS_MENTAIS', 'TECIDO_MOLE', 'OSTEOPOROSE', 'INSUFICIENCIA_RENAL', 'EPILEPSIA', 'DPOC', 'DOENCA_CARDIACA', 'CANCER', 'ASMA']


### 3.2 Tree-Based: SelectFromModel

In [7]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

embeded_rf_selector = SelectFromModel(RandomForestRegressor())
embeded_rf_selector.fit(X, y)

embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')
print(embeded_rf_feature)

4 selected features
['OSTEOPOROSE', 'INSUFICIENCIA_RENAL', 'DPOC', 'CANCER']


# All together

In [8]:
feature_selection_df = pd.DataFrame({'Feature':X.columns, 'Pearson':cor_support, 'RFE':rfe_support,
                                    'Random Forest':embeded_rf_support})
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feat+5)

Unnamed: 0,Feature,Pearson,RFE,Random Forest,Total
1,OSTEOPOROSE,True,True,True,3
2,INSUFICIENCIA_RENAL,True,True,True,3
3,DPOC,True,True,True,3
4,CANCER,True,True,True,3
5,TRANSTORNOS_MENTAIS,True,True,False,2
6,TECIDO_MOLE,True,True,False,2
7,DOENCA_CARDIACA,True,True,False,2
8,TRAUMATISMO_INTRACRANIANO,False,True,False,1
9,INSUFICIENCIA_CARDIACA,True,False,False,1
10,EPILEPSIA,False,True,False,1
