# Feature Selection
Src: https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2

https://heartbeat.fritz.ai/hands-on-with-feature-selection-techniques-filter-methods-f248e0436ce5

https://pbiecek.github.io/ema/doItYourselfWithPython.html

https://www.machinelearningplus.com/machine-learning/feature-selection/#1boruta

## Why do we need Feature Selection?
1. Curse of Dimensionality - Overfitting

As the number of features (or dimensions) grows, the amount of data we need to generalize accurately grows exponentially.

If the number of features is bigger than the number of samples, we will be able to train the data perfectly, but not generalize it to new samples (overfit).

2. Explainability

We want the models to be simple and explainable.

3. Garbage information

We want to remove unnecessary information. 

## Methods
- **Filter based**: filter features based on some metrics (ex: correlation, chi-square)

- **Wrapped-based**: selection of features is treated as a search problem (ex: recursive feature elimination)

- **Embedded**: use of algorithms that have built-in feature selection methods (ex: lasso and RF)

# Example
We will be using the **football player dataset** to understand which features make a player great.

This is a classification problem where we consider that a great player is one that has an Overall Score >= 87.

Import libraries

In [2]:
import pandas as pd
import numpy as np

Read dataset

In [3]:
df = pd.read_csv('data.csv', index_col=0)
#numerical = [k for k in df.dtypes.keys() if df.dtypes[k] in ['int64', 'float64']]
#categorical = [k for k in df.dtypes.keys() if df.dtypes[k] == 'object']

Choose categorical and numerical features

In [4]:
numerical = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
categorical = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

Generate dummy variables for categorial features, drop null

In [5]:
df = pd.concat([df[numerical], pd.get_dummies(df[categorical])],axis=1)
df = df.dropna()
df.head()

Unnamed: 0,Overall,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,94,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,...,0,0,0,0,0,0,0,0,0,0
1,94,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,...,0,0,0,0,0,0,0,0,0,0
2,92,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,...,0,0,0,0,0,0,0,0,0,0
3,91,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,...,0,0,0,0,0,0,0,0,0,0
4,91,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,...,0,0,0,0,0,0,0,0,0,0


Defining X, y and num_feat (number of features we're trying to select).

As seen above, we have 224 columns, and we're looking to find the 30 most relevant ones.

In [6]:
X = df.drop(columns="Overall")
y = df["Overall"] >= 87
num_feat = 30

## 1. Filter Based

### 1.1. Correlation Feature Selection

Correlation shows the strength of a relationship between two variables

#### Pearson correlation
Assumptions:
- Both variables should be **normally distributed**.
- A **straight-line relationship** between the two variables.
- Data is **equally distributed** around the regression line.

In [7]:
def cor_feature_selector(X,y,n):
    cor_list = []
    for i in list(X.columns):
        cor = np.corrcoef(X[i], y)[0,1]
        cor_list.append([i, cor])
    cor_ranking = sorted(cor_list, key=lambda a : abs(a[1]),reverse=True)
    cor_feature = [x[0] for x in cor_ranking[:n]]
    cor_support = [True if i in cor_feature else False for i in X.columns]
    return cor_support, cor_feature
cor_support, cor_feature = cor_feature_selector(X,y,num_feat)
print(str(len(cor_feature)), 'selected features')
print(cor_feature)

30 selected features
['Reactions', 'Body Type_C. Ronaldo', 'Body Type_Messi', 'Body Type_Neymar', 'Body Type_Courtois', 'Body Type_PLAYER_BODY_TYPE_25', 'Position_LF', 'Position_RF', 'ShortPassing', 'Volleys', 'LongPassing', 'FKAccuracy', 'BallControl', 'Finishing', 'LongShots', 'ShotPower', 'Dribbling', 'Nationality_Belgium', 'Crossing', 'Agility', 'Weak Foot', 'Stamina', 'Nationality_Slovenia', 'Nationality_Gabon', 'Strength', 'SprintSpeed', 'Acceleration', 'Nationality_Uruguay', 'Position_LAM', 'Nationality_Costa Rica']


### 1.2. Chi-Squared Feature Selection

- Suited for **categorical variables and binary targets** only.
- The variables should be **non-negative** and typically **boolean, frequencies, or counts**.
- Compares the observed distribution between various features in the dataset and the target variable.

In [8]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=num_feat)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')
print(chi_feature)

30 selected features
['Finishing', 'ShortPassing', 'LongPassing', 'BallControl', 'Volleys', 'FKAccuracy', 'Reactions', 'LongShots', 'Position_CM', 'Position_LAM', 'Position_LF', 'Position_LW', 'Position_RB', 'Position_RF', 'Body Type_C. Ronaldo', 'Body Type_Courtois', 'Body Type_Messi', 'Body Type_Neymar', 'Body Type_PLAYER_BODY_TYPE_25', 'Nationality_Belgium', 'Nationality_Costa Rica', 'Nationality_Croatia', 'Nationality_Egypt', 'Nationality_England', 'Nationality_France', 'Nationality_Gabon', 'Nationality_Slovakia', 'Nationality_Slovenia', 'Nationality_Spain', 'Nationality_Uruguay']


### 1.3. Mutual Information

- It is a measure of the mutual dependence of two variables.
- It measures the amount of information obtained about one variable through observing the other variable, that is, it measures how much information the presence/absence of a feature contributes to making the correct prediction on Y.

In [9]:
from sklearn.feature_selection import mutual_info_classif, SelectKBest

numerical = ['Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
numerical_X = X[numerical]

mi_selector = SelectKBest(mutual_info_classif, k=10)
mi_selector.fit(numerical_X, y)
mi_support = mi_selector.get_support()
mi_feature = np.array(numerical)[mi_support]
print(str(len(mi_feature)), 'selected features')
print(mi_feature)

10 selected features
['Crossing' 'Finishing' 'ShortPassing' 'Dribbling' 'LongPassing'
 'BallControl' 'Volleys' 'Reactions' 'ShotPower' 'LongShots']


### 1.4. ANOVA

- Also measures the dependence of two variables. 
- Assumes a **linear relationship between the variables and the target**, and also that the variables are **normally distributed**.

In [10]:
from sklearn.feature_selection import f_classif, SelectKBest

anova_selector = SelectKBest(f_classif, k=num_feat).fit(X,y)
anova_support = anova_selector.get_support()
anova_feature = X.columns[anova_support]
print(str(len(anova_feature)), 'selected features')
print(anova_feature)

30 selected features
Index(['Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing',
       'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina',
       'Volleys', 'FKAccuracy', 'Reactions', 'ShotPower', 'Strength',
       'LongShots', 'Weak Foot', 'Position_LAM', 'Position_LF', 'Position_RF',
       'Body Type_C. Ronaldo', 'Body Type_Courtois', 'Body Type_Messi',
       'Body Type_Neymar', 'Body Type_PLAYER_BODY_TYPE_25',
       'Nationality_Belgium', 'Nationality_Costa Rica', 'Nationality_Gabon',
       'Nationality_Slovenia', 'Nationality_Uruguay'],
      dtype='object')


### 1.5. Univariate ROC-AUC /RMSE
1. Build a decision tree with a single feature and target

2. Rank the features according to the model metrics

3. Select features ranked the highest

Metrics:
- Regression: RMSE

- Classification: ROC-AUC

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

roc_values = []

for feature in X.columns:
    clf = DecisionTreeClassifier()
    clf.fit(X[feature].to_frame(), y)
    y_scored = clf.predict_proba(X[feature].to_frame())
    roc_values.append(roc_auc_score(y, y_scored[:, 1]))

roc_values = pd.Series(roc_values)
roc_values.index = X.columns

roc_feature = list(roc_values.sort_values(ascending=False)[:num_feat].keys())
roc_support = [True if i in roc_feature else False for i in X.columns]
print(str(len(roc_feature)), 'selected features')
print(roc_feature)

30 selected features
['Reactions', 'BallControl', 'ShortPassing', 'LongShots', 'Dribbling', 'Finishing', 'Volleys', 'ShotPower', 'LongPassing', 'FKAccuracy', 'Crossing', 'Interceptions', 'Stamina', 'Agility', 'Balance', 'Aggression', 'Acceleration', 'SprintSpeed', 'Strength', 'Weak Foot', 'Body Type_Lean', 'Nationality_Belgium', 'Position_CM', 'Nationality_England', 'Position_RB', 'Body Type_Normal', 'Nationality_Spain', 'Position_CB', 'Nationality_France', 'Position_GK']


## 2. Wrapper-based

Src: https://heartbeat.fritz.ai/hands-on-with-feature-selection-techniques-wrapper-methods-5bb6d99b1274

1. Search for a subset of features

2. Build a model with the chosen features

3. Evaluate the model

4. Repeat with a new subset of features

5. Stop according to one of the following criteria:
    - Model performance decreases
    - Model performance increases
    - A predefined number of features is reached


### 2.1. Recursive Feature Elimination
"The goal of recursive feature elimination (RFE) is to select features by **recursively considering smaller and smaller sets of features**. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached." (`sklearn` documentation)

In [12]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=num_feat, step=10, verbose=5)
rfe_selector.fit(X_norm, y)
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')
print(rfe_feature)

Fitting estimator with 223 features.
Fitting estimator with 213 features.
Fitting estimator with 203 features.
Fitting estimator with 193 features.
Fitting estimator with 183 features.
Fitting estimator with 173 features.
Fitting estimator with 163 features.
Fitting estimator with 153 features.
Fitting estimator with 143 features.
Fitting estimator with 133 features.
Fitting estimator with 123 features.
Fitting estimator with 113 features.
Fitting estimator with 103 features.
Fitting estimator with 93 features.
Fitting estimator with 83 features.
Fitting estimator with 73 features.
Fitting estimator with 63 features.
Fitting estimator with 53 features.
Fitting estimator with 43 features.
Fitting estimator with 33 features.
30 selected features
['Finishing', 'ShortPassing', 'LongPassing', 'BallControl', 'SprintSpeed', 'Agility', 'Volleys', 'FKAccuracy', 'Reactions', 'Strength', 'Weak Foot', 'Position_CAM', 'Position_CM', 'Position_GK', 'Position_LCB', 'Position_LM', 'Position_RB', 'Posi

## 3. Embedded
### 3.1. Lasso: SelectFromModel
- L1 regularization has shrinks some of the coefficients to zero, therefore indicating that a certain predictor or certain features will be multiplied by zero to estimate the target. 
- These features will be removed because they aren’t contributing to the final prediction.

In [13]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1", solver='liblinear'), max_features=num_feat)
embeded_lr_selector.fit(X_norm, y)

embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')
print(embeded_lr_feature)

27 selected features
['LongPassing', 'Reactions', 'Balance', 'Aggression', 'Preferred Foot_Right', 'Position_CAM', 'Position_CM', 'Position_GK', 'Position_LCB', 'Position_LM', 'Position_LW', 'Position_RB', 'Position_RCB', 'Position_RW', 'Body Type_Lean', 'Body Type_Stocky', 'Nationality_Belgium', 'Nationality_Brazil', 'Nationality_Croatia', 'Nationality_England', 'Nationality_France', 'Nationality_Germany', 'Nationality_Italy', 'Nationality_Netherlands', 'Nationality_Portugal', 'Nationality_Slovenia', 'Nationality_Uruguay']


### 3.2 Tree-Based: SelectFromModel
- Tree-based models provides us information about feature importance.
- Feature importance tells us which variables are more important in making accurate predictions on the target variable/class
- When training a tree, feature importance is calculated as the decrease in node impurity weighted in a tree. The higher the value, the more important the feature.

In [14]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=num_feat)
embeded_rf_selector.fit(X, y)

embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')
print(embeded_rf_feature)

22 selected features
['Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions', 'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Weak Foot', 'Body Type_Courtois', 'Nationality_Slovenia']


# 4. Other

## 4.1 Boruta
- Feature ranking and selection algorithm based on random forests algorithm.
- It clearly decides if a variable is important or not and helps to select variables that are statistically significant

In [14]:
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

boruta_feature_selector = BorutaPy(rf, n_estimators='auto', verbose=2, max_iter = 50, perc = 90)
boruta_feature_selector.fit(X.values, y.values.ravel())
boruta_support = boruta_feature_selector.support_
boruta_feature = X.loc[:,boruta_support].columns.tolist()
print(str(len(boruta_feature)), 'selected features')
print(boruta_feature)

Iteration: 	1 / 50
Confirmed: 	0
Tentative: 	223
Rejected: 	0
Iteration: 	2 / 50
Confirmed: 	0
Tentative: 	223
Rejected: 	0
Iteration: 	3 / 50
Confirmed: 	0
Tentative: 	223
Rejected: 	0
Iteration: 	4 / 50
Confirmed: 	0
Tentative: 	223
Rejected: 	0
Iteration: 	5 / 50
Confirmed: 	0
Tentative: 	223
Rejected: 	0
Iteration: 	6 / 50
Confirmed: 	0
Tentative: 	223
Rejected: 	0
Iteration: 	7 / 50
Confirmed: 	0
Tentative: 	223
Rejected: 	0
Iteration: 	8 / 50
Confirmed: 	25
Tentative: 	15
Rejected: 	183
Iteration: 	9 / 50
Confirmed: 	25
Tentative: 	15
Rejected: 	183
Iteration: 	10 / 50
Confirmed: 	25
Tentative: 	15
Rejected: 	183
Iteration: 	11 / 50
Confirmed: 	25
Tentative: 	15
Rejected: 	183
Iteration: 	12 / 50
Confirmed: 	25
Tentative: 	9
Rejected: 	189
Iteration: 	13 / 50
Confirmed: 	25
Tentative: 	9
Rejected: 	189
Iteration: 	14 / 50
Confirmed: 	25
Tentative: 	9
Rejected: 	189
Iteration: 	15 / 50
Confirmed: 	25
Tentative: 	9
Rejected: 	189
Iteration: 	16 / 50
Confirmed: 	25
Tentative: 	6
Rej

# All together

In [15]:
feature_selection_df = pd.DataFrame({'Feature':X.columns, 'Pearson':cor_support, 'Chi-2':chi_support,'ANOVA': anova_support, 'ROC': roc_support,'RFE':rfe_support, 'Logistics':embeded_lr_support,
                                    'Random Forest':embeded_rf_support})
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feat)

Unnamed: 0,Feature,Pearson,Chi-2,ANOVA,ROC,RFE,Logistics,Random Forest,Total
1,Reactions,True,True,True,True,True,True,True,7
2,LongPassing,True,True,True,True,True,True,True,7
3,Volleys,True,True,True,True,True,False,True,6
4,ShortPassing,True,True,True,True,True,False,True,6
5,Nationality_Slovenia,True,True,True,False,True,True,True,6
6,Nationality_Belgium,True,True,True,True,True,True,False,6
7,Finishing,True,True,True,True,True,False,True,6
8,FKAccuracy,True,True,True,True,True,False,True,6
9,BallControl,True,True,True,True,True,False,True,6
10,Weak Foot,True,False,True,True,True,False,True,5
