***

**<center><font size = "6">How to Choose a Feature Selection Method For Machine Learning<center>**
***
<center><font size = "2">Prepared by: Sitsawek Sukorn<center>

#### Feature Selection

- Is the process of reducing the number of input variables when developing a predictive model.

- Statistical-based feature selection method --> evaluating the relationship between each input and target variable.
- Strongest relationship between input and output variable.
- Depends on data type of both input and output variable.

#### Feature Selection Methods

**Unsupervised**
- Do not use the target variable (e.g. remove redundant variables). --> correlation

**Supervised**
- Use the target variable (e.g. remove irrelevant variables).

*Wrapper* : Search for well-performing subsets of features. --> RFE

*Filter* : Select subsets of features based on their relationship with target. --> Statistical Methods, Feature Importance Methods

*Intrinsic* : Algorithms that perform automatic feature selection during training. --> Decission Trees

**Dimensionality Reduction**
- Project input data into a lower-dimensional feature space.

#### Statistics for Filter-Based Feature Selection Methods

**Common input variable data type**

*Numerical Variables*
- Interger Variables.
- Floating Point Variables.

--> output for regression predictive modeling problem.

*Categorical Variables*
- Booleans Variables (dichotomus).
- Ordinal Variables.
- Nominal Variables.

--> output for classification predictive modeling problem.

**How to Choose a Feature Selection Method**

*Numerical input*
- Numerical output --> Pearson's (linear), Spearman's (non-linear)
- Categorical output --> ANOVA (linear), Kendall's (non-linear)

*Categorical input*
- Numerical output --> ANOVA (linear), Kendall's (non-linear)
- Categorical output --> Chi-Squared, Mutual Information

#### Tip and Tricks for Feature Selection

- When using filter-based feature selection.

**Correlation Statistics**

*scikit-learn* --> import sklearn.feature_selection
- Pearson's Correlation Coefficient : f_regression()
- ANOVA : f_classif()
- Chi-Squared : chi2()
- Mutual Information : mutual_info_classif() and mutual_info_regression()

*SciPy* --> scipy.stats
- Kendall's tau : kendalltau()
- Spearman's rank correlation : spearmanr()


**Selection Method**

Two more popular methods include:
- Select the top k variables: SelectKBest()
- Select the top percentile variables: SelectPercentile()

*machinelearningmastery often use SelectKBest()*

**Transform Variables**
- Can transform a categorical variable to ordinal, even if it not, and see if any interesting results come out.
- Can also make numerical variable discrete(e.g. bin); try categorical-based measures.
- Some statistical measures assume properties of the variables, such as Pearson's that assumes a Gaussian probability distribution to the observations and a linear relationship. You can transform the data to meet the expectations of the test and try the test regardless of the expectations and compare results.

**What is the Best Method?**
- You must discover what works best for your specific problem using careful systematic of experimentation.
- Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.

#### Worked Examples of Feature Selection

**Regression Feature Selection: (Numerical Input, Numerical Output)**


In [28]:
# Pearson's correlation feature selection for numeric input and numeric output
from sklearn.datasets import make_regression
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
import pandas as pd
# Generate dataset
X, y = make_regression(n_samples=100, n_features=100, n_informative=10)
# Define feature selection
fs = SelectKBest(score_func=f_regression, k=10)
# Apply feature selection
X_selected = fs.fit_transform(X, y)
# print(X_selected.shape)
a = pd.DataFrame(X)
b = pd.DataFrame(y).rename(columns={0:'label'})
c = pd.concat([a, b], axis=1)

In [30]:
fs = SelectKBest(score_func=f_regression, k=10)

In [47]:
x_choose = fs.fit_transform(c.iloc[:, :-1], c.iloc[:, -1])

In [52]:
pd.DataFrame(x_choose).tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
95,-0.247694,-0.387708,0.632681,-0.276245,-0.172128,-0.985949,0.889518,0.366668,-0.323041,0.040358
96,-0.531734,-0.955553,-0.840907,-1.294934,0.189587,-0.80192,-0.694504,-0.82556,-0.068377,1.150303
97,-0.464535,0.503714,-1.747163,-0.258527,0.448751,-0.154598,-0.513932,-0.632027,0.027348,-1.701479
98,0.572872,1.621995,-0.821452,-1.655734,2.247372,-1.198173,-0.152114,-0.986263,-1.369093,0.600471
99,1.089259,-1.611546,0.557194,1.323147,0.197278,-0.19625,0.725134,1.688488,-2.504461,-0.987967


**Classification Feature Selection: (Numerical Input, Categorical Output)**

In [56]:
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest
X, y = make_classification(n_samples=100, n_features=20, n_informative=2)
fs = SelectKBest(score_func=f_classif, k=3)
X_selected = fs.fit_transform(X, y)
print(X_selected.shape) 

(100, 3)


**Classification Feature Selection: (Categorical Input, Categorical Output)**