## Feature Selection Techniques

<!-- <hr>

### Agenda
1. Introduction to Feature Selection
2. VarianceThreshold
3. Chi-squared stats
4. ANOVA using f_classif
5. Univariate Linear Regression Tests using f_regression
6. F-score vs Mutual Information
7. Mutual Information for discrete value
8. Mutual Information for continues value
9. SelectKBest
10. SelectPercentile
11. SelectFromModel
12. Recursive Feature Elemination

<hr> -->

### Feature Selection
* Selecting features from the dataset
* Improve estimator's accuracy
* Boost preformance for high dimensional datsets
* Below we will discuss univariate selection methods
* Also, feature elimination method

In [29]:
from sklearn import feature_selection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

### Mutual Information for classification using mutual_info_classification
* Returns dependency in the scale of 0 & 1 among feature & target
* Captures any kind of dependency even if non-linear
* Target is discrete in nature

In [56]:
df = pd.read_csv('data.csv')

In [57]:
df.drop(columns=["住院号", "CT号"], index=1, inplace=True)


In [58]:
## fit_transform(): Used on the training data so that we can scale the training data 
## and also learn the scaling parameters of that data.

for col in df.columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

In [60]:
feature_selection.mutual_info_classif(df.drop('M_rate', axis=1), df.M_rate) # mutual_info_classification

array([0.00581123, 0.10523715, 0.07344954, 0.        , 0.        ,
       0.        , 0.        , 0.04024628, 0.04073715, 0.03801411,
       0.        , 0.        , 0.06247625, 0.0474178 , 0.110514  ,
       0.08397866, 0.24394015, 0.24567307, 0.18962624, 0.21935476,
       0.04799043, 0.06297876])

In [65]:
chi2, pval = feature_selection.f_classif(df.drop('M_rate', axis=1), df.M_rate) # f_classif

In [66]:
F, pval = feature_selection.f_regression(df.drop('M_rate', axis=1), df.M_rate) # f_regression

In [68]:
feature_selection.mutual_info_regression(df.drop('M_rate', axis=1), df.M_rate) # mutual_info_regression

array([0.02137916, 0.1389955 , 0.16038144, 0.        , 0.02403568,
       0.10512181, 0.05809497, 0.04339978, 0.        , 0.        ,
       0.        , 0.        , 0.06520219, 0.04898776, 0.12155731,
       0.184342  , 0.22503207, 0.11800883, 0.21279133, 0.17206704,
       0.09783133, 0.1042841 ])

In [61]:
df.columns

Index(['窦-连合LCC', '窦-连合RCC', '窦-连合NCC', '周长LCC', '周长RCC', '周长NCC', 'AV面积',
       'Valsalva窦', 'AV-annulus', 'STJ', 'AO根部直径', 'AO根部面积', 'mPA直径', 'mPA面积',
       'LPA近端直径', 'LPA近端面积', 'RPA近端直径', 'RPA近端面积', 'LPA远端直径', 'LPA远端面积',
       'RPA远端直径', 'RPA远端面积', 'M_rate'],
      dtype='object')

### 11. SelectFromModel
* Selecting important features from model weights
* The estimator should support 'feature_importances'

In [50]:
from sklearn.datasets import load_boston

In [51]:
boston = load_boston()

In [52]:
from sklearn.linear_model import LinearRegression

In [53]:
clf = LinearRegression()
sfm = feature_selection.SelectFromModel(clf, threshold=0.25)

In [54]:
sfm.fit_transform(boston.data, boston.target).shape

(506, 7)

In [55]:
boston.data.shape

(506, 13)

### 12. Recursive Feature Elimination
* Uses an external estimator to calculate weights of features
* First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. 
* Then, the least important features are pruned from current set of features. 
* That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [63]:
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_regression(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, 5, step=1)
data = selector.fit_transform(X, y)

In [65]:
X.shape

(50, 10)

In [66]:
data.shape

(50, 5)

In [67]:
selector.ranking_

array([1, 1, 4, 3, 1, 6, 1, 2, 5, 1])