# feature selection

下面介紹常見的特徵選擇的方法，第一種就是根據 Variance 去篩選，如果有非常相似就可以刪除。





In [1]:
from sklearn.feature_selection import VarianceThreshold

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)


array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

我們也可以基於統計的變異數分析(ANOVA)去選，
我們可以用 SelectKBest 直接選出最好的，
或是用 SelectPercentile 根據百分比去選出前面好的特徵。



In [11]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)

print('Shape: ', X.shape)

# 最好的幾個
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print('Best 2: ', X_new.shape)

# 前百分比好的
X_new = SelectPercentile(chi2, percentile=30).fit_transform(X, y)
X_new.shape


Shape:  (150, 4)
Best 2:  (150, 2)


(150, 1)

下面看一下 pipeline 的使用

In [25]:
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)


model_pipeline = Pipeline(
    [
        ("anova", SelectPercentile(chi2)),
        ("scaler", StandardScaler()),
        ("svc", SVC(C=1, random_state=87, tol=1e-5)),
    ]
)

percentiles = (1, 10, 20, 50, 100)
kernels = ('linear', 'poly', 'rbf', 'sigmoid')

for percentile in percentiles:
    for ker in kernels:
        model_pipeline.set_params(anova__percentile=percentile, svc__kernel=ker)
        model_pipeline.fit(X_train, y_train)
        print('Percent %s, kernel %s Testing accuracy: %s' % (percentile, ker, round(model_pipeline.score(X_test, y_test),3)))



Percent 1, kernel linear Testing accuracy: 0.933
Percent 1, kernel poly Testing accuracy: 0.933
Percent 1, kernel rbf Testing accuracy: 0.933
Percent 1, kernel sigmoid Testing accuracy: 0.9
Percent 10, kernel linear Testing accuracy: 0.933
Percent 10, kernel poly Testing accuracy: 0.933
Percent 10, kernel rbf Testing accuracy: 0.933
Percent 10, kernel sigmoid Testing accuracy: 0.9
Percent 20, kernel linear Testing accuracy: 0.933
Percent 20, kernel poly Testing accuracy: 0.933
Percent 20, kernel rbf Testing accuracy: 0.933
Percent 20, kernel sigmoid Testing accuracy: 0.9
Percent 50, kernel linear Testing accuracy: 0.967
Percent 50, kernel poly Testing accuracy: 0.967
Percent 50, kernel rbf Testing accuracy: 0.967
Percent 50, kernel sigmoid Testing accuracy: 0.967
Percent 100, kernel linear Testing accuracy: 0.967
Percent 100, kernel poly Testing accuracy: 0.967
Percent 100, kernel rbf Testing accuracy: 0.967
Percent 100, kernel sigmoid Testing accuracy: 0.9


In [26]:
# 可以知道 pipeline 的可調參數
model_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'anova', 'scaler', 'svc', 'anova__percentile', 'anova__score_func', 'scaler__copy', 'scaler__with_mean', 'scaler__with_std', 'svc__C', 'svc__break_ties', 'svc__cache_size', 'svc__class_weight', 'svc__coef0', 'svc__decision_function_shape', 'svc__degree', 'svc__gamma', 'svc__kernel', 'svc__max_iter', 'svc__probability', 'svc__random_state', 'svc__shrinking', 'svc__tol', 'svc__verbose'])

還有其他的特徵選擇器，但是就介紹到這邊，更多內容可以參考 [feature_selection](https://scikit-learn.org/stable/modules/feature_selection.html)。


