## Backward Elimination\_反向淘汱
[學習來源\_udemy-机器学习 A-Z (Machine Learning A-Z in Chinese)](https://www.udemy.com/machinelearningchinese/learn/v4/overview)
1. Select a significance level to stay in the model($\alpha$=0.05)
    * 選擇一個門檻值，當P_value較SL大，則淘汱該特徵(或稱自變量)
    * 當P_value較SL小，則保留該特徵(或稱自變量)
2. Fit the full model with all possible predictors
    * 用所有的特徵來訓練(擬合)模型
3. Consider the predictor with the highest P-value. if P>SL, go to STEP4, otherwise go to FIN
    * 選出P_value最高的特徵，如果較大，那就執行第四步
4. Remove the predictor
    * 移除該特徵
5. Fit model without the variable
    * 以剩餘的特徵重新訓練(擬合)模型
    * 再回到第三步
    
整體完成之後，我們即排除對模型影響較小的特徵，僅保留相關性較高的特徵。

範例中會用到另一個標準的package：  
statsmodels.formula.api.OLS(endog:y_train,exog:X_train)  

In [9]:
import statsmodels.formula.api as sm
import numpy as np

In [43]:
#  取得測試資料
from sklearn.datasets import load_boston
datasets = load_boston()
X_data = datasets.data
y_data = datasets.target
print(X_data.shape, y_data.shape)

(506, 13) (506,)


In [44]:
#  手動加入一欄偏差單元，數值皆為1
#  平常透過sklearn的時候因為有參數控制是否自動加入，所以這邊我們手動加入
X_data = np.append(np.ones((X_data.shape[0],1)), X_data, 1)
X_data.shape

(506, 14)

利用`summary()`來呈現模型資訊，我們專注在P_value是否大於$\alpha \rightarrow0.05$

In [45]:
X_opt = X_data[:, [0,1,2,3,4,5,6,8,9,10,11,12,13]]
#  訓練(擬合)模型
linear = sm.OLS(y_data, X_opt).fit()
linear.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.734
Method:,Least Squares,F-statistic:,108.1
Date:,"Tue, 03 Jul 2018",Prob (F-statistic):,6.95e-135
Time:,09:56:38,Log-Likelihood:,-1498.8
No. Observations:,506,AIC:,3026.0
Df Residuals:,492,BIC:,3085.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,36.4911,5.104,7.149,0.000,26.462,46.520
x1,-0.1072,0.033,-3.276,0.001,-0.171,-0.043
x2,0.0464,0.014,3.380,0.001,0.019,0.073
x3,0.0209,0.061,0.339,0.735,-0.100,0.142
x4,2.6886,0.862,3.120,0.002,0.996,4.381
x5,-17.7958,3.821,-4.658,0.000,-25.302,-10.289
x6,3.8048,0.418,9.102,0.000,2.983,4.626
x7,0.0008,0.013,0.057,0.955,-0.025,0.027
x8,-1.4758,0.199,-7.398,0.000,-1.868,-1.084

0,1,2,3
Omnibus:,178.029,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,782.015
Skew:,1.521,Prob(JB):,1.54e-170
Kurtosis:,8.276,Cond. No.,15100.0


上面模型我們可以看到，特徵x7最大，P_value為0.995，並且大於0.05，所以我們會拿掉該特徵，再重新執行一次流程。

In [46]:
#  拿掉x7，重新訓練
X_opt = X_data[:, [0,1,2,3,4,6,8,9,10,11,12,13]]
#  訓練(擬合)模型
linear = sm.OLS(y_data, X_opt).fit()
linear.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.728
Model:,OLS,Adj. R-squared:,0.722
Method:,Least Squares,F-statistic:,120.4
Date:,"Tue, 03 Jul 2018",Prob (F-statistic):,4.38e-132
Time:,09:57:38,Log-Likelihood:,-1510.5
No. Observations:,506,AIC:,3045.0
Df Residuals:,494,BIC:,3096.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,23.1168,4.355,5.308,0.000,14.561,31.673
x1,-0.0972,0.033,-2.916,0.004,-0.163,-0.032
x2,0.0509,0.014,3.669,0.000,0.024,0.078
x3,-0.0603,0.060,-0.998,0.319,-0.179,0.058
x4,2.4891,0.878,2.836,0.005,0.765,4.214
x5,3.8909,0.417,9.324,0.000,3.071,4.711
x6,-1.1210,0.179,-6.246,0.000,-1.474,-0.768
x7,0.2648,0.067,3.952,0.000,0.133,0.396
x8,-0.0139,0.004,-3.627,0.000,-0.021,-0.006

0,1,2,3
Omnibus:,181.181,Durbin-Watson:,1.038
Prob(Omnibus):,0.0,Jarque-Bera (JB):,833.172
Skew:,1.534,Prob(JB):,1.2e-181
Kurtosis:,8.487,Cond. No.,11300.0


這次P_value最高的是x3

In [47]:
#  拿掉x3，重新訓練
X_opt = X_data[:, [0,1,2,4,6,8,9,10,11,12,13]]
#  訓練(擬合)模型
linear = sm.OLS(y_data, X_opt).fit()
linear.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.728
Model:,OLS,Adj. R-squared:,0.722
Method:,Least Squares,F-statistic:,132.4
Date:,"Tue, 03 Jul 2018",Prob (F-statistic):,6.08e-133
Time:,09:57:55,Log-Likelihood:,-1511.0
No. Observations:,506,AIC:,3044.0
Df Residuals:,495,BIC:,3090.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,22.6152,4.326,5.228,0.000,14.116,31.114
x1,-0.0954,0.033,-2.865,0.004,-0.161,-0.030
x2,0.0528,0.014,3.835,0.000,0.026,0.080
x3,2.3827,0.871,2.735,0.006,0.671,4.094
x4,3.9364,0.415,9.490,0.000,3.121,4.751
x5,-1.0547,0.167,-6.325,0.000,-1.382,-0.727
x6,0.2819,0.065,4.354,0.000,0.155,0.409
x7,-0.0157,0.003,-4.688,0.000,-0.022,-0.009
x8,-0.7568,0.126,-6.007,0.000,-1.004,-0.509

0,1,2,3
Omnibus:,181.247,Durbin-Watson:,1.036
Prob(Omnibus):,0.0,Jarque-Bera (JB):,833.826
Skew:,1.534,Prob(JB):,8.65e-182
Kurtosis:,8.49,Cond. No.,11200.0


經過兩次的淘汱，其餘特徵的P_value皆已小於$\alpha$，我們就可以利用這些特徵(自變量)來做模型的訓練。