## Recursive Feature Elimination (RFE) example
- https://www.kaggle.com/code/carlmcbrideellis/recursive-feature-elimination-rfe-example/notebook

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.7.17</div>
<div style="text-align: right"> Last update: 2023.7.17</div>

In [2]:
import os
import sys
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats
import warnings; warnings.filterwarnings('ignore')
#plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid')
%matplotlib inline

RFE는 backward feature elimination의 한 예이다.  
이 방법은 모든 특징을 사용하여 모델을 피팅한 다음, 매개 변수 n_features_to_select로 설정된 원하는 수의 특징이 남을 때까지 점진적으로 하나씩 가장 중요한 특징을 제거하여 재피팅할 때마다 제거한다.  

사이킷런에서는 sklearn.feature_selection.RFE가 제공된다.   
비슷한 것으로 교차검증 기능이 통합된 sklearn.feature_selection.RFECV도 있다.

RFE와 반대로 피처를 축정하는 방법이 있다. 
이것은 sklearn.feature_selection.f_regression을 사용하면 된다.   

도메인 지식이 있어서 어떤 피처가 중요한지 않다면 이 두가지 방법보다 개선된 결과를 얻을 수도 있다. 

In [3]:
train = pd.read_csv("../data/house-price/train.csv", index_col=0)
test = pd.read_csv("../data/house-price/test.csv", index_col=0)

target = 'SalePrice'

X_train = train.select_dtypes(include=['number']).copy()
X_train = X_train.drop([target], axis=1)
y_train = train[target]
X_test  = test.select_dtypes(include=['number']).copy()

X_train = X_train.fillna(X_train.mean())
X_test  = X_test.fillna(X_test.mean())

X_train.shape

(1460, 36)

In [4]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=100, max_depth=10)


In [5]:
from sklearn.feature_selection import RFE
# here we want only one final feature, we do this to produce a ranking
n_features_to_select = 1
rfe = RFE(regressor, n_features_to_select=n_features_to_select)
rfe.fit(X_train, y_train)

In [11]:
#===========================================================================
# now print out the features in order of ranking
#===========================================================================
from operator import itemgetter
features = X_train.columns.to_list()
for x, y in (sorted(zip(rfe.ranking_ , features), key=itemgetter(0))):
    print(x, y)

1 OverallQual
2 GrLivArea
3 TotalBsmtSF
4 BsmtFinSF1
5 GarageArea
6 2ndFlrSF
7 1stFlrSF
8 YearBuilt
9 GarageCars
10 LotArea
11 YearRemodAdd
12 LotFrontage
13 TotRmsAbvGrd
14 BsmtUnfSF
15 MasVnrArea
16 GarageYrBlt
17 OpenPorchSF
18 WoodDeckSF
19 OverallCond
20 FullBath
21 Fireplaces
22 MoSold
23 MSSubClass
24 BedroomAbvGr
25 YrSold
26 ScreenPorch
27 HalfBath
28 BsmtFullBath
29 KitchenAbvGr
30 EnclosedPorch
31 PoolArea
32 BsmtFinSF2
33 3SsnPorch
34 BsmtHalfBath
35 LowQualFinSF
36 MiscVal


10개 피처 선별하기

In [12]:
n_features_to_select = 10
rfe = RFE(regressor, n_features_to_select=n_features_to_select)
rfe.fit(X_train, y_train)

In [13]:
predictions = rfe.predict(X_test)
predictions

array([129571.14475109, 156150.52662227, 183498.88482218, ...,
       148442.43726133, 108804.30573127, 235895.10942214])