# Preprocessing-3: 特徴量選択 

データ前処理の仕上げとして、one-hotエンコーディングから特徴量選択までの一連の処理の流れを確認します。データはローン審査データです。

In [1]:
import pandas as pd

df = pd.read_csv('../data/av_loan_u6lujuX_CVtuZ9i.csv', header=0)
X = df.iloc[:, :-1]            # 最終列が審査結果のため最終列以前を特徴量Xとして読込
X = X.drop('Loan_ID', axis=1)  # 1列目のLoan_IDはローン審査のID情報のため特徴量ベクトルから削除
y = df.iloc[:, [-1]]           # 最終列を正解データとして読込

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print('Check the null count of the target variable: %i' % y.isnull().sum())
print('----------------------------------------------------------------------------------------')

# converting stirng to number(binary flag)
# ローン審査でNOとなったサンプルを1（正例）として変換
class_mapping = {'N':1, 'Y':0}
y_new = y.copy()
y_new.loc[:,'Loan_Status'] = y_new['Loan_Status'].map(class_mapping)
print(y_new.groupby(['Loan_Status']).size())
print('----------------------------------------------------------------------------------------')
X.join(y_new).head()

----------------------------------------------------------------------------------------
X shape: (614,11)
y shape: (614,1)
----------------------------------------------------------------------------------------
Check the null count of the target variable: 0
----------------------------------------------------------------------------------------
Loan_Status
0    422
1    192
dtype: int64
----------------------------------------------------------------------------------------


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,0
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,1
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,0
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,0
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,0


既に学習済みの、カテゴリ変数のダミー変数化と欠損値補完、そして連続変数の欠損値補完までを実行します。

In [2]:
from sklearn.preprocessing import Imputer

# one-hot エンコーディング
ohe_columns = ['Dependents','Gender','Married','Education','Self_Employed','Property_Area']
X_new = pd.get_dummies(X, dummy_na=True, columns=ohe_columns)

# 欠損値補完(平均値置換,処理は列方向)
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X_new)
X_new_columns = X_new.columns.values
X_new = pd.DataFrame(imp.transform(X_new), columns=X_new_columns)

# 結果表示
print('X_new_shape:(%i,%i)' % X_new.shape)
X_new.head()

X_new_shape:(614,26)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849.0,0.0,146.412162,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


上記までで、元の特徴量が11次元が26次元まで増加したことを確認できます。この状態でアルゴリズムにデータを渡しても、もちろん構いません。ただ実務においては、この時点の特徴次元が1000次元以上になることも珍しくありません。そこで予め有効そうな特徴量に限定する処理、特徴量選択の一つであるRFE、の使い方をここで学びましょう。

In [3]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingClassifier

# モデルベースの特徴量選択クラスRFEを実体化
# 特徴量因子の重要度を推定する分類器をGradientBoostingClassifierに設定
# 最終的に残す特徴量を10に設定
# 1回のstepで削除する次元数は5%ずつとする
selector = RFE(estimator=GradientBoostingClassifier(random_state=42), n_features_to_select=10, step=0.05)
selector.fit(X_new, y.as_matrix().ravel())

RFE(estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=42, subsample=1.0, verbose=0,
              warm_start=False),
  n_features_to_select=10, step=0.05, verbose=0)

RFEを設定しfitすることで、26変数のうちどの変数を残すかが決定されました。残された変数の確認は"support_"属性を呼び出すことで可能です。Trueが採用された変数（の場所）を表しています。

In [4]:
print(selector.support_)

[ True  True  True  True  True False  True False False False False False
 False  True False  True False False False False False  True False  True
 False False]


fitまでで選択すべき変数を決めることができたので、実際にデータの絞り込み処理をしましょう。Imputerと同様にデータの変換はtransformでできます。

In [5]:
X_new_selected = selector.transform(X_new)
X_new_selected = pd.DataFrame(X_new_selected, columns=X_new_columns[selector.support_])
print(X_new_selected.shape)
print(X_new_selected.dtypes)
X_new_selected.head()

(614, 10)
ApplicantIncome            float64
CoapplicantIncome          float64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Dependents_1               float64
Married_No                 float64
Married_nan                float64
Self_Employed_nan          float64
Property_Area_Semiurban    float64
dtype: object


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_1,Married_No,Married_nan,Self_Employed_nan,Property_Area_Semiurban
0,5849.0,0.0,146.412162,360.0,1.0,0.0,1.0,0.0,0.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,0.0,0.0,0.0,0.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,0.0,0.0,0.0,0.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,0.0,1.0,0.0,0.0,0.0


以上で、RFEによる特徴量次元の絞り込みは終了です。one-hotエンコーディングからの一連の流れの中の位置づけを理解するようにしましょう。