# Preprocessing: one-hot encoding 

one-hot encodingはカテゴリ変数をダミー変数化する処理です。本処理内容を確認すべく、まずはローン審査結果データをサンプルとして読み込みましょう。ローン審査でNOとなったデータを"1"（正例）と変換しています。

In [2]:
# import sample data: Loan screening data for classification 
import pandas as pd

df = pd.read_csv('./data/av_loan_u6lujuX_CVtuZ9i.csv', header=0)
X = df.iloc[:, :-1]            # 最終列が審査結果のため最終列以前を特徴量Xとして読込
X = X.drop('Loan_ID', axis=1)  # 1列目のLoan_IDはローン審査のID情報のため特徴量ベクトルから削除
y = df.iloc[:, [-1]]           # 最終列を正解データとして読込

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print('Check the null count of the target variable: %i' % y.isnull().sum())
print('----------------------------------------------------------------------------------------')

# converting stirng to number(binary flag)
# ローン審査でNOとなったサンプルを1（正例）として変換
class_mapping = {'N':1, 'Y':0}
y_new = y.copy()
y_new.loc[:,'Loan_Status'] = y_new['Loan_Status'].map(class_mapping)
print(y_new.groupby(['Loan_Status']).size())
print('----------------------------------------------------------------------------------------')
print(X.join(y_new).dtypes)
X.join(y_new).head()

----------------------------------------------------------------------------------------
X shape: (614,11)
y shape: (614,1)
----------------------------------------------------------------------------------------
Check the null count of the target variable: 0
----------------------------------------------------------------------------------------
Loan_Status
0    422
1    192
dtype: int64
----------------------------------------------------------------------------------------
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status            int64
dtype: object


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,0
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,1
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,0
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,0
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,0


上記から、Dependents, Gender, Married, Education, Self_Employed, Propaerty_Areaはカテゴリ変数のため、one-hotエンコーディングの対象となることがわかります。このダミー変数化はPandasのget_dummiies関数を用いることで実現できます。

In [3]:
import pandas as pd
ohe_columns = ['Dependents','Gender','Married','Education','Self_Employed','Property_Area']
X_new = pd.get_dummies(X, dummy_na=True, columns=ohe_columns)
print(X_new.dtypes)
X_new.head()

ApplicantIncome              int64
CoapplicantIncome          float64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Dependents_0                 uint8
Dependents_1                 uint8
Dependents_2                 uint8
Dependents_3+                uint8
Dependents_nan               uint8
Gender_Female                uint8
Gender_Male                  uint8
Gender_nan                   uint8
Married_No                   uint8
Married_Yes                  uint8
Married_nan                  uint8
Education_Graduate           uint8
Education_Not Graduate       uint8
Education_nan                uint8
Self_Employed_No             uint8
Self_Employed_Yes            uint8
Self_Employed_nan            uint8
Property_Area_Rural          uint8
Property_Area_Semiurban      uint8
Property_Area_Urban          uint8
Property_Area_nan            uint8
dtype: object


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0


get_dummiesのdummy_na=Trueとすることで、欠損値の場合に1が立つ変数を作成できます。欠損値の生成要因を把握した上で、欠損であることを予測に使用して問題なければTrueにしましょう。たった一行でone-hot Encodingは終了です。