# Preprocessing-2: 欠損値補完 

特徴量に含まれる欠損値の補完処理は、連続値であれば平均値で、カテゴリ変数であればone-hotエンコーディングで欠損フラグ化する（または最頻値で補完する）ことが最も簡便な方法として使われています。実装例の紹介のため、まずはローン審査結果データを読込ます。

In [3]:
# import sample data: Loan screening data for classification 
import pandas as pd

df = pd.read_csv('../data/av_loan_u6lujuX_CVtuZ9i.csv', header=0)
X = df.iloc[:, :-1]            # 最終列が審査結果のため最終列以前を特徴量Xとして読込
X = X.drop('Loan_ID', axis=1)  # 1列目のLoan_IDはローン審査のID情報のため特徴量ベクトルから削除
y = df.iloc[:, [-1]]           # 最終列を正解データとして読込

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print('Check the null count of the target variable: %i' % y.isnull().sum())
print('----------------------------------------------------------------------------------------')

# converting stirng to number(binary flag)
# ローン審査でNOとなったサンプルを1（正例）として変換
class_mapping = {'N':1, 'Y':0}
y_new = y.copy()
y_new.loc[:,'Loan_Status'] = y_new['Loan_Status'].map(class_mapping)
print(y_new.groupby(['Loan_Status']).size())
print('----------------------------------------------------------------------------------------')
print(X.join(y_new).dtypes)
X.join(y_new).head()

----------------------------------------------------------------------------------------
X shape: (614,11)
y shape: (614,1)
----------------------------------------------------------------------------------------
Check the null count of the target variable: 0
----------------------------------------------------------------------------------------
Loan_Status
0    422
1    192
dtype: int64
----------------------------------------------------------------------------------------
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status            int64
dtype: object


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,0
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,1
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,0
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,0
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,0


上記例えば、LoanAmountに欠損（NaN）が存在しているのを確認できます。この欠損をLoanAmount列の平均値で置き換えましょう。本講座では、欠損処理の統一のため、まずone-hotエンコードをしてカテゴリ変数の欠損をフラグ変数化して解決した上で、残った連続変数の欠損値を平均値で置き換えることとします。それではone-hotエンコードを実施します。オプションのdummy_na=Trueを忘れないようにしましょう。

In [2]:
ohe_columns = ['Dependents','Gender','Married','Education','Self_Employed','Property_Area']
X_new = pd.get_dummies(X, dummy_na=True, columns=ohe_columns)
X_new.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0


上記まででカテゴリ変数の欠損処理は終了です。次に連続変数の欠損を平均値で置き換えます。この処理はsklearnのImputerクラスで実現できます。処理の確認のため、予めLoanAmountの基礎統計量を確認しておきましょう。平均値が146.412162であることが確認できます。

In [3]:
X_new.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
count,614.0,614.0,592.0,600.0,564.0,614.0,614.0,614.0,614.0,614.0,...,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199,0.561889,0.166124,0.164495,0.083062,0.02443,...,0.781759,0.218241,0.0,0.814332,0.13355,0.052117,0.291531,0.379479,0.32899,0.0
std,6109.041673,2926.248369,85.587325,65.12041,0.364878,0.496559,0.372495,0.371027,0.276201,0.154506,...,0.413389,0.413389,0.0,0.389155,0.340446,0.222445,0.454838,0.485653,0.470229,0.0
min,150.0,0.0,9.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3812.5,1188.5,128.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5795.0,2297.25,168.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
max,81000.0,41667.0,700.0,480.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


それでは連続変数の欠損値の平均値補完の実装です。preporcessingクラスからImputerを読み込みます。Imputerのメソッドtransfomrを適用することで、LoanAmountの1行目がNaNから146.412162に置き換えることができます。

In [5]:
from sklearn.preprocessing import Imputer

# インピュータークラスの実体化
# 欠損値NaNを平均値(mean)で置き換える.処理は列方向で行う
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

# 各特徴量の平均値を学習
imp.fit(X_new)

# 学習済みのImputerを適用しX_newの欠損値を置き換える
X_new_columns = X_new.columns.values
X_new = pd.DataFrame(imp.transform(X_new), columns=X_new_columns)

# 結果表示
X_new.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849.0,0.0,146.412162,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


以上で、カテゴリ変数のフラグ変数化と欠損処理、連続変数の欠損処理を終えました。これでsklearnのアルゴリズムに投入することができるデータセットになりました。