### スコアリングフェーズにおけるデータ処理（課題把握編）

In [1]:
!git clone https://github.com/saiku122/AIJobcolle.git

Cloning into 'AIJobcolle'...
remote: Enumerating objects: 135, done.[K
remote: Counting objects: 100% (135/135), done.[K
remote: Compressing objects: 100% (97/97), done.[K
remote: Total 135 (delta 50), reused 60 (delta 14), pack-reused 0[K
Receiving objects: 100% (135/135), 2.62 MiB | 19.90 MiB/s, done.
Resolving deltas: 100% (50/50), done.


ローン審査データを使って<b>モデリング段階のデータ処理フローをおさらいし、</b><br>その後、スコアリング段階のデータ処理で必要となるテクニックを学びましょう。

In [2]:
cd /content/AIJobcolle/MachineLearning/python

/content/AIJobcolle/MachineLearning/python


##### モデル用データの前処理：モデル用データの読み込み

In [3]:
# import sample data: Loan screening data for classification 
import pandas as pd

df = pd.read_csv('./data/av_loan_u6lujuX_CVtuZ9i.csv',header=0)
X  = df.iloc[:,:-1]           # 最終列以前を特徴量X
ID = X.iloc[:,[0]]            # 最初列がPK（Loan_ID）なのでID情報としてセット
X  = X.drop('Loan_ID',axis=1) # 1列目(Loan_ID)は特徴量ベクトルから削除
y  = df.iloc[:,-1]            # 最終列を正解データ

# check the shape
print('--------------------------------------')
print('Raw shape: (%i,%i)' %df.shape)
print('X shape: (%i,%i)' %X.shape)

# ローン審査でNOとなったサンプルを1（正例）として変換
class_mapping = {'N':1, 'Y':0}
y = y.map(class_mapping)
print('---------------------------------------')
print(y.value_counts())
print('---------------------------------------')
print(ID.join(X).join(y).dtypes)
ID.join(X).join(y).head()

# 表示列数のオプション変更
pd.options.display.max_columns = 50

--------------------------------------
Raw shape: (614,13)
X shape: (614,11)
---------------------------------------
0    422
1    192
Name: Loan_Status, dtype: int64
---------------------------------------
Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status            int64
dtype: object


##### モデル用データの前処理：カテゴリ変数の数量化と欠損対応
まず、カテゴリ変数のone-hotエンコーディングを行います。

In [4]:
ohe_columns = ['Dependents',
               'Gender',
               'Married',
               'Education',
               'Self_Employed',
               'Property_Area']

X_ohe = pd.get_dummies(X,
                       dummy_na=True,
                       columns=ohe_columns)

print('X_ohe shape:(%i,%i)' % X_ohe.shape)
display(X_ohe.head())

X_ohe shape:(614,26)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,Gender_Female,Gender_Male,Gender_nan,Married_No,Married_Yes,Married_nan,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,1,0,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0


##### モデル用データの前処理：数値変数の欠損対応
次に、数値変数の欠損値を各列平均値で置換します。

In [8]:
from sklearn.impute import SimpleImputer

# 欠損値NaNを平均値(mean)で置換
imp = SimpleImputer()
imp.fit(X_ohe)

# 学習済みImputerを適用しX_newの欠損値を置換
X_ohe_columns = X_ohe.columns.values
X_ohe = pd.DataFrame(imp.transform(X_ohe), columns=X_ohe_columns)

# 結果表示
display(X_ohe.head())

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,Gender_Female,Gender_Male,Gender_nan,Married_No,Married_Yes,Married_nan,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849.0,0.0,146.412162,360.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


##### モデル用データの前処理：次元圧縮（特徴選択）
続けて、特徴次元の圧縮を図ります。

In [9]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

selector = RFE(RandomForestClassifier(n_estimators=100,random_state=1),
               n_features_to_select=10,
               step=.05)

selector.fit(X_ohe,y)

X_fin = pd.DataFrame(selector.transform(X_ohe),
                     columns=X_ohe_columns[selector.support_])

print('X_fin shape:(%i,%i)' % X_fin.shape)
X_fin.head()

X_fin shape:(614,10)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Married_No,Education_Graduate,Property_Area_Rural,Property_Area_Semiurban
0,5849.0,0.0,146.412162,360.0,1.0,1.0,1.0,1.0,0.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,0.0,1.0,1.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,1.0,0.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,1.0,1.0,0.0,0.0


ここまでがモデリング段階でのデータ加工でした。

##### スコア用データの前処理
さて、スコア用データの前処理です。本処理は以下要件を満たす必要があります。
- 上記10次元の特徴量をこの並びの通りに変換する必要
- 学習済みSimpleImputerインスタンスによる欠損値補完
- 学習済みRFEインスタンスによる特徴量選択

なぜなら、以下理由のためです。
- モデル用データとスコア用データの並び順が違うことを学習済みモデルはわからない
- 並び順が異なると、学習済みSimpleImputerも学習済みRFEもインデックス情報が狂う
- 数値変数の欠損値は「学習段階」の平均値でしか置換できない（未来情報は使えない）
- 何が重要な変数かは「学習段階」のXとyの関係からしか知りようがない（同上）

##### スコア用データへの前処理：データの読み込み

In [10]:
# import sample data for classificatio
df_s = pd.read_csv('./data/av_loan_test_Y3wMUE5_7gLdaTN.csv', header=0)
ID_s = df_s.iloc[:,[0]]            # 第0列はPK（Loan_ID）なのでIDとしてセット
X_s  = df_s.drop('Loan_ID',axis=1) # Loan_IDはID情報なので特徴ベクトルから削除

# check the shape
print('Raw shape: (%i,%i)' %df_s.shape)
print('X shape: (%i,%i)' %X_s.shape)
print('-------------------------------')
print(X_s.dtypes)

Raw shape: (333,12)
X shape: (333,11)
-------------------------------
Gender                object
Married               object
Dependents           float64
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome      int64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
dtype: object


##### スコア用データの前処理：カテゴリ変数の数量化と欠損対応
本処理はスコア用データに対し、モデル用データとは独立に実施します。

In [11]:
X_ohe_s = pd.get_dummies(X_s,
                         dummy_na=True,
                         columns=ohe_columns)
print('X_ohe_s shape:(%i,%i)' % X_ohe_s.shape)
X_ohe_s.head()

X_ohe_s shape:(333,26)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0.0,Dependents_1.0,Dependents_2.0,Dependents_nan,Gender_Female,Gender_Male,Gender_Unknown,Gender_nan,Married_No,Married_Yes,Married_nan,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5720,0,110.0,360.0,1.0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0
1,3076,1500,126.0,360.0,1.0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0
2,5000,1800,208.0,360.0,1.0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0
3,2340,2546,100.0,360.0,,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0
4,3276,0,78.0,360.0,1.0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0


##### スコア用データの前処理：one-hotエンコーディング後のデータ整合チェック
さて、one-hotエンコーディング後のモデル用とスコア用データの整合性を確認してみます。
確認のため、Pythonのset型（集合型変数）を利用しています。

In [12]:
# Pythonの集合型変数を利用
cols_model = set(X_ohe.columns.values)
cols_score = set(X_ohe_s.columns.values)

# モデルにはあったスコアにはないデータ項目
diff1 = cols_model - cols_score
print('Modelのみ:%s' % diff1)

# スコアにはあるがモデルになかったデータ項目
diff2 = cols_score - cols_model
print('Scoreのみ:%s' % diff2)

Modelのみ:{'Dependents_3+', 'Dependents_0', 'Dependents_1', 'Dependents_2'}
Scoreのみ:{'Gender_Unknown', 'Dependents_1.0', 'Dependents_0.0', 'Dependents_2.0'}


実は、このスコア用データは、以下２つの細工が施されていました。
1. Gender変数に"Unknown"というカテゴリ値を新しく追加
2. Dependents変数の"3+"というカテゴリ値を除外（残された値は0,1,2の3種類）

結果としてモデル用とスコア用のデータ間で、以下不整合の生じる可能性があるとわかります。
1. モデルデータにないカラムが生成される可能性（Gender_Unknown)
1. モデルデータにあったカラムが消える可能性（Dependents_3+）
1. データ型の違いが理由で①/②が生じる可能性

次のJupyterファイル「スコアリングフェーズにおけるデータ処理（解決編）」で上記の不整合への対処を学びましょう。