## Feature Engineering Deep Dive チュートリアル

In [1]:
import h2o
import matplotlib as plt
%matplotlib inline
from h2o.automl import H2OAutoML

In [3]:
#> H2Oクラスターの開始
h2o.init(url='http://localhost:54321')

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_261"; Java(TM) SE Runtime Environment (build 1.8.0_261-b12); Java HotSpot(TM) 64-Bit Server VM (build 25.261-b12, mixed mode)
  Starting server from /home/ec2-user/anaconda3/envs/h2o_3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpiovt3p_5
  JVM stdout: /tmp/tmpiovt3p_5/h2o_ec2_user_started_from_python.out
  JVM stderr: /tmp/tmpiovt3p_5/h2o_ec2_user_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Asia/Tokyo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.1.2
H2O_cluster_version_age:,1 month and 2 days
H2O_cluster_name:,H2O_from_python_ec2_user_n5ntwl
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.399 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


### 1. データの読み込みと確認

In [5]:
#> データのロード
loans = h2o.import_file("https://sample-data-open.s3-ap-northeast-1.amazonaws.com/h2o_sample_loan/loan.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
loans.describe()

Rows:163987
Cols:15




Unnamed: 0,loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
type,int,enum,real,int,enum,real,enum,enum,real,int,real,int,int,int,enum
mins,500.0,,5.42,0.0,,1896.0,,,0.0,0.0,0.0,1.0,0.0,0.0,
mean,13074.169141456336,,13.715904065566168,5.684352932995333,,71915.67051974915,,,15.881530121290089,0.2273570060625282,54.07917280242262,24.57973383427463,0.1830388994249544,14.854273655448353,
maxs,35000.0,,26.06,10.0,,7141778.0,,,39.99,29.0,150.70000000000002,118.0,1.0,65.0,
sigma,7993.556188734652,,4.3919398705457935,3.610663731100237,,59070.91565491827,,,7.587668224192549,0.694167922928418,25.285366766770515,11.685190365910659,0.3866995896078875,6.947732922546697,
zeros,0,,0,14248,,0,,,270,139459,1562,0,133971,11,
missing,0,0,0,5804,0,4,0,0,0,29,193,29,0,29,0
0,5000.0,36 months,10.65,10.0,RENT,24000.0,credit_card,AZ,27.65,0.0,83.7,9.0,0.0,26.0,verified
1,2500.0,60 months,15.27,0.0,RENT,30000.0,car,GA,1.0,0.0,9.4,4.0,1.0,12.0,verified
2,2400.0,36 months,15.96,10.0,RENT,12252.0,small_business,IL,8.72,0.0,98.5,10.0,0.0,10.0,not verified


#### データに関して

[LendingClub](https://www.lendingclub.com/info/statistics.action)データに対しデータクレンジングと簡素化を実施したもの

|Id  | Column Name | Description | 説明 |
|:---|:----------------------|:-------------------|:-------------------|
|1   | loan_amnt             | Requested loan amount (US dollars)       | 申し込まれたローンの額(USD) |
|2   | term                  | Loan term length (months)       | 申し込まれたローン期間(月) |
|3   | int_rate              | Recommended interest rate       | 貸出金利 |
|4   | emp_length            | Employment length (years)       | 継続雇用期間(年) |
|5   | home_ownership        | Housing status       | 住居形態 |
|6   | annual_inc            | Annual income (US dollars)       | 年収(USD) |
|7   | purpose               | Purpose for the loan       | ローン借り入れ理由 |
|8   | addr_state            | State of residence       | 居住州 |
|9   | dti                   | Debt to income ratio       | 負債比率(%)（既存のローンの返済額を借り手の月収で割った率） |
|10  | delinq_2yrs           | Number of delinquencies in the past 2 years       | 過去2年における滞納回数 |
|11  | revol_util            | Percent of revolving credit line utilized       | リボルビングクレジット利用率(%) |
|12  | total_acc             | Number of active accounts       | アクティブなアカウント数 |
|13  | bad_loan              | Bad loan indicator       | 不良貸し付け |
|14  | longest_credit_length | Age of oldest active account       | 最長のアクティブなアカウント(年) |
|15  | verification_status   | Income verification status       | 所得確認状況 |

参考：[Lending Club Loan Analysis](Lending Club Loan Analysis) on Kaggle Notebook

`bad_loan`をターゲットとしてモデル作成

In [9]:
loans['bad_loan'].table()

bad_loan,Count
0,133971
1,30016




In [7]:
#> 学習/テストデータセットへの分割（学習 :  テスト = 80% : 20% ）
train, test = loans.split_frame([0.8], seed=12345)
print("＜train/testのデータ数＞")
print("train:%d test:%d" % (train.nrows, test.nrows))

＜train/testのデータ数＞
train:131248 test:32739


### 2. ベースラインモデルの作成

In [12]:
response = "bad_loan"   # ターゲット変数

predictors = train.col_names
predictors.remove(response)
predictors.remove("int_rate")
predictors   # 特徴量

['loan_amnt',
 'term',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'purpose',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'revol_util',
 'total_acc',
 'longest_credit_length',
 'verification_status']

In [13]:
aml = H2OAutoML(max_models = 6,
                max_runtime_secs_per_model = 60,
                exclude_algos = ['DRF', 'DeepLearning', 'StackedEnsemble'],
                seed = 12345)

aml.train(x = predictors, 
          y = response, 
          training_frame = train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [15]:
aml.leaderboard

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_1_AutoML_20201006_132904,0.139504,0.373502,0.139504,0.279279,0.261936
GBM_2_AutoML_20201006_132904,0.139654,0.373703,0.139654,0.278536,0.262141
XGBoost_3_AutoML_20201006_132904,0.139812,0.373914,0.139812,0.278655,0.262403
GLM_1_AutoML_20201006_132904,0.141137,0.375682,0.141137,0.282978,
XGBoost_1_AutoML_20201006_132904,0.146218,0.382385,0.146218,0.278316,0.270003
XGBoost_2_AutoML_20201006_132904,0.162962,0.403686,0.162962,0.293458,0.290704




In [16]:
aml.leader

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_1_AutoML_20201006_132904


Model Summary: 


0,1,2,3,4,5,6,7,8,9
,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,49.0,49.0,45411.0,6.0,6.0,6.0,33.0,64.0,59.02041




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.13373745179615754
RMSE: 0.36570131500468733
MAE: 0.27346287726909746
RMSLE: 0.25583903320918744
Mean Residual Deviance: 0.13373745179615754

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.13950370194645625
RMSE: 0.37350194369836454
MAE: 0.2792793761249456
RMSLE: 0.26193574601277086
Mean Residual Deviance: 0.13950370194645625

Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.2792794,0.0011512,0.2774286,0.2793406,0.2794319,0.2806024,0.2795933
mean_residual_deviance,0.1395037,0.0010270,0.1380415,0.1393030,0.1392385,0.1407036,0.1402320
mse,0.1395037,0.0010270,0.1380415,0.1393030,0.1392385,0.1407036,0.1402320
r2,0.0645323,0.0020242,0.0621232,0.0630051,0.0672377,0.0648510,0.0654443
residual_deviance,0.1395037,0.0010270,0.1380415,0.1393030,0.1392385,0.1407036,0.1402320
rmse,0.3734999,0.0013755,0.3715393,0.3732331,0.3731467,0.3751048,0.3744757
rmsle,0.2619349,0.0007339,0.2609292,0.2618365,0.2617059,0.2629116,0.2622914



Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2020-10-06 13:31:22,20.224 sec,0.0,0.3861761,0.2982640,0.1491320
,2020-10-06 13:31:22,20.771 sec,5.0,0.3786662,0.2915320,0.1433881
,2020-10-06 13:31:23,21.117 sec,10.0,0.3744440,0.2866263,0.1402083
,2020-10-06 13:31:23,21.406 sec,15.0,0.3720746,0.2831897,0.1384395
,2020-10-06 13:31:23,21.698 sec,20.0,0.3704525,0.2804646,0.1372350
,2020-10-06 13:31:24,21.995 sec,25.0,0.3692710,0.2787076,0.1363611
,2020-10-06 13:31:24,22.298 sec,30.0,0.3683839,0.2772997,0.1357067
,2020-10-06 13:31:24,22.597 sec,35.0,0.3674631,0.2758390,0.1350291
,2020-10-06 13:31:24,22.890 sec,40.0,0.3667503,0.2748920,0.1345058



Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
term,1995.3569336,1.0,0.2032696
addr_state,1931.6018066,0.9680483,0.1967748
annual_inc,1269.1815186,0.6360674,0.1292932
revol_util,1115.3443604,0.5589698,0.1136216
dti,904.6354370,0.4533702,0.0921564
purpose,861.2447510,0.4316244,0.0877361
loan_amnt,621.8223267,0.3116346,0.0633459
longest_credit_length,237.9829712,0.1192684,0.0242436
total_acc,236.8121033,0.1186816,0.0241244


