# H2O를 활용한 정형데이터 딥러닝 분석 - 분류 classification

본 실습은 Chicago에서 2016년 2월에 열렸던 세미나를 참고하였음 http://open.h2o.ai/chicago.html

## H2O 설치

먼저 파이썬 환경에서 실습하기 위해서는 h2o python module을 설치해야 함
Python 버전은 2.7 or 3.5+

    pip install requests
    pip install tabulate
    pip install scikit-learn
    pip install colorama
    pip install future
    pip install https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/6/Python/h2o-3.14.0.6-py2.py3-none-any.whl

만약 최신버전이 나왔다면

    pip uninstall h2o
    
로 기존 h2o를 제거하고 새로운 버전의 h2o를 설치하면 된다

## 1. H2O 실습환경 구축

H2O cluster를 local (laptop / desktop)에서 구동할 수 있도록 한다

In [1]:
import h2o

# 컴퓨터 자원 사용을 어떻게 할 것인지 nthreads 로 cpu core 개수를, max_mem_size로 memory(GB)를 설정
# nthreads = -1은 모든 코어 사용하겠다는 의미
h2o.init(nthreads = -1, max_mem_size = 4)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from C:\ProgramData\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\DAVIDO~1\AppData\Local\Temp\tmp7qyqjaz1
  JVM stdout: C:\Users\DAVIDO~1\AppData\Local\Temp\tmp7qyqjaz1\h2o_David_Oh_started_from_python.out
  JVM stderr: C:\Users\DAVIDO~1\AppData\Local\Temp\tmp7qyqjaz1\h2o_David_Oh_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster version:,3.14.0.6
H2O cluster version age:,"7 days, 14 hours and 42 minutes"
H2O cluster name:,H2O_from_python_David_Oh_2r3fj9
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


## 2. 데이터 준비

### 데이터 입력
입력할 데이터는 렌딩클럽(lending club)의 전처리된 "Bad Loans" 데이터셋이다. 이번 실습의 목적은 대출상환 여부를 예측하는 모델을 만드는 것이다. 총 15개의 cloumns과 163987개의 rows로 구성되어 있으며. Target variable은 'bad_loan'이다. 아래는 분석하려는 데이터의 원본 형태이다.

"loan_amnt","term","int_rate","emp_length","home_ownership","annual_inc","purpose","addr_state","dti","delinq_2yrs","revol_util","total_acc","bad_loan","longest_credit_length","verification_status"
5000,"36 months",10.65,10,"RENT",24000.0,"credit_card","AZ",27.650000000000002,0,83.7,9,0,26,"verified"
2500,"60 months",15.27,0,"RENT",30000.0,"car","GA",1.0,0,9.4,4,1,12,"verified"
2400,"36 months",15.96,10,"RENT",12252.0,"small_business","IL",8.72,0,98.5,10,0,10,"not verified"

"bad loan" 컬럼의 값은 만약 대출상환이 안됐다면(bad loan) 1, 됐으면 0이다.

In [9]:
loan_csv = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"
data = h2o.import_file(loan_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [10]:
data.shape

(163987, 15)

In [11]:
data.head(5)

loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
5000,36 months,10.65,10,RENT,24000,credit_card,AZ,27.65,0,83.7,9,0,26,verified
2500,60 months,15.27,0,RENT,30000,car,GA,1.0,0,9.4,4,1,12,verified
2400,36 months,15.96,10,RENT,12252,small_business,IL,8.72,0,98.5,10,0,10,not verified
10000,36 months,13.49,10,RENT,49200,other,CA,20.0,0,21.0,37,0,15,verified
5000,36 months,7.9,3,RENT,36000,wedding,AZ,11.2,0,28.3,12,0,7,verified




### Target variable 변환
target variable인 'bad_loan'은 현재 numerical value이므로 factor로 변경해야 함

In [12]:
data['bad_loan'] = data['bad_loan'].asfactor()
data['bad_loan'].levels()

[['0', '1']]

### 데이터 분할 작업
data를 3개로 분할하여 모델을 만들고 검증 및 테스트할 것임 

    training: 70%
    validation: 15%
    test: 15%

In [13]:
splits = data.split_frame(ratios = [0.7, 0.15], seed=1) #seed를 설정하여 재현가능성을 높일 수 있음

train = splits[0]
valid = splits[1]
test = splits[2]
#정확하게 70%, 15%, 15%으로 나누는 것은 아니니 이점 참고하자

In [14]:
print (train.nrow)
print (valid.nrow)
print (test.nrow)

114908
24498
24581


### Target과 Input variables 설정


In [15]:
y = 'bad_loan'
x = list(data.columns)

In [17]:
#변수 설정
x.remove(y)
x.remove('int_rate') #int_rate은 상관관계 분석 결과 outcome과 상관이 있는 것으로 나타나 제거

In [21]:
print (x)
print (y)

['loan_amnt', 'term', 'emp_length', 'home_ownership', 'annual_inc', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'revol_util', 'total_acc', 'longest_credit_length', 'verification_status']
bad_loan


## 3. 데이터 모델링 with Deep Learning
H2O's Deep Learning algorithm은 a multilayer feed-forward artificial neural network 방식이다. autoencoder 학습을 위해서도 사용할 수 있다. 이번 실습에서는 loan data로 지도학습 예측 모델을 학습하는 것으로 했다.

In [22]:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

먼저 default paramer 값으로 기본적인 딥러닝 모델을 학습한다. 아래는 H2O의 딥러닝 관련 reference 요약문이다.
First we will train a basic DL model with default parameters. DL will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. H2O's DL will not be reproducbible if run on more than a single core, so in this example, the performance metrics below may vary slightly from what you see on your machine.

In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.


In [23]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1', seed=1)
dl_fit1.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


### Train with new architecture and more epochs

이번에는 epoch의 숫자를 늘려보겠는데 더 좋은 성능을 기대할 수 있지만 그만큼 오버피팅의 위험도 증가한다. H2O에는 최적의 epoch 수를 찾아주는 early stopping functionality라는 기능을 제공하고 있다. default로 early stopping을 사용하고 있다. 'stopping_rounds' 파라미터 값을 0으로 할 때는 이 기능을 끌 수 있다.

In [24]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=20, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  # early stopping 비활성화
                                   seed=1)
dl_fit2.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


### Train DL with early stopping

앞서 만들었던 모델과 동일한 parameter 값을 주고 early stopping은 활성화하고 early stop의 기준을 설정한다. 아래에서는 AUC를 기준으로 했다.

In [25]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=20, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=3,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


### 모델간 성능 비교

test 데이터를 통해서 AUC를 기준으로 모델의 성능을 비교해보았다.

In [26]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [28]:
print (dl_perf1.auc()) #가장 높은 성능을 보였다.
print (dl_perf2.auc())
print (dl_perf3.auc())

0.6815131460573379
0.6789750104322544
0.6811862996300405


In [29]:
dl_fit3.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_lift,validation_classification_error
0,,2017-10-17 16:14:10,0.000 sec,,0.0,0,0.0,,,,,,,,,,
1,,2017-10-17 16:14:11,0.691 sec,258741 obs/sec,0.871419,1,100133.0,0.383815,0.474358,0.65655,2.558203,0.378641,0.383771,0.47436,0.658069,2.597762,0.411911
2,,2017-10-17 16:14:12,2.275 sec,382042 obs/sec,6.087653,7,699520.0,0.376736,0.450241,0.675731,2.830353,0.4196,0.376646,0.450089,0.677478,2.487688,0.332884
3,,2017-10-17 16:14:14,3.771 sec,411272 obs/sec,11.310092,13,1299620.0,0.377134,0.451596,0.676154,2.721493,0.383091,0.377061,0.451768,0.677449,2.663807,0.367785
4,,2017-10-17 16:14:15,5.499 sec,422627 obs/sec,17.404132,20,1999874.0,0.379622,0.455766,0.680129,3.048072,0.326962,0.379891,0.456319,0.679761,2.619777,0.333497
5,,2017-10-17 16:14:16,6.277 sec,429057 obs/sec,20.017571,23,2300179.0,0.377347,0.450836,0.67554,2.394914,0.370854,0.377771,0.452132,0.672465,2.663807,0.347743
6,,2017-10-17 16:14:16,6.414 sec,427145 obs/sec,20.017571,23,2300179.0,0.379622,0.455766,0.680129,3.048072,0.326962,0.379891,0.456319,0.679761,2.619777,0.333497


In [33]:
print (dl_perf1.precision())
print (dl_perf2.precision())
print (dl_perf3.precision())

[[0.8778770103188119, 1.0]]
[[0.6879366528374184, 1.0]]
[[0.6912144478543738, 1.0]]


In [34]:
h2o.shutdown()

    >>> h2o.shutdown()
        ^^^^ Deprecated, use ``h2o.cluster().shutdown()``.
H2O session _sid_b46b closed.
