# H2O를 활용한 정형데이터 딥러닝 분석 - 회귀 Regression

## H2O 설치

먼저 파이썬 환경에서 실습하기 위해서는 h2o python module을 설치해야 함
Python 버전은 2.7 or 3.5+

    >> pip install requests
    >> pip install tabulate
    >> pip install scikit-learn
    >> pip install colorama
    >> pip install future
    >> pip install https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/6/Python/h2o-3.14.0.6-py2.py3-none-any.whl

만약 최신버전이 나왔다면

    >> pip uninstall h2o
    
로 기존 h2o를 제거하고 새로운 버전의 h2o를 설치하면 된다

## 실습환경 구축

In [1]:
import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from C:\ProgramData\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\DAVIDO~1\AppData\Local\Temp\tmptzaa682o
  JVM stdout: C:\Users\DAVIDO~1\AppData\Local\Temp\tmptzaa682o\h2o_David_Oh_started_from_python.out
  JVM stderr: C:\Users\DAVIDO~1\AppData\Local\Temp\tmptzaa682o\h2o_David_Oh_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster version:,3.14.0.6
H2O cluster version age:,10 days
H2O cluster name:,H2O_from_python_David_Oh_xmlok6
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


## wine Data를 활용한 Regression in Deep Learning

Regression에서 많이 활용되는 유명한 wine 데이터로 딥러닝 알고리즘 기반 회귀분석

   Input variables (based on physicochemical tests):
   1 - fixed acidity
   2 - volatile acidity
   3 - citric acid
   4 - residual sugar
   5 - chlorides
   6 - free sulfur dioxide
   7 - total sulfur dioxide
   8 - density
   9 - pH
   10 - sulphates
   11 - alcohol
   Output variable (based on sensory data): 
   12 - quality (score between 0 and 10)

quality 값을 회귀 예측하는 것을 목표로함

In [3]:
# wine 데이터를 입력한다
data = h2o.import_file('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
data.shape

Parse progress: |█████████████████████████████████████████████████████████| 100%


(4898, 12)

In [4]:
data.head(5)

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6




## 데이터 전처리
x는 Input Variables의 name을 넣고, y에는 Target Variable의 name을 할당한다

wine 데이터는 총 3개로 분할한다.
    
    1) train: 학습용 데이터
    2) valid: 검증용 데이터
    3) test: 테스트용 데이터

In [5]:
x = data[:-1].columns
y = 'quality'

In [6]:
# 데이터는 70 : 15 : 15의 비율로 분할하였다
splits = data.split_frame(ratios = [0.7, 0.15], seed =1)

train = splits[0]
valid = splits[1]
test = splits[2]

In [7]:
print(train.nrow)
print(valid.nrow)
print(test.nrow)

3459
710
729


## 데이터 모델링

딥러닝 모델을 만들고 모델을 학습시킨다. 학습은 위에서 만든 train 데이터로 학습을 시행

valid 데이터로 검증을 함께 함

In [8]:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [9]:
# 딥러닝 모델 생성 및 학습
dlmod1 = H2ODeepLearningEstimator(model_id='dlmod1', hidden=[10,10], epochs=10, seed=1)
dlmod1.train(x=x, y=y, training_frame=train, validation_frame=valid)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [10]:
# 모델의 성능 확인
train_mse_dl = dlmod1.model_performance(train).mse()
valid_mse_dl  = dlmod1.model_performance(valid ).mse()
test_mse_dl  = dlmod1.model_performance(test ).mse()
print (" DL mse TRAIN=",train_mse_dl,", mse VALID=",valid_mse_dl,", mse TEST=",test_mse_dl)

 DL mse TRAIN= 0.5218589809369525 , mse VALID= 0.5113199670405458 , mse TEST= 0.5347020113085751


In [11]:
# 데이터 학습 및 검증 결과
dlmod1.show()

Model Details
H2ODeepLearningEstimator :  Deep Learning
Model Key:  dlmod1


ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.5218589809369518
RMSE: 0.7223980765041889
MAE: 0.5597629013949419
RMSLE: 0.1074711459591223
Mean Residual Deviance: 0.5218589809369518

ModelMetricsRegression: deeplearning
** Reported on validation data. **

MSE: 0.5113199670405458
RMSE: 0.715066407433985
MAE: 0.5641108537246852
RMSLE: 0.10546051537847546
Mean Residual Deviance: 0.5113199670405458
Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12
,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_deviance,training_mae,validation_rmse,validation_deviance,validation_mae
,2017-10-20 14:12:57,0.000 sec,,0.0,0,0.0,,,,,,
,2017-10-20 14:12:58,1.287 sec,31162 obs/sec,1.0,1,3459.0,0.7660694,0.5868623,0.5941512,0.7441256,0.5537229,0.5898432
,2017-10-20 14:12:59,1.499 sec,118458 obs/sec,10.0,10,34590.0,0.7223981,0.5218590,0.5597629,0.7150664,0.5113200,0.5641109


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
residual sugar,1.0,1.0,0.1209199
volatile acidity,0.9081931,0.9081931,0.1098186
total sulfur dioxide,0.8418853,0.8418853,0.1018007
fixed acidity,0.8258755,0.8258755,0.0998648
alcohol,0.7953501,0.7953501,0.0961736
citric acid,0.7945948,0.7945948,0.0960823
free sulfur dioxide,0.7723160,0.7723160,0.0933884
pH,0.7467077,0.7467077,0.0902918
density,0.5554624,0.5554624,0.0671664


## 모델 테스트

복수의 모델을 만들어서 train / valid 간의 MSE 차이 등을 통해 가장 좋은 성능을 보인 모델을 선택해야 하나 이 과정은 생략하였음.

위에서 생성한 모델에 test 데이터를 넣어 값을 예측하였음

In [12]:
yhat = dlmod1.predict(test)

deeplearning prediction progress: |███████████████████████████████████████| 100%


In [13]:
#예측된 값
yhat

predict
5.69824
6.4496
5.56955
6.28495
5.75692
5.52067
5.7281
5.53897
5.11098
5.16609


