
# Iris Dataset with H2O.ai

**By - Yashasvi Jariwala**

Dataset : "https://www.kaggle.com/uciml/iris#Iris.csv"


# **Install H2O**

The first step in this tutorial is to download and install the h2o Python module.

The latest version is always here: http://www.h2o.ai/download/h2o/py

# Start up the H2O Cluster


Once the Python module is installed, we begin by starting up a local (on your laptop) H2O cluster.


In [55]:
# Load the H2O library and start up the H2O cluter locally on your machine
import h2o

# Number of threads, nthreads = -1, means use all cores on your machine
# max_mem_size is the maximum memory (in GB) to allocate to H2O
h2o.init(nthreads = -1, max_mem_size = 8)


Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,85 days 2 hours 13 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.20.0.3
H2O cluster version age:,3 months and 8 days
H2O cluster name:,H2O_from_python_yashasvijariwala_3u6sip
H2O cluster total nodes:,1
H2O cluster free memory:,6.855 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# **Data Preparation**

**Import data**

Next we will import a cleaned up version of the Lending Club "Bad Loans" dataset. The purpose here is to predict whether a loan will be bad (i.e. not repaid to the lender). The response column, bad_loan, is 1 if the loan was bad, and 0 otherwise.

In [56]:
cd Desktop

[Errno 2] No such file or directory: 'Desktop'
/Users/yashasvijariwala/Desktop


In [57]:
iris = "https://raw.githubusercontent.com/Avkash/mldl/master/data/iris.csv"  # Absolute path

data = h2o.import_file(iris)  # 150 rows x 6 columns

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [58]:
data.shape

(150, 5)

**Encode response variable**

Since we want to train a binary classification model, we must ensure that the response is coded as a factor. If the response is 0/1, H2O will assume it's numeric, which means that H2O will train a regression model instead.

In [59]:
data.describe

C1,C2,C3,C4,C5
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


<bound method H2OFrame.describe of >

In [62]:
#data=data.drop('Id',axis=1) # we need to delete the first column as it a unique serial number which 
# is of no use for the classification

In [68]:
data.names
[u'C1', u'C2', u'C3', u'C4', u'C5']

data.set_names(['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm','Species'])


SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa




In [69]:
data['Species'] = data['Species'].asfactor()

In [70]:
data.describe()

Rows:150
Cols:5




Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
type,real,real,real,real,enum
mins,4.3,2.0,1.0,0.1,
mean,5.843333333333333,3.053999999999999,3.758666666666667,1.1986666666666665,
maxs,7.9,4.4,6.9,2.5,
sigma,0.8280661279778637,0.43359431136217375,1.764420419952262,0.7631607417008414,
zeros,0,0,0,0,
missing,0,0,0,0,0
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


In [71]:
data.columns_by_type("categorical")

[4.0]

In [72]:
data[data.columns_by_type("categorical")].col_names

['Species']

In [73]:
data['Species'].levels()

[['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']]

In [74]:
response = "Species"

In [75]:
features = data.col_names
features.remove(response)
print(features)

['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']


# 1. Gradient Boosting Machine

In [76]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [77]:
iris_gbm = H2OGradientBoostingEstimator()

In [78]:
iris_gbm.train(x = features, y = response, training_frame=data)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [79]:
iris_gbm

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1532633295350_7386


ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.0028363898527273166
RMSE: 0.05325776800361912
LogLoss: 0.018812456609360567
Mean Per-Class Error: 0.0
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
50.0,0.0,0.0,0.0,0 / 50
0.0,50.0,0.0,0.0,0 / 50
0.0,0.0,50.0,0.0,0 / 50
50.0,50.0,50.0,0.0,0 / 150


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,1.0
2,1.0
3,1.0


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error
,2018-10-19 17:45:14,0.011 sec,0.0,0.6666667,1.0986123,0.64
,2018-10-19 17:45:14,0.072 sec,1.0,0.6039400,0.9266417,0.04
,2018-10-19 17:45:14,0.092 sec,2.0,0.5463898,0.7914320,0.0466667
,2018-10-19 17:45:14,0.106 sec,3.0,0.4947885,0.6835336,0.0466667
,2018-10-19 17:45:14,0.128 sec,4.0,0.4481401,0.5945290,0.0466667
---,---,---,---,---,---,---
,2018-10-19 17:45:15,1.133 sec,46.0,0.0603513,0.0225288,0.0
,2018-10-19 17:45:15,1.143 sec,47.0,0.0581134,0.0214278,0.0
,2018-10-19 17:45:15,1.154 sec,48.0,0.0563745,0.0204908,0.0



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
PetalLengthCm,227.9377289,1.0,0.4959819
PetalWidthCm,226.4931946,0.9936626,0.4928387
SepalWidthCm,2.8897071,0.0126776,0.0062879
SepalLengthCm,2.2479935,0.0098623,0.0048915




In [80]:
iris_prediction = iris_gbm.predict(test_data=data)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [81]:
iris_prediction

predict,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-setosa,0.99893,0.000565689,0.000504307
Iris-setosa,0.998931,0.000566147,0.000502475
Iris-setosa,0.99893,0.00056636,0.000503784
Iris-setosa,0.998496,0.000651493,0.000852955
Iris-setosa,0.99893,0.000565701,0.000504318
Iris-setosa,0.998345,0.00115235,0.000502861
Iris-setosa,0.99893,0.000566356,0.000503835
Iris-setosa,0.998879,0.000617457,0.000503239
Iris-setosa,0.998947,0.000550974,0.000502437
Iris-setosa,0.998496,0.00065149,0.000852944




In [82]:
iris_gbm.model_performance(test_data=data)


ModelMetricsMultinomial: gbm
** Reported on test data. **

MSE: 0.0028363898944457956
RMSE: 0.05325776839528479
LogLoss: 0.018812456568859062
Mean Per-Class Error: 0.0
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
50.0,0.0,0.0,0.0,0 / 50
0.0,50.0,0.0,0.0,0 / 50
0.0,0.0,50.0,0.0,0 / 50
50.0,50.0,50.0,0.0,0 / 150


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,1.0
2,1.0
3,1.0




In [83]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')


In [84]:
test_data = [{'SepalLengthCm': 3.4, 'SepalWidthCm': 3.5, 'PetalLengthCm': 5.1, 'PetalWidthCm': 4.1},
         {'SepalLengthCm': 3.1, 'SepalWidthCm': 3.9, 'PetalLengthCm': 4.1, 'PetalWidthCm': 5.3},
         {'SepalLengthCm': 2.9, 'SepalWidthCm': 3.7, 'PetalLengthCm': 4.6, 'PetalWidthCm': 3.5 }]
df_test = pd.DataFrame(test_data)

In [85]:
df_test

Unnamed: 0,PetalLengthCm,PetalWidthCm,SepalLengthCm,SepalWidthCm
0,5.1,4.1,3.4,3.5
1,4.1,5.3,3.1,3.9
2,4.6,3.5,2.9,3.7


In [86]:
df_pred = h2o.H2OFrame(df_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [87]:
pred_result = iris_gbm.predict(test_data=df_pred)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [88]:
pred_result

predict,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-virginica,0.00184306,0.0162578,0.981899
Iris-virginica,0.0105074,0.287258,0.702234
Iris-virginica,0.0111248,0.243437,0.745439




GBMs outperforms every other model on the list.

# 2. Generalized Linear Model


In [94]:
y = 'Species'
x = list(data.columns)
x.remove(y)
x

['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']

In [91]:
# Partition data into 70%, 15%, 15% chunks
# Setting a seed will guarantee reproducibility

splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

In [92]:
print(train.nrow)
print(valid.nrow)
print(test.nrow)

112
19
19


In [107]:
# Import H2O GLM:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

**Train a default GLM**

We first create an object of class, "H2OGeneralizedLinearEstimator". This does not actually do any training, it just sets the model up for training by specifying model parameters.

In [134]:
multinomial_fit1 = H2OGeneralizedLinearEstimator(family = "multinomial", model_id='multinomial_fit1')
multinomial_fit1.train(y = response, x = features, training_frame = data)

glm Model Build progress: |███████████████████████████████████████████████| 100%


**Train a GLM with lambda search**

Next we will do some automatic tuning by passing in a validation frame and setting lambda_search = True. Since we are training a GLM with regularization, we should try to find the right amount of regularization (to avoid overfitting). The model parameter, lambda, controls the amount of regularization in a GLM model and we can find the optimal value for lambda automatically by setting lambda_search = True and passing in a validation frame (which is used to evaluate model performance using a particular value of lambda).

In [135]:
multinomial_fit2 = H2OGeneralizedLinearEstimator(family='multinomial', model_id='multinomial_fit2', lambda_search=True)
multinomial_fit2.train(y = response, x = features, training_frame = data)

glm Model Build progress: |███████████████████████████████████████████████| 100%


**Evaluate model performance**

In [136]:
multinomial_fit1 = multinomial_fit1.model_performance(test)
multinomial_fit2 = multinomial_fit2.model_performance(test)

In [137]:
# Print model performance
print (multinomial_fit1)
print (multinomial_fit2)


ModelMetricsMultinomialGLM: glm
** Reported on test data. **

MSE: 0.011560581342367475
RMSE: 0.10752014389112151


ModelMetricsMultinomialGLM: glm
** Reported on test data. **

MSE: 0.005446915335652197
RMSE: 0.07380322036098558



Very less error observed in using the Multinomial GLMs. Partial plots and variable importance doesn't work for Multinomial classification

# 3. Deep Learning

In [150]:
from h2o.estimators.deeplearning import H2OAutoEncoderEstimator, H2ODeepLearningEstimator
dl_1 = H2ODeepLearningEstimator(epochs=1)
dl_1.train(features, response, data)

dl_250 = H2ODeepLearningEstimator(checkpoint=dl_1, epochs=250)
dl_250.train(features, response, data)

dl_500 = H2ODeepLearningEstimator(checkpoint=dl_250, epochs=500)
dl_500.train(features, response, data)

dl_750 = H2ODeepLearningEstimator(checkpoint=dl_500, epochs=750)
dl_750.train(features,response, data)

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Model Details
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1532633295350_7391


ModelMetricsMultinomial: deeplearning
** Reported on train data. **

MSE: 0.33587273764211434
RMSE: 0.5795452852384483
LogLoss: 3.9990466700781653
Mean Per-Class Error: 0.34
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
49.0,0.0,1.0,0.02,1 / 50
0.0,0.0,50.0,1.0,50 / 50
0.0,0.0,50.0,0.0,0 / 50
49.0,0.0,101.0,0.34,51 / 150


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,0.66
2,0.6666667
3,1.0


Scoring History: 


0,1,2,3,4,5,6,7,8,9,10
,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_r2,training_classification_error
,2018-10-19 18:38:13,0.000 sec,,0.0,0,0.0,,,,
,2018-10-19 18:38:13,0.414 sec,666 obs/sec,0.12,1,18.0,0.5795453,3.9990467,0.4961909,0.34
,2018-10-19 18:38:13,0.605 sec,798 obs/sec,1.08,10,162.0,0.5780630,5.2317508,0.4987647,0.3333333
,2018-10-19 18:38:13,0.644 sec,720 obs/sec,1.08,10,162.0,0.5795453,3.9990467,0.4961909,0.34


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
PetalLengthCm,1.0,1.0,0.2623853
SepalWidthCm,0.9722013,0.9722013,0.2550913
SepalLengthCm,0.9436944,0.9436944,0.2476115
PetalWidthCm,0.8952938,0.8952938,0.2349119


deeplearning Model Build progress: |██████████████████████████████████████| 100%
Model Details
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1532633295350_7392


ModelMetricsMultinomial: deeplearning
** Reported on train data. **

MSE: 0.04174842476187793
RMSE: 0.20432431270379434
LogLoss: 0.21976041810695185
Mean Per-Class Error: 0.05333333333333334
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
50.0,0.0,0.0,0.0,0 / 50
0.0,50.0,0.0,0.0,0 / 50
0.0,8.0,42.0,0.16,8 / 50
50.0,58.0,42.0,0.0533333,8 / 150


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,0.9466667
2,1.0
3,1.0


Scoring History: 


0,1,2,3,4,5,6,7,8,9,10
,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_r2,training_classification_error
,2018-10-19 18:38:13,0.000 sec,,0.0,0,0.0,,,,
,2018-10-19 18:38:14,0.422 sec,3978 obs/sec,10.0,1,1500.0,0.3824285,0.6644393,0.7806227,0.1733333
,2018-10-19 18:38:19,5.566 sec,3815 obs/sec,140.0,14,21000.0,0.2229345,0.2375138,0.9254503,0.0533333
,2018-10-19 18:38:24,10.192 sec,3706 obs/sec,250.0,25,37500.0,0.2043243,0.2197604,0.9373774,0.0533333


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
PetalLengthCm,1.0,1.0,0.2847838
PetalWidthCm,0.9983639,0.9983639,0.2843179
SepalLengthCm,0.7622693,0.7622693,0.2170820
SepalWidthCm,0.7508023,0.7508023,0.2138163


deeplearning Model Build progress: |██████████████████████████████████████| 100%
Model Details
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1532633295350_7393


ModelMetricsMultinomial: deeplearning
** Reported on train data. **

MSE: 0.02874677756589855
RMSE: 0.1695487468720974
LogLoss: 0.17551572517736536
Mean Per-Class Error: 0.02666666666666667
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
50.0,0.0,0.0,0.0,0 / 50
0.0,50.0,0.0,0.0,0 / 50
0.0,4.0,46.0,0.08,4 / 50
50.0,54.0,46.0,0.0266667,4 / 150


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,0.9733334
2,1.0
3,1.0


Scoring History: 


0,1,2,3,4,5,6,7,8,9,10
,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_r2,training_classification_error
,2018-10-19 18:38:24,0.000 sec,,0.0,0,0.0,,,,
,2018-10-19 18:38:24,0.531 sec,3537 obs/sec,10.0,1,1500.0,0.6682032,1.8684642,0.3302566,0.5
,2018-10-19 18:38:30,5.827 sec,3418 obs/sec,130.0,13,19500.0,0.2362937,0.2966076,0.9162480,0.06
,2018-10-19 18:38:35,10.845 sec,3920 obs/sec,280.0,28,42000.0,0.2320765,0.2927421,0.9192108,0.06
,2018-10-19 18:38:40,15.999 sec,4161 obs/sec,440.0,44,66000.0,0.2084025,0.2562013,0.9348526,0.0466667
,2018-10-19 18:38:42,17.980 sec,4206 obs/sec,500.0,50,75000.0,0.1695487,0.1755157,0.9568798,0.0266667


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
PetalLengthCm,1.0,1.0,0.3163113
PetalWidthCm,0.8831847,0.8831847,0.2793613
SepalLengthCm,0.6403560,0.6403560,0.2025518
SepalWidthCm,0.6379023,0.6379023,0.2017757


deeplearning Model Build progress: |██████████████████████████████████████| 100%


There's a gradual increase in the model performace with increasing epochs 

# Conclusion

The best performance is found in Gradiant Boosting Machines which has zero error rate and even the other models does well enough with gives close to 100% accuracy. The Variable importance of Petal Length is observed in every model.