<a href="https://colab.research.google.com/github/subash2617/H2O-Tutorial/blob/main/H2O_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

H2O is an open source Machine Learning framework with full-tested implementations of several widely-accepted ML algorithms. You just have to pick up the algorithm from its huge repository and apply it to your dataset. It contains the most widely used statistical and ML algorithms.

H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. It provides **several statistical and ML algorithms including deep learning.**

In this tutorial, we will consider examples and understand how to go about working with H2O.

**Audience**
This tutorial is designed to help all those learners who are aiming to develop a Machine Learning model on a huge database.

Prerequisites

---


It is assumed that the learner has a basic understanding of Machine Learning and is familiar with Python.

**H2O Setup Guide**


 Have you ever been asked to develop a Machine Learning model on a **huge database**? Typically, the database will provide you  and ask you to make certain predictions such as who will be the potential buyers; if there can be an early detection of fraudulent cases, etc. To answer these questions, your task would be to develop a Machine Learning algorithm that would provide an answer to the customer’s query. Developing a Machine Learning algorithm from scratch is not an easy task and why should you do this when there are **several ready-to-use Machine Learning libraries** available in the market.

These days, you would rather use these libraries, apply a well-tested algorithm from these libraries and look at its performance. If the performance were not within acceptable limits, you would try to either fine-tune the current algorithm or try an altogether different one.

Likewise, you may try multiple algorithms on the same dataset and then pick up the best one that satisfactorily meets the customer’s requirements. This is where H2O comes to your rescue. It is an open source Machine Learning framework with full-tested implementations of several widely-accepted ML algorithms. You just have to pick up the algorithm from its huge repository and apply it to your dataset. It contains the most widely used statistical and ML algorithms.

To mention a few here it includes **gradient boosted machines (GBM), generalized linear model (GLM), deep learning and many more**. Not only that it also supports ***AutoML functionality*** that will rank the performance of different algorithms on your dataset, thus reducing your efforts of finding the best performing model. It is an in-memory platform that provides superb performance.

To install the H2O on your machine . see this web link [H2O Installation Tutorial](https://www.tutorialspoint.com/h2o/h2o_installation.htm)We will understand how to use this in the command line so that you understand its working line-wise. If you are a Python lover, you may use Jupyter or any other IDE of your choice for developing H2O applications. 

The H2O also provides a web-based tool to test the different algorithms on your dataset. This is called Flow.

The tutorial will introduce you to the use of **Flow**. Alongside, we will discuss the use of **AutoML** that will identify the best performing algorithm on your dataset. Are you not excited to learn H2O? Keep reading!


** H20 provide many in-built ML and Deep Leraing Algorithms. but in this tutorial my foucs to provide AutoML tutorial.**  

**To use AutoML, start a new Jupyter notebook and follow the steps shown below.**

**Importing AutoML**

First import H2O and AutoML package into the project using the following two statements −

In [None]:
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o


In [None]:
import h2o
from h2o.automl import H2OAutoML

**Initialize H2O**

Initialize h2o using the following statement −

In [None]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpotwijlgi
  JVM stdout: /tmp/tmpotwijlgi/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpotwijlgi/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.7
H2O_cluster_version_age:,8 days
H2O_cluster_name:,H2O_from_python_unknownUser_os1mza
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.180 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


**Loading Data**

We are using iris.csv dataset.Load the data using the following statement −

In [None]:
from sklearn import datasets
data = h2o.import_file('https://gist.githubusercontent.com/btkhimsar/ed560337d8b944832d1c1f55fac093fc/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv')



Parse progress: |█████████████████████████████████████████████████████████| 100%


In [None]:
data.columns

['sepal.length', 'sepal.width', 'petal.length', 'petal.width', 'variety']

**Preparing Dataset**

We need to decide on the features and the prediction columns. We use the same features and the predication column as in our earlier case. Set the features and the output column using the following two statements −

In [None]:
features = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
output = 'variety'

Split the data in 80:20 ratio for training and testing −

In [None]:
train, test = data.split_frame(ratios=[0.8])

**Applying AutoML**

Now, we are all set for applying AutoML on our dataset. The AutoML will run for a fixed amount of time set by us and give us the optimized model. We set up the AutoML using the following statement −

In [None]:
automl = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)

The first parameter specifies the number of models that we want to evaluate and compare.

The second parameter specifies the time for which the algorithm runs.

We now call the train method on the AutoML object as shown here −

In [None]:
automl.train(x =features, y =output, training_frame = train)

AutoML progress: |███████████
18:52:11.228: Skipping training of model GBM_5_AutoML_20200729_185148 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20200729_185148.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 123.0.


█████████████████████████████████████████████| 100%


We specify the x as the features array that we created earlier, the y as the output variable to indicate the predicted value and the dataframe as train dataset.

Run the code, you will have to wait for 5 minutes (we set the max_runtime_secs to 300) until you get the following output −

**Printing the Leaderboard**

When the AutoML processing completes, it creates a leaderboard ranking all the 30 algorithms that it has evaluated. To see the first 10 records of the leaderboard, use the following code −

In [None]:
lb = automl.leaderboard
lb.head()

model_id,mean_per_class_error,logloss,rmse,mse
GLM_1_AutoML_20200729_185148,0.0238372,0.0780831,0.153723,0.0236306
XGBoost_1_AutoML_20200729_185148,0.0405039,0.211211,0.227813,0.0518988
GBM_grid__1_AutoML_20200729_185148_model_4,0.0405039,0.141385,0.191948,0.036844
StackedEnsemble_AllModels_AutoML_20200729_185148,0.0405039,0.209262,0.217512,0.0473116
DeepLearning_grid__2_AutoML_20200729_185148_model_1,0.0405039,0.178255,0.18503,0.0342359
XGBoost_3_AutoML_20200729_185148,0.0405039,0.143473,0.195478,0.0382116
XGBoost_grid__1_AutoML_20200729_185148_model_4,0.0405039,0.128332,0.180665,0.0326399
StackedEnsemble_BestOfFamily_AutoML_20200729_185148,0.0405039,0.243483,0.235017,0.0552331
XGBoost_grid__1_AutoML_20200729_185148_model_5,0.0405039,0.199859,0.216986,0.047083
GBM_grid__1_AutoML_20200729_185148_model_5,0.0488372,0.169129,0.210865,0.0444642




**Predicting on Test Data**

Now, you have the models ranked, you can see the performance of the top-rated model on your test data. To do so, run the following code statement −

In [None]:
preds = automl.predict(test)

glm prediction progress: |████████████████████████████████████████████████| 100%


**Printing Result**

Print the predicted result using the following statement −

In [None]:
print (preds)

predict,Setosa,Versicolor,Virginica
Setosa,0.998816,0.00118353,1.60418e-16
Setosa,0.999697,0.000302757,2.45932e-17
Setosa,0.998615,0.00138542,2.82535e-16
Setosa,0.999405,0.000595221,3.82059e-17
Setosa,0.997888,0.0021118,3.46002e-17
Setosa,0.989374,0.0106258,1.6211e-15
Setosa,0.999287,0.000713256,6.43839e-17
Setosa,0.999276,0.000723839,1.72141e-17
Setosa,0.99681,0.00318959,6.00182e-15
Setosa,0.994585,0.00541516,1.38801e-15





**Printing the Ranking for All**

If you want to see the ranks of all the tested algorithms, run the following code statement −

In [None]:
lb.head(rows = lb.nrows)

model_id,mean_per_class_error,logloss,rmse,mse
GLM_1_AutoML_20200729_185148,0.0238372,0.0780831,0.153723,0.0236306
XGBoost_1_AutoML_20200729_185148,0.0405039,0.211211,0.227813,0.0518988
GBM_grid__1_AutoML_20200729_185148_model_4,0.0405039,0.141385,0.191948,0.036844
StackedEnsemble_AllModels_AutoML_20200729_185148,0.0405039,0.209262,0.217512,0.0473116
DeepLearning_grid__2_AutoML_20200729_185148_model_1,0.0405039,0.178255,0.18503,0.0342359
XGBoost_3_AutoML_20200729_185148,0.0405039,0.143473,0.195478,0.0382116
XGBoost_grid__1_AutoML_20200729_185148_model_4,0.0405039,0.128332,0.180665,0.0326399
StackedEnsemble_BestOfFamily_AutoML_20200729_185148,0.0405039,0.243483,0.235017,0.0552331
XGBoost_grid__1_AutoML_20200729_185148_model_5,0.0405039,0.199859,0.216986,0.047083
GBM_grid__1_AutoML_20200729_185148_model_5,0.0488372,0.169129,0.210865,0.0444642




**Conclusion**

H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. It provides several statistical and ML algorithms including deep learning. During testing, you can fine tune the parameters to these algorithms. You can do so using command-line or the provided web-based interface called Flow. H2O also supports AutoML that provides the ranking amongst the several algorithms based on their performance. H2O also performs well on Big Data. This is definitely a boon for Data Scientist to apply the different Machine Learning models on their dataset and pick up the best one to meet their needs.