 # H20
This is an introductory notebook for people wanting to get started with H2O (the open source machine learning package by H2O.ai)
H2O is the world’s number one machine learning platform. 

It is an open-source software, and the H2O-3 GitHub repository is available for anyone to start hacking. This hands-on guide aims to explain the basic principles behind H2O and get you as a data scientist started as quickly as possible in the most simple way. The rest is just machine learning.

After reading this guide, you’ll be able to:

- Understand which basic problems H2O solves and why,
- play with H2O — explore data and create and tune models,
- see beyond the horizon. Understand where H2O can take you.

In [None]:
import h2o
import time
import seaborn
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator

%matplotlib inline
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Once the module in imported, the first step is to initialize the h2o module.

The h2o.init() command is pretty smart and does a lot of things. First, an attempt is made to search for an existing H2O instance being started already, before starting a new one. When none is found automatically or specified manually with argument available, a new instance of H2O is started.

During startup, H2O is going to print some useful information. Version of the Python it is running on, H2O’s version, how to connect to H2O’s Flow interface or where error logs reside, just to name a few.

In [None]:
h2o.init()

Now that the initialization is done, let us first import the dataset. The command is very similar to pandas.read_csv and the data is stored in memory as H2OFrame

H2O supports various file formats and data sources.

In [None]:
heart_df = h2o.import_file("../input/heart.csv", destination_frame="heart_df")

# Data Exploration
Now, lets look out the dataset with h20 dataframe

In [None]:
heart_df.head()

In [None]:
heart_df.describe()

# Histograms for all the features 

In [None]:
for col in heart_df.columns:
    heart_df[col].hist()

# Correlation Heatmap

In [None]:
plt.figure(figsize=(10,10))
corr = heart_df.cor().as_data_frame()
corr.index = heart_df.columns
sns.heatmap(corr, annot = True, cmap='RdYlGn', vmin=-1, vmax=1)
plt.title("Correlation Heatmap", fontsize=16)
plt.show()

# Train Test validation split

Let us now split the data into three parts - train, valid and test datasets - at a ratio of 60%, 10% and 20% respectively. We could use split_frame() function for the same.

In [None]:
train, valid, test = heart_df.split_frame(ratios=[0.6,0.1], seed=1234)
response = "target"
train[response] = train[response].asfactor()
valid[response] = valid[response].asfactor()
test[response] = test[response].asfactor()
print("Number of rows in train, valid and test set : ", train.shape[0], valid.shape[0], test.shape[0])

# Modeling Building

Now, let us build a baseline model using these splits. There are multiple algorithms available in the H2O module. 
First Starting with one of my favourite algo Gradient Boosting Machines

In [None]:
predictors = heart_df.columns[:-1]
gbm = H2OGradientBoostingEstimator()
gbm.train(x=predictors, y=response, training_frame=train)

In [None]:
print(gbm)

Now that is quite a bit of information. We can look at them individually.

1. First, we get the name of the model and a key to acces the model ( key is not much useful for us I guess )
2. Error metrics on the train data like log-loss, mean per class error, AUC, Gini, MSE, RMSE
3. Confusion matrix for max F1 threshold
4. Threshold values for different metrics
5. Gains / Lift table
6. Scoring history - information on how the metrics changed in each of the epochs
7. Feature importance
Okay. I heard you. How can we use the metrics of train set (as we actually trained on this dataset). We need to evaluate them from the valid set. We can use the model_performance() function for the same. We can then print the metrics.

In [None]:
perf = gbm.model_performance(valid)
print(perf)

So using our baseline model, we are getting about 0.87 auc in valid set and 0.1 auc in train set. Similarly, log loss is 0.466 in valid set and 0.093 in train set.

Now we can use the validation set to tune our parameters. We can use the early stopping to find the number of iterations to train similar to other GBM implementations. We can set some random values for the parameters to start with. 

Please note that, we have added a new validation_frame parameter in this one compared to the previous one while training.

# Model Tuning

In [None]:
gbm_tune = H2OGradientBoostingEstimator(
    ntrees = 1000,
    learn_rate = 0.01,
    stopping_rounds = 20,
    stopping_metric = "AUC",
    col_sample_rate = 0.7,
    sample_rate = 0.7,
    seed = 1234
)      
gbm_tune.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

Now, Lets check out the performance of tuned model

In [None]:
gbm_tune.model_performance(valid).auc()

!! Great we have achieved AUC of 0.91 which is satisfactory

# Grid Search

In [None]:
from h2o.grid.grid_search import H2OGridSearch

gbm_grid = H2OGradientBoostingEstimator(
    ntrees = 1000,
    learn_rate = 0.01,
    stopping_rounds = 20,
    stopping_metric = "AUC",
    col_sample_rate = 0.7,
    sample_rate = 0.7,
    seed = 1234
) 

hyper_params = {'max_depth':[4,6,8,10,12]}
grid = H2OGridSearch(gbm_grid, hyper_params,
                         grid_id='depth_grid',
                         search_criteria={'strategy': "Cartesian"})
#Train grid search
grid.train(x=predictors, 
           y=response,
           training_frame=train,
           validation_frame=valid)

In [None]:
print(grid)

As we can see this has printed the log loss performance at various depths. If we want to look at the validation AUC, then we can use the following.

In [None]:
sorted_grid = grid.get_grid(sort_by='auc',decreasing=True)
print(sorted_grid)

At max_depth of 4 maximum auc is achieved which is 0.917968 

Interestingly, there is not much change in the AUC for the top two results. Since we train on a very small sample, we might be getting this.

Also please note that, we just searched for the max_depth parameter. Please do a more comprehensive search for better results. Please refer to this [notebook](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb) for more comprehensive details on finetuning.

## K-Fold cross validation:

Most of the times, we will just do K-fold cross valdiation. So now let us do the same using H2O. Just setting the nfolds parameter in the model will do the k-fold cross validation.

In [None]:
cv_gbm = H2OGradientBoostingEstimator(
    ntrees = 3000,
    learn_rate = 0.05,
    stopping_rounds = 20,
    stopping_metric = "AUC",
    nfolds=4, 
    seed=2018)
cv_gbm.train(x = predictors, y = response, training_frame = train, validation_frame=valid)
cv_summary = cv_gbm.cross_validation_metrics_summary().as_data_frame()
cv_summary

In [None]:
cv_gbm.model_performance(valid).auc()

# H20 with XGBoost

In [None]:
from h2o.estimators import H2OXGBoostEstimator

cv_xgb = H2OXGBoostEstimator(
    ntrees = 1000,
    learn_rate = 0.05,
    stopping_rounds = 20,
    stopping_metric = "AUC",
    nfolds=4, 
    seed=2018)
cv_xgb.train(x = predictors, y = response, training_frame = train, validation_frame=valid)
cv_xgb.model_performance(valid).auc()

There is a improvement of 2 percent in comparison of GBM that's great

# feature importance
feature importance is inbuilt with xgboost model to see what are the contribution features in heart disease prediction with XGboost.

In [None]:
cv_xgb.varimp_plot()

# AutoML : Automatic Machine Learning:



H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

So let us use the H2OAutoML function to do automatic machine learning. We can specify the max_models parameter which indicates the number of individual (or "base") models, and does not include the two ensemble models that are trained at the end.

In [None]:
from h2o.automl import H2OAutoML

aml = H2OAutoML(max_models = 10, max_runtime_secs=100, seed = 1)
aml.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

now lets look out the auto ml leader board

In [None]:
lb = aml.leaderboard
lb

As we can see XGboost AutoML is the top contributor.

Please hit **Upvote** if you like the introductory kernel of H20.Please share your valuable feedback.