![](https://raw.githubusercontent.com/Carl-McBride-Ellis/images_for_kaggle/main/H2O_ai_logo.png)
# H2O.ai Gradient boosting classifier
In this notebook we shall be using the gradient boosting classifier ([`H2OGradientBoostingEstimator`](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html)) from [H2O.ai](https://www.h2o.ai/)

To learn more about using H2O.ai see:
* [H2O.ai Overview](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html)
* [H2O.ai Tutorials](https://docs.h2o.ai/h2o-tutorials/latest-stable/index.html)

Firstly, import `h2o` and start a local H2O server

In [None]:
import h2o
h2o.init()

# Read in the data
Read in the data as a [H2OFrame](https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html), the primary data store for H2O. For examples of munging with H2O see the [Data Manipulation](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging.html) page.


In [None]:
train_data = h2o.import_file('../input/tabular-playground-series-nov-2021/train.csv')
test_data  = h2o.import_file('../input/tabular-playground-series-nov-2021/test.csv')

take a quick look

In [None]:
print(train_data.shape)
print(test_data.shape)

ok, so we have 600k rows and 102 columns in the training data, and 540k rows in the test data.

In [None]:
train_data

Convert the `target` column in the `train_data` frame to be categorical, indicating to the estimator that this is a classification problem. As for any other categorical features, if they exist H2O automatically takes care of them.

In [None]:
train_data["target"] = train_data["target"].asfactor()

let us also create a list of the column names for later use

In [None]:
X = train_data.columns
y = "target"
X.remove(y)

[Split the data into the train and validation sets](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/splitting-datasets.html). In this notebook, so as not to use up too much CPU, we shall only use 8% of the `train_data` for actual training, and a further 2% of the data for validation.

In [None]:
split_data = train_data.split_frame( ratios=[.08, .02], seed = 1)
train = split_data[0]
valid = split_data[1]

We shall be using the [H2O Gradient Boosting Machine](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html) in conjunction with a [hyperparameter grid search](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html):

In [None]:
from h2o.estimators.gbm   import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch

In [None]:
# GBM hyperparameters to try:
gbm_hyperparameters = {'learn_rate': [0.05, 0.07, 0.09],
                       'max_depth': [5, 6, 7],
                       'sample_rate': [0.8, 1.0],
                       'col_sample_rate': [0.2, 0.5, 1.0]}

# Train and validate a cartesian grid of GBMs
gbm_grid = H2OGridSearch(model        = H2OGradientBoostingEstimator(),
                         grid_id      = 'gbm_grid',
                         hyper_params = gbm_hyperparameters)

gbm_grid.train(x=X, y=y,
                training_frame=train,
                validation_frame=valid,
                ntrees=250,
                seed=1)

# Sort the grid results by validation AUC
gbm_gridperf = gbm_grid.get_grid(sort_by='auc', decreasing=True)

# take a look: click on "Show hidden code" to view
gbm_gridperf

In [None]:
# Select the top GBM model, as chosen by the validation AUC
best_model = gbm_gridperf.models[0]
best_model.summary()

# Produce our predictions using the best model

In [None]:
predictions = best_model.predict(test_data)
# take a quick look
predictions

we shall now [combine](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/combining-columns.html) our predictions with the `test_data`

In [None]:
target = predictions["p1"].set_names(['target'])
test_with_predictions = test_data.cbind(target)

# Create a `submission.csv` for scoring by kaggle

In [None]:
submission = test_with_predictions[:,["id","target"]]
h2o.export_file(submission, path = "submission.csv", force = True)

We have finished, and shall now shut down our H2O instance

In [None]:
h2o.cluster().shutdown()