**Automated machine learning** (**AutoML**) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML makes machine learning available in a true sense, even to people with no major expertise in this field.

# Advantages

The advantages of AutoML can be summed up in three major points:

-   **Increases productivity**  by automating repetitive tasks. This enables a  data scientist to focus more on the problem rather than the models.
-   Automating the ML pipeline also helps to  **avoid errors** that might creep in manually.
-   Ultimately,  AutoML is a step towards **democratizing machine learning** by making the power of ML accessible to everybody.

# [H2O AutoML](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)
H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. [Stacked Ensembles](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html) – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

Properties of H2O AutoML

* Basic data pre-processing (as in all H2O algos).

* Trains a random grid of GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space.

* Individual models are tuned using cross-validation.

* Two Stacked Ensembles are trained (“All Models” ensemble & a lightweight “Best of Family” ensemble).

* Returns a sorted “Leaderboard” of all models. All models can be easily exported to production.


# Objective

Our job is to predict how long a car on a production line will take to pass the testing phase. This is a classical regression problem, and we're evaluated with the R2 metric.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

pal = sns.color_palette()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Start H2O
Import the h2o Python module and H2OAutoML class and initialize a local H2O cluster

In [None]:
import h2o
print(h2o.__version__)
from h2o.automl import H2OAutoML

h2o.init(max_mem_size='16G')

# Load data into H2O

In [None]:
%%time
train = h2o.import_file("../input/mercedes-benz-greener-manufacturing/train.csv")
test = h2o.import_file("../input/mercedes-benz-greener-manufacturing/test.csv")


Let's take a look at the data.

In [None]:
train.head(5)

In [None]:
print(f'Size of training set: {train.shape[0]} rows and {train.shape[1]} columns')

Next, let's identify the response column and save the column name as y. In this dataset, we will use all columns except the response as predictors.

In [None]:
x = train.columns
y = 'y'
x.remove(y)


# Run AutoML

Run AutoML, stopping after around 1 hour. The max_runtime_secs argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.


In [None]:
aml = H2OAutoML(max_runtime_secs = 3500, seed = 1, project_name = "lb_frame")
aml.train(x = x, y = y, training_frame = train)

# Leaderboard
Next, we will view the AutoML Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [None]:

lb = aml.leaderboard
lb.head()  

In [None]:
# The leader model is stored here
aml.leader

## Ensemble Exploration
To understand how the ensemble works, let's take a peek inside the Stacked Ensemble "All Models" model. The "All Models" ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard.

In [None]:

# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(se.metalearner()['name'])

Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. 

In [None]:
metalearner.coef_norm()

Plotting the base learner contributions to the ensemble.

In [None]:
metalearner.std_coef_plot()

# Predicting Using Leader Model

In [None]:
pred = aml.predict(test)
pred.head()

## Save Leader Model

You can also save and download your model and use it for deploying it to productiont.

In [None]:
h2o.save_model(aml.leader, path = "./product_backorders_model_bin")

## Submissions

In [None]:
sample_submission = pd.read_csv('../input/mercedes-benz-greener-manufacturing/sample_submission.csv')
sample_submission.shape

In [None]:
sample_submission['y'] = pred.as_data_frame().values
sample_submission.to_csv('h2o_automl_submission_4.csv', index=False)

In [None]:
h2o.save_model(aml.leader, path = "submission1.csv")

In [None]:
sample_submission.head()