# Tutorial #2: Train a regression model with automated machine learning

This tutorial is **part two of a two-part tutorial series**. In the previous tutorial, you [prepared the NYC taxi data for regression modeling](regression-part1-data-prep.ipynb).

Now, you're ready to start building your model with Azure Machine Learning service. In this part of the tutorial, you will use the prepared data and automatically generate a regression model to predict taxi fare prices. Using the automated ML capabilities of the service, you define your machine learning goals and constraints, launch the automated machine learning process and then allow the algorithm selection and hyperparameter-tuning to happen for you. The automated ML technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.

In this tutorial, you learn how to:

> * Setup a Python environment and import the SDK packages
> * Configure an Azure Machine Learning service workspace
> * Auto-train a regression model 
> * Run the model locally with custom parameters
> * Explore the results
> * Register the best model

If you don’t have an Azure subscription, create a [free account](https://aka.ms/AMLfree) before you begin. 

> Code in this article was tested with Azure Machine Learning SDK version 1.0.0


## Prerequisites

> * [Run the data preparation tutorial](regression-part1-data-prep.ipynb)

> * Automated machine learning configured environment e.g. Azure notebooks, Local Python environment or Data Science Virtual Machine. [Setup](https://docs.microsoft.com/azure/machine-learning/service/samples-notebooks) automated machine learning.

### Import packages
Import Python packages you need in this tutorial.

In [6]:
import azureml.core
import pandas as pd
from azureml.core.workspace import Workspace
from azureml.train.automl.run import AutoMLRun
import time
import logging

### Configure workspace

Create a workspace object from the existing workspace. A `Workspace` is a class that accepts your Azure subscription and resource information, and creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **aml_config/config.json** and loads the details into an object named `ws`.  `ws` is used throughout the rest of the code in this tutorial.

Once you have a workspace object, specify a name for the experiment and create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment.

In [7]:
ws = Workspace.from_config()
# choose a name for the run history container in the workspace
experiment_name = 'automated-ml-regression'
# project folder
project_folder = './automated-ml-regression'

import os

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

Found the config file in: /home/nbuser/library/aml_config/config.json


Unnamed: 0,Unnamed: 1
Location,westeurope
Project Directory,./automated-ml-regression
Resource Group,resgrpAMLS
SDK version,1.0.2
Subscription ID,70b8f39e-8863-49f7-b6ba-34a80799550c
Workspace,AMLSworkspace


## Explore data

Utilize the data flow object created in the previous tutorial. Open and execute the data flow and review the results.

In [16]:
import azureml.dataprep as dprep
package_saved = dprep.Package.open("dflows")
dflow_prepared = package_saved.dataflows[0]
dflow_prepared.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
vendor,FieldType.STRING,1,VTS,7059.0,0.0,7059.0,0.0,0.0,0.0,,,,,,,,,,,,,,
pickup_weekday,FieldType.STRING,Friday,Wednesday,7059.0,0.0,7059.0,0.0,0.0,0.0,,,,,,,,,,,,,,
pickup_hour,FieldType.DECIMAL,0,23,7059.0,0.0,7059.0,0.0,0.0,0.0,0.0,3.57523,3.0,9.91106,15.9327,19.0,22.0225,23.0,23.0,14.2326,6.34926,40.3131,-0.693335,-0.459336
pickup_minute,FieldType.DECIMAL,0,59,7059.0,0.0,7059.0,0.0,0.0,0.0,0.0,5.32313,4.92308,14.2214,29.5244,44.6436,56.3767,58.9798,59.0,29.4635,17.4396,304.14,0.00440324,-1.20458
pickup_second,FieldType.DECIMAL,0,59,7059.0,0.0,7059.0,0.0,0.0,0.0,0.0,4.99286,4.91954,14.6121,29.9239,44.5221,56.6792,59.0,59.0,29.6225,17.3868,302.302,-0.0227466,-1.19409
dropoff_weekday,FieldType.STRING,Friday,Wednesday,7059.0,0.0,7059.0,0.0,0.0,0.0,,,,,,,,,,,,,,
dropoff_hour,FieldType.DECIMAL,0,23,7059.0,0.0,7059.0,0.0,0.0,0.0,0.0,3.23217,2.93333,9.92334,15.9135,19.0,22.2739,23.0,23.0,14.1815,6.45578,41.677,-0.691001,-0.500215
dropoff_minute,FieldType.DECIMAL,0,59,7059.0,0.0,7059.0,0.0,0.0,0.0,0.0,5.1064,5.0,14.2051,29.079,44.2937,56.6338,58.9984,59.0,29.353,17.4241,303.598,0.0142562,-1.21531
dropoff_second,FieldType.DECIMAL,0,59,7059.0,0.0,7059.0,0.0,0.0,0.0,0.0,5.03373,5.0,14.7471,29.598,45.3216,56.1044,58.9728,59.0,29.7923,17.481,305.585,-0.0281313,-1.21965
store_forward,FieldType.STRING,N,Y,7059.0,0.0,7059.0,0.0,0.0,0.0,,,,,,,,,,,,,,


You prepare the data for the experiment by adding columns to `dflow_X` to be features for our model creation. You define `dflow_y` to be our prediction value; cost.


In [17]:
dflow_X = dflow_prepared.keep_columns(['pickup_weekday', 'dropoff_latitude', 'dropoff_longitude','pickup_hour','pickup_longitude','pickup_latitude','passengers'])
dflow_y = dflow_prepared.keep_columns('cost')

### Split data into train and test sets

Now you split the data into training and test sets using the `train_test_split` function in the `sklearn` library. This function segregates the data into the x (features) data set for model training and the y (values to predict) data set for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are always deterministic.

In [18]:
from sklearn.model_selection import train_test_split


x_df = dflow_X.to_pandas_dataframe()
y_df = dflow_y.to_pandas_dataframe()

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=123)
# flatten y_train to 1d array
y_train.values.flatten()

array([19. ,  8.5, 15.5, ...,  6. ,  7. ,  2.5])

You now have the necessary packages and data ready for auto training for your model. 

## Automatically train a model

To automatically train a model:
1. Define settings for the experiment run
1. Submit the experiment for model tuning


### Define settings for autogeneration and tuning

Define the experiment parameters and models settings for autogeneration and tuning. View the full list of [settings](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train).


|Property| Value in this tutorial |Description|
|----|----|---|
|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration|
|**iterations**|30|Number of iterations. In each iteration, the model trains with the data with a specific pipeline|
|**primary_metric**|spearman_correlation | Metric that you want to optimize.|
|**preprocess**| True | True enables experiment to perform preprocessing on the input.|
|**verbosity**| logging.INFO | Controls the level of logging.|
|**n_cross_validationss**|5|Number of cross validation splits


In [19]:
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 30,
    "primary_metric" : 'spearman_correlation',
    "preprocess" : True,
    "verbosity" : logging.INFO,
    "n_cross_validations": 5
}

In [20]:
from azureml.train.automl import AutoMLConfig

# local compute 
automated_ml_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automated_ml_errors.log',
                             path = project_folder,
                             X = x_train.values,
                             y = y_train.values.flatten(),
                             **automl_settings)

### Train the automatic regression model

Start the experiment to run locally. Pass the defined `automated_ml_config` object to the experiment, and set the output to `true` to view progress during the experiment.

In [21]:
from azureml.core.experiment import Experiment
experiment=Experiment(ws, experiment_name)
local_run = experiment.submit(automated_ml_config, show_output=True)

Parent Run ID: AutoML_f715298d-5733-4fa5-bbe8-c2c3145f22f1
*******************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
*******************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MaxAbsScaler ExtremeRandomTrees                0:01:02       0.5362    0.5362
         1   MaxAbsScaler GradientBoosting                  0:02:07       0.4250    0.5362
         2   MaxAbsScaler ExtremeRandomTrees                0:00:49       0.6356    0.6356
         3   MaxAbsScaler GradientBoosting                  0:00:55       0.7245    0.7245
         4   StandardScalerWrapper GradientB

MSI: Failed to retrieve a token from 'http://localhost:25198/nb/api/nbsvc/oauth2/token' with an error of 'HTTPConnectionPool(host='localhost', port=25198): Max retries exceeded with url: /nb/api/nbsvc/oauth2/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f175f9ab1d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))'. This could be caused by the MSI extension not yet fullly provisioned.


MaxAbsScaler LightGBM                          0:00:47       0.7500    0.7882
        11   StandardScalerWrapper LightGBM                 0:00:37       0.7165    0.7882
        12   SparseNormalizer ExtremeRandomTrees            0:01:34       0.3050    0.7882
        13   MaxAbsScaler DecisionTree                      0:00:53       0.6566    0.7882
        14   MaxAbsScaler RandomForest                      0:00:47       0.6604    0.7882
        15   MaxAbsScaler ExtremeRandomTrees                0:00:47       0.6192    0.7882
        16   MaxAbsScaler RandomForest                      0:05:14       0.8129    0.8129
        17   MaxAbsScaler RandomForest                      0:01:06       0.6599    0.8129
        18   StandardScalerWrapper ExtremeRandomTrees       0:00:47       0.3586    0.8129
        19   MaxAbsScaler RandomForest                      0:00:48       0.4758    0.8129
        20   StandardScalerWrapper LightGBM                 0:00:56       0.6883    0.8129
        21  

MSI: Failed to retrieve a token from 'http://localhost:25198/nb/api/nbsvc/oauth2/token' with an error of 'HTTPConnectionPool(host='localhost', port=25198): Max retries exceeded with url: /nb/api/nbsvc/oauth2/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f175dd2d128>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))'. This could be caused by the MSI extension not yet fullly provisioned.


MaxAbsScaler SGD                               0:00:48       0.0452    0.8129
        26   

MSI: Failed to retrieve a token from 'http://localhost:25198/nb/api/nbsvc/oauth2/token' with an error of 'HTTPConnectionPool(host='localhost', port=25198): Max retries exceeded with url: /nb/api/nbsvc/oauth2/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f175dcfa828>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))'. This could be caused by the MSI extension not yet fullly provisioned.


StandardScalerWrapper DecisionTree             0:00:51       0.6750    0.8129
        27   StandardScalerWrapper ExtremeRandomTrees       0:01:05       0.6634    0.8129
        28   MaxAbsScaler DecisionTree                      0:00:51       0.6479    0.8129
        29    Ensemble                                      0:04:14       0.8427    0.8427


## Explore the results

Explore the results of automatic training with a Jupyter widget or by examining the experiment history.

### Option 1: Add a Jupyter widget to see results

Use the Jupyter notebook widget to see a graph and a table of all results.

In [22]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 'sd…

### Option 2: Get and examine all run iterations in Python

Alternatively, you can retrieve the history of each experiment and explore the individual metrics for each iteration run.

In [23]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

import pandas as pd
rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
explained_variance,0.127213,0.122456,0.224217,0.382238,0.353439,0.443718,0.457082,0.498826,0.381608,0.284133,...,0.357091,0.331669,0.339121,0.382699,0.458525,-0.001764,0.375429,0.249744,0.334401,0.494338
mean_absolute_error,6.394824,6.40989,6.32757,4.857654,4.785398,4.339912,4.278462,3.627265,5.033176,5.985938,...,4.617597,5.300992,5.127,4.626284,3.772906,7.728627,4.708356,6.128216,5.265543,3.454024
median_absolute_error,3.873,3.889245,5.101538,3.347935,3.032882,2.869871,2.964672,2.164105,3.609615,4.862259,...,2.748236,3.570471,3.404129,2.805219,2.06065,6.329966,2.887346,4.886083,3.53219,1.981099
normalized_mean_absolute_error,0.014403,0.014437,0.014251,0.010941,0.010778,0.009775,0.009636,0.00817,0.011336,0.013482,...,0.0104,0.011939,0.011547,0.01042,0.008498,0.017407,0.010604,0.013802,0.011859,0.007779
normalized_median_absolute_error,0.008723,0.00876,0.01149,0.00754,0.006831,0.006464,0.006677,0.004874,0.00813,0.010951,...,0.00619,0.008042,0.007667,0.006318,0.004641,0.014257,0.006503,0.011005,0.007955,0.004462
normalized_root_mean_squared_error,0.025779,0.025799,0.023534,0.021142,0.021757,0.020104,0.019877,0.019107,0.021145,0.022668,...,0.021496,0.021949,0.02181,0.021108,0.019829,0.026569,0.02125,0.023177,0.021898,0.019261
normalized_root_mean_squared_log_error,0.099605,0.102303,0.103526,0.085103,0.081573,0.080072,0.081296,0.074454,0.087789,0.101485,...,,0.090011,0.090018,0.084086,,0.119977,0.085795,0.100877,0.089165,0.069029
r2_score,0.060549,0.059197,0.224001,0.381609,0.34374,0.442939,0.456488,0.49831,0.381235,0.283822,...,0.355213,0.331216,0.338542,0.38211,0.457332,-0.002994,0.374659,0.249453,0.334126,0.490206
root_mean_squared_error,11.445961,11.454774,10.449223,9.386954,9.660134,8.92602,8.825459,8.483502,9.388254,10.06467,...,9.544199,9.745504,9.683633,9.37184,8.804037,11.796664,9.434914,10.290641,9.722506,8.551961
root_mean_squared_log_error,0.607398,0.623854,0.631307,0.518962,0.497438,0.488283,0.495747,0.454024,0.535346,0.618863,...,,0.548893,0.548937,0.512762,,0.731631,0.523185,0.615155,0.543736,0.420947


## Retrieve the best model

Select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last fit invocation. There are overloads on `get_output` that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration.

In [24]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: automated-ml-regression,
Id: AutoML_f715298d-5733-4fa5-bbe8-c2c3145f22f1_29,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(logger=None, task=None)), ('prefittedsoftvotingregressor', PreFittedSoftVotingRegressor(estimators=[('RandomForest', Pipeline(memory=None,
     steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=False...nsform=None,
               weights=[0.5333333333333333, 0.13333333333333333, 0.3333333333333333]))])


## Register the model

Register the model in your Azure Machine Learning Workspace.

In [25]:
description = 'Automated Machine Learning Model'
tags = None
local_run.register_model(description=description, tags=tags)
local_run.model_id # Use this id to deploy the model as a web service in Azure

Registering model AutoMLf715298d5best


'AutoMLf715298d5best'

## Test the best model accuracy

Use the best model to run predictions on the test data set. The function `predict` uses the best model, and predicts the values of y (trip cost) from the `x_test` data set. Print the first 10 predicted cost values from `y_predict`.

In [26]:
y_predict = fitted_model.predict(x_test.values) 
print(y_predict[:10])

[ 8.38708077  8.8706504   8.70706933  8.64981801  6.80996859  7.43748343
  7.72175158 28.72376002  7.57333283 23.1779971 ]


Compare the predicted cost values with the actual cost values. Use the `y_test` dataframe, and convert it to a list to compare to the predicted values. The function `mean_absolute_error` takes two arrays of values, and calculates the average absolute value error between them. In this example, a mean absolute error of 3.5 would mean that on average, the model predicts the cost within plus or minus 3.5 of the actual value.

In [30]:
from sklearn.metrics import mean_absolute_error

y_actual = y_test.values.flatten().tolist()
print("Mean Absolute Error :")
mean_absolute_error(y_actual, y_predict)

Mean Absolute Error :


3.1235324986518913

Run the following code to calculate MAPE (mean absolute percent error) using the full `y_actual` and `y_predict` data sets. This metric calculates an absolute difference between each predicted and actual value, sums all the differences, and then expresses that sum as a percent of the total of the actual values.

In [31]:
sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1
    
    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val
    
mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE :")
print(mean_abs_percent_error)
print()
print("Model Accuracy :")
print(1 - mean_abs_percent_error)

Model MAPE :
0.24793424410420511

Model Accuracy :
0.7520657558957948
