# Workshop Azure Databricks
## 09. AutoML
## AutoML from Azure ML integration
<br>
<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>
<br>
<br>
Azure ML Documentation : https://docs.microsoft.com/en-us/azure/machine-learning/

AutoML: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train<br>
Supported models: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train?view=azure-ml-py#supported-models

## Azure ML AutoML Databricks Installation 
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-databricks-automl-environment

In [0]:
#%pip install --upgrade --force-reinstall -r https://aka.ms/automl_linux_requirements.txt

### Check the Azure ML Core SDK Version to Validate Your Installation

In [0]:
import azureml.core
print("You are using Azure ML version", azureml.core.VERSION)

## Initialize an Azure ML Workspace
### What is an Azure ML Workspace and Why Do I Need One?

An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows.  In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, operationalization, and the monitoring of operationalized models.


### What do I Need?

To create or access an Azure ML workspace, you will need to import the Azure ML library and specify following information:
* A name for your workspace. You can choose one.
* Your subscription id. Use the `id` value from the `az account show` command output above.
* The resource group name. The resource group organizes Azure resources and provides a default region for the resources in the group. The resource group will be created if it doesn't exist. Resource groups can be created and viewed in the [Azure portal](https://portal.azure.com)
* Supported regions include `eastus2`, `eastus`,`westcentralus`, `southeastasia`, `westeurope`, `australiaeast`, `westus2`, `southcentralus`, `westeurope`.

In [0]:
subscription_id = "TOBEREPLACED"
resource_group = "AMLworkshop-rg"
workspace_name = "AMLworkshop"
workspace_region = "westeurope"

## Creating a Workspace
If you already have access to an Azure ML workspace you want to use, you can skip this cell.  Otherwise, this cell will create an Azure ML workspace for you in the specified subscription, provided you have the correct permissions for the given `subscription_id`.

This will fail when:
1. The workspace already exists.
2. You do not have permission to create a workspace in the resource group.
3. You are not a subscription owner or contributor and no Azure ML workspaces have ever been created in this subscription.

If workspace creation fails for any reason other than already existing, please work with your IT administrator to provide you with the appropriate permissions or to provision the required resources.

**Note:** Creation of a new workspace can take several minutes.

In [0]:
from azureml.core import Workspace

ws = Workspace.create(name = workspace_name,
                      subscription_id = subscription_id,
                      resource_group = resource_group, 
                      location = workspace_region,                      
                      exist_ok=True)

https://microsoft.com/devicelogin

In [0]:
ws.get_details()

## Configuring Your Local Environment
You can validate that you have access to the specified workspace and write a configuration file to the default configuration location, `./aml_config/config.json`.

In [0]:
from azureml.core import Workspace

ws = Workspace(workspace_name = workspace_name,
               subscription_id = subscription_id,
               resource_group = resource_group)

ws.write_config()

## Create an Experiment

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [0]:
import logging
import os
import random
import time
import json

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [0]:
experiment_name = 'Databricks-AutoML-Diabetes'
experiment = Experiment(ws, experiment_name)

## Load Training Data Using Dataset

Automated ML takes a `TabularDataset` as input.

You are free to use the data preparation libraries/tools of your choice to do the require preparation and once you are done, you can write it to a datastore and create a TabularDataset from it.

In [0]:
#!pip install azureml-opendatasets

In [0]:
from azureml.core import Dataset
from azureml.opendatasets import Diabetes 
diabetes = Diabetes.get_tabular_dataset() # Azure ML dataset

train_dataset = diabetes

train_df = train_dataset.to_pandas_dataframe() # Convert Azure ML dataset to Pandas dataframe
train_df.info()

In [0]:
train_df.head(20)

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135
5,23,1,22.6,89.0,139,64.8,61.0,2.0,4.1897,68,97
6,36,2,22.0,90.0,160,99.6,50.0,3.0,3.9512,82,138
7,66,2,26.2,114.0,255,185.0,56.0,4.55,4.2485,92,63
8,60,2,32.1,83.0,179,119.4,42.0,4.0,4.4773,94,110
9,29,1,30.0,85.0,180,93.4,43.0,4.0,5.3845,88,310


In [0]:
train_df.describe()

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,48.5181,1.468326,26.375792,94.647014,189.140271,115.43914,49.788462,4.070249,4.641411,91.260181,152.133484
std,13.109028,0.499561,4.418122,13.831283,34.608052,30.413081,12.934202,1.29045,0.522391,11.496335,77.093005
min,19.0,1.0,18.0,62.0,97.0,41.6,22.0,2.0,3.2581,58.0,25.0
25%,38.25,1.0,23.2,84.0,164.25,96.05,40.25,3.0,4.2767,83.25,87.0
50%,50.0,1.0,25.7,93.0,186.0,113.0,48.0,4.0,4.62005,91.0,140.5
75%,59.0,2.0,29.275,105.0,209.75,134.5,57.75,5.0,4.9972,98.0,211.5
max,79.0,2.0,42.2,133.0,301.0,242.4,99.0,9.09,6.107,124.0,346.0


In [0]:
train_df.corr()

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
AGE,1.0,0.173737,0.185085,0.335428,0.260061,0.219243,-0.075181,0.203841,0.270774,0.301731,0.187889
SEX,0.173737,1.0,0.088161,0.24101,0.035277,0.142637,-0.37909,0.332115,0.149916,0.208133,0.043062
BMI,0.185085,0.088161,1.0,0.395411,0.249777,0.26117,-0.366811,0.413807,0.446157,0.38868,0.58645
BP,0.335428,0.24101,0.395411,1.0,0.242464,0.185548,-0.178762,0.25765,0.39348,0.39043,0.441482
S1,0.260061,0.035277,0.249777,0.242464,1.0,0.896663,0.051519,0.542207,0.515503,0.325717,0.212022
S2,0.219243,0.142637,0.26117,0.185548,0.896663,1.0,-0.196455,0.659817,0.318357,0.2906,0.174054
S3,-0.075181,-0.37909,-0.366811,-0.178762,0.051519,-0.196455,1.0,-0.738493,-0.398577,-0.273697,-0.394789
S4,0.203841,0.332115,0.413807,0.25765,0.542207,0.659817,-0.738493,1.0,0.617859,0.417212,0.430453
S5,0.270774,0.149916,0.446157,0.39348,0.515503,0.318357,-0.398577,0.617859,1.0,0.464669,0.565883
S6,0.301731,0.208133,0.38868,0.39043,0.325717,0.2906,-0.273697,0.417212,0.464669,1.0,0.382483


In [0]:
label = 'Y'

## Configure AutoML

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

In [0]:
automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'r2_score',
                             iteration_timeout_minutes = 5,
                             experiment_timeout_minutes = 20,
                             iterations = 10,
                             n_cross_validations = 3,
                             max_concurrent_iterations = 3, #change it based on number of worker nodes
                             verbosity = logging.INFO,
                             spark_context=sc, #databricks/spark related
                             training_data=train_dataset,
                             label_column_name=label)

## Train the Models

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.

In [0]:
automl_run = experiment.submit(automl_config, show_output = True)

## Explore the Results

#### Portal URL for Monitoring Runs

The following will provide a link to the web interface to explore individual run details and status. In the future we might support output displayed in the notebook.

In [0]:
displayHTML("<a href={} target='_blank'>Azure Portal: {}</a>".format(automl_run.get_portal_url(), automl_run.id))

#### Retrieve All Child Runs after the experiment is completed (in portal)
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [0]:
children = list(automl_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    #print(properties)
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

mylist = pd.DataFrame(metricslist).sort_index(1)
mylist

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
root_mean_squared_error,58.85,56.22,54.61,56.35,56.91,54.66,57.27,57.32,54.2,54.36
normalized_root_mean_squared_error,0.18,0.18,0.17,0.18,0.18,0.17,0.18,0.18,0.17,0.17
normalized_root_mean_squared_log_error,0.17,0.16,0.16,0.16,0.16,0.16,0.16,0.17,0.16,0.16
r2_score,0.41,0.47,0.5,0.46,0.45,0.5,0.45,0.44,0.5,0.5
root_mean_squared_log_error,0.43,0.42,0.42,0.42,0.42,0.42,0.42,0.43,0.41,0.41
explained_variance,0.41,0.47,0.5,0.46,0.45,0.5,0.45,0.45,0.5,0.5
median_absolute_error,38.67,40.47,38.59,40.64,37.4,39.24,41.03,41.17,39.02,39.79
mean_absolute_error,46.35,45.84,44.13,45.52,45.51,44.17,46.35,46.59,43.85,43.99
mean_absolute_percentage_error,39.91,40.5,39.36,39.73,39.69,39.39,40.65,41.44,38.71,38.97
normalized_mean_absolute_error,0.14,0.14,0.14,0.14,0.14,0.14,0.14,0.15,0.14,0.14


### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last invocation. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [0]:
best_run, fitted_model = automl_run.get_output()

### Download the conda environment file
From the *best_run* download the conda environment file that was used to train the AutoML model.

In [0]:
from azureml.automl.core.shared import constants
conda_env_file_name = 'conda_env.yml'
best_run.download_file(name="outputs/conda_env_v_1_0_0.yml", output_file_path=conda_env_file_name)
with open(conda_env_file_name, "r") as conda_file:
    conda_file_contents = conda_file.read()
    print(conda_file_contents)

### Download the model scoring file
From the *best_run* download the scoring file to get the predictions from the AutoML model.

In [0]:
from azureml.automl.core.shared import constants
script_file_name = 'scoring_file.py'
best_run.download_file(name="outputs/scoring_file_v_1_0_0.py", output_file_path=script_file_name)
with open(script_file_name, "r") as scoring_file:
    scoring_file_contents = scoring_file.read()
    print(scoring_file_contents)

## Register the Fitted Model
If neither metric nor iteration are specified in the register_model call, the iteration with the best primary metric is registered.

In [0]:
tags= {"Type": "test" , 
       "Framework" : "AutoML AzureML", 
       "Team" : "DataScience" , 
       "Country" : "France"}

In [0]:
mydescription = 'AutoML Regression Model using AutoML from AzureML and Azure Databricks'

model = automl_run.register_model(description = mydescription, tags=tags)
automl_run.model_id


> You can see all the results from the Azure ML Studio