# Automated ML

Import all the dependencies that we will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.widgets import RunDetails
from azureml.core.run import Run
import pandas as pd
from azureml.data.datapath import DataPath
from azureml.train.automl import AutoMLConfig
import joblib 
import os

## Dataset

### Overview

The dataset that we will be using for this project is the [Heart Failure Prediction](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) dataset from Kaggle. 

Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

People with cardiovascular disease or who are at high cardiovascular risk need early detection and management wherein a machine learning model can be of great help.

**12 clinical features:**

* age - Age

* anaemia - Decrease of red blood cells or hemoglobin (boolean)

* creatinine_phosphokinase - Level of the CPK enzyme in the blood (mcg/L)

* diabetes - If the patient has diabetes (boolean)

* ejection_fraction - Percentage of blood leaving the heart at each contraction (percentage)

* high_blood_pressure - If the patient has hypertension (boolean)
  
* platelets - Platelets in the blood (kiloplatelets/mL)

* serum_creatinine - Level of serum creatinine in the blood (mg/dL)

* serum_sodium - Level of serum sodium in the blood (mEq/L)
  
* sex - Woman or man (binary)
  
* smoking - If the patient smokes or not (boolean)

* time - Follow-up period (days)

In this project we will use Azure Automated ML to make prediction on the death event based on the above mentioned clinical features.


In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'new-experiment'

experiment = Experiment(ws, experiment_name)

In [3]:
print('Workspace name: '+ ws.name,
     'Azure region: '+ ws.location,
      'Subscription id: '+ ws.subscription_id,
     'Resource group: '+ ws.resource_group, sep="\n")

run = experiment.start_logging()

Workspace name: quick-starts-ws-138897
Azure region: southcentralus
Subscription id: 48a74bb7-9950-4cc1-9caa-5d50f995cc55
Resource group: aml-quickstarts-138897


In [4]:
ds = Dataset.get_by_name(ws, 'heart-failure-dataset')

In [5]:
df = ds.to_pandas_dataframe()
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [6]:
train_data, test_data = ds.random_split(0.9)

## Create Compute Cluster

In [7]:
cpu_cluster_name = "compute-cluster"
#Verify that the cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace = ws, name = cpu_cluster_name)
    print("Found existing cluster. Use it")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes =4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
cpu_cluster.wait_for_completion(show_output=True)
    

Found existing cluster. Use it
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

Instantiate an AutoMLConfig object for AutoML Configuration. 

The parameters used here are:

* `n_cross_validation = 3` - Since our dataset is small. We apply cross validation with 3 folds instead of train/validation data split.


* `primary_metric = 'accuracy'` - The primary metric parameter determines the metric to be used during model training for optimization. Accuracy primary metric is chosen for binary classification dataset.


* `experiment_timeout_minutes = 30` - This defines how long, in minutes, our experiment should continue to run. Here this timeout is set to 30 minutes.


* `max_concurrent_iterations = 4` - To help manage child runs and when they can be performed, we match the number of maximum concurrent iterations of our experiment to the number of nodes in the cluster. So, we get a dedicated cluster per experiment.


* `task = 'classification'` - This specifies the experiment type as classification.


*  `compute_target = cpu_cluster` -  Azure Machine Learning Managed Compute is a managed service that enables the ability to train machine learning models on clusters of Azure virtual machines. Here compute target is set to cpu_cluster which is already defined with 'STANDARD_D2_V2' and maximum nodes equal to 4.


* `training_data = train_data` - This specifies the training data to be used in this experiment which is set to train_data which is a part of the dataset uploaded to the datastore.


* `label_column_name = 'DEATH_EVENT'` - The target column here is set to DEATH_EVENT which has values 1 if the patient deceased or 0 if the patient survived.


* `featurization= 'auto'` - This indicates that as part of preprocessing, data guardrails and featurization steps are performed automatically.


In [8]:
# Automl settings
automl_settings = {
    "n_cross_validations": 3,
    "primary_metric": 'accuracy',
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 4
}

# automl config here
automl_config = AutoMLConfig(task = 'classification',
                            compute_target = cpu_cluster,
                             training_data = train_data,
                             label_column_name = 'DEATH_EVENT',
                             featurization= 'auto',
                             **automl_settings
                            )

In [9]:
# Submit the experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

The `RunDetails` widget shows the different experiments.

In [10]:
RunDetails(remote_run).show()
remote_run.wait_for_completion(show_output=True)

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…


Current status: FeaturesGeneration. Generating features for the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS

{'runId': 'AutoML_dba6c487-8850-4801-ab47-d58b82f70720',
 'target': 'compute-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-14T18:27:25.344262Z',
 'endTimeUtc': '2021-02-14T19:07:30.802651Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '3',
  'target': 'compute-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"a5596cf6-441e-4518-aa24-14789a1f149d\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/02-14-2021_052120_UTC/heart_failure_clinical_records_dataset.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-138897\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"

## Best Model

The best model from the automl experiments and all the properties of the model.



In [11]:
best_automl_run, best_automl_model = remote_run.get_output()

Package:azureml-automl-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-core, training version:1.21.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.8.2, current version:2.7.3
Package:azureml-dataprep-native, training version:28.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.6.0, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-defaults, training version:1.21.0, current version:1.20.0
Package:azureml-interpret, training version:1.21.0, current version:1.20.0
Package:azureml-pipeline-core, training version:1.21.0, current version:1.20.0
Package:azureml-telemetry, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.21.0, current version:1.20.0


In [12]:
print(best_automl_run)

Run(Experiment: new-experiment,
Id: AutoML_dba6c487-8850-4801-ab47-d58b82f70720_92,
Type: azureml.scriptrun,
Status: Completed)


In [13]:
print(best_automl_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                  max_features=None,
                                                                                                  max_leaf_nodes=None,
                                                                                                  max_samples=None,
                                        

In [14]:
best_automl_run

Experiment,Id,Type,Status,Details Page,Docs Page
new-experiment,AutoML_dba6c487-8850-4801-ab47-d58b82f70720_92,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [15]:
get_best_automl_metrics = best_automl_run.get_metrics()

for metric_name in get_best_automl_metrics:
    metric = get_best_automl_metrics[metric_name]
    print(metric_name, metric)

average_precision_score_weighted 0.9194036305691871
AUC_macro 0.9121034116553831
average_precision_score_macro 0.9002312747846276
precision_score_weighted 0.866628227073543
precision_score_macro 0.8583241538244272
recall_score_macro 0.8306664959890767
average_precision_score_micro 0.9254478016655366
f1_score_micro 0.8659003831417623
AUC_micro 0.9245607081516712
AUC_weighted 0.9121034116553829
log_loss 0.3750649364380503
accuracy 0.8659003831417623
recall_score_micro 0.8659003831417623
f1_score_weighted 0.8629526354343282
norm_macro_recall 0.6613329919781532
weighted_accuracy 0.893325198426492
matthews_correlation 0.6879617955289419
f1_score_macro 0.8402446760248354
balanced_accuracy 0.8306664959890767
precision_score_micro 0.8659003831417623
recall_score_weighted 0.8659003831417623
accuracy_table aml://artifactId/ExperimentRun/dcid.AutoML_dba6c487-8850-4801-ab47-d58b82f70720_92/accuracy_table
confusion_matrix aml://artifactId/ExperimentRun/dcid.AutoML_dba6c487-8850-4801-ab47-d58b82f707

In [16]:
# Save the best model
model = best_automl_run.register_model(model_name = 'best_automl_model', model_path = 'outputs/model.pkl', 
                                       tags = {'Training context':'Auto ML'},
                                       properties={'Accuracy': get_best_automl_metrics['accuracy']})
print(model)

Model(workspace=Workspace.create(name='quick-starts-ws-138897', subscription_id='48a74bb7-9950-4cc1-9caa-5d50f995cc55', resource_group='aml-quickstarts-138897'), name=best_automl_model, id=best_automl_model:1, version=1, tags={'Training context': 'Auto ML'}, properties={'Accuracy': '0.8659003831417623'})


In [17]:
# List best models of HyperDrive Run and AutoML Run to compare the accuracy of the model
from azureml.core.model import Model

for model in Model.list(ws):
    print(model.name)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print('\t',tag_name,':',tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print('\t',prop_name,':',prop)
    print("\n")

best_automl_model
	 Training context : Auto ML
	 Accuracy : 0.8659003831417623


best_hyperdrive_model
	 Training context : Hyper Drive
	 Accuracy : 0.8




## Retrieve the Best Model's explanation

Retrieve the explanation from the `best_automl_run` which includes explanations for engineered features and raw features. 

In [18]:
model_explainability_run_id = remote_run.id + "_" + "ModelExplain"
print(model_explainability_run_id)
model_explainability_run = Run(experiment=experiment, run_id=model_explainability_run_id)
model_explainability_run.wait_for_completion()

AutoML_dba6c487-8850-4801-ab47-d58b82f70720_ModelExplain


{'runId': 'AutoML_dba6c487-8850-4801-ab47-d58b82f70720_ModelExplain',
 'target': 'compute-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-14T19:07:43.993046Z',
 'endTimeUtc': '2021-02-14T19:16:11.974215Z',
 'properties': {'azureml.runsource': 'automl',
  'parentRunId': 'AutoML_dba6c487-8850-4801-ab47-d58b82f70720_92',
  '_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '509df7e5-32dd-41fd-814b-8dc152ba9798',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json',
  'dependencies_versions': '{"azureml-train-automl-runtime": "1.21.0", "azureml-train-automl-client": "1.21.0", "azureml-telemetry": "1.21.0", "azureml-pipeline-core": "1.21.0", "azureml-model-management-sdk": "1.0.1b6.post1", "azureml-interpret": "1.21.0", "azureml-defaults": "1.21.0", "azureml-dataset-runtime": "1.21.0", "azureml-dataprep": "2.8.3", "azureml-dataprep-rslex": "1.6.0", "azureml-dataprep-native": "28.0.0", "azureml-core": "1.

**Download engineered feature importance from artifact store**

Here we use `ExplanationClient` to download the engineered feature explanations from the artifact store of the `best_automl_run`.

In [21]:
from azureml.interpret import ExplanationClient
client = ExplanationClient.from_run(best_automl_run)
engineered_explanations = client.download_model_explanation(raw=False)
exp_data = engineered_explanations.get_feature_importance_dict()
exp_data

{'time_MeanImputer': 0.9321322837200569,
 'ejection_fraction_MeanImputer': 0.3539359134153123,
 'serum_creatinine_MeanImputer': 0.25826874933613403,
 'age_MeanImputer': 0.23282709418923286,
 'serum_sodium_MeanImputer': 0.13243743589422685,
 'platelets_MeanImputer': 0.08711744920331814,
 'creatinine_phosphokinase_MeanImputer': 0.07991989711789646,
 'sex_ModeCatImputer_LabelEncoder': 0.043291783654328994,
 'diabetes_ModeCatImputer_LabelEncoder': 0.017874578104940517,
 'smoking_ModeCatImputer_LabelEncoder': 0.017557216719288674,
 'anaemia_ModeCatImputer_LabelEncoder': 0.008833710649745795,
 'high_blood_pressure_ModeCatImputer_LabelEncoder': 0.0045620525353339755}

**Download raw feature importance from artifact store**

`ExplanationClient` is used to download the raw feature explanations from the artifact store of the `best_automl_run`.

In [23]:
client = ExplanationClient.from_run(best_automl_run)
engineered_explanations = client.download_model_explanation(raw=True)
exp_data = engineered_explanations.get_feature_importance_dict()
exp_data

{'time': 0.9321322837200569,
 'ejection_fraction': 0.3539359134153123,
 'serum_creatinine': 0.25826874933613403,
 'age': 0.23282709418923286,
 'serum_sodium': 0.13243743589422685,
 'platelets': 0.08711744920331814,
 'creatinine_phosphokinase': 0.07991989711789646,
 'sex': 0.043291783654328994,
 'diabetes': 0.017874578104940517,
 'smoking': 0.017557216719288674,
 'anaemia': 0.008833710649745795,
 'high_blood_pressure': 0.0045620525353339755}

## Retrieve the Best ONNX Model

Then we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allows to retrieve the best run and fitted model for any logged metric or for a particular iteration.

Set the parameter `return_onnx_model = True` to retrieve the best ONNX model, instead of the Python model.

In [28]:
from azureml.automl.runtime.onnx_convert import OnnxConverter
best_run, onnx_mdl = remote_run.get_output(return_onnx_model=True)

OnnxConvertException: OnnxConvertException:
	Message: Requested an ONNX compatible model but the run has ONNX compatibility disabled.
	InnerException: None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Requested an ONNX compatible model but the run has ONNX compatibility disabled.",
        "target": "onnx_compatible",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "ArgumentInvalid"
            }
        }
    }
}

## Save the best ONNX model

In [None]:
from azureml.automl.runtime.onnx_convert import OnnxConverter
onnx_fl_path = "./best_model.onnx"
OnnxConverter.save_onnx_model(onnx_mdl, onnx_fl_path)

## Predict with the ONNX model

In [None]:
import sys
import json
from azureml.automl.core.onnx_convert import OnnxConvertConstants
from azureml.train.automl import constants

if sys.version_info < OnnxConvertConstants.OnnxIncompatiblePythonVersion:
    python_version_compatible = True
else:
    python_version_compatible = False

import onnxruntime
from azureml.automl.runtime.onnx_convert import OnnxInferenceHelper

def get_onnx_res(run):
    res_path = 'onnx_resource.json'
    run.download_file(name=constants.MODEL_RESOURCE_PATH_ONNX, output_file_path=res_path)
    with open(res_path) as f:
        onnx_res = json.load(f)
    return onnx_res

if python_version_compatible:
    test_df = test_dataset.to_pandas_dataframe()
    mdl_bytes = onnx_mdl.SerializeToString()
    onnx_res = get_onnx_res(best_run)

    onnxrt_helper = OnnxInferenceHelper(mdl_bytes, onnx_res)
    pred_onnx, pred_prob_onnx = onnxrt_helper.predict(test_df)

    print(pred_onnx)
    print(pred_prob_onnx)
else:
    print('Please use Python version 3.6 or 3.7 to run the inference helper.')

## Model Deployment

Register the model, create an inference config and deploy the model as a web service.

In [18]:
# Download scoring file
best_automl_run.download_file('outputs/scoring_file_v_1_0_0.py','score.py')

# Download environment file
best_automl_run.download_file('outputs/conda_env_v_1_0_0.yml', 'envFile.yml')

In [19]:
# Create an inference config

from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig


# env = Environment.get(ws, "AzureML-Minimal").clone(env_name)

# for pip_package in ["scikit-learn"]:
#     env.python.conda_dependencies.add_pip_package(pip_package)

inference_config = InferenceConfig(entry_script='score.py',
                                    environment=best_automl_run.get_environment())

In [20]:
# Deploy the model as a web service
from azureml.core.webservice import AciWebservice, Webservice

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)
service = Model.deploy(ws, "aciservice", [model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running....................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [21]:
service.update(enable_app_insights = True)

In [24]:
print("State : "+service.state)
print("Swagger URI : "+service.swagger_uri)
print("Scoring URI : "+service.scoring_uri)

State : Healthy
Swagger URI : http://f4b51c89-2d5e-469f-9132-b8a5132ff160.southcentralus.azurecontainer.io/swagger.json
Scoring URI : http://f4b51c89-2d5e-469f-9132-b8a5132ff160.southcentralus.azurecontainer.io/score


Send a request to the web service that is deployed to test it.

In [47]:
import requests
import json

# Two sets of data to score, so we get two results back

td = test_data.to_pandas_dataframe()
sample_data = td.sample(2)
y_test = sample_data["DEATH_EVENT"]
sample_data.drop(['DEATH_EVENT'], inplace=True, axis=1)
x_test = sample_data
data = {"data":x_test.to_dict(orient='records')}

# Convert to JSON string
input_data = json.dumps(data)
print(input_data)

{"data": [{"age": 51.0, "anaemia": 1, "creatinine_phosphokinase": 582, "diabetes": 1, "ejection_fraction": 35, "high_blood_pressure": 0, "platelets": 263358.03, "serum_creatinine": 1.5, "serum_sodium": 136, "sex": 1, "smoking": 1, "time": 145}, {"age": 63.0, "anaemia": 0, "creatinine_phosphokinase": 936, "diabetes": 0, "ejection_fraction": 38, "high_blood_pressure": 0, "platelets": 304000.0, "serum_creatinine": 1.1, "serum_sodium": 133, "sex": 1, "smoking": 1, "time": 88}]}


In [48]:
# Set the content type
headers = {'Content-Type': 'application/json'}

# Make the request and display the response
resp = requests.post(service.scoring_uri, input_data, headers=headers)
print('Prediction :', resp.text)

# Print original labels
print('True Values :', y_test.values)

ConnectionError: HTTPConnectionPool(host='f4b51c89-2d5e-469f-9132-b8a5132ff160.southcentralus.azurecontainer.io', port=80): Max retries exceeded with url: /score (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f97b4ddb9b0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Print the logs of the web service and delete the service

In [42]:
print(service.get_logs())

2021-02-12T09:12:40,103765300+00:00 - gunicorn/run 
2021-02-12T09:12:40,102932200+00:00 - iot-server/run 
2021-02-12T09:12:40,138524100+00:00 - rsyslog/run 
2021-02-12T09:12:40,164934300+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [43]:
service.delete()
cpu_cluster.delete()