# Automated ML

## Dependencies 

All the dependencies needed to complete the project appear here.

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.data.dataset_factory import TabularDatasetFactory

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

from azureml.pipeline.steps import AutoMLStep

from azureml.widgets import RunDetails

import joblib

from azureml.core.environment import Environment 
from azureml.core.model import InferenceConfig 
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.38.0


## Workspace

The `config.json` file is downloaded from Azure environment and has to be in the project folder in order for this cell to run.

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-188693
aml-quickstarts-188693
southcentralus
48a74bb7-9950-4cc1-9caa-5d50f995cc55


## Create an Azure ML experiment
I am creating an experiment named `heart-failure-prediction` and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.


In [3]:
# Choose a name for the run history container in the workspace.

experiment_name = 'heart-failure-prediction'
project_folder = './capstone-project'

experiment = Experiment(ws, experiment_name)
experiment

run = experiment.start_logging()

## Create or Attach a cluster

We will need to create a [compute target](https://docs.microsoft.com/en-us/azure/machine-learning/concept-azure-machine-learning-architecture#compute-target) for the AutoML run. In case the compute target (named `compute-cluster` in this script) is not found, a new one is created using the default AmlCompute as the training compute resource.

In [4]:
# max_nodes should be no greater than 4.

# Choose a name for the cluster
cpu_cluster_name = "compute-cluster2"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    print('Creating a new compute cluster...')
    # Poll for a minimum number of nodes (min_nodes = 1). 
    # If no min node count is provided it uses the scale settings for the cluster.
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_DS3_v2', min_nodes=1, max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Creating a new compute cluster...
InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded..................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 1, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2022-03-12T11:36:23.590000+00:00', 'errors': None, 'creationTime': '2022-03-12T11:35:13.663928+00:00', 'modifiedTime': '2022-03-12T11:35:17.265738+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT1800S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS3_V2'}


## Dataset

### Overview

The dataset used is taken from [Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) and the data comes from 299 patients with heart failure collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015. The patients consisted of both women (105) and men (194), and the main task of the project is to classify the patients based on their odds of survival.

Dataset features:

| Feature | Explanation |
| :---: | :---: |
| *age* | Age of patient |
| *anaemia* | Decrease of red blood cells or hemoglobin |
| *creatinine-phosphokinase* | Level of the CPK enzyme in the blood |
| *diabetes* | Whether the patient has diabetes or not |
| *ejection_fraction* | Percentage of blood leaving the heart at each contraction |
| *high_blood_pressure* | Whether the patient has hypertension or not |
| *platelets* | Platelets in the blood |
| *serum_creatinine* | Level of creatinine in the blood |
| *serum_sodium* | Level of sodium in the blood |
| *sex* | Female (F) or Male (M) |
| *smoking* | Whether the patient smokes or not |
| *time* | Follow-up period |
| *DEATH_EVENT* | Whether the patient died during the follow-up period |


In [6]:
data = pd.read_csv('./heart_failure_clinical_records_dataset.csv')

found = False
key = "heart-failure-prediction"
description_text = "Prediction of survival of patients with heart failure"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        my_dataset = 'https://github.com/Zahak-Anjum/nd00333-capstone/blob/master/heart_failure_clinical_records_dataset.csv'
        dataset = Dataset.Tabular.from_delimited_files(my_dataset)        
        # Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)
                                
# Preview of the first five rows
print(data.head())

# Explore data
print(data.describe())

df = dataset.to_pandas_dataframe()
df.describe()

# Data columns
df.columns = ['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time', 'DEATH_EVENT']
x = df[['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']]
y = df[['DEATH_EVENT']]


    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  DEATH_EVENT  
0        0     4            1  
1        0     6            1  
2       

## AutoML Configuration

Here is an overview of the `automl` settings and configuration I used for the AutoML run:

`"n_cross_validations": 2`

`"primary_metric": 'accuracy'`

`"enable_early_stopping": True`

`"max_concurrent_iterations": 4`

`"experiment_timeout_minutes": 20`

`"verbosity": logging.INFO`

`compute_target = compute_target`

`task = 'classification'`

`training_data = dataset`

`label_column_name = 'DEATH_EVENT'` 

`path = project_folder`

`featurization = 'auto'`

`debug_log = 'automl_errors.log`

`enable_onnx_compatible_models = False`


In [7]:
# Automl settings

automl_settings = {"n_cross_validations": 2,
                    "primary_metric": 'accuracy',
                    "enable_early_stopping": True,
                    "max_concurrent_iterations": 4,
                    "experiment_timeout_minutes": 20,
                    "verbosity": logging.INFO
                    }

# Parameters for AutoMLConfig

automl_config = AutoMLConfig(compute_target = compute_target,
                            task='classification',
                            training_data=dataset,
                            label_column_name='DEATH_EVENT',
                            path = project_folder,
                            featurization= 'auto',
                            debug_log = "automl_errors.log",
                            enable_onnx_compatible_models=False,
                            **automl_settings
                            )

In [8]:
# Submit the experiment

remote_run = experiment.submit(automl_config, show_output = True)
remote_run.wait_for_completion()

Submitting remote run.
No run_configuration provided, running on compute-cluster2 with default configuration
Running on remote compute: compute-cluster2


Experiment,Id,Type,Status,Details Page,Docs Page
heart-failure-prediction,AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

********************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

**********************************************************************************

{'runId': 'AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46',
 'target': 'compute-cluster2',
 'status': 'Completed',
 'startTimeUtc': '2022-03-12T11:41:27.014577Z',
 'endTimeUtc': '2022-03-12T11:56:24.922886Z',
 'services': {},
   'message': 'No scores improved over last 20 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '2',
  'target': 'compute-cluster2',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"339efca3-83b4-49d0-9f36-84eed1e4bc12\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml

In [9]:
# get_status()
# Fetch the latest status of the run. It should show 'Completed'

print("Run Status: ",remote_run.get_status())

Run Status:  Completed


## Run Details

In the cell below, I use the `RunDetails` widget and show the children runs of the experiment.

In [14]:

RunDetails(remote_run).show()

# Get details from each run
for child_run in remote_run.get_children():
    print('\n' * 3)
    print(child_run)

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…





Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_39,
Type: azureml.scriptrun,
Status: Completed)




Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_38,
Type: azureml.scriptrun,
Status: Completed)




Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_35,
Type: azureml.scriptrun,
Status: Canceled)




Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_37,
Type: azureml.scriptrun,
Status: Canceled)




Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_36,
Type: azureml.scriptrun,
Status: Canceled)




Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_34,
Type: azureml.scriptrun,
Status: Completed)




Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_33,
Type: azureml.scriptrun,
Status: Completed)




Run(E

## Best Model
In the cell below, I get the best model from the automl experiment and display all the properties of the model.

In [13]:

best_run, fitted_model = remote_run.get_output()

# get_metrics()
# Returns the metrics
print("Best run metrics :",best_run.get_metrics())
print('\n' * 3)

# get_details()
# Returns a dictionary with the details for the run
print("Best run details :",best_run.get_details())
print('\n' * 3)

# get_properties()
# Fetch the latest properties of the run from the service
print("Best run properties :",best_run.get_properties())
print('\n' * 3)


Best run metrics : {'average_precision_score_macro': 0.8765713126587158, 'balanced_accuracy': 0.8146488319363978, 'recall_score_macro': 0.8146488319363978, 'recall_score_micro': 0.8629082774049217, 'f1_score_micro': 0.8629082774049217, 'precision_score_micro': 0.8629082774049217, 'average_precision_score_micro': 0.901780631702115, 'norm_macro_recall': 0.6292976638727957, 'weighted_accuracy': 0.9003775021559426, 'average_precision_score_weighted': 0.8991863885468029, 'matthews_correlation': 0.6788931483963743, 'f1_score_macro': 0.8315994122260009, 'accuracy': 0.8629082774049217, 'f1_score_weighted': 0.857416631526812, 'AUC_micro': 0.9044829852509146, 'recall_score_weighted': 0.8629082774049217, 'log_loss': 0.4266589462241306, 'precision_score_weighted': 0.8658060442047144, 'AUC_macro': 0.8928934539899802, 'precision_score_macro': 0.8669703868301999, 'AUC_weighted': 0.8928934539899802, 'confusion_matrix': 'aml://artifactId/ExperimentRun/dcid.AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_38

In [16]:
best_run.get_file_names()

# Download the yaml file that includes the environment dependencies
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'env.yml')

In [17]:
# Download the model file

best_run.download_file('outputs/model.pkl', 'Automl_model.pkl')

In [18]:
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
), random_state=0, reg_alpha=2.1875, reg_lambda=1.0416666666666667, subsample=1, tree_method='auto'))], verbose=False)), ('15', Pipeline(memory=None, steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('kneighborsclassifier', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan', metric_params=None, n_jobs=1, n_neighbors=6, p=2, weights='distance'))], verbose=False))], flatten_transform=None, weights=[0.16666666666666666, 0.08333333333333333, 0.08333333333333333, 0.08333333333333333, 0.08333333333333333, 0.08333333333333333, 0.16666666666666666, 0.083333333333333

In [19]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
heart-failure-prediction,AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_38,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [20]:
# Save the best model

best_run.register_model(model_name = "best_run_automl.pkl", model_path = './outputs/')

print(best_run)

Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_38,
Type: azureml.scriptrun,
Status: Completed)


## Best Model Based on Another Metric

Show the run and model that has the highest **AUC_weighted** and the one with the smallest **average_precision_score_weighted** value:

In [21]:
lookup_metric = "AUC_weighted"
best_run, fitted_model = remote_run.get_output(metric = lookup_metric)
print('\n' * 12)
print("Based on AUC_weighted: ",best_run)
print(fitted_model)

lookup_metric = "average_precision_score_weighted"
best_run, fitted_model = remote_run.get_output(metric = lookup_metric)
print('\n' * 12)
print("Based on average_precision_score_weighted: ",best_run)
print(fitted_model)
















Based on AUC_weighted:  Run(Experiment: heart-failure-prediction,
Id: AutoML_b3b7183d-b3ce-45c6-b7cb-75498698ab46_29,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features=0.5,
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,


## Model Deployment

As the best model coming from AutoML run has better accuracy than the one coming from the HyperDrive run, I deploy it in the cell below, register it, create an inference config and deploy the model as a web service.

In [23]:
model = remote_run.register_model(model_name = 'best_run_automl.pkl')
print(remote_run.model_id)

# https://knowledge.udacity.com/questions/463620

environment = best_run.get_environment()
entry_script='inference/scoring.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', entry_script)


inference_config = InferenceConfig(entry_script = entry_script, environment = environment)

# Deploying the model via ACI WebService
# https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/machine-learning/how-to-deploy-azure-container-instance.md

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                                    memory_gb = 1, 
                                                    auth_enabled= True, 
                                                    enable_app_insights= True)

service = Model.deploy(ws, "aci-heartdata-deployed", [model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)

best_run_automl.pkl
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-03-12 12:51:41+00:00 Creating Container Registry if not exists.
2022-03-12 12:51:41+00:00 Registering the environment.
2022-03-12 12:51:42+00:00 Use the existing image.
2022-03-12 12:51:42+00:00 Generating deployment configuration.
2022-03-12 12:51:43+00:00 Submitting deployment to compute.
2022-03-12 12:51:46+00:00 Checking the status of deployment aci-heartdata-deployed..
2022-03-12 12:54:35+00:00 Checking the status of inference endpoint aci-heartdata-deployed.
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [24]:
# Getting the service state
# The scorig URI & the primary authentication key are copied to the endpoint.py file in order to test the deployed service.
# The Swagger URI can be used in Swagger UI: https://petstore.swagger.io/ For more info, please see the relevant part in the README file.

# Authentication is enabled, so I use the get_keys method to retrieve the primary and secondary authentication keys:
primary, secondary = service.get_keys()

print('Service state: ' + service.state)
print('Service scoring URI: ' + service.scoring_uri)
print('Service Swagger URI: ' + service.swagger_uri)
print('Service primary authentication key: ' + primary)

Service state: Healthy
Service scoring URI: http://03106752-0251-4e73-bff0-623e93921b87.southcentralus.azurecontainer.io/score
Service Swagger URI: http://03106752-0251-4e73-bff0-623e93921b87.southcentralus.azurecontainer.io/swagger.json
Service primary authentication key: EIMtUUpQ2lMihr2GbM28A3P4pRnaMkkF


In [25]:
# Sending a request to the deployed web service to test it: consuming model endpoint

%run endpoint.py

{"result": [1, 1]}



Expected result: [true, true], where 'true' means '1' as result in the 'DEATH_EVENT' column


In [26]:
# Printing the logs
print(service.get_logs())

2022-03-12T12:54:14,890909100+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2022-03-12T12:54:14,895164200+00:00 - iot-server/run 
2022-03-12T12:54:14,908722200+00:00 - nginx/run 
2022-03-12T12:54:14,904890800+00:00 - rsyslog/run 
rsyslogd: /azureml-envs/azureml_6797cf9b513e59b405ce80f3e9222a7d/lib/libuuid.so.1: no version information available (required by rsyslogd)
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2022-03-12T12:54:15,325939400+00:00 - iot-server/finish 1 0
2022-03-12T12:54:15,327523700+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (73)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 101
SPARK_HOME not set. Skipping PySpark Initialization.
Generating new fontManager, this may take some time...
Initializing logger
2022-03-12 12:54:18,712 | root | INFO | Starting up app insights client
logging socket was

## Deleting the service
Putting the deletion of the service in a separate cell to avoid accidentally running the cell before finishing the tasks

In [27]:

service.delete()
