# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

With the help of python notebooks provided in 1st and 2nd project of this nanodegree program ('Optimizing a pipeline in Azure' and 'Operationalizing machine learning'), I have imported following basic dependencies required to complete this project. Any other specific dependecy will be imported as we proceed further in this project.

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as pyplot
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# CheckIng core SDK version number
print("SDK Version:", azureml.core.VERSION)

SDK Version: 1.19.0


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.

I will be using the 'Heart Failure Clinical Data' which consists of 12 features ( age, anaemia, creatinine_phosphokinase, diabetes, ejection_fraction, high_blood_pressure, platelets, serum_creatinine, serum_sodium, sex, smoking, time ) which can be used to predict mortality by heart failure. There are total of 299 input rows in the dataset with 0 null entries.

The 12 features are as follows:

(1) age

(2) anaemia i.e. decrease of red blood cells or hemoglobin (boolean)

(3) creatining_phosphokinase i.e. level of the CPK enzyme in the blood (mcg/L)

(4) diabetes i.e. if the patient has diabetes or not (boolean)

(5) ejection_fraction i.e. percentage of blood leaving the heart at each contraction (percentage)

(6) high_blood_pressure i.e. if the patient has hypertension (boolean)

(7) platelets i.e. platelets in the blood (kiloplatelets/mL)

(8) serum_creatinine i.e. level of serum creatinine in the blood (mg/dL)

(9) serum_sodium i.e. level of serum sodium in the blood (mEq/L)

(10) sex i.e. woman or man (binary)

(11) smoking i.e. if the patient smokes or not (boolean)

(12) time i.e. follow-up period (days)


We will be predicting the following output:

DEATH_EVENT i.e if the patient deceased during the follow-up period (boolean)

A machine learning classification model on this dataset will be helpful for early detection of people with cardiovascular disease or those who are at high risk of cardiovascular disease. 

SOURCE : https://www.kaggle.com/andrewmvd/heart-failure-clinical-data

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

I have already registered the dataset after downloading it from kaggle. So, I will be using the name and description that I saved the dataset with, to import it in my experiment.

In [2]:
# creating an automl experiment in our workspace

# initializing a workspace
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

# choosing a name for experiment
experiment_name = 'capstone-automl-experiment'
project_folder = './pipeline-project'

# creating the experiment
experiment = Experiment(ws, experiment_name)
experiment.start_logging()
experiment

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code RW8Z7NMVS to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


In [None]:
# creating an AMLCompute cluster for running the experiment

# importing required dependencies
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# Choosing a name for our CPU cluster
amlcompute_cluster_name = "aml-auto"

# Verifying that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
compute_target.get_status()

In [3]:
# entering the dataset's name and description in 'key' and 'description_text' respectively

key = 'heart-failure-clinical-data'
description_text = 'heart failure predictions'

# importing the dataset for use
dataset = ws.datasets[key]

# converting the imported dataset to pandas dataframe for analyzing purpose
df = dataset.to_pandas_dataframe()

# analyzing the dataframe
df.describe()    

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

for automl settings, I will be using the following parameters:

(1) experiment_timeout_minutes : It is the amount of time that the experiment will run upto. I will input it as 30 minutes which means the the experiment will exit after 30 minutes ( if it doesn't find the best run within 30 minutes and exit on its own ) and will give out the best result found during that time.

(2) max_concurrent_iterations : It is the maximum number of iterations to be executed in parallel. I will input it as '5' iterations. 5 concurrent iterations will help in quickly executing the tasks of experiment and it will also not load the compute target too much for computation.

(3) primary_metric : This is the metric that will be optimized by Automated Machine Learning for model selection. I will use 'AUC_weighted' as 'primary_metric' parameter. AUC means the area under the Receiver Operating Characteristic Curve which plots the relationship between true positive rate and false positive rate. Since our dataset doesn't have high class imbalance, we can use ROC method for judging the performance of a model. I will use AUC_weighted in order to mitigate the effects of whatever little imbalance is there in the dataset. AUC_weighted is the arithmetic mean of the score for each class, weighted by the number of true instances in each class.

for automl configuration, I will be using the following parameters:

(1) compute_target : It is the compute target on which we will run our Azure Machine Learning experiment. Since I have created a compute target named as 'compute_target' for this purpose, I will input it as the 'compute_target' parameter.

(2) task : I want to make a classification model that can predict whether the patient is at a high risk of cardiovascular disease or not. Hence, I will input 'classification' as 'task' parameter.

(3) training_data : It is the training dataset to be used for the experiment. I will use 'dataset' (the registered dataset imported above for running this experiment) as 'training_data' parameter. importing training dataset means the output columns will be included and its name will be entered in 'label_column_name'.

(4) label_column_name : It is the name of the output column present in the training dataset. I will enter 'DEATH_EVENT' as 'label_column_name' parameter.

(5) path : This is the full path to the Azure Machine learning project folder. Hence, I will input './pipeline-project' as 'path' parameter.

(6) enable_early_stopping : we can choose to terminate the experiment if the score stops improving in the short term. I will enter 'True' as 'enable_early_stopping' parameter.

(7) featurization : It is the option to featurize the dataset i.e. whether we want the Azure to do it automatically or we want to turn it off or we want some customized featurization step. I will input 'auto' in the 'featurization' parameter as I want Azure to featurize the dataset automatically.

(8) debug_log : it is the log file in which debug information is written. I am entering 'automl_errors.log' as 'debug_log' parameter.

(9) n_cross_validations : It is the number of cross validations performed. I will input it as '5' since the input rows is way lower than 1000 and 5 cross validations will not be very computation expensive.

In [4]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes" : 30,
    "max_concurrent_iterations" : 5,
    "primary_metric" : 'AUC_weighted'
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=compute_target,
                            task="classification",
                            training_data=dataset,
                            label_column_name="DEATH_EVENT",
                            path='./pipeline-project',
                            enable_early_stopping=True,
                            featurization='auto',
                            enable_onnx_compatible_models=True,
                            debug_log="automl_errors.log",
                            **automl_settings
)

In [5]:
# TODO: Submit your experiment
remote_run = experiment.submit(config = automl_config, show_output=True)

Running on remote.


In [None]:
# waiting for completion of remote_run while showing its output
remote_run.wait_for_completion(show_output=True)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [8]:
# importing required dependencies
from azureml.widgets import RunDetails

RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|10                               |
+---------------------------------+

****************************************************************************************************

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were d

{'runId': 'AutoML_0f3c2865-a426-4af8-9657-8b3d54d6aa46',
 'target': 'notebook130193',
 'status': 'Completed',
 'startTimeUtc': '2020-12-12T09:14:09.076555Z',
 'endTimeUtc': '2020-12-12T09:42:48.279492Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'notebook130193',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"5888ec20-9252-4f8b-bb6f-367022c655cc\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/12-12-2020_083429_UTC/heart_failure_clinical_records_dataset.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-130193\\\\\\", \\\\\\"subscription\\\\\\": \\\\

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [9]:
# Retrieve best model from AutoML Run

# importing required dependencies
import joblib

best_automl_run, best_automl_fitted_model = remote_run.get_output()
print(best_automl_run)
print(best_automl_fitted_model)

Run(Experiment: capstone-automl-experiment,
Id: AutoML_0f3c2865-a426-4af8-9657-8b3d54d6aa46_65,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                    n_estimators=25,
                                                                                                    n_jobs=1,
                         

In [10]:
#TODO: Save the best model
joblib.dump(best_automl_fitted_model,'best_automl_model.pkl')

['best_automl_model.pkl']

In [None]:
# getting the details of the best model produced by automl
best_automl_run.get_tags()

In [None]:
from pprint import pprint

def print_model(model,prefix=""):
    for step in model.steps:
        print(prefix+step[0])
        if hasattr(step[1],'estimators') and hasattr(step[1],'weights'):
            pprint({'estimators':list(e[0] for e in step[1].estimators), 'weights':step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1],estimator[0]+'-')
        else:
            pprint(step[1].get_params())
            print()

print_model(best_automl_fitted_model)

In [None]:
# retrieving the best model as ONNX model
best_auto_run, best_onnx_model = remote_run.get_output(return_onnx_model=True)

In [None]:
# importing required dependencies
from azureml.automl.runtime.onnx_convert import OnnxConverter

# saving the best model as onnx_model
onnx_fl_path = "./best_model.onnx"
OnnxConverter.save_onnx_model(best_onnx_model, onnx_fl_path)

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [None]:
# registering the best automl model

description = 'heart failure predictions'
tags = None

model = remote_run.register_model(description = description, tags = tags)

print(remote_run.model_id)

# importing required dependencies
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model
from azureml.core.environment import Environment

# loading a curated environment from workspace

env = Environment.get(ws, "AzureML-AutoML")

# specifying scikit-learn as dependency
for pip_package in ["scikit-learn"]:
    env.python.conda_dependencies.add_pip_package(pip_package)

# creating an inference config
inference_config = InferenceConfig(entry_script='entry_script.py', environment=env)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)

# naming the service to be deployed
aci_service_name = 'automl-heart-failure-predictions'
print(aci_service_name)

# deploying the model
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
# importing required dependencies
from numpy import array

# getting the test features and labels
X_test = dataset.drop_columns(columns=['DEATH_EVENT'])
y_test = dataset.keep_columns(columns=['DEATH_EVENT'], validate=True)

# converting to pandas dataframe
#dataset.take(5).to_pandas_dataframe()
#X_test = X_test.to_pandas_dataframe()
#y_test = y_test.to_pandas_dataframe()

# importing required dependencies
import json
import requests

X_test_json = X_test.to_json(orient='records')
data = "{\"data\": " + X_test_json +"}"
headers = {'Content-Type': 'application/json'}

#resp = requests.post(aci_service.scoring_uri, data, headers=headers)

#y_pred = json.loads(json.loads(resp.text))['result']
#actual = array(y_test)
#actual = actual[:,0]
#print(len(y_pred), " ", len(actual))

In [None]:
resp = requests.post(aci_service.scoring_uri, data, headers=headers)

y_pred = json.loads(json.loads(resp.text))['result']
actual = array(y_test)
actual = actual[:,0]
print(len(y_pred), " ", len(actual))

TODO: In the cell below, print the logs of the web service and delete the service

In [None]:
# printing the logs of deployed web service
aci_service.get_logs()

In [None]:
# deleting a web service
aci_service.delete()

In [None]:
# delete compute cluster
compute_target.delete()