# Automated ML

Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [None]:
from azureml.core import Workspace, Experiment
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.data.dataset_factory import TabularDatasetFactory
from train import split_data
from sklearn.model_selection import train_test_split
from azureml.core import ScriptRunConfig 
import os

## Dataset

### Overview
For this project, the dataset chosen is the [***Heart Disease UCI***](https://github.com/yashasvisingh14/MachineLearningEngineerWithMicrosoftAzure03/blob/main/heart.csv) from Kaggle. This database contains 14 columns. The "target" field refers to the presence of heart disease in the patient (0 or 1).

Attribute Information -


1.   age
2.   sex
3.   chest pain type (4 values)
4.   resting blood pressure
5.   serum cholestoral in mg/dl
6.   fasting blood sugar > 120 mg/dl
7.   resting electrocardiographic results (values 0,1,2)
8.   maximum heart rate achieved
9.   exercise induced angina
10.  oldpeak = ST depression induced by exercise relative to rest
11.  the slope of the peak exercise ST segment
12.  number of major vessels (0-3) colored by flourosopy
13.  thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14.  target

The task here to classify the presence of heart disease in a person and thus a binary classification algorithm is required. All the features is being used for training the model and the column target is considered as the target variable.





In [None]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'Heart_AutoML'

experiment=Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

web_path = "https://raw.githubusercontent.com/yashasvisingh14/MachineLearningEngineerWithMicrosoftAzure03/main/heart.csv"
dataset = TabularDatasetFactory.from_delimited_files(path=web_path)

Workspace name: quick-starts-ws-139614
Azure region: southcentralus
Subscription id: 81cefad3-d2c9-4f77-a466-99a7f541c7bb
Resource group: aml-quickstarts-139614


In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cpu_cluster_name = "cpucluster"
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [None]:
import pandas as pd
x, y = split_data(dataset)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

try:
    os.makedirs('./data', exist_ok=True)
except OSError as error:
    print('New directory cannot be created')

train_df = X_train
train_df['target'] = y_train

train_path = 'data/train-data.csv'
train_df.to_csv(train_path)

test_df = X_test
test_df['target'] = y_test

test_path = 'data/test-data.csv'
test_df.to_csv(test_path)

datastore = ws.get_default_datastore()
datastore.upload(src_dir='data', target_path='data')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Uploading an estimated of 2 files
Uploading data/test-data.csv
Uploaded data/test-data.csv, 1 files out of an estimated total of 2
Uploading data/train-data.csv
Uploaded data/train-data.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_ad16420fc5c046c882a6f5bcff27c985

In [None]:
train_data = TabularDatasetFactory.from_delimited_files(path=[(datastore, ('data/train-data.csv'))])
test_data = TabularDatasetFactory.from_delimited_files(path=[(datastore, ('data/test-data.csv'))])

In [None]:
train_data

{
  "source": [
    "('workspaceblobstore', 'data/train-data.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

## AutoML Configuration

AutoML creates a number of pipelines in parallel that try different algorithms and parameters for us. It gives us the best model which "fits" our data. It trains and tunes the model using the target metric specified. AutoML implements ML solutions without extensive programming knowledge. It saves time and resources.

In this project, AutoML was configured using an instance of the AutoMLConfig object. The following parameters were set:


1.   Task helps us determine the kind of machine learning problem we need to solve. It can be classification, regression, and forecasting.

2.   The primary metric parameter determines the metric to be used during model training for optimization. In this case where classification scenario is used we provided accuracy as primary metric.

3.   training_data is the training data to be used within the experiment. Here train_data is a TabularDataset loaded from a CSV file.

4.   experiment_timeout_minutes defines how long, in minutes, the experiment should continue to run, in our case its 30 minutes.

5.   n_cross_validations parameter sets number of cross validations to perform, based on the same number of folds.

6.   label_column_name is the name of the label column. Here the target column is 'target' which specifies whether a person has heart disease (1) or not (0).

7.   Retrieved and saved the best automl model.

  






In [None]:
from azureml.train.automl import AutoMLConfig

# Automl settings
automl_settings = {
    "experiment_timeout_minutes": 60,
    "primary_metric": 'accuracy'
}

# Define Automl config 
automl_config = AutoMLConfig(
    task='classification',
    training_data=train_data,
    label_column_name='target',
    compute_target=cpu_cluster,
    **automl_settings
)

In [None]:
# Submit your experiment
remote_run = experiment.submit(automl_config, show_output=True)
RunDetails(remote_run).show()
remote_run.wait_for_completion(show_output=True)

Running on remote.
No run_configuration provided, running on cpucluster with default configuration
Running on remote compute: cpucluster
Parent Run ID: AutoML_a42712d0-9994-4bec-81a9-4135c416679d

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|10                               |
+---------------------------------+

****************************************************************************************************

TYPE:         Class balancing detection
ST

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|10                               |
+---------------------------------+

****************************************************************************************************

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were d


 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          0:00:52       1.0000    1.0000
         1   MaxAbsScaler XGBoostClassifier                 0:00:53       1.0000    1.0000
         2   MaxAbsScaler RandomForest                      0:00:48       0.9958    1.0000
         3   MaxAbsScaler RandomForest                      0:00:54       0.9543    1.0000
         4   MaxAbsScaler RandomForest                      0:00:51       0.9750    1.0000
         5   MaxAbsScaler RandomForest                      0:00:51       0.9460    1.0000
         6   SparseNormalizer XGBoostClassifier             0:01:21       0.9630    1.0000
         7   MaxAbsScaler LightGBM                          0:00:58       1.0000    1.0000
         8   MaxAbsScaler GradientBoosting                  0:01:01       1.0000    1.0000
         9   StandardScalerWrapper LightGBM                 0:00:58       1.0000    1.000

{'runId': 'AutoML_a42712d0-9994-4bec-81a9-4135c416679d',
 'target': 'cpucluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-28T15:36:20.936555Z',
 'endTimeUtc': '2021-02-28T16:50:23.271785Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'cpucluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"031927cc-ea57-4b36-8962-bdfdc6a313a9\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"data/train-data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-139614\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"81cefad3-d2c9-4f77-a466-99a7f541c7bb\\\\\\", \\\\\\"works

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [None]:
best_run, fitted_model = remote_run.get_output()

print(best_run)

Run(Experiment: Heart_AutoML,
Id: AutoML_a42712d0-9994-4bec-81a9-4135c416679d_0,
Type: azureml.scriptrun,
Status: Completed)


In [None]:
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('MaxAbsScaler', MaxAbsScaler(copy...
                 LightGBMClassifier(boosting_type='gbdt', class_weight=None,
                                    colsample_bytree=1.0,
                                    importance_type='split', learning_rate=0.1,
                                    max_depth=-1, min_child_samples=20,
                                    min_child_weight=0.001, min_split_gain=0.0,
                        

In [None]:
best_run.get_tags()

{'_aml_system_azureml.automlComponent': 'AutoML',
 '_aml_system_ComputeTargetStatus': '{"AllocationState":"steady","PreparingNodeCount":0,"RunningNodeCount":1,"CurrentNodeCount":1}',
 '_aml_system_automl_is_child_run_end_telemetry_event_logged': 'True'}

In [None]:
metrics = best_run.get_metrics()
metrics['accuracy']

1.0

In [None]:
#TODO: Save the best model

import joblib
from azureml.core.model import Model

description = "Heart Dataset"

os.makedirs('outputs', exist_ok=True)
joblib.dump(fitted_model, filename="outputs/automl-model.pkl")
automl_model = remote_run.register_model(model_name='Heart_AutoML', description=description)

## Model Deployment

In the cell below, register the model, create an inference config and deploy the model as a web service.

In [None]:

from azureml.core.webservice import AciWebservice

aci_config = AciWebservice.deploy_configuration(
    cpu_cores=1,
    memory_gb=1,
    description='Heart AutoML Model',
    auth_enabled=True
)

In [None]:
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.automl.core.shared import constants

model = Model(ws, 'Heart_AutoML')


myenv = best_run.get_environment()
entry_script = 'score.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', entry_script)
best_run.download_file(constants.CONDA_ENV_FILE_PATH, 'myenv.yml')

inference_config = InferenceConfig(entry_script=entry_script, environment=myenv)

service = Model.deploy(workspace=ws, 
                       name='automl', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aci_config)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running...........................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In the cell below, send a request to the web service you deployed to test it.

In [None]:
service.update(enable_app_insights=True)

In [None]:
print("State "+ service.state)
print("Key " + service.get_keys()[0])
print("Swagger URI " + service.swagger_uri)
print("Scoring URI " + service.scoring_uri)

State Healthy
Key SPVQXTA00qrUBcfQN0eKRsn3AyiKSv24
Swagger URI http://cbce5c7c-b362-49f5-891c-e62d284d6831.southcentralus.azurecontainer.io/swagger.json
Scoring URI http://cbce5c7c-b362-49f5-891c-e62d284d6831.southcentralus.azurecontainer.io/score


In [None]:
%run endpoint.py

{"result": [1, 0]}


In the cell below, print the logs of the web service and delete the service

In [None]:
print(service.get_logs())

2021-02-28T17:19:55,357322800+00:00 - iot-server/run 
2021-02-28T17:19:55,364538500+00:00 - rsyslog/run 
2021-02-28T17:19:55,360277400+00:00 - gunicorn/run 
2021-02-28T17:19:55,428677100+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_09ff55f546b313bb1ab136a466214499/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_09ff55f546b313bb1ab136a466214499/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_09ff55f546b313bb1ab136a466214499/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_09ff55f546b313bb1ab136a466214499/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_09ff55f546b313bb1ab136a466214499/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [None]:
service.delete()
cpu_cluster.delete()

Current provisioning state of AmlCompute is "Deleting"

