# Use Case

Classification of presence of heart disease in the patient. It contains 13 attributes: age, sex, chest pain type (4 values), resting blood pressure, serum, cholestoral in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results (values 0,1,2), maximum heart rate achieved, exercise induced angina, oldpeak = ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by flourosopy, thal: 0 = normal; 1 = fixed defect; 2 = reversable defect.The response is "target" field which refers to the presence of heart disease in the patient. It is integer valued 0 = no/less chance of heart attack and 1 = more chance of heart attack. Table with imported data is shown below. CSV file is available here: [Data](https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility)

![dataset](./data.png)



##  Preparation of the resource

1. Log into Azure portal and create Azure Machine Learning resource.
2. Download necessary data and notebook file.
3. Go to Azure Macine Learning Studio. 
4. Create new Dataset with downloaded csv file as it is shown on first picture.
5. Create compute instace: 
![dataset](./cluster.png)
6. Import notebook file, replace campute instance name and file name if it is needed and run all cells.

In [40]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

## Load previously imported data from Azure ML Datasets 

In [26]:
# Load Data
aml_dataset = ws.datasets['hearth-data']

# Use Pandas DataFrame just to check schema
full_df = aml_dataset.to_pandas_dataframe()
full_df.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [27]:
# Use Pandas DataFrame just to investigate the dataset's schema and info
full_df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.37,0.68,0.97,131.62,246.26,0.15,0.53,149.65,0.33,1.04,1.4,0.73,2.31,0.54
std,9.08,0.47,1.03,17.54,51.83,0.36,0.53,22.91,0.47,1.16,0.62,1.02,0.61,0.5
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


## Split Dataset in test and train Tabular Datasets

In [28]:
train_dataset, test_dataset = aml_dataset.random_split(0.8, seed=234)

train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe())

         age    sex     cp  trestbps   chol    fbs  restecg  thalach  exang  \
count 254.00 254.00 254.00    254.00 254.00 254.00   254.00   254.00 254.00   
mean   54.18   0.68   1.02    131.46 246.12   0.15     0.52   149.91   0.32   
std     9.20   0.47   1.03     17.98  52.21   0.35     0.52    23.04   0.47   
min    29.00   0.00   0.00     94.00 126.00   0.00     0.00    71.00   0.00   
25%    47.00   0.00   0.00    120.00 212.00   0.00     0.00   134.50   0.00   
50%    55.00   1.00   1.00    130.00 240.00   0.00     1.00   154.00   0.00   
75%    60.00   1.00   2.00    140.00 273.75   0.00     1.00   167.75   1.00   
max    77.00   1.00   3.00    200.00 564.00   1.00     2.00   202.00   1.00   

       oldpeak  slope     ca   thal  target  
count   254.00 254.00 254.00 254.00  254.00  
mean      1.00   1.41   0.71   2.29    0.57  
std       1.11   0.61   1.04   0.61    0.50  
min       0.00   0.00   0.00   0.00    0.00  
25%       0.00   1.00   0.00   2.00    0.00  
50%       0.

## Connect to Compute Instance
Provide name of your compute cluster created in step 5 of preparation.

In [29]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "automl-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D13_V2", 
                                                                 max_nodes = 20)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')

aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)
    


Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [30]:
# For additional details of current AmlCompute status:
aml_remote_compute.get_status().serialize()

{'currentNodeCount': 1,
 'targetNodeCount': 1,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 1,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-12-29T17:05:49.942000+00:00',
 'errors': None,
 'creationTime': '2020-12-29T14:25:42.793760+00:00',
 'modifiedTime': '2020-12-29T14:25:58.595569+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 0,
  'maxNodeCount': 1,
  'nodeIdleTimeBeforeScaleDown': 'PT120S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_DS2_V2'}

## List primary metric to drive the AutoML classification problem
Chosen primary metric is 'accuracy' where closer to 1.00 is better

In [31]:
from azureml.train import automl

# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('classification')

['norm_macro_recall',
 'average_precision_score_weighted',
 'precision_score_weighted',
 'AUC_weighted',
 'accuracy']

## Define AutoML Experiment settings
Chosen settings are:
- Label column name - target
- Task - classification
- Metric - accuracy

In [32]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './automlclassification'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='accuracy',
                             experiment_timeout_minutes=15,                            
                             training_data=train_dataset,
                             label_column_name="target",
                             n_cross_validations=5,                                                
                             iteration_timeout_minutes=5,                                                    
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                            )

## Run Experiment

In [33]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "classif-automl-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

classif-automl-remote-12-29-2020-17
Running on remote.
No run_configuration provided, running on automl-cluster with default configuration
Running on remote compute: automl-cluster
Parent Run ID: AutoML_2dff5011-8cea-4d38-b365-daa674b3dbcc

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              L

## Measure Parent Run Time needed for the whole AutoML process 

In [34]:
import time
import datetime as dt

run_details = run.get_details()

end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (parent_run_time))

Run Timing: --- 1383.0 seconds needed for running the whole Remote AutoML Experiment ---


## Retrieve the Best Model for later testing

In [35]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: classif-automl-remote-12-29-2020-17,
Id: AutoML_2dff5011-8cea-4d38-b365-daa674b3dbcc_15,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                    max_leaf_nodes=None,
                                                                                                    max_samples=None,
    

#### Best model: Soft Voting Classifier

## Make Predictions

### Extract feature columns from test dataset and convert to NumPi array for predicting 
Quality of wine is the feature we are about to classify with the best model so it has to be removed from test data

In [36]:
import pandas as pd

if 'target' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('target')

x_test_df = test_dataset_df

### Predictions

In [44]:
# Use of the best model
y_predictions = fitted_model.predict(x_test_df)

print('20 predictions: ')
print(y_predictions[:20])

20 predictions: 
[1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1]


In [45]:
y_predictions.shape

(49,)

### Calculate the Accuracy with Test Dataset compared to previously removed classes

In [46]:
from sklearn.metrics import accuracy_score

print('Accuracy:')
accuracy_score(y_test_df, y_predictions)

Accuracy:


0.8367346938775511