# Using Automated Machine Learning

There are many kinds of machine learning algorithm that you can use to train a model, and sometimes it's not easy to determine the most effective algorithm for your particular data and prediction requirements. Additionally, you can significantly affect the predictive performance of a model by preprocessing the training data, using techniques such as normalization, missing feature imputation, and others. In your quest to find the *best* model for your requirements, you may need to try many combinations of algorithms and preprocessing transformations; which takes a lot of time and compute resources.

Azure Machine Learning enables you to automate the comparison of models trained using different algorithms and preprocessing options. You can use the visual interface in [Azure Machine Learning studio](https://ml/azure.com) or the SDK to leverage this capability. he SDK gives you greater control over the settings for the automated machine learning experiment, but the visual interface is easier to use. In this lab, you'll explore automated machine learning using the SDK.

## Before You Start

Before you start this lab, ensure that you have completed the *Create an Azure Machine Learning Workspace* and *Create a Compute Instance* tasks in [Lab 1: Getting Started with Azure Machine Learning](./labdocs/Lab01.md). Then open this notebook in Jupyter on your Compute Instance.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: If you do not have a current authenticated session with your Azure subscription, you'll be prompted to authenticate. Follow the instructions to authenticate using the code provided.

In [2]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.9.0 to work with ms-learn-ml


## Prepare Data for Automated Machine Learning

You don't need to create a training script for automated machine learning, but you do need to load the training data. In this case, you'll create a dataset containing details of diabetes patients (just as you did in previous labs), and then split this into two datasets: one for training, and another for model validation.

In [3]:
from azureml.core import Dataset

default_ds = ws.get_default_datastore()

if 'diabetes dataset' not in ws.datasets:
    default_ds.upload_files(files=['./data/diabetes.csv', './data/diabetes2.csv'], # Upload the diabetes csv files in /data
                        target_path='diabetes-data/', # Put it in a folder path in the datastore
                        overwrite=True, # Replace existing files of the same name
                        show_progress=True)

    #Create a tabular dataset from the path on the datastore (this may take a short while)
    tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))

    # Register the tabular dataset
    try:
        tab_data_set = tab_data_set.register(workspace=ws, 
                                name='diabetes dataset',
                                description='diabetes data',
                                tags = {'format':'CSV'},
                                create_new_version=True)
        print('Dataset registered.')
    except Exception as ex:
        print(ex)
else:
    print('Dataset already registered.')


# Split the dataset into training and validation subsets
diabetes_ds = ws.datasets.get("diabetes dataset")
train_ds, test_ds = diabetes_ds.random_split(percentage=0.7, seed=123)
print("Data ready!")

Dataset already registered.
Data ready!


In [6]:
#this is my own work, I was puzzled what happened when 2 datasets are loaded into one tabular dataset in azureML
# first dataset had 10000 entries, second had 5000 entries it appears that when the dataset is retrieved, they are combined. 
#I kind of wonder what would happen if I put two datasets with different schema in
import pandas as pd

train_df = train_ds.to_pandas_dataframe()
test_df = test_ds.to_pandas_dataframe()



In [8]:
train_df.shape
test_df.shape

(4445, 10)

In [9]:
diabetes_df = diabetes_ds.to_pandas_dataframe()

In [11]:
diabetes_df.shape

(15000, 10)

In [13]:
#I'm putting a second dataset with two different schema into one dataset to see what happens
# in storage explorere I made a folder combined-data with titanic and diabetes data together
if 'combined' not in ws.datasets:
    
    #Create a tabular dataset from the path on the datastore (this may take a short while)
    tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'combined-data/*.csv'))

    # Register the tabular dataset
    try:
        tab_data_set = tab_data_set.register(workspace=ws, 
                                name='combined dataset',
                                description='one dataset two different schema',
                                tags = {'format':'CSV'},
                                create_new_version=True)
        print('Dataset registered.')
    except Exception as ex:
        print(ex)
else:
    print('Dataset already registered.')

In [14]:
combined_ds = ws.datasets.get('combined dataset')
combined_df = combined_ds.to_pandas_dataframe()    

In [15]:
combined_df.shape

(10891, 12)

In [18]:
combined_df.head()
#that's some funny looking data

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Column11,Column12
0,1354778,0,171,80.0,34.0,23.0,43.509726,1.213191,21.0,0.0,,
1,1147438,8,92,93.0,47.0,36.0,21.240576,0.158365,23.0,0.0,,
2,1640031,7,115,47.0,52.0,35.0,41.511523,0.079019,23.0,0.0,,
3,1883350,9,103,78.0,25.0,304.0,29.582192,1.28287,43.0,1.0,,
4,1424119,1,85,59.0,27.0,35.0,42.604536,0.549542,22.0,0.0,,


## Configure Automated Machine Learning

Now you're ready to configure the automated machine learning experiment. To do this, you'll need a run configuration that includes the required packages for the experiment environment, and a set of configuration settings that specifies how many combinations to try, which metric to use when evaluating models, and so on.

> **Note**: In this example, you'll run the automated machine learning experiment on local compute to avoid waiting for a cluster to start. This will cause each iteration (child-run) to run serially rather than in parallel. For this reason, we're restricting the experiment to 6 iterations to reduce the amount of time taken. In reality, you'd likely try many more iterations on a compute cluster.

In [3]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(name='Automated ML Experiment',
                             task='classification',
                             compute_target='ms-learn-compute-ryan',
                             enable_local_managed=True,
                             training_data = train_ds,
                             validation_data = test_ds,
                             label_column_name='Diabetic',
                             iterations=6,
                             primary_metric = 'AUC_weighted',
                             max_concurrent_iterations=4,
                             featurization='auto'
                             )

print("Ready for Auto ML run.")

Ready for Auto ML run.


## Run an Automated Machine Learning Experiment

OK, you're ready to go. Let's run the automated machine learning experiment.

In [4]:
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails

print('Submitting Auto ML experiment...')
automl_experiment = Experiment(ws, 'diabetes_automl')
automl_run = automl_experiment.submit(automl_config)
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

Submitting Auto ML experiment...
Running on remote or ADB.


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…


Current status: FeaturesGeneration. Completed fit featurizers and featurizing the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality featur

{'runId': 'AutoML_33729639-5bd6-48eb-9705-995f0a315cee',
 'target': 'ms-learn-compute-ryan',
 'status': 'Completed',
 'startTimeUtc': '2020-08-17T04:52:20.318983Z',
 'endTimeUtc': '2020-08-17T04:54:43.023412Z',
 'properties': {'num_iterations': '6',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'ms-learn-compute-ryan',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"6baef95d-c40c-4071-b9bb-c20eee7e2115\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"diabetes-data/*.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"ms-learn-ml\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"375be063-544d-4b99-aed9-5a0e16b1f428\\\\\\", 

## Determine the Best Performing Model

When the experiment has completed, view the output in the widget, and click the run that produced the best result to see its details.
Then click the link to view the experiment details in the Azure portal and view the overall experiment details before viewing the details for the individual run that produced the best result. There's lots of information here about the performance of the model generated.

Let's get the best run and the model that it produced.

In [8]:
best_run, fitted_model = automl_run.get_output()
print(best_run)
print(fitted_model)
best_run_metrics = best_run.get_metrics()
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric)

Package:azureml-core, training version:1.11.0, current version:1.9.0
Package:azureml-dataprep, training version:2.0.2, current version:1.9.3
Package:azureml-pipeline-core, training version:1.11.0, current version:1.9.0
Package:azureml-telemetry, training version:1.11.0, current version:1.9.0
Package:azureml-train-automl-client, training version:1.11.0, current version:1.9.0.post1
Package:azureml-dataset-runtime, training version:1.11.0.post1
Package:azureml-defaults, training version:1.11.0
Package:azureml-explain-model, training version:1.11.0
Package:azureml-interpret, training version:1.11.0
Package:azureml-model-management-sdk, training version:1.0.1b6.post1
Package:azureml-train-automl-runtime, training version:1.11.0.post1


Run(Experiment: diabetes_automl,
Id: AutoML_33729639-5bd6-48eb-9705-995f0a315cee_5,
Type: azureml.scriptrun,
Status: Completed)
None
AUC_weighted 0.9901764737796902
AUC_micro 0.9910976426034485
log_loss 0.13358770478780743
accuracy 0.9511811023622048
AUC_macro 0.9901764737796902
f1_score_weighted 0.9511351712881654
average_precision_score_micro 0.9913166477760578
precision_score_weighted 0.9511022342093305
precision_score_macro 0.945771464058234
recall_score_micro 0.9511811023622048
average_precision_score_macro 0.9886069234192706
precision_score_micro 0.9511811023622048
weighted_accuracy 0.9568391994717691
f1_score_macro 0.944926125188499
f1_score_micro 0.9511811023622048
average_precision_score_weighted 0.9907667853390161
recall_score_weighted 0.9511811023622048
norm_macro_recall 0.8881939875724747
balanced_accuracy 0.9440969937862373
matthews_correlation 0.8898668824132099
confusion_matrix aml://artifactId/ExperimentRun/dcid.AutoML_33729639-5bd6-48eb-9705-995f0a315cee_5/confusion_ma

Automated machine learning includes the option to try preprocessing the data, which is accomplished through the use of [Scikit-Learn transformation pipelines](https://scikit-learn.org/stable/modules/compose.html#combining-estimators) (not to be confused with Azure Machine Learning pipelines!). These produce models that include steps to transform the data before inferencing. You can view the steps in a model like this:

In [9]:
for step in fitted_model.named_steps:
    print(step)

AttributeError: 'NoneType' object has no attribute 'named_steps'

Finally, having found the best performing model, you can register it.

In [7]:
from azureml.core import Model

# Register model
best_run.register_model(model_path='outputs/model.pkl', model_name='diabetes_model_automl',
                        tags={'Training context':'Auto ML'},
                        properties={'AUC': best_run_metrics['AUC_weighted'], 'Accuracy': best_run_metrics['accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model_automl version: 1
	 Training context : Auto ML
	 AUC : 0.9901764737796902
	 Accuracy : 0.9511811023622048


diabetes_model version: 5
	 Training context : Hyperdrive
	 AUC : 0.856969468262725
	 Accuracy : 0.7891111111111111


diabetes_model version: 4
	 Training context : Inline Training
	 AUC : 0.8733718095427634
	 Accuracy : 0.8866666666666667


diabetes_model version: 3
	 Training context : Inline Training
	 AUC : 0.8761281655803771
	 Accuracy : 0.89


diabetes_model version: 2
	 Training context : Pipeline


diabetes_model version: 1
	 Training context : Pipeline




### More Information

For more information Automated machine Learning, see the [Azure ML documentation](https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train).

## Clean Up

If you've finished exploring, you can close this notebook and shut down your Compute Instance.