# Automated ML

In [1]:
import numpy as np
import os

from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core.model import Model
from azureml.core import Webservice

from azureml.core.compute import AmlCompute, ComputeTarget

from azureml.core.webservice import AciWebservice
from azureml.widgets import RunDetails
import joblib

## Dataset

### Overview
The dataset is the [Mushrooms](https://archive.ics.uci.edu/ml/datasets/mushroom)'s dataset from UCI. It includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible or definitely poisonous. It is a classification task consisting of predicting whether a mushroom is edible or poisonous.

In [2]:
ws = Workspace.from_config()
experiment_name = 'automl-exp'
experiment = Experiment(ws, experiment_name)

In order to easily acces the data, we downloaded it from [Kaggle](https://www.kaggle.com/uciml/mushroom-classification) and saved it in our Github repository. Thus we have a link to the dataset that we can use in our notebook.

In [3]:
found = False
key = "Mushrooms"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 
        
if not found:
        dataset = Dataset.Tabular.from_delimited_files('https://raw.githubusercontent.com/sannif/udacity_capstone_project/main/dataset/mushrooms.csv', infer_column_types=False)       
        dataset = dataset.register(workspace=ws, name=key)

dataset.take(5).to_pandas_dataframe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


## Compute
Create a compute cluster to run the AutoML experiment.

In [4]:
compute_name = "cluster1"
vm_size = "Standard_DS12_v2"
min_nodes, max_nodes = 1, 6
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size, min_nodes = min_nodes, max_nodes = max_nodes)
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
compute_target.wait_for_completion(show_output=True)

creating new compute target...
Creating......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded.....................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration
For the AutoML, we choose the accuracy as primary metric because we have balanced classes. Also, the experiment timeout is set to 20 minutes meaning that the run will stop after 20 minutes.

In [5]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'accuracy'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="class",   
                             path = '.',
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [6]:
automl_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
automl-exp,AutoML_aabaaec6-778c-4ba5-a0a4-5a4b8dbd7790,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

More than half of the models reached 100% of accuracy. The classification task is pretty simple.  
*XGBoostClassifier*, *LightGBM*, *Logistic Regression*, *Random Forest*, *ExtremeRandomTrees* are the models that have been tested. We note that *RandomForest* and *ExtremeRandomTrees* didn't reach 100% accuracy.
Moreover *XGBoostClassifier* with a *StandardScalerWrapper* as processing produced models with 100% accuracy but also models with 52% accuracy. It demonstrates the importance of the hyperparameters.

In [7]:
RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

In [None]:
best_run, fitted_model = automl_run.get_output()

In [9]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-exp,AutoML_aabaaec6-778c-4ba5-a0a4-5a4b8dbd7790_0,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


Below are all the properties of the best model. It is a *LightGBM* algorithm with *MaxAbsScaler* for the processing. *min_data_in_leaf* is the only hyperparameter that has been changed from its default value to 20.

In [11]:
best_run.get_details()['properties']

{'runTemplate': 'automl_child',
 'pipeline_id': '5dfac790c5c209f98a1da2dc1c7fb76f0397324f',
 'pipeline_spec': '{"objects":[{"spec_class":"preproc","class_name":"MaxAbsScaler","module":"sklearn.preprocessing","param_args":[],"param_kwargs":{},"prepared_kwargs":{}},{"spec_class":"sklearn","class_name":"LightGBMClassifier","module":"automl.client.core.common.model_wrappers","param_args":[],"param_kwargs":{"min_data_in_leaf":20},"prepared_kwargs":{}}],"pipeline_id":"5dfac790c5c209f98a1da2dc1c7fb76f0397324f","module":"sklearn.pipeline","class_name":"Pipeline"}',
 'training_percent': '100',
 'predicted_cost': None,
 'iteration': '0',
 '_aml_system_scenario_identification': 'Remote.Child',
 '_azureml.ComputeTargetType': 'amlcompute',
 'ContentSnapshotId': 'c6f03ea3-002a-4784-97cc-e10963dbbf1d',
 'ProcessInfoFile': 'azureml-logs/process_info.json',
 'ProcessStatusFile': 'azureml-logs/process_status.json',
 'run_preprocessor': 'MaxAbsScaler',
 'run_algorithm': 'LightGBM',
 'model_output_path': 

### Save the model

In [13]:
os.makedirs('models', exist_ok=True)
joblib.dump(fitted_model, 'models/auto_ml.pkl')

['models/auto_ml.pkl']

It is worth noting that when looking at the *Data guardrails* of the AutoML experiment on Azure ML Studio, it failed to find and remove the column *veil-type* that has one unique value

## Model Deployment
Since AutoML and Hyperdrive give the same performance, we chose to deploy the hyperdrive model.

## Cleaning resources

In [12]:
compute_target.delete()