# Use Case

Classification of Glass type. It contains 10 attributes(unit measurement: weight percent in corresponding oxide): RI: refractive index, Na: Sodium, Mg: Magnesium, Al: Aluminum, Si: Silicon, K: Potassium, Ca: Calcium, Ba: Barium, Fe: Iron .The response is glass type: 1 building windows float processed, 2 building windows nonfloat processed, 3 vehicle windows float processed, 4 vehicle windows nonfloat processed, 5 containers, 6 tableware, 7 headlamps. Table with imported data is shown below. CSV file is available here: [Data](https://www.kaggle.com/uciml/glass)

![dataset](./data.png)



##  Preparation of the resource

1. Log into Azure portal and create Azure Machine Learning resource.
2. Download necessary data and notebook file.
3. Go to Azure Macine Learning Studio. 
4. Create new Dataset with downloaded csv file as it is shown on first picture.
5. Create compute instace: 
![dataset](./cluster.png)
6. Import notebook file, replace campute instance name and file name if it is needed and run all cells.

In [1]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

## Load previously imported data from Azure ML Datasets 

In [2]:
# Load Data
aml_dataset = ws.datasets['glass-data']

# Use Pandas DataFrame just to check schema
full_df = aml_dataset.to_pandas_dataframe()
full_df.head(5)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [3]:
# Use Pandas DataFrame just to investigate the dataset's schema and info
full_df.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009,2.780374
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439,2.103739
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
25%,1.516523,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0,1.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0,2.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1,3.0
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0


## Split Dataset in test and train Tabular Datasets

In [4]:
train_dataset, test_dataset = aml_dataset.random_split(0.8, seed=234)

train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe())

               RI          Na          Mg          Al          Si           K  \
count  181.000000  181.000000  181.000000  181.000000  181.000000  181.000000   
mean     1.518499   13.405414    2.743370    1.413757   72.631657    0.498453   
std      0.003105    0.819509    1.419048    0.493881    0.806918    0.687429   
min      1.511150   10.730000    0.000000    0.290000   69.810000    0.000000   
25%      1.516550   12.930000    2.200000    1.180000   72.280000    0.140000   
50%      1.517780   13.270000    3.480000    1.350000   72.780000    0.550000   
75%      1.519260   13.800000    3.610000    1.580000   73.080000    0.600000   
max      1.533930   17.380000    4.490000    3.500000   75.410000    6.210000   

               Ca          Ba          Fe        Type  
count  181.000000  181.000000  181.000000  181.000000  
mean     8.972983    0.157403    0.053481    2.640884  
std      1.423647    0.493052    0.090606    2.010607  
min      5.870000    0.000000    0.000000    1

## Connect to Compute Instance
Provide name of your compute cluster created in step 5 of preparation.

In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "automl-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D13_V2", 
                                                                 max_nodes = 20)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')

aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)
    


Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [6]:
# For additional details of current AmlCompute status:
aml_remote_compute.get_status().serialize()

{'currentNodeCount': 0,
 'targetNodeCount': 0,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 0,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-12-29T15:56:20.165000+00:00',
 'errors': None,
 'creationTime': '2020-12-29T14:25:42.793760+00:00',
 'modifiedTime': '2020-12-29T14:25:58.595569+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 0,
  'maxNodeCount': 1,
  'nodeIdleTimeBeforeScaleDown': 'PT120S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_DS2_V2'}

## List primary metric to drive the AutoML classification problem
Chosen primary metric is 'accuracy' where closer to 1.00 is better

In [8]:
from azureml.train import automl

# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('classification')

['norm_macro_recall',
 'average_precision_score_weighted',
 'precision_score_weighted',
 'AUC_weighted',
 'accuracy']

## Define AutoML Experiment settings
Chosen settings are:
- Label column name - Type
- Task - classification
- Metric - accuracy

In [15]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './automlclassification'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='accuracy',
                             experiment_timeout_minutes=15,                            
                             training_data=train_dataset,
                             label_column_name="Type",
                             n_cross_validations=5,                                                
                             iteration_timeout_minutes=5,                                                    
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                            )

## Run Experiment

In [16]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "classif-automl-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

classif-automl-remote-12-29-2020-16
Running on remote.
No run_configuration provided, running on automl-cluster with default configuration
Running on remote compute: automl-cluster
Parent Run ID: AutoML_df6c5ca5-b72f-4a45-9ce2-c869166ace33

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+---------------------------------+---------------------------------+--------------------------------------+


## Measure Parent Run Time needed for the whole AutoML process 

In [17]:
import time
import datetime as dt

run_details = run.get_details()

end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (parent_run_time))

Run Timing: --- 1580.0 seconds needed for running the whole Remote AutoML Experiment ---


## Retrieve the 'Best Model' for later testing

In [18]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: classif-automl-remote-12-29-2020-16,
Id: AutoML_df6c5ca5-b72f-4a45-9ce2-c869166ace33_15,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                               max_delta_step=0,
                                                                                               max_depth=3,
                      

#### Best model: Soft Voting Classifier

## Make Predictions

### Extract feature columns from test dataset and convert to NumPi array for predicting 
Quality of wine is the feature we are about to classify with the best model so it has to be removed from test data

In [22]:
import pandas as pd

if 'Type' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('Type')

x_test_df = test_dataset_df

### Predictions

In [23]:
# Use of the best model
y_predictions = fitted_model.predict(x_test_df)

print('20 predictions: ')
print(y_predictions[:20])

20 predictions: 
[1 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 1 1 3]


In [24]:
y_predictions.shape

(33,)

### Calculate the Accuracy with Test Dataset compared to previously removed classes

In [25]:
from sklearn.metrics import accuracy_score

print('Accuracy:')
accuracy_score(y_test_df, y_predictions)

Accuracy:


0.7272727272727273