# USE CASE

Classify if the breast tumor is malignant or benign. By it's size and placement. The data shown below.

![data-excel](.\data-excel.png)

Data is explained in **breast-cancer-wisconsin.names** file in the located in the repository.

# Learn AutoMl with UI

I clicked through the UI guideline of making AutoML. It it really simple. I selected Classification as the target method and Class column as column. All other properties and options were autoselected for optimization. Selected machine cost $0.09/h.  The experiment run multiple child model parings to figure out which algorithm has the best accuracy for the job.

![ui-run](.\ui-run.png)



And after over 1h later the results came with [VotingEnsemble](https://ml.azure.com/experiments/id/2406bef4-357d-4e25-91f5-f56a73749add/runs/AutoML_ba4a282a-5b8d-46f9-ba48-7749b199b962_43?wsid=/subscriptions/da29bcc9-497c-44b3-95aa-169e164600f6/resourceGroups/AutoML-RG/providers/Microsoft.MachineLearningServices/workspaces/AutoML-Cancer&tid=3b50229c-cd78-4588-9bcf-97b7629e2f0f#model) algorithm as the best one with accuracy equal 97.422%

# First Steps

1. Log into Azure subscription and create Azure Machine Learning resource.

   ![automl-resource](.\automl-resource.png)

2. Open Machine Learning Studio

3. Import data from **breast-cancer-wisconsin.csv**

4. Create new Notebook and attach a compute module to it. Whem creating compute module watch out for pricing. But more expensive coputation machines will work faster. Which may have bigger impact.

5. Authenticate to Azure by running code below and follow the instructions.

In [None]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

6. Display your imported data.

In [2]:
# Load Data
dataset_name = 'Breast-cancer-wisconsin'
aml_dataset = ws.datasets[dataset_name]

# Use Pandas DataFrame just to sneak peak some data and schema
full_df = aml_dataset.to_pandas_dataframe()
full_df.head()

Unnamed: 0,ID,Clump Thickness,Cell Size,Cell Shape,Marginal Adhesion,Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


# Data preparation
As the dataset is not very vast end every column is directly connected to classification of tumor class I will not be deleting any column.

# Spliting data into Train and Test groups
A model has to be trained on a subset of the data.
After traning the model has to be tested on the smaller part of data that was not provided for training.

The ratio I have chosen is 80% of data for training and 20% for testing.

In [3]:
train_dataset, test_dataset = aml_dataset.random_split(0.8, seed=1)

train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe())

                 ID  Clump Thickness   Cell Size  Cell Shape  \
count  5.690000e+02       569.000000  569.000000  569.000000   
mean   1.073360e+06         4.405975    3.175747    3.237258   
std    5.929028e+05         2.756944    3.101688    3.002045   
min    6.163400e+04         1.000000    1.000000    1.000000   
25%    8.783580e+05         2.000000    1.000000    1.000000   
50%    1.171578e+06         4.000000    1.000000    2.000000   
75%    1.237674e+06         6.000000    5.000000    5.000000   
max    1.345435e+07        10.000000   10.000000   10.000000   

       Marginal Adhesion  Epithelial Cell Size  Bland Chromatin  \
count         569.000000            569.000000       569.000000   
mean            2.850615              3.270650         3.455185   
std             2.876049              2.282172         2.437640   
min             1.000000              1.000000         1.000000   
25%             1.000000              2.000000         2.000000   
50%             1.000

# Connect to Compute Unit
We will select "Notebook-Breast" compute instance created at the beggining.

In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

amlcompute_cluster_name = "Notebook-Breast"

found = False
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'ComputeInstance':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D13_V2", # for GPU, use "STANDARD_NC12"
                                                                 #vm_priority = 'lowpriority', # optional
                                                                 max_nodes = 20)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')

aml_remote_compute.wait_for_completion(show_output = True)

Found existing training cluster.
Checking cluster status...

Running


# Primary metric for Classification

I will use just the Accuracy as the primary metric for the Classification method. It is the most simple to get and to understand.

Values from 0 to 1. Closer to one is better.

# Define AutoML Experiment settings (With AML Remote Compute)
Lets define the run configuration.
We set:

*classification* as task we are doing

*accuracy* as a primary metric to be calculated

**Accuracy** is the ratio of predictions that exactly match the true class labels

**Class** column as the target column.

In [7]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './strachob'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='accuracy',
                             experiment_timeout_minutes=15,                            
                             training_data=train_dataset,
                             label_column_name="Class",
                             n_cross_validations=5,
                             # blacklist_models='XGBoostClassifier', 
                             # iteration_timeout_minutes=5,                                                    
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                             )

# Run the experiment

!!! Beware it can take some time (up to an hour) !!!

In [8]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "classify-automl-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

classify-automl-remote-12-29-2020-08
Running on remote.
No run_configuration provided, running on Notebook-Breast with default configuration
Running on remote compute: Notebook-Breast
Parent Run ID: AutoML_2294fdb5-c6f2-4570-84f5-08b500378b62

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASS

# Find the Best Model and Show run details


In [11]:
import time
import datetime as dt

run_details = run.get_details()

# Like: 2020-01-12T23:11:56.292703Z
end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (parent_run_time))


## Best model


best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run Timing: --- 1395.0 seconds needed for running the whole Remote AutoML Experiment ---
Run(Experiment: classify-automl-remote-12-29-2020-08,
Id: AutoML_2294fdb5-c6f2-4570-84f5-08b500378b62_16,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                    min_samples_leaf=0.01,
                              

So it turns out the best model for this particular use case was "Prefitted Soft Voting Classifier"

# Prepare data for testing and classify testing data
Pop Class column from test data. It has to be classified.

In [16]:
import pandas as pd

#Remove Label/y column
if 'Class' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('Class')

x_test_df = test_dataset_df


# Try the best model
y_predictions = fitted_model.predict(x_test_df)

print('10 predictions: ')
print(y_predictions[:10])

y_predictions.shape


## Show the accuracy

from sklearn.metrics import accuracy_score

print('Accuracy:')
accuracy_score(y_test_df, y_predictions)

10 predictions: 
[4 2 4 2 2 4 4 4 4 2]
Accuracy:


0.9692307692307692

This time accuracy was on the verge of 97% but it took 1/4 of time.