# AppEase Machine Learning

In this notebook, you use automated machine learning in Azure Machine Learning service to create a classification model to predict labels. This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

To run this notebook, you only need an Azure subscription. Additionally, the data and labels JSON files need to be in the local directory. These two files should be merge-able on the index, and they should contain at least 63 records of data for training (this is the minimum assuming a 0.8/0.2 train/test split).

We found it easier to run this notebook in an Azure Data Science Virtual Machine using a Python 3 kernel to ensure that all the necessary packages were available.


In [1]:
# import packages
import azureml.core
from azureml.core.workspace import Workspace
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.model_selection import train_test_split
import logging
from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails
from sklearn.metrics import mean_squared_error
from math import sqrt
import pickle

In [2]:
# required info
subscription = '574c17c0-996b-490d-a26a-3faa4e105b6d'

# you choose these
workspace_resource_group = 'AppEase' # replace this if you'd like to use a pre-built resource group
workspace_loc = 'eastus' # feel free to change this
workspace_name = 'appeasewML'
compute_cluster_name = 'appeasecompute'

# name of files in local directory with data and labels (to be merged on indexes)
# NOTE: data_file must contain at least 63 records of data for training (with 0.8/0.2 train/test split)
data_file_name = 'simulated_health_data.json'
labels_file_name = 'random_labels.json'
label_column_name = 'Label'

In [3]:
# create an Azure workspace
if workspace_resource_group == None:
    create_RG = True
else:
    create_RG = False

try:
    ws = Workspace.get(name=workspace_name, subscription_id= subscription, resource_group=workspace_resource_group)
except:
    ws = Workspace.create(name= workspace_name, subscription_id=subscription,resource_group=workspace_resource_group, create_resource_group=create_RG,location=workspace_loc)

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code EFA3D85L3 to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...


get_workspace error using subscription_id=574c17c0-996b-490d-a26a-3faa4e105b6d, resource_group_name=AppEase, workspace_name=appeasewML


Interactive authentication successfully completed.
Deploying StorageAccount with name appeasewstorageb908b43bf.
Deploying AppInsights with name appeasewinsights228e3399.
Deployed AppInsights with name appeasewinsights228e3399. Took 7.61 seconds.
Deploying KeyVault with name appeasewkeyvault4a41c505.
Deployed KeyVault with name appeasewkeyvault4a41c505. Took 21.87 seconds.
Deploying Workspace with name appeasewML.
Deployed StorageAccount with name appeasewstorageb908b43bf. Took 27.63 seconds.
Deployed Workspace with name appeasewML. Took 62.68 seconds.


In [4]:
# create an Azure compute cluster
try: # Verify that cluster does not exist already
    cpu_cluster = ComputeTarget(workspace=ws, name=compute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, compute_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

InProgress.........
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [5]:
# load data
data = pd.read_json(data_file_name)

# convert data types to int
data['bloodType'].replace(['A-','A+','B-','B+','AB-','AB+','O-','O+'], [0,1,2,3,4,5,6,7], inplace=True)
data['sex'].replace(['Male','Female'],[0,1],inplace=True)
data['name'] = data['name'].map(lambda x: int(x[4:]))
data['TimeStamp'] = data['TimeStamp'].astype(int)

labels = pd.read_json(labels_file_name, typ='series')
final_df = data.merge(labels.rename('Label'), left_index=True, right_index=True)

In [6]:
# Split the data into train and test sets
x_train, x_test = train_test_split(final_df, test_size=0.2, random_state=223)

In [7]:
# define settings for the experiment run (see parameters at https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train)
automl_settings = {
    "n_cross_validations": 3,
    "primary_metric": "accuracy",
    "experiment_timeout_hours": 0.25,  # This is a time limit for testing purposes, remove it for real use cases, this will drastically limit ability to find the best model possible
    "verbosity": logging.INFO,
    "enable_stack_ensemble": False,
}

automl_config = AutoMLConfig(
    task="classification",
    debug_log="automl_errors.log",
    training_data=x_train,
    label_column_name=label_column_name,
    **automl_settings,
)

In [8]:
# create and run the Experiment
experiment = Experiment(ws, "AppEaseML")
local_run = experiment.submit(automl_config, show_output=True) 
# this can take about 20 minutes with the default settings

No run_configuration provided, running on local with default configuration
Running in the active local environment.


Experiment,Id,Type,Status,Details Page,Docs Page
AppEaseML,AutoML_2a41b9e7-10e7-4834-a59e-0f003ae86b0d,automl,Preparing,Link to Azure Machine Learning studio,Link to Documentation


Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values

In [9]:
# explore the results and retrieve the best model
best_run, best_model = local_run.get_output()
best_model

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/home/appeaseuser/notebooks/AppEase_Cloud')),
                ('SparseNormalizer', Normalizer(copy=True, norm='max')),
                ('KNeighborsClassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='manhattan', metric_params=None,
                                      n_jobs=1, n_neighbors=10, p=2,
                                      weights='distance'))],
         verbose=False)

In [10]:
# calculate the root mean squared error, mean absolute percent error, and accuracy of the best model
y_test = x_test.pop("Label")
y_predict = best_model.predict(x_test)

y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
print("Model RMSE:")
print(rmse)
print()

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

Model RMSE:
0.7844645405527362

Model MAPE:
1.1428571428571428

Model Accuracy:
-0.1428571428571428


In [11]:
# save the best model
pkl_filename = "best_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(best_model, file)

# save training data and labels
data.to_csv('train_data.csv', index = False, header= True)
labels.to_csv('train_labels.csv', index= False, header = 'Label')