## Setting Up the Environment

🎯 Create and set up the workspace, experiment, and environment associated with this project.


In [2]:
from azureml.core import Workspace, Experiment, Environment

ws = Workspace.get(name="quick-starts-ws-147693", resource_group = "aml-quickstarts-147693", subscription_id = "f9d5a085-54dc-4215-9ba6-dad5d86e60a0")

exp = Experiment(workspace=ws, name="bank-offer-success-prediction")

env = Environment.get(workspace=ws, name="AzureML-Tutorial")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: quick-starts-ws-147693
Azure region: southcentralus
Subscription id: f9d5a085-54dc-4215-9ba6-dad5d86e60a0
Resource group: aml-quickstarts-147693


## Setting Up the AzureML Compute Target Cluster
🎯 Retrieve or create an ML compute target cluster to be used for training. If creating, use the Standard_D2_V2 template, which consists of 4 cores CPU and 7 GB of RAM. 

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException

cluster_name = "rad-gpu-cluster"
try:
    cluster = ComputeTarget(workspace = ws, name = cluster_name)
    print('Found existing cluster with specified name. Using it!')
except ComputeTargetException:
    print('Did not found existing cluster with specified name. Creating it!')
    config = AmlCompute.provisioning_configuration(vm_size="Standard_D64_v3", max_nodes=4)
    cluster = ComputeTarget.create(ws, cluster_name, config)
    
cluster.wait_for_completion(show_output=True)

Found existing cluster with specified name. Using it!
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Preparing HyperDrive Run

🎯Attempt to find the best hyperparameters (*C, max_iter*) for a logistic regression run, using the following specifications.
* Sampling strategy - Random sampling, for both hyperparameters.
* Early stopping policy - Bandit policy, that early stops runs when their current accuracy is more than 0.2 worse than the best accuracy of the run. 
* Run configuration - The start script of the training process, the compute target cluster and the environment

In [4]:
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, randint
import os

sampler = RandomParameterSampling(parameter_space = {"C": uniform(1, 10), 
                                                "max_iter": randint(3000)})

policy = BanditPolicy(evaluation_interval = 1, slack_factor = 0.2, delay_evaluation = 5)


run_configuration = ScriptRunConfig(source_directory = '.',
                                    script = "train.py",
                                    compute_target = cluster,
                                    environment = env)

hyperdrive_config = HyperDriveConfig(run_config = run_configuration,
                                     hyperparameter_sampling = sampler,
                                     policy = policy,
                                     primary_metric_name = "auc_weighted",
                                     primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=32,
                                     max_concurrent_runs=4)

## Submit and Visualize the HyperDrive Run

🎯Start the automatic hyperparameter tuning process and display the live results of the runs.

In [5]:
from azureml.widgets import RunDetails

hyperdrive_run = exp.submit(hyperdrive_config)
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Preparing and Preprocessing Data

🎯 Prepare and clean the data for further processing. 

In [6]:
from azureml.data.dataset_factory import TabularDatasetFactory
from train import clean_data, split_variables

train_csv_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
test_csv_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv"

train_ds = TabularDatasetFactory.from_delimited_files(train_csv_path)
test_ds = TabularDatasetFactory.from_delimited_files(test_csv_path)

train_ds = clean_data(train_ds)
test_ds = clean_data(test_ds)

x_train, y_train = split_variables(train_ds)
x_test, y_test = split_variables(test_ds)

x_train.head(5)

Unnamed: 0,default,housing,loan,age_group_30-60,age_group_<30,age_group_>60,previous_group_0-1,previous_group_2-3-4,previous_group_5-6,previous_group_7,...,month_oct,month_sep,poutcome_failure,poutcome_nonexistent,poutcome_success,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed
0,0,0,1,1,0,0,1,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,0,0,0,1,0,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,0
3,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
4,0,1,0,0,1,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
5,0,1,1,1,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0


## Getting the Best HyperDrive Model

🎯 Find the model with the highest accuracy from the hyperparameter tuning process. Train the model on the train data, then save it.

In [21]:
import joblib
from sklearn.linear_model import LogisticRegression

best_hyperdrive_run = hyperdrive_run.get_best_run_by_primary_metric()
best_hyperdrive_run

Experiment,Id,Type,Status,Details Page,Docs Page
bank-offer-success-prediction,HD_7d04040a-0398-46c4-9276-5cd67c303387_3,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [24]:
best_log_reg_model = LogisticRegression(C=5.82, max_iter=728).fit(x_train, y_train)
joblib.dump(best_log_reg_model, './outputs/hyperdrive-model.pkl')

## Configuring the AutoML Run 
🎯 Configure the AutoML parameters.

* Aim to maximize the accuracy
* Use 5-fold cross-validation

In [9]:
from azureml.train.automl.utilities import get_primary_metrics

get_primary_metrics('classification')

['accuracy',
 'average_precision_score_weighted',
 'AUC_weighted',
 'norm_macro_recall',
 'precision_score_weighted']

In [10]:
from azureml.train.automl import AutoMLConfig

automl_settings = {
    "experiment_timeout_minutes" : 30,
    "enable_early_stopping": True,
    "iteration_timeout_minutes": 5,
    "max_concurrent_iterations": 4,
    "max_cores_per_iteration": -1,
    "primary_metric": "AUC_weighted",
    "featurization": "auto"
}

automl_config = AutoMLConfig(task='classification',
                             training_data=train_ds,
                             label_column_name='y',
                             n_cross_validations=5,
                             **automl_settings)

In [11]:
automl_run = exp.submit(automl_config, show_output=True)

No run_configuration provided, running on local with default configuration
Running in the active local environment.


Experiment,Id,Type,Status,Details Page,Docs Page
bank-offer-success-prediction,AutoML_47d8f7d6-00dd-4b4e-8db3-a8e66067cb64,automl,Preparing,Link to Azure Machine Learning studio,Link to Documentation


Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias toward

## Getting the Best AutoML Model

🎯 Get the model with the highest accuracy.

In [14]:
from azureml.core.run import get_run

automl_run = get_run(experiment=exp, run_id='AutoML_47d8f7d6-00dd-4b4e-8db3-a8e66067cb64')

In [25]:
import joblib

best_automl_run, best_automl_model = automl_run.get_output()

print(best_automl_model)

joblib.dump(best_automl_model, './outputs/automl-model.pkl')

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
    iteration_timeout_mode=0,
    iteration_timeout_param=None,
    feature_column_names=None,
    label_column_name=None,
    weight_column_name=None,
    cv_split_column_names=None,
    enable_streaming=None,
    timeseries_param_dict=None,
    gpu_training_param_dict={'processing_unit_type': 'cpu'}
), random_state=0, reg_alpha=0, reg_lambda=2.1875, subsample=1, tree_method='auto'))],
         verbose=False)


## Evaluating on Test Data

🎯 Evaluate both the hyperdrive model and the automl model on the test data.

* Random Sampler Tuned Logistic Regression: **0.9135** Test Accuracy
* Gradient-boosted-tree-based Voting Ensemble: **0.9168** Test Accuracy

In [17]:
missing_columns = set(x_train.columns) - set(x_test.columns)
for column in missing_columns:
    x_test[column] = 0
x_test = x_test[x_train.columns]

In [29]:
from sklearn.metrics import roc_auc_score

y_pred_log_reg = best_log_reg_model.predict(x_test)
y_pred_automl = best_automl_model.predict(x_test)

auc_log_reg = roc_auc_score(y_pred_log_reg, y_test, average='weighted')
auc_automl = roc_auc_score(y_pred_automl, y_test, average='weighted')

print(f"""
     Weighted ROC AUC score for test data:
     - {auc_log_reg:.3f} for the best logistic regression trained with automatic hyperparameter selection (DyperDrive)
     - {auc_automl:.3f} for the best automatically selected learning algorithm (XGBoost)
     """)


     Weighted ROC AUC score for test data:
     - 0.770 for the best logistic regression trained with automatic hyperparameter selection (DyperDrive)
     - 0.787 for the best automatically selected learning algorithm (XGBoost)
     


## Final scoring report

* Logistic Regression: 0.817 (Train) vs 0.770 (Test) 

* XGBoost (AutoMl): 0.774 (Train) vs. 0.787 (Test)



## Cleaning the Compute
🎯 Delete the AML Compute instance

In [14]:
try:
    cluster.delete()
    print('Successfully deleted the allocated compute cluster.')
except ComputeTargetException:
    print('The compute AML cluster was not created via Azure Machine Learning, therefore it cannot be deleted programtically. Nothing has happened.')

Successfully deleted the allocated compute cluster.
