## Setting Up the Environment

🎯 Create and set up the workspace, experiment, and environment associated with this project.


In [1]:
from azureml.core import Workspace, Experiment, Environment

ws = Workspace.get(name="quick-starts-ws-147135", resource_group = "aml-quickstarts-147135", subscription_id = "d4ad7261-832d-46b2-b093-22156001df5b")

exp = Experiment(workspace=ws, name="bank-offer-success-prediction")

env = Environment.get(workspace=ws, name="AzureML-Tutorial")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: quick-starts-ws-147135
Azure region: southcentralus
Subscription id: d4ad7261-832d-46b2-b093-22156001df5b
Resource group: aml-quickstarts-147135


## Setting Up the AzureML Compute Target Cluster
🎯 Retrieve or create an ML compute target cluster to be used for training. If creating, use the Standard_D2_V2 template, which consists of 4 cores CPU and 7 GB of RAM. 

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException

cluster_name = "cpu-cluster-1"
try:
    cluster = ComputeTarget(workspace = ws, name = cluster_name)
    print('Found existing cluster with specified name. Using it!')
except ComputeTargetException:
    print('Did not found existing cluster with specified name. Creating it!')
    config = AmlCompute.provisioning_configuration(vm_size="Standard_D2_V2", max_nodes=4)
    cluster = ComputeTarget.create(ws, cluster_name, config)
    
cluster.wait_for_completion(show_output=True)

Did not found existing cluster with specified name. Creating it!
Creating.........
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Preparing HyperDrive Run

🎯Attempt to find the best hyperparameters (*C, max_iter*) for a logistic regression run, using the following specifications.
* Sampling strategy - Random sampling, for both hyperparameters.
* Early stopping policy - Bandit policy, that early stops runs when their current accuracy is more than 0.2 worse than the best accuracy of the run. 
* Run configuration - The start script of the training process, the compute target cluster and the environment

In [3]:
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, randint
import os

sampler = RandomParameterSampling(parameter_space = {"C": uniform(0, 10), 
                                                "max_iter": randint(100)})

policy = BanditPolicy(evaluation_interval = 1, slack_factor = 0.2, delay_evaluation = 5)

if "training" not in os.listdir():
    os.mkdir("./training")

run_configuration = ScriptRunConfig(source_directory = '.',
                                    script = "train.py",
                                    compute_target = cluster,
                                    environment = env)

hyperdrive_config = HyperDriveConfig(run_config = run_configuration,
                                     hyperparameter_sampling = sampler,
                                     policy = policy,
                                     primary_metric_name = "accuracy",
                                     primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=20,
                                     max_concurrent_runs=4)

## Submit and Visualize the HyperDrive Run

🎯Start the automatic hyperparameter tuning process and display the live results of the runs.

In [5]:
from azureml.widgets import RunDetails

hyperdrive_run = exp.submit(hyperdrive_config)
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Getting the Best HyperDrive Model

🎯 Find the model with the highest accuracy from the hyperparameter tuning process, and save it.

In [32]:
import joblib
from sklearn.linear_model import LogisticRegression

highest_accuracy_run = hyperdrive_run.get_best_run_by_primary_metric()
highest_accuracy_log_reg_model = LogisticRegression(C=7.78, max_iter=65)

joblib.dump(highest_accuracy_log_reg_model, './outputs/log-reg-highest-acc-model.pkl')

## Preparing Data for AutoML

🎯 Prepare the data for the automatic model selection process.

In [11]:
from azureml.data.dataset_factory import TabularDatasetFactory

csv_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

ds = TabularDatasetFactory.from_delimited_files(path=csv_path)

## Preprocessing Data for AutoML

🎯 Clean the data and view the first 5 entries in the training data.

In [13]:
from train import clean_data, split_variables
from sklearn.model_selection import train_test_split

data = clean_data(ds)
train_data, test_data = train_test_split(data, test_size = 0.1)

train_data.head(5)

Unnamed: 0,age,marital,default,housing,loan,month,day_of_week,duration,campaign,pdays,...,contact_cellular,contact_telephone,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown
13346,61,1,0,1,0,10,5,336,2,999,...,1,0,0,0,0,0,0,0,1,0
9000,42,1,0,0,0,5,5,119,2,999,...,0,1,0,0,1,0,0,0,0,0
14907,25,0,0,0,0,11,4,244,1,999,...,1,0,0,0,0,1,0,0,0,0
13492,48,0,0,1,0,4,4,400,1,999,...,1,0,0,0,0,1,0,0,0,0
28006,29,0,0,1,0,8,4,435,1,999,...,1,0,0,0,0,0,0,0,1,0


## Configuring the AutoML Run 
🎯 Configure the AutoML parameters.

* Aim to maximize the accuracy
* Use 5-fold cross-validation

In [20]:
from azureml.train.automl import AutoMLConfig

automl_settings = {
    "experiment_timeout_minutes" : 15,
    "enable_early_stopping": True,
    "iteration_timeout_minutes": 5,
    "max_concurrent_iterations": 4,
    "max_cores_per_iteration": -1,
    "primary_metric": "accuracy",
    "featurization": "auto"
}

automl_config = AutoMLConfig(
    task='classification',
    training_data=train_data,
    label_column_name='y',
    n_cross_validations=5,
    **automl_settings)

In [21]:
automl_run = exp.submit(automl_config)



Experiment,Id,Type,Status,Details Page,Docs Page
bank-offer-success-prediction,AutoML_ad941832-e474-4aeb-8d6d-e223e32f1c8d,automl,Preparing,Link to Azure Machine Learning studio,Link to Documentation


## Getting the Best AutoML Model

🎯 Get the model with the highest accuracy.

In [26]:
import joblib

highest_accuracy_automl_run, highest_accuracy_automl_model = automl_run.get_output()

print(highest_accuracy_automl_model)

joblib.dump(highest_accuracy_automl_model, './outputs/automl-highest-acc-model.pkl')

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
    gpu_training_param_dict={'processing_unit_type': 'cpu'}
), random_state=0, reg_alpha=1.0416666666666667, reg_lambda=1.5625, subsample=0.8, tree_method='hist'))], verbose=False))], flatten_transform=None, weights=[0.23076923076923078, 0.07692307692307693, 0.15384615384615385, 0.07692307692307693, 0.15384615384615385, 0.07692307692307693, 0.07692307692307693, 0.15384615384615385]))],
         verbose=False)


## Evaluating on Test Data

🎯 Evaluate both the hyperdrive model and the automl model on the test data.

* Random Sampler Tuned Logistic Regression: **0.9135** Test Accuracy
* Gradient-boosted-tree-based Voting Ensemble: **0.9168** Test Accuracy

In [31]:
x_train, y_train = split_variables(train_data)
x_test, y_test = split_variables(test_data)

In [33]:
highest_accuracy_log_reg_model.fit(x_train, y_train)
log_reg_accuracy = highest_accuracy_log_reg_model.score(x_test, y_test)

log_reg_accuracy

0.91350531107739

In [34]:
highest_accuracy_automl_model.fit(x_train, y_train)
automl_accuracy = highest_accuracy_automl_model.score(x_test, y_test)

automl_accuracy

0.9168437025796662