# Hyperdrive avec Scikit Learn

<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>

**Efficiently tune hyperparameters** for your model using Azure Machine Learning.<br>
**Hyperparameter tuning** includes the following steps:
<br>
- Define the parameter search space<br>
- Specify a primary metric to optimize<br>
- Specify early termination criteria for poorly performing runs<br>
- Allocate resources for hyperparameter tuning<br>
- Launch an experiment with the above configuration<br>
- Visualize the training runs<br>
- Select the best performing configuration for your model<br>

Documentation Hyperdrive avec Azure ML :
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

## 1. Introduction

In [1]:
import sys
sys.version

'3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]'

In [2]:
import datetime
now = datetime.datetime.now()
print(now)

2020-03-26 15:03:48.222483


In [3]:
import azureml.core
print("Version Azure ML service : ", azureml.core.VERSION)

Version Azure ML service :  1.0.85


In [4]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Workspace Azure ML :', ws.name)

Workspace Azure ML : workshopAML2020


## 2. Hyperdrive pour trouver la meilleure combinaison

The remote compute you created is a four-node cluster, and you can take advantage of this to execute multiple experiment runs in parallel. One key reason to do this is to try training a model with a range of different hyperparameter values.

Azure ML includes a feature called *hyperdrive* that enables you to randomly try different values for one or more hyperparameters, and find the best performing trained model based on a metric that you specify - such as *Accuracy* or *Area Under the Curve (AUC)*.

> **More Information**: For more information about Hyperdrive, see the [Azure ML documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters).

Let's run a Hyperdrive experiment on the remote compute you have provisioned. First, we'll create the experiment and its associated folder.

In [5]:
import os
from azureml.core import Experiment

In [6]:
# Expérimentation
hyperdrive_experiment_name = 'Exemple14-Scikit-Learn-HyperDrive'

In [7]:
hyperdrive_experiment = Experiment(workspace = ws, name = hyperdrive_experiment_name)

hyperdrive_experiment_folder = './' + hyperdrive_experiment_name
os.makedirs(hyperdrive_experiment_folder, exist_ok=True)

print("Expérimentation :", hyperdrive_experiment.name)

Expérimentation : Exemple14-Scikit-Learn-HyperDrive


In [8]:
%%writefile $hyperdrive_experiment_folder/diabetes_training.py

import argparse
import joblib
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate

run = Run.get_context()

print("Loading Data...")
diabetes = run.input_datasets['diabetes'].to_pandas_dataframe()

X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Partitionnement
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Modélisation
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# Accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Courbe de ROC
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Courbe de ROC')
run.log_image(name = "ROC", plot = fig)
plt.show()

os.makedirs('outputs', exist_ok=True)

joblib.dump(value=model, filename='outputs/diabetes.pkl')

run.complete()

Writing ./Exemple14-Scikit-Learn-HyperDrive/diabetes_training.py


In [9]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "cpu-cluster-aml"

try:
    compute1 = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute1 = ComputeTarget.create(ws, cluster_name, compute_config)

compute1.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [10]:
from azureml.train.hyperdrive import GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn

# Grid Search
params = GridParameterSampling(
    {
        # Différentes valeurs du paramétre à tester
        '--regularization': choice(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.2)
    }
)

# Policy pour définir critère d'arrêt
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

# Données
diabetes_ds = ws.datasets.get("diabetes dataset")

# Définition estimateur
hyper_estimator = SKLearn(source_directory=hyperdrive_experiment_folder,
                           inputs=[diabetes_ds.as_named_input('diabetes')], # Données en entrée
                           compute_target = compute1,
                           conda_packages=['pandas','ipykernel','matplotlib'],
                           pip_packages=['azureml-sdk','argparse','pyarrow'],
                           entry_script='diabetes_training.py')

# Configuration hyperdrive
hyperdrive = HyperDriveConfig(estimator=hyper_estimator, 
                          hyperparameter_sampling=params, 
                          policy=policy, 
                          primary_metric_name='AUC', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=6,
                          max_concurrent_runs=4)


# Run
hyperdrive_run = hyperdrive_experiment.submit(config=hyperdrive)

RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

> Temps de traitement : autour de **10 minutes**

In [23]:
# Statut du run
hyperdrive_run.get_details()

{'runId': 'HD_a486a28d-5da9-4f9d-b2bf-c03d437b04e4',
 'target': 'cpu-cluster-aml',
 'status': 'Completed',
 'startTimeUtc': '2020-03-26T15:04:27.891052Z',
 'endTimeUtc': '2020-03-26T15:14:47.983305Z',
 'properties': {'primary_metric_config': '{"name": "AUC", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '61b4c649-cbe4-4de7-95e8-e64cf49abe4f',
  'score': '0.8569106291712714',
  'best_child_run_id': 'HD_a486a28d-5da9-4f9d-b2bf-c03d437b04e4_1',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://workshopaml2027584246021.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_a486a28d-5da9-4f9d-b2bf-c03d437b04e4/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=1FGsJ00DyAhmRudl6v8wO2SQoZPWXD8FeszwnxRRolI%3D&st=2020-03-26T15%3A05%3A26Z&se=2020-03-26T23%3A15%3A26Z&sp=r'}}

### On récupère le best run

In [24]:
best_hyperdrive_run = hyperdrive_run.get_best_run_by_primary_metric()
best_hyperdrive_run_metrics = best_hyperdrive_run.get_metrics()
hyperdrive_parameter_values = best_hyperdrive_run.get_details() ['runDefinition']['arguments']

print('Best Run ID =', best_hyperdrive_run.id)
print("Métriques :")
print(' - AUC =', best_hyperdrive_run_metrics['AUC'])
print(' - Accuracy =', best_hyperdrive_run_metrics['Accuracy'])
print(' - Regularization Rate =',hyperdrive_parameter_values)

Best Run ID = HD_a486a28d-5da9-4f9d-b2bf-c03d437b04e4_1
Métriques :
 - AUC = 0.8569106291712714
 - Accuracy = 0.7902222222222223
 - Regularization Rate = ['--regularization', '0.0005']


### On référence le meilleur modèle

In [25]:
from azureml.core import Model

best_hyperdrive_run.register_model(model_path='outputs/diabetes.pkl', 
                                   model_name='Diabetes', 
                                   tags={'Training context':'Hyperdrive'}, # Ajout de tag
                                   properties={'AUC': best_hyperdrive_run_metrics['AUC'], 
                                               'Accuracy': best_hyperdrive_run_metrics['Accuracy']})


Model(workspace=Workspace.create(name='workshopAML2020', subscription_id='70b8f39e-8863-49f7-b6ba-34a80799550c', resource_group='workshopAML2020-rg'), name=Diabetes, id=Diabetes:4, version=4, tags={'Training context': 'Hyperdrive'}, properties={'AUC': '0.8569106291712714', 'Accuracy': '0.7902222222222223'})

### Liste des modèles référencés dans le workspace Azure ML

In [26]:
# Liste des modèles référencés
for model in Model.list(ws):
    print(model.name, 'Version =', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

Diabetes Version = 4
	 Training context : Hyperdrive
	 AUC : 0.8569106291712714
	 Accuracy : 0.7902222222222223


Diabetes Version = 3
	 Training context : Hyperdrive
	 AUC : 0.8569106291712714
	 Accuracy : 0.7902222222222223


Diabetes Version = 2
	 Training context : Hyperdrive
	 AUC : 0.8569106291712714
	 Accuracy : 0.7902222222222223


Diabetes Version = 1
	 Training context : Hyperdrive
	 AUC : 0.8569106291712714
	 Accuracy : 0.7902222222222223


Exemple10-Modele-TensorFlow Version = 3
	 Training context : TensorFlow GPU


Exemple10-Modele-TensorFlow Version = 2
	 Training context : TensorFlow GPU


sklearn_regression_model.pkl Version = 2
	 area : diabetes
	 type : regression


IBM_attrition_explainer Version = 4


local_deploy_model Version = 4


Exemple10-Modele-TensorFlow Version = 1
	 Training context : TensorFlow GPU


sklearn_regression_model.pkl Version = 1
	 area : diabetes
	 type : regression


diabetes_model Version = 2
	 Training context : Pipeline


Exemple4-AutoML-Fo

## 3. AutoML avec Azure ML

In [27]:
diabetes_ds = ws.datasets.get("diabetes dataset")
train_ds, test_ds = diabetes_ds.random_split(percentage=0.7, seed=123)
print("OK")

OK


In [28]:
from azureml.core.runconfig import RunConfiguration
from azureml.train.automl import AutoMLConfig
import time
import logging

automl_run_config = RunConfiguration(framework="python")
automl_run_config.environment.docker.enabled = True

automl_settings = {
    "name": "Diabetes_AutoML_{0}".format(time.time()),
    "iteration_timeout_minutes": 2,
    "iterations": 10,
    "primary_metric": 'AUC_weighted',
    "preprocess": False,
    "max_concurrent_iterations": 2
}

automl_config = AutoMLConfig(task='classification',
                             debug_log='automl14.log',
                             compute_target=compute1,
                             run_configuration=automl_run_config,
                             training_data = train_ds,
                             validation_data = test_ds,
                             label_column_name='Diabetic',
                             model_explainability=True,
                             **automl_settings,
                             )

print("OK.")

OK.


In [29]:
%%time
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails

automl_experiment = Experiment(ws, 'Exemple14-AutoML-Diabetes')
automl_run = automl_experiment.submit(automl_config, show_output = True)
RunDetails(automl_run).show()

Running on remote or ADB.
Running on remote compute: cpu-cluster-aml
Parent Run ID: AutoML_f45ef1a3-570c-4abf-9a5c-64aba1f82ff5

Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         1   StandardScalerWrapper SGD                      0:01:59       0.8539    0.8539
         0   MaxAbsScaler LightGBM                          0:01:58       0.9900    0.9900
         3   MinMaxScaler LightGBM                          0:02:07       0.98

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

CPU times: user 6.68 s, sys: 2.18 s, total: 8.86 s
Wall time: 14min 20s


### Voici le best run de l'AutoML :

In [31]:
best_automl_run, fitted_model = automl_run.get_output()
print(best_automl_run)
print()
print(fitted_model)
print()
best_automl_run_metrics = best_automl_run.get_metrics()
for metric_name in best_automl_run_metrics:
    metric = best_automl_run_metrics[metric_name]
    print(metric_name, metric)

Run(Experiment: Exemple14-AutoML-Diabetes,
Id: AutoML_f45ef1a3-570c-4abf-9a5c-64aba1f82ff5_0,
Type: azureml.scriptrun,
Status: Completed)

Pipeline(memory=None,
     steps=[('MaxAbsScaler', MaxAbsScaler(copy=True)), ('LightGBMClassifier', LightGBMClassifier(boosting_type='gbdt', class_weight=None,
          colsample_bytree=1.0, importance_type='split', learning_rate=0.1,
          max_depth=-1, min_child_samples=20, min_child_weight=0.001,
          min_split_g...    silent=True, subsample=1.0, subsample_for_bin=200000,
          subsample_freq=0, verbose=-10))])

recall_score_macro 0.9486595205431674
f1_score_weighted 0.9545479652486709
weighted_accuracy 0.9592649598342113
AUC_macro 0.9900041352137385
log_loss 0.12080804025763404
f1_score_macro 0.9488116461956579
accuracy_table aml://artifactId/ExperimentRun/dcid.AutoML_f45ef1a3-570c-4abf-9a5c-64aba1f82ff5_0/accuracy_table
AUC_micro 0.9911678419275206
norm_macro_recall 0.8973190410863348
recall_score_micro 0.9545556805399326
accuracy

### Facteurs explicatifs du best run :

In [32]:
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient

client = ExplanationClient.from_run(best_automl_run)
engineered_explanations = client.download_model_explanation(raw=True)
feature_importances = engineered_explanations.get_feature_importance_dict()

print("Feature\tImportance")
for key, value in feature_importances.items():
    print(key, "\t", value)

Feature	Importance
Pregnancies 	 1.8426337771987051
Age 	 0.7778802389966267
BMI 	 0.7108683623434161
SerumInsulin 	 0.5928044490098323
PlasmaGlucose 	 0.46539266086245157
TricepsThickness 	 0.3524027260758251
DiastolicBloodPressure 	 0.23950323837307577
DiabetesPedigree 	 0.1569421948989365
PatientID 	 0.019492243327769034


<img src="https://github.com/retkowsky/images/blob/master/Powered-by-MS-Azure-logo-v2.png?raw=true" height="300" width="300">