# Azure ML Hyperdrive avec Scikit-Learn

<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>

**Efficiently tune hyperparameters** for your model using Azure Machine Learning.<br>
**Hyperparameter tuning** includes the following steps:
<br>
- Define the parameter search space<br>
- Specify a primary metric to optimize<br>
- Specify early termination criteria for poorly performing runs<br>
- Allocate resources for hyperparameter tuning<br>
- Launch an experiment with the above configuration<br>
- Visualize the training runs<br>
- Select the best performing configuration for your model<br>

Documentation Hyperdrive avec Azure ML :
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

## 1. Introduction

In [1]:
import sys
sys.version

'3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]'

In [2]:
import datetime
now = datetime.datetime.now()
print(now)

2020-05-06 09:52:48.209455


In [3]:
import azureml.core
print("Version Azure ML service : ", azureml.core.VERSION)

Version Azure ML service :  1.4.0


In [4]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Workspace Azure ML :', ws.name)

Workspace Azure ML : workshopAML2020


In [5]:
from azureml.core import ComputeTarget, Datastore, Dataset

print("Compute Targets:")
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print("\t", compute.name, ':', compute.type)
    
print("Datastores:")
for datastore_name in ws.datastores:
    datastore = Datastore.get(ws, datastore_name)
    print("\t", datastore.name, ':', datastore.datastore_type)
    
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name)

Compute Targets:
	 drift-aks : AKS
	 instance-aks : AKS
	 instance : ComputeInstance
	 AutoML : AmlCompute
	 cpu-cluster : AmlCompute
	 cpupipelines : AmlCompute
	 clustergpuNC6 : AmlCompute
	 gpuclusterNC6 : AmlCompute
Datastores:
	 azureml : AzureBlob
	 modelservingdata : AzureBlob
	 aiexportdata : AzureBlob
	 modeldata : AzureBlob
	 teststorageserge : AzureBlob
	 azureml_globaldatasets : AzureBlob
	 workspaceblobstore : AzureBlob
	 workspacefilestore : AzureFile
Datasets:
	 drift-demo-dataset
	 target
	 dataset
	 Iris
	 mnist dataset
	 diabetes dataset


## 2. Hyperdrive pour trouver la meilleure combinaison

The remote compute you created is a four-node cluster, and you can take advantage of this to execute multiple experiment runs in parallel. One key reason to do this is to try training a model with a range of different hyperparameter values.

Azure ML includes a feature called *hyperdrive* that enables you to randomly try different values for one or more hyperparameters, and find the best performing trained model based on a metric that you specify - such as *Accuracy* or *Area Under the Curve (AUC)*.

> **More Information**: For more information about Hyperdrive, see the [Azure ML documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters).

Let's run a Hyperdrive experiment on the remote compute you have provisioned. First, we'll create the experiment and its associated folder.

In [6]:
import os
from azureml.core import Experiment

In [7]:
# Expérimentation
hyperdrive_experiment_name = 'Exemple14-Scikit-Learn-HyperDrive'

In [8]:
hyperdrive_experiment = Experiment(workspace = ws, name = hyperdrive_experiment_name)

hyperdrive_experiment_folder = './' + hyperdrive_experiment_name
os.makedirs(hyperdrive_experiment_folder, exist_ok=True)

print("Expérimentation :", hyperdrive_experiment.name)

Expérimentation : Exemple14-Scikit-Learn-HyperDrive


In [9]:
%%writefile $hyperdrive_experiment_folder/diabetes_training.py

import argparse
import joblib
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate

run = Run.get_context()

print("Chargement des données...")
diabetes = run.input_datasets['diabetes'].to_pandas_dataframe()

X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Partitionnement
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Modélisation
print('Régression logistique avec taux de régularisation', reg)
run.log('Taux de régularisation',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# Accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy =', acc)
run.log('Accuracy', np.float(acc))

# AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC =' + str(auc))
run.log('AUC', np.float(auc))

# Courbe de ROC
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Courbe de ROC')
run.log_image(name = "ROC", plot = fig)
plt.show()

os.makedirs('outputs', exist_ok=True)

joblib.dump(value=model, filename='outputs/diabetes.pkl')

run.complete()

Overwriting ./Exemple14-Scikit-Learn-HyperDrive/diabetes_training.py


In [10]:
!ls Exemple14-Scikit-Learn-HyperDrive/diabetes_training.py -l

-rwxrwxrwx 1 root root 1862 May  6 09:52 Exemple14-Scikit-Learn-HyperDrive/diabetes_training.py


In [11]:
#Viewing the yml file
with open(os.path.join('./Exemple14-Scikit-Learn-HyperDrive/diabetes_training.py'), 'r') as f:
    print(f.read())


import argparse
import joblib
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate

run = Run.get_context()

print("Chargement des données...")
diabetes = run.input_datasets['diabetes'].to_pandas_dataframe()

X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Partitionnement
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Modélisation
print('Régression logistique ave

In [12]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "cpu-standardd4"

try:
    compute1 = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D4', min_nodes=1, max_nodes=10)
    compute1 = ComputeTarget.create(ws, cluster_name, compute_config)

compute1.wait_for_completion(show_output=True)

Creating
Succeeded..................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [13]:
# Liste des clusters
liste = ws.compute_targets
for liste in liste:
    print("Ressources compute du workspace :", liste)

Ressources compute du workspace : drift-aks
Ressources compute du workspace : instance-aks
Ressources compute du workspace : instance
Ressources compute du workspace : AutoML
Ressources compute du workspace : cpu-cluster
Ressources compute du workspace : cpupipelines
Ressources compute du workspace : clustergpuNC6
Ressources compute du workspace : gpuclusterNC6
Ressources compute du workspace : cpu-standardd4


In [14]:
# Définition de tags pour le run
tagsdurun = {"Type": "test" , "Langage" : "Python" , "Framework" : "Scikit-Learn" , "Hyperdrive" : "Gridsearch"}

In [15]:
from azureml.train.hyperdrive import GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn

# Grid Search
params = GridParameterSampling(
    {
        # Différentes valeurs du paramétre de régularisation à tester
        '--regularization': choice(0.0005, 0.005, 0.01, 0.1)
    }
)

# Policy Bandit is an early termination policy based on slack factor/slack amount and evaluation interval. 
# The policy early terminates any runs where the primary metric is not within the specified slack factor/slack amount 
# with respect to the best performing training run.

policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

# Données
diabetes_ds = ws.datasets.get("diabetes dataset")

# Définition estimateur
hyper_estimator = SKLearn(source_directory=hyperdrive_experiment_folder,
                           inputs=[diabetes_ds.as_named_input('diabetes')], # Données en entrée
                           compute_target = compute1, # Compute server
                           conda_packages=['pandas','ipykernel','matplotlib'], #Dépendances
                           pip_packages=['azureml-sdk','argparse','pyarrow'], 
                           entry_script='diabetes_training.py')  # script Python

# Configuration hyperdrive
hyperdrive = HyperDriveConfig(estimator=hyper_estimator, 
                          hyperparameter_sampling=params, # Paramétres
                          policy=policy, #Policy
                          primary_metric_name='Accuracy', #Métrique
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, #Optimisation de la métrique
                          max_total_runs=10,
                          max_concurrent_runs=8)


Documentation:
    https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy?view=azure-ml-py
        

In [16]:
# Run
hyperdrive_run = hyperdrive_experiment.submit(config=hyperdrive, tags=tagsdurun)
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

> Temps de traitement : autour de **10 minutes**

In [31]:
# Progression du run
hyperdrive_run.get_details()

{'runId': 'HD_7f6a2a34-56bf-4f9b-bfd5-9fb94312188f',
 'target': 'cpu-standardd4',
 'status': 'Completed',
 'startTimeUtc': '2020-05-06T09:54:54.690063Z',
 'endTimeUtc': '2020-05-06T10:04:08.566556Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '41ccefec-2596-4208-ad39-46da04e0fb3e',
  'score': '0.7902222222222223',
  'best_child_run_id': 'HD_7f6a2a34-56bf-4f9b-bfd5-9fb94312188f_0',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://workshopaml2027584246021.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_7f6a2a34-56bf-4f9b-bfd5-9fb94312188f/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=etVhQqvT0Ne%2FzdbwybipgG%2Bj33o3oDpeKsnHMCmbVM0%3D&st=2020-05-06T09%3A55%3A17Z&se=2020-05-06T18%3A05%3A17Z&sp=r'}}

### On récupère le best run :

In [32]:
best_hyperdrive_run = hyperdrive_run.get_best_run_by_primary_metric()
best_hyperdrive_run_metrics = best_hyperdrive_run.get_metrics()
hyperdrive_parameter_values = best_hyperdrive_run.get_details() ['runDefinition']['arguments']

In [33]:
print("Résultats du best run de l'hyperparameter Tuning :")
print()
print('Best Run ID =', best_hyperdrive_run.id)
print()
print('Regularization Rate optimal =', hyperdrive_parameter_values)
print()
print('Métriques :')
print(' - AUC =', best_hyperdrive_run_metrics['AUC'])
print(' - Accuracy =', best_hyperdrive_run_metrics['Accuracy'])

Résultats du best run de l'hyperparameter Tuning :

Best Run ID = HD_7f6a2a34-56bf-4f9b-bfd5-9fb94312188f_0

Regularization Rate optimal = ['--regularization', '0.0005']

Métriques :
 - AUC = 0.8569106291712714
 - Accuracy = 0.7902222222222223


### On référence le meilleur modèle :

In [34]:
from azureml.core import Model

best_hyperdrive_run.register_model(model_path='outputs/diabetes.pkl', 
                                   model_name='Diabetes',
                                   tags={'Training context':'Hyperdrive'},
                                   properties={'AUC': best_hyperdrive_run_metrics['AUC'],
                                               'Accuracy': best_hyperdrive_run_metrics['Accuracy']})

Model(workspace=Workspace.create(name='workshopAML2020', subscription_id='70b8f39e-8863-49f7-b6ba-34a80799550c', resource_group='workshopAML2020-rg'), name=Diabetes, id=Diabetes:15, version=15, tags={'Training context': 'Hyperdrive'}, properties={'AUC': '0.8569106291712714', 'Accuracy': '0.7902222222222223'})

> Le modèle est disponible dans le section **Models** d'Azure ML Studio

### Liste des modèles référencés dans le workspace Azure ML

In [35]:
# Liste des modèles référencés
for model in Model.list(ws):
    print(model.name, '- version =', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

Diabetes - version = 15
	 Training context : Hyperdrive
	 AUC : 0.8569106291712714
	 Accuracy : 0.7902222222222223


Exemple10-Modele-TensorFlow - version = 16
	 Framework : TensorFlow
	 Hyperdrive : Oui
	 GPU : Oui


Diabetes - version = 14
	 Training context : Hyperdrive
	 AUC : 0.8569106291712714
	 Accuracy : 0.7902222222222223


Exemple10-Modele-TensorFlow - version = 15
	 Framework : TensorFlow
	 Hyperdrive : Oui
	 GPU : Oui


diabetes_model - version = 15
	 Training context : Pipeline


Modele-SKLEARN-Regression - version = 2
	 area : diabetes
	 type : regression
	 format : Scikit-Learn pkl


Exemple4-AutoML-Forecast - version = 5
	 Training context : Azure Auto ML
	 R2 : 0.2102297702988311
	 RMSE : 0.025719636958220327


boston_model.pkl - version = 13
	 algo : Regression
	 Training context : Azure ML
	 Framework : scikit-learn


RegressionRidge - version = 6
	 area : Diabetes
	 type : Regression Ridge
	 k : 0.4
	 MSE : 3295.741064355809
	 R2 : 0.3572956390661659
	 Framework : A

### Suppression du compute cluster

In [36]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, "(" , ct.type, ") :", ct.provisioning_state)

drift-aks ( AKS ) : Succeeded
instance-aks ( AKS ) : Succeeded
instance ( ComputeInstance ) : Succeeded
AutoML ( AmlCompute ) : Succeeded
cpu-cluster ( AmlCompute ) : Succeeded
cpupipelines ( AmlCompute ) : Succeeded
clustergpuNC6 ( AmlCompute ) : Succeeded
gpuclusterNC6 ( AmlCompute ) : Succeeded
cpu-standardd4 ( AmlCompute ) : Succeeded


In [37]:
# Suppression du cluster
#compute1.delete()

In [38]:
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, "(" , ct.type, ") :", ct.provisioning_state)

drift-aks ( AKS ) : Succeeded
instance-aks ( AKS ) : Succeeded
instance ( ComputeInstance ) : Succeeded
AutoML ( AmlCompute ) : Succeeded
cpu-cluster ( AmlCompute ) : Succeeded
cpupipelines ( AmlCompute ) : Succeeded
clustergpuNC6 ( AmlCompute ) : Succeeded
gpuclusterNC6 ( AmlCompute ) : Succeeded
cpu-standardd4 ( AmlCompute ) : Succeeded


<img src="https://github.com/retkowsky/images/blob/master/Powered-by-MS-Azure-logo-v2.png?raw=true" height="300" width="300">