# 2. Analyse ML avec Azure ML

<img src='https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true'>

### Données Framingham

https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset

Attributes/columns:

- male: 0 = Female; 1 = Male
- age: Age at exam time
- education: 1 = Some High School; 2 = High School or GED; 3 = Some College or Vocational School; 4 = college
- currentSmoker: 0 = nonsmoker; 1 = smoker
- cigsPerDay: number of cigarettes smoked per day (estimated average)
- BPMeds: 0 = Not on Blood Pressure medications; 1 = Is on Blood Pressure medications
- prevalentStroke
- prevalentHyp
- diabetes: 0 = No; 1 = Yes
- totChol in mg/dL
- sysBP in mmHg
- diaBP in mmHg
- BMI: Body Mass Index calculated as: Weight (kg) / Height(meter-squared)
- heartRate: Beats/Min (Ventricular)
- glucose in mg/dL

- TenYearCHD - Did the person get heart disease in the 10 years study period?
label; 0 = No for heart disease, 1 = Yes for heart disease;


## 1. Informations

In [1]:
import sys
sys.version

'3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]'

In [2]:
import datetime
now = datetime.datetime.now()
print(now)

2020-10-13 07:47:46.653574


In [3]:
#Version
import azureml.core
print("Version Azure ML :", azureml.core.VERSION)

Version Azure ML : 1.15.0


## 2. Workspace Azure ML

In [4]:
subscription_id = "AREMPLACER" 
resource_group = "AREMPLACER" 
workspace_name = "AREMPLACER" 
workspace_region = "AREMPLACER" 

In [5]:
# Expérience Azure ML
experiment_name = 'monexperimentation'

In [6]:
project_dir = './monprojet'
#deployment_dir = './deploy'
model_name = 'monmodele'
model_description = 'madescription'
vm_name = "monclustercpu"

In [7]:
import os
import logging

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.compute import ComputeTarget
from azureml.core.model import Model
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.core import Workspace
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.widgets import RunDetails

In [8]:
ws = Workspace.create(
    name = workspace_name,
    subscription_id = subscription_id,
    resource_group = resource_group, 
    location = workspace_region,
    exist_ok = True) #Leverage existing

ws.write_config()
print('Workspace configuration succeeded')

Workspace configuration succeeded


In [9]:
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Resource group: ' + ws.resource_group, sep='\n')

Workspace name: workshoplcl
Azure region: westeurope
Resource group: workshoplcl-rg


In [10]:
from azureml.core import ComputeTarget, Datastore, Dataset

print("Compute Targets:")
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print("\t", compute.name, ':', compute.type)
    
print("Datastores:")
for datastore_name in ws.datastores:
    datastore = Datastore.get(ws, datastore_name)
    print("\t", datastore.name, ':', datastore.datastore_type)
    
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name)

Compute Targets:
	 nbookinstance : ComputeInstance
Datastores:
	 azureml_globaldatasets : AzureBlob
	 workspaceblobstore : AzureBlob
	 workspacefilestore : AzureFile
Datasets:


## 3. Expérimentation Azure ML

In [11]:
if not os.path.exists(project_dir):
    os.makedirs(project_dir)

In [12]:
experiment = Experiment(ws, experiment_name)

In [13]:
# Infos
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
ws.get_details()

{'id': '/subscriptions/70b8f39e-8863-49f7-b6ba-34a80799550c/resourceGroups/workshoplcl-rg/providers/Microsoft.MachineLearningServices/workspaces/workshoplcl',
 'name': 'workshoplcl',
 'location': 'westeurope',
 'type': 'Microsoft.MachineLearningServices/workspaces',
 'tags': {},
 'sku': 'Basic',
 'workspaceid': '8e72db2b-253f-43ad-b178-21834bb76976',
 'description': '',
 'friendlyName': 'workshoplcl',
 'creationTime': '2020-10-13T07:30:49.1872058+00:00',
 'keyVault': '/subscriptions/70b8f39e-8863-49f7-b6ba-34a80799550c/resourcegroups/workshoplcl-rg/providers/microsoft.keyvault/vaults/workshoplcl0643011812',
 'applicationInsights': '/subscriptions/70b8f39e-8863-49f7-b6ba-34a80799550c/resourcegroups/workshoplcl-rg/providers/microsoft.insights/components/workshoplcl0507464134',
 'identityPrincipalId': 'cc04ec13-5ba2-469f-9d48-93a456fbf38d',
 'identityTenantId': '72f988bf-86f1-41af-91ab-2d7cd011db47',
 'identityType': 'SystemAssigned',
 'storageAccount': '/subscriptions/70b8f39e-8863-49f7-

## 4. Azure ML Compute

In [14]:
# Liste des compute servers disponibles
from azureml.core.compute import ComputeTarget, AmlCompute
AmlCompute.supported_vmsizes(workspace = ws)

[{'name': 'Standard_D1_v2',
  'vCPUs': 1,
  'gpus': 0,
  'memoryGB': 3.5,
  'maxResourceVolumeMB': 51200},
 {'name': 'Standard_D2_v2',
  'vCPUs': 2,
  'gpus': 0,
  'memoryGB': 7.0,
  'maxResourceVolumeMB': 102400},
 {'name': 'Standard_D3_v2',
  'vCPUs': 4,
  'gpus': 0,
  'memoryGB': 14.0,
  'maxResourceVolumeMB': 204800},
 {'name': 'Standard_D4_v2',
  'vCPUs': 8,
  'gpus': 0,
  'memoryGB': 28.0,
  'maxResourceVolumeMB': 409600},
 {'name': 'Standard_D11_v2',
  'vCPUs': 2,
  'gpus': 0,
  'memoryGB': 14.0,
  'maxResourceVolumeMB': 102400},
 {'name': 'Standard_D12_v2',
  'vCPUs': 4,
  'gpus': 0,
  'memoryGB': 28.0,
  'maxResourceVolumeMB': 204800},
 {'name': 'Standard_D13_v2',
  'vCPUs': 8,
  'gpus': 0,
  'memoryGB': 56.0,
  'maxResourceVolumeMB': 409600},
 {'name': 'Standard_D14_v2',
  'vCPUs': 16,
  'gpus': 0,
  'memoryGB': 112.0,
  'maxResourceVolumeMB': 819200},
 {'name': 'Standard_DS1_v2',
  'vCPUs': 1,
  'gpus': 0,
  'memoryGB': 3.5,
  'maxResourceVolumeMB': 7168},
 {'name': 'Standar

In [15]:
%%time
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

try:
    compute_target = ComputeTarget(workspace=ws, name=vm_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D12_V2',
                                                           min_nodes=1, 
                                                           max_nodes=2, 
                                                           idle_seconds_before_scaledown=1200)

    # create the cluster
    compute_target = ComputeTarget.create(ws, vm_name, compute_config)
    # Show output
    compute_target.wait_for_completion(show_output=True)

Creating a new compute target...
Creating
Succeeded..................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
CPU times: user 329 ms, sys: 24.5 ms, total: 354 ms
Wall time: 1min 52s


In [16]:
from azureml.core import ComputeTarget, Datastore, Dataset

print("Compute Targets:")
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print("\t", compute.name, ':', compute.type)

Compute Targets:
	 nbookinstance : ComputeInstance
	 monclustercpu : AmlCompute


In [17]:
experiment

Name,Workspace,Report Page,Docs Page
monexperimentation,workshoplcl,Link to Azure Machine Learning studio,Link to Documentation


## 5. Environnement

In [18]:
# Docker based environment avec scikit-learn
training_venv = Environment("training_env")

training_venv.docker.enabled = True
training_venv.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])

In [19]:
%%writefile $project_dir/train.py

import pandas as pd
import numpy as np
import pickle
import os

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

dataset_url = ('https://raw.githubusercontent.com/retkowsky/WorkshopMLOps/master/framingham.csv')
df = pd.read_csv(dataset_url)

smoke = (df['currentSmoker']==1)
df.loc[smoke,'cigsPerDay'] = df.loc[smoke,'cigsPerDay'].fillna(df.loc[smoke,'cigsPerDay'].mean())

df['BPMeds'].fillna(0, inplace = True)
df['glucose'].fillna(df.glucose.mean(), inplace = True)
df['totChol'].fillna(df.totChol.mean(), inplace = True)
df['education'].fillna(1, inplace = True)
df['BMI'].fillna(df.BMI.mean(), inplace = True)
df['heartRate'].fillna(df.heartRate.mean(), inplace = True)

features = df.iloc[:,:-1]
result = df.iloc[:,-1] # the last column is what we are about to forecast

# Train et Test
X_train, X_test, y_train, y_test = train_test_split(features, result, test_size = 0.2, random_state = 14)

# RandomForest classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.12
sfm = SelectFromModel(clf, threshold=0.12)

# Train the selector
sfm.fit(X_train, y_train)

# Features selected
feat_labels = list(features.columns.values) # creating a list with features' names
for feature_list_index in sfm.get_support(indices=True):
    print(feat_labels[feature_list_index])

# Feature importance
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

print("Feature ranking =")
for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# With only imporant features. Can check X_important_train.shape[1]
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

clf_important = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf_important.fit(X_important_train, y_train)

# Calcul Accuracy
from sklearn import metrics
y_pred=clf.predict(X_test)
accuracy=metrics.accuracy_score(y_test, y_pred)
print("Accuracy du modèle :", accuracy)

# Export modèle
os.makedirs('./outputs/model', exist_ok=True)

filename = './outputs/model/chd-rf-model'
pickle.dump(clf_important, open(filename, 'wb'))
print("model saved in ././outputs/model/chd-rf-model folder")
print("Saving model completed")

Writing ./monprojet/train.py


In [20]:
import pandas as pd
url = ('https://raw.githubusercontent.com/retkowsky/WorkshopMLOps/master/framingham.csv')
df = pd.read_csv(url)
df.head(5)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


## 6. Run

In [21]:
# Définition de tags pour le run
tagsdurun = {"Type": "test" , "Langage" : "Python" , "Framework" : "Scikit-Learn", "Team" : "DataScience" , "Pays" : "France"}

In [22]:
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

src = ScriptRunConfig(source_directory=project_dir, script='train.py')

# Compute Target
src.run_config.target = compute_target.name

# Set environment
src.run_config.environment = training_venv
 
run = experiment.submit(config=src, tags=tagsdurun)
run

Experiment,Id,Type,Status,Details Page,Docs Page
monexperimentation,monexperimentation_1602575538_84ebca82,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


> Prévoir 2 minutes de temps de traitement

### Widget pour suivre l'avancement du run

In [23]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Pour suivre l'avancement du run

In [37]:
run.get_status()

'Completed'

In [38]:
run.get_details()

{'runId': 'monexperimentation_1602575538_84ebca82',
 'target': 'monclustercpu',
 'status': 'Completed',
 'startTimeUtc': '2020-10-13T07:59:17.814717Z',
 'endTimeUtc': '2020-10-13T08:02:02.98967Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '0abc68e1-ae04-4ca0-aa19-e91cff0861be',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': [],
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'monclustercpu',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'priority': None,
  'environment': {'name': 'training_env',
   'version': 'Autosave_2020-10-13T07:52:21Z_13b660a6',
   'python': {'interpreterPath': 'python',
    'us

In [39]:
# Statut du compute server
compute_target.list_nodes()

[{'nodeId': 'tvmps_acf50f4fa1807c7df2240e8b769cf8ea6fe893276aaecb7682a3a735ba74c436_d',
  'port': 50000,
  'publicIpAddress': '20.54.177.218',
  'privateIpAddress': '10.0.0.4',
  'nodeState': 'idle'}]

In [40]:
#get_status () gets the latest status of the AmlCompute target
compute_target.get_status().serialize()

{'currentNodeCount': 1,
 'targetNodeCount': 1,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 1,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2020-10-13T07:50:13.010000+00:00',
 'errors': None,
 'creationTime': '2020-10-13T07:48:26.754906+00:00',
 'modifiedTime': '2020-10-13T07:48:42.685390+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 1,
  'maxNodeCount': 2,
  'nodeIdleTimeBeforeScaleDown': 'PT1200S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_D12_V2'}

## 7. Référencement du modèle avec Azure ML

In [42]:
if run.get_status() == 'Completed':
    print("OK!")
    model_run = run.register_model(model_name=model_name,  
                               model_path="././outputs/model/chd-rf-model",
                               tags={"type": "classification", "framework": "scikit", "description": model_description, "run_id": run.id})
    print("Version du modèle : ", model_run.version)
else:
    print("OK")
    Exception("Erreur")
    

OK!
Version du modèle :  1


### Liste des modèles référencés

In [43]:
from azureml.core import Model

for model in Model.list(ws):
    print(model.name, 'Version =', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

monmodele Version = 1
	 type : classification
	 framework : scikit
	 description : madescription
	 run_id : monexperimentation_1602575538_84ebca82




In [44]:
experiment

Name,Workspace,Report Page,Docs Page
monexperimentation,workshoplcl,Link to Azure Machine Learning studio,Link to Documentation


> Fin