# **Rain prediction Australia**

*Contexto*  

Predecir la lluvia del día siguiente entrenando modelos con AutoML de Azure, la variable objetivo RainTomorrow.

In [16]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl, Input, MLClient
from pprint import pprint
from azure.ai.ml.entities import ResourceConfiguration

In [17]:
credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential)

Found the config file in: /config.json


##### **1. Read and split data**

Se dividió el dataset en train (80%) y test (20%)

In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split

weather_data = pd.read_csv('https://raw.githubusercontent.com/sharonmaygua/rain_prediction/main/weatherAUS_ML.csv')
train_data, test_data = train_test_split(weather_data, test_size=0.2, random_state=11)

train_data.to_csv("./data-ml-table/training-ml-table/train_data.csv", index=False)
test_data.to_csv("./data-ml-table/validation-ml-table/valid_data.csv", index=False)

In [33]:
# READ data en formato MLTABLE
training_mltable_path   = "./data-ml-table/training-ml-table/"
validation_mltable_path = "./data-ml-table/validation-ml-table/"

# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(type=AssetTypes.MLTABLE, path=training_mltable_path)

# Validation MLTable defined locally, with local data to be uploaded
my_validation_data_input = Input(type=AssetTypes.MLTABLE, path=validation_mltable_path)

##### **2. Configurar el trabajo de clasificación**

Se utilizó un trabajo de clasificación para entrenar un modelo que prediga "RainTomorrow". Se entrenan varios modelos utilizando los datos de entrenamiento. El modelo con el mejor rendimiento en los datos de validación basados en la métrica principal se selecciona como modelo final.

- El compute cluster utilizado es de tipo General purpose, **Standard_D2ds_v4**, con 2 de 4 cores disponibles, 8GB RAM, 75GB storage. Las características del cluster se adaptan a la tarea de clasificación y al conjunto de datos que no requiere un amplio almacenamiento, pero nos permite utilizar dos nodos para paralelizar el proceso.

- Para seguir experimentando con los modelos se configuro a 30 minutos para ejecutar todo el experimento.

In [34]:
compute_name = 'general-purpose-Standard-D2ds-v4'
exp_name     = 'rain-prediction'
exp_timeout  = 30

classification_job = automl.classification(
    compute=compute_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    target_column_name="RainTomorrow",
    primary_metric="precision_score_weighted",
    enable_model_explainability=True,
    tags={"resource": exp_name}
)

classification_job.set_limits(timeout_minutes=exp_timeout,
                               max_trials=5,
                               max_nodes=2)
classification_job.resources = ResourceConfiguration(
    instance_type="Standard_D2ds_v4")

print(exp_name)

rain-prediction


In [35]:
returned_job = ml_client.jobs.create_or_update(
    classification_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

[32mUploading training-ml-table (7.42 MBs): 100%|██████████| 7419698/7419698 [00:00<00:00, 17440200.96it/s]
[39m

[32mUploading validation-ml-table (1.86 MBs): 100%|██████████| 1855601/1855601 [00:00<00:00, 13598101.00it/s]
[39m



Created job: compute: azureml:general-purpose-Standard-D2ds-v4
creation_context:
  created_at: '2023-10-23T13:23:05.967968+00:00'
  created_by: Carolina Aldunate
  created_by_type: User
display_name: honest_airport_vcbqjnktj2
experiment_name: rain-prediction
id: azureml:/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resourceGroups/ws-ml-proyectos/providers/Microsoft.MachineLearningServices/workspaces/rain_prediction_ml/jobs/honest_airport_vcbqjnktj2
limits:
  enable_early_termination: true
  max_concurrent_trials: 1
  max_cores_per_trial: -1
  max_nodes: 2
  max_trials: 5
  timeout_minutes: 30
  trial_timeout_minutes: 30
log_verbosity: info
name: honest_airport_vcbqjnktj2
outputs: {}
primary_metric: precision_score_weighted
properties: {}
resources:
  instance_count: 1
  instance_type: Standard_D2ds_v4
  properties: {}
  shm_size: 2g
services:
  Studio:
    endpoint: https://ml.azure.com/runs/honest_airport_vcbqjnktj2?wsid=/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resour

# 

**Entrenamiento...**

In [36]:
ml_client.jobs.stream(returned_job.name)

RunId: honest_airport_vcbqjnktj2
Web View: https://ml.azure.com/runs/honest_airport_vcbqjnktj2?wsid=/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resourcegroups/ws-ml-proyectos/workspaces/rain_prediction_ml

Execution Summary
RunId: honest_airport_vcbqjnktj2
Web View: https://ml.azure.com/runs/honest_airport_vcbqjnktj2?wsid=/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resourcegroups/ws-ml-proyectos/workspaces/rain_prediction_ml



Con base a los resultados, los modelos que entreno AutoML son: StackEnsemble, VotingEnsemble, MaxAbsScaler-XGBoostClassifier, MaxAbsScaler-LightGBM y MaxAbsScaler, ExtremeRandomTrees. El mejor modelo *StackEnsemble* tiene una precision alta de 0.85540

##### **4. Deployment**.
Inicializar cliente MLFlow

In [37]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

azureml://eastus.api.azureml.ms/mlflow/v1.0/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resourceGroups/ws-ml-proyectos/providers/Microsoft.MachineLearningServices/workspaces/rain_prediction_ml


In [38]:
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

In [39]:
from mlflow.tracking.client import MlflowClient
from mlflow.artifacts import download_artifacts

# Initialize MLFlow client
mlflow_client = MlflowClient()

In [40]:
# Get the job
job = ml_client.jobs.get(name=returned_job.name)
mlflow_parent_run = mlflow_client.get_run(job.name)
print(mlflow_parent_run)

<Run: data=<RunData: metrics={'AUC_macro': 0.893833034462931,
 'AUC_micro': 0.9366814331106983,
 'AUC_weighted': 0.893833034462931,
 'accuracy': 0.8626728475771623,
 'average_precision_score_macro': 0.8555227825580028,
 'average_precision_score_micro': 0.9368934858850606,
 'average_precision_score_weighted': 0.9162125999568085,
 'balanced_accuracy': 0.7491855803562811,
 'f1_score_macro': 0.7763349889147229,
 'f1_score_micro': 0.8626728475771623,
 'f1_score_weighted': 0.8538303884857857,
 'log_loss': 0.32208101016425095,
 'matthews_correlation': 0.5677680900579292,
 'norm_macro_recall': 0.49837116071256227,
 'precision_score_macro': 0.8234141835445727,
 'precision_score_micro': 0.8626728475771623,
 'precision_score_weighted': 0.8553784165669999,
 'recall_score_macro': 0.7491855803562811,
 'recall_score_micro': 0.8626728475771623,
 'recall_score_weighted': 0.8626728475771623,
 'weighted_accuracy': 0.922317166980469}, params={}, tags={'automl_best_child_run_id': 'honest_airport_vcbqjnktj2

In [41]:
# Seleccionar al mejor modelo

best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
best_run = mlflow_client.get_run(best_child_run_id)
best_run

<Run: data=<RunData: metrics={'AUC_macro': 0.893833034462931,
 'AUC_micro': 0.9366814331106983,
 'AUC_weighted': 0.893833034462931,
 'accuracy': 0.8626728475771623,
 'average_precision_score_macro': 0.8555227825580028,
 'average_precision_score_micro': 0.9368934858850606,
 'average_precision_score_weighted': 0.9162125999568085,
 'balanced_accuracy': 0.7491855803562811,
 'f1_score_macro': 0.7763349889147229,
 'f1_score_micro': 0.8626728475771623,
 'f1_score_weighted': 0.8538303884857857,
 'log_loss': 0.32208101016425095,
 'matthews_correlation': 0.5677680900579292,
 'norm_macro_recall': 0.49837116071256227,
 'precision_score_macro': 0.8234141835445727,
 'precision_score_micro': 0.8626728475771623,
 'precision_score_weighted': 0.8553784165669999,
 'recall_score_macro': 0.7491855803562811,
 'recall_score_micro': 0.8626728475771623,
 'recall_score_weighted': 0.8626728475771623,
 'weighted_accuracy': 0.922317166980469}, params={}, tags={'mlflow.parentRunId': 'honest_airport_vcbqjnktj2',
 'm

**Download run's artifacts**

In [52]:
# Download run's artifacts
import os

# Create local folder
local_dir = "./artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)
# Download run's artifacts/outputs
local_path = download_artifacts(
    run_id=best_run.info.run_id, artifact_path="outputs", dst_path=local_dir
)

# Show the contents of the MLFlow model folder
os.listdir("./artifact_downloads/outputs/mlflow-model")

['conda.yaml', 'MLmodel', 'model.pkl', 'python_env.yaml', 'requirements.txt']

** Configurar endpoint**

In [83]:
# import required libraries
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
    ProbeSettings,
)

# Creating a unique endpoint name with current datetime to avoid conflicts
import datetime

online_endpoint_name = "rain-prediction" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is a sample online endpoint for deploying model",
    auth_mode="key",
    tags={"foo": "bar"},
)
print(online_endpoint_name)

.rain-prediction10231400701537


In [84]:
# Create an endpoint
ml_client.begin_create_or_update(endpoint).result()

.........................

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://rain-prediction10231400701537.eastus.inference.ml.azure.com/score', 'openapi_uri': 'https://rain-prediction10231400701537.eastus.inference.ml.azure.com/swagger.json', 'name': 'rain-prediction10231400701537', 'description': 'this is a sample online endpoint for deploying model', 'tags': {'foo': 'bar'}, 'properties': {'azureml.onlineendpointid': '/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resourcegroups/ws-ml-proyectos/providers/microsoft.machinelearningservices/workspaces/rain_prediction_ml/onlineendpoints/rain-prediction10231400701537', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/providers/Microsoft.MachineLearningServices/locations/eastus/mfeOperationsStatus/oe:1b640eff-727a-497b-801c-e11b4ab09ec7:a8a980f9-a0be-44de-96d5-9e921f4fbc82?api-version=2022-02-01-preview'}, 'print_as_yaml': True, 'id': '/

...

In [92]:
# Register model
model_name = "rain-prediction-model"
model = Model(
    path=f"azureml://jobs/{best_run.info.run_id}/outputs/artifacts/outputs/mlflow-model/",
    name=model_name,
    description="my sample rain prediction task",
    type=AssetTypes.MLFLOW_MODEL,
)
registered_model = ml_client.models.create_or_update(model)

registered_model.id

'/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resourceGroups/ws-ml-proyectos/providers/Microsoft.MachineLearningServices/workspaces/rain_prediction_ml/models/rain-prediction-model/versions/8'

El online endpointment se realizo utilizando un CPU - General Purposed, x-small.

In [93]:
from azure.ai.ml.entities import OnlineRequestSettings

# Setting the request timeout 90s
req_timeout = OnlineRequestSettings(request_timeout_ms=90000)


deployment = ManagedOnlineDeployment(
    name='rain-prediction-deploy',
    endpoint_name=online_endpoint_name,
    model=registered_model,
    instance_type="Standard_D2as_v4",  
    instance_count=1,
    request_settings=req_timeout
)

In [94]:
ml_client.online_deployments.begin_create_or_update(deployment).result()

Check: endpoint rain-prediction10231400701537 exists
Bad pipe message: %s [b'\xb9PV\xe0\x05\xc0\xc7<P5\xc0\n\x07\x10\x0b\xe2\xca\xf2 \xe4H\xf4\xf3G\x19\xd4\xf4qS\x06@\xd67&\xc5\x06']
Bad pipe message: %s [b'`HY\\/\x13\xe1`\xb8\x1e\xfbQ*\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01\x00\x00\x8f\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0.1\x00\x0b\x00\x04\x03\x00\x01\x02\x00\n\x00\x0c\x00\n\x00\x1d\x00\x17\x00\x1e\x00\x19\x00\x18\x00#\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\r\x00\x1e\x00\x1c\x04\x03\x05', b'\x03\x08']
Bad pipe message: %s [b'\x08\x08\t\x08\n\x08']
Bad pipe message: %s [b'\x04\x08\x05\x08\x06\x04\x01\x05\x01\x06']
Bad pipe message: %s [b'']
Bad pipe message: %s [b'\x03\x02\x03\x04\x00-\x00\x02\x01\x01\x003\x00&\x00$\x00\x1d\x00 \xffO\x87\x15b//(6\xc5m\xdfnKk\x81G\xb7\x9b\x8d\x99\xd3']
Bad pipe message: %s [b"q\x08\x98\x0f \xdb@9-\x1b\xdalP@y\x89'~ \x0e9\x8f\xb0m\xb7\xba\x18:\xb3\xad\xf9\x9d+\xbe\xdd"]
Bad pipe message: %s [b'\xb9x\x80S3\xd3\xbf\xdd\x8d\xfa#\xd2\xdd\x00\x0

..............................................................................................

ManagedOnlineDeployment({'private_network_connection': None, 'provisioning_state': 'Succeeded', 'endpoint_name': 'rain-prediction10231400701537', 'type': 'Managed', 'name': 'rain-prediction-deploy', 'description': None, 'tags': {}, 'properties': {'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/providers/Microsoft.MachineLearningServices/locations/eastus/mfeOperationsStatus/od:1b640eff-727a-497b-801c-e11b4ab09ec7:0a4f8c70-cce6-44b7-9e12-6c81f0460a2b?api-version=2023-04-01-preview'}, 'print_as_yaml': True, 'id': '/subscriptions/3deaa453-5a6c-4bcd-85f1-1645c3ccd539/resourceGroups/ws-ml-proyectos/providers/Microsoft.MachineLearningServices/workspaces/rain_prediction_ml/onlineEndpoints/rain-prediction10231400701537/deployments/rain-prediction-deploy', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/aldunatelipac1/code/Users/aldunatelipac/rain_prediction', 'creation_context': None, 'serial

##### **6. Realizar inferencias**

Se verificó el modelo con los dos primeros sample de los datos de validación. Como se puede observar tanto para el primer caso, no RainTomorrow como para el segundo, si RainTomorrow, el modelo logró predecir correctamente el resultado

In [105]:
import json

print(test_data.iloc[0,:])

request_json = {
    "input_data":{
        "columns":["Location","MinTemp","MaxTemp","Rainfall",'WindGustDir','WindGustSpeed',
                    "WindDir9am","WindDir3pm","WindSpeed9am","WindSpeed3pm","Humidity9am",
                    "Humidity3pm","Pressure9am","Pressure3pm","RainToday"],
        "data":   [{
                    "Location":21.0,
                    "MinTemp":6.1,
                    "MaxTemp":20.7,
                    "Rainfall":0.0,
                    "WindGustDir":0.0,
                    "WindGustSpeed":30.0,
                    "WindDir9am":0.0,
                    "WindDir3pm":8.0,
                    "WindSpeed9am":17.0,
                    "WindSpeed3pm":9.0,
                    "Humidity9am":68.0,
                    "Humidity3pm":39.0,
                    "Pressure9am":1026.0,
                    "Pressure3pm":1021.9,
                    "RainToday":0.0
                  }],
    }
}


request_file_name = "sample_request_data.json"
with open(request_file_name, "w") as request_file:
    json.dump(request_json, request_file)

resp = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name=deployment.name,
    request_file=request_file_name,
)
print('Predicción con el mejor modelo')
resp

Location           21.0
MinTemp             6.1
MaxTemp            20.7
Rainfall            0.0
WindGustDir         0.0
WindGustSpeed      30.0
WindDir9am          0.0
WindDir3pm          8.0
WindSpeed9am       17.0
WindSpeed3pm        9.0
Humidity9am        68.0
Humidity3pm        39.0
Pressure9am      1026.0
Pressure3pm      1021.9
RainToday           0.0
RainTomorrow        0.0
Name: 12670, dtype: float64
Predicción con el mejor modelo


'[0]'

In [108]:
import json
print(test_data.iloc[2,:])
request_json = {
    "input_data":{
        "columns":["Location","MinTemp","MaxTemp","Rainfall",'WindGustDir','WindGustSpeed',
                    "WindDir9am","WindDir3pm","WindSpeed9am","WindSpeed3pm","Humidity9am",
                    "Humidity3pm","Pressure9am","Pressure3pm","RainToday"],
        "data":   [{
                    "Location":15.0,
                    "MinTemp":15.1,
                    "MaxTemp":22.7,
                    "Rainfall":5.8,
                    "WindGustDir":6.0,
                    "WindGustSpeed":65.0,
                    "WindDir9am":6.74,
                    "WindDir3pm":5.0,
                    "WindSpeed9am":2.0,
                    "WindSpeed3pm":9.0,
                    "Humidity9am":7.0,
                    "Humidity3pm":82.0,
                    "Pressure9am":1006.0,
                    "Pressure3pm":1002.9,
                    "RainToday":1.0
                  }],
    }
}


request_file_name = "sample_request_data.json"
with open(request_file_name, "w") as request_file:
    json.dump(request_json, request_file)

resp = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name=deployment.name,
    request_file=request_file_name,
)
resp

Location           15.000000
MinTemp            15.100000
MaxTemp            22.700000
Rainfall            5.800000
WindGustDir         6.000000
WindGustSpeed      65.000000
WindDir9am          6.742962
WindDir3pm          5.000000
WindSpeed9am        2.000000
WindSpeed3pm        7.000000
Humidity9am        94.000000
Humidity3pm        82.000000
Pressure9am      1006.200000
Pressure3pm      1002.100000
RainToday           1.000000
RainTomorrow        1.000000
Name: 111579, dtype: float64


'[1]'