# Notebook de exemplo Azure Machine Learning

Este é um notebook de exemplo de como treinar um modelo para classificar flores usando o Iris Dataset.
Como requerimentos para usar esse notebook:
- obrigatório: ter conta no Portal do Azure
- obrigatório: ter um workspace AzureMl
- obrigatório: ter compute target
- opcional: ter datastore criado (pode ser usado o default)
- opcional: ter um yaml de um ambiente virtual (conda env export --name azureml > environment.yml) ou arquivo de requirements.txt


### Passos:
- Importando pacotes
- Definindo variáveis
- Acessando o workspace
------------------------------
- Criando um datastore
- Criando um ambiente
- Fazendo o upload de dataset no datastore
- Registrando o dataset
- Registrando o ambiente
------------------------------
- Definindo os steps do pipeline
- Criando o pipeline
- Criando o experimento com o pipeline

## Importando pacotes

In [None]:
import os
from azureml.core import Workspace, Datastore, Experiment, Dataset, ScriptRunConfig
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.environment import Environment
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.pipeline.core.graph import PipelineParameter
from azureml.train.estimator import Estimator
from azureml.core.runconfig import RunConfiguration
from msrest.exceptions import HttpOperationError

## Definindo variáveis

In [None]:
workspace_config_path = './azureml_files/configuration/workspace_config.json'
environment_requirements_path = './azureml_files/configuration/requirements.txt'
environment_name = 'iris_env'
datastore_name = 'iris'
dataset_path = './dataset/'
dataset_file_name = 'iris.csv'
dataset_register_name = 'iris_dataset'
dataset_description = 'iris dataset'
train_file_name = 'train.py'
experiment_name = 'iris_training_pipeline'
experiment_description = 'Iris training pipeline with decision tree'
project_folder = 'azureml_files'
model_name = 'iris_decision_tree_model'
compute_target_name = 'training-compute'

## Acessando o workspace

In [None]:
azureml_workspace = Workspace.from_config(workspace_config_path)

## Criando um ambiente

In [None]:
environment = Environment.from_pip_requirements(name = environment_name, file_path = environment_requirements_path)

## Criando um datastore

In [None]:
try:
    datastore = Datastore.get(azureml_workspace, datastore_name)
except HttpOperationError:
    error_message = 'Datastore "{}" not found in the "{}" workspace. Using default datastore.'
    print(error_message.format(datastore_name, azureml_workspace.name))
    datastore = azureml_workspace.get_default_datastore()

## Fazendo o upload de dataset no datastore

In [None]:
datastore.upload_files(
    files = [os.path.join(dataset_path, dataset_file_name)],
    target_path = dataset_register_name,
    overwrite = True,
    show_progress=True
)

## Registrando o dataset

In [None]:
dataset_path = (datastore, os.path.join(dataset_register_name, dataset_file_name))
tabular_dataset = Dataset.Tabular.from_delimited_files(path = dataset_path)

tabular_dataset.register(
    workspace = azureml_workspace, 
    name = dataset_register_name,
    description = dataset_description,
    tags = {'format' : 'CSV'},
    create_new_version = True
)

## Registrando o ambiente

In [None]:
environment.register(workspace = azureml_workspace)

## Acessando o cluster

In [None]:
pipeline_compute_target = ComputeTarget(workspace = azureml_workspace, name = compute_target_name)


## Definindo os steps do pipeline

In [None]:
test_size_param = PipelineParameter(name = 'test_size', default_value = 0.1)
max_leaf_nodes_param = PipelineParameter(name = 'max_leaf_nodes_param', default_value = 4)
dataset_name_param = PipelineParameter(name = 'dataset_name_param', default_value = dataset_register_name)
model_folder = PipelineData('model_folder', datastore = datastore, output_name = 'model_folder')
dataset_input = tabular_dataset.as_named_input(dataset_register_name)

arguments_lst = [
    '--model-name', model_name,
    '--dataset-name', dataset_name_param,
    '--output-folder', model_folder,
    '--random-state-test', 0,
    '--random-state-model', 0,
    '--max-leaf-nodes', max_leaf_nodes_param,
    '--test-size', test_size_param
]

estimator = Estimator(
    source_directory = project_folder,
    environment_definition = environment,
    compute_target = pipeline_compute_target,
    entry_script = train_file_name
)

train_step = EstimatorStep(
    name = 'Train model',
    estimator = estimator, 
    estimator_entry_script_arguments = arguments_lst,
    compute_target = pipeline_compute_target,
    inputs = [dataset_input],
    outputs = [model_folder],
    allow_reuse = True
)

In [None]:
dataset_input

## Criando o pipeline

In [None]:
pipeline = Pipeline(workspace = azureml_workspace, steps = [train_step])

## Criando o experimento com o pipeline

In [None]:
pipeline_run = Experiment(azureml_workspace, experiment_name).submit(pipeline)
pipeline_run.wait_for_completion()

## Publicando o pipeline

In [None]:
published_pipeline = pipeline_run.publish_pipeline(
     name = experiment_name,
     description = experiment_description,
     version = "1.0"
)

In [None]:
published_pipeline