# Create a training pipeline

Assuming you have a compute cluster named `cpu-cluster`, this notebook will create a pipeline that will be training the model, accepting as a parameter the target country.
Intermediate results are stored in the default datastore.

The pipeline will have the following parameters:
- country: The target country to train the model
- target-day: The reference day to load the data from

The pipeline will have 3 steps:
- `step_00_prepare_data.py`: This file loads the country specific data and transforms them into lightgbm Dataset.
- `step_01_train_model.py`: Trains the lightgbm and stores the model and the feature importance into the output folder.
- `step_02_register_model.py`: Registers all artifacts stored by the previous step in the Model registry. Note that within the model artifacts we store both the joblib model and the feature importance.

Note: Normally the scripts should have been in separate folders in order to upload only the needed scripts for each step. In this example, we upload all of them. This means that every time we change a single step, the [whole folder is snapshotted](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb).

In [None]:
# Variables used in script
compute_cluster_name='cpu-cluster'
pipeline_name='train-churn-model-pipeline'

In [None]:
# [Optionally] upgrade to latest AzureML SDK. If the next cell throws errors,
# you may need to restart the compute
# !pip install --upgrade azureml-sdk

In [None]:
import azureml

from azureml.core import Workspace, Experiment, Datastore, Environment
from azureml.core.runconfig import RunConfiguration
from azureml.data.datapath import DataPath, DataPathComputeBinding
from azureml.data.data_reference import DataReference
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.widgets import RunDetails
from azureml.train.estimator import Estimator
import os

print("Azure ML SDK Version: ", azureml.core.VERSION)

In [None]:
# Connect to workspace and get resource references
ws = Workspace.from_config()
compute_cluster = ComputeTarget(workspace=ws, name=compute_cluster_name)
datastore = ws.get_default_datastore()

In [None]:
# Define the pipeline parameters 
country_pipeline_param = PipelineParameter(name="country", default_value="GR")
target_day_param = PipelineParameter(name="target-day", default_value="2020-11-26")

## Python environment
We are going to create a single python environment where we will run all steps.
Ideally, I could have separate requirements.txt file, one per step (e.g. instead of training_environment I would have step_00_environment, step_01_environment etc).

In [None]:
# Create an environment with the pip requirements
training_environment = Environment.from_pip_requirements('training_environment', 'training_pipeline_requirements.txt')
# Create a run config that we will use in our steps
training_run_config = RunConfiguration()
training_run_config.environment = training_environment
# For more samples about environments have a look at
# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training/using-environments/using-environments.ipynb

## Step 00 Data preparation
The script accepts three arguments:
- country: This is a pipeline parameter
- target-day: The date partition from where we are going to read the training dataset
- output-path: Where to store the processed dataset

With in the script, the output-path is just a folder.

In [None]:
# We are going to store the prepared data as an intermediate step
training_data_path = PipelineData(
    "training_data",
    datastore=datastore,
    is_directory=True
)

# This is the first step
step_00 = PythonScriptStep(
    'step_00_prepare_data.py',
    source_directory='.',
    name='Prepare data',
    compute_target=compute_cluster,
    arguments=[
        "--country", country_pipeline_param, 
        "--target-day", target_day_param,
        "--output-path", training_data_path
        ],
    runconfig=training_run_config,
    outputs=[training_data_path],
    allow_reuse=True # Allow reuse of the data prep step
)

## Step 01 Training
The script has the following inputs:
- input-path: A folder that contains the processed dataset
- output-path: A folder to store the model and the feature importance dataset

In [None]:
# We are going to store the prepared data as an intermediate step
model_store_path = PipelineData(
    "model_store",
    datastore=datastore,
    is_directory=True
)

# This is the first step
step_01 = PythonScriptStep(
    'step_01_train_model.py',
    source_directory='.',
    name='Train churn model',
    compute_target=compute_cluster,
    arguments=[
        "--input-path", training_data_path, 
        "--output-path", model_store_path
        ],
    runconfig=training_run_config,
    inputs=[training_data_path],
    outputs=[model_store_path],
    allow_reuse=True # Allow reuse of the data prep step
)

## Step 02 Register model
This script has the following arguments:
- input-path: The folder that contains the model and the feature importance dataset
- country: The pipeline parameter passed in Step 00. This is used to append in the registered model name.


In [None]:
# This is the first step
step_02 = PythonScriptStep(
    'step_02_register_model.py',
    source_directory='.',
    name='Register churn model',
    compute_target=compute_cluster,
    arguments=[
        "--input-path", model_store_path, 
        "--country", country_pipeline_param, 
    ],
    runconfig=training_run_config,
    inputs=[model_store_path],
    allow_reuse=True # Allow reuse of the data prep step
)

## Create the pipeline
Now that we have all steps ready, we create and publish a pipeline that we can be invoking through Azure Data Factory, simple REST API call or even schedule it's execution.

In [None]:
from azureml.pipeline.core import PublishedPipeline

# Disable previously registered pipelines
published_pipeline = PublishedPipeline.list(workspace=ws)
for x in published_pipeline:
    if (x.name == pipeline_name):
        x.disable()
        print("Disabled pipeline with id ", x.id, " and name ", x.name)

In [None]:
pipeline = Pipeline(workspace=ws, steps=[step_00, step_01, step_02])

published_pipeline = pipeline.publish(
    pipeline_name, 
    description="Train a churn model for a specific country")