# AML Pipeline with DataTranferStep
This notebook is used to demonstrate the use of DataTranferStep in AML Pipeline.

In certain cases, you will need to transfer data from one data location to another. For example, your data may be in Files storage and you may want to move it to Blob storage. Or, if your data is in an ADLS account and you want to make it available in the Blob storage. Built-in **DataTransferStep** can help with such movement of data.

The below example shows how to move data in an ADLS account to the Blob storage.

## AML and Pipeline SDK-specific imports

In [None]:
import os
from azureml.core.compute import ComputeTarget, DatabricksCompute
from azureml.exceptions import ComputeTargetException
from azureml.core import Workspace, Run, Experiment
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import AdlaStep
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from azureml.core import attach_legacy_compute_target

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [None]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Register Datastores

In [None]:
subscription_id = "<my-subscription-id>"
resource_group = "<my-rg>"
store_name = "<my-sotrename>"
tenant_id = "<my-tenant>"
client_id = "<my-client-id>"
client_secret = "<my-client-secret>"

adls_datastore = Datastore.register_azure_data_lake(
    workspace=ws,
    datastore_name='MyAdlsDatastore',
    subscription_id=subscription_id, # subscription id of ADLS account
    resource_group=resource_group, # resource group of ADLS account
    store_name=store_name, # ADLS account name
    tenant_id=tenant_id, # tenant id of service principal
    client_id=client_id, # client id of service principal
    client_secret=client_secret) # the secret of service principal
      
blob_datastore = Datastore.register_azure_blob_container(
    workspace=ws,
    datastore_name='MyBlobDatastore',
    account_name="<my-account>", # Storage account name
    container_name="<my-container>", # Name of Azure blob container
    account_key="<my-account-key>") # Storage account key

# CLI:
# az ml datastore register-blob -n <datastore-name> -a <account-name> -c <container-name> -k <account-key> [-t <sas-token>]

## Create DataReferences

In [None]:
adls_datastore = Datastore(workspace=ws, name="MyAdlsDatastore")
adls_data_ref = DataReference(
    datastore=adls_datastore,
    data_reference_name="adls_test_data",
    path_on_datastore="testdata")

blob_datastore = Datastore(workspace=ws, name="MyBlobDatastore")
blob_data_ref = DataReference(
    datastore=blob_datastore,
    data_reference_name="blob_test_data",
    path_on_datastore="testdata")

## Setup Data Factory Account

In [None]:
data_factory_name = 'adftest'

def get_or_create_data_factory(workspace, factory_name):
    try:
        return DataFactoryCompute(workspace, factory_name)
    except ComputeTargetException as e:
        if 'ComputeTargetNotFound' in e.message:
            print('Data factory not found, creating...')
            provisioning_config = DataFactoryCompute.provisioning_configuration()
            data_factory = ComputeTarget.create(workspace, factory_name, provisioning_config)
            data_factory.wait_for_provisioning()
            return data_factory
        else:
            raise e
            
data_factory_compute = get_or_create_data_factory(ws, data_factory_name)

# CLI:
# Create: az ml computetarget setup datafactory -n <name>
# BYOC: az ml computetarget attach datafactory -n <name> -i <resource-id>

## Create a DataTransferStep

DataTransferStep: Transfers data between Azure Blob and Data Lake accounts.

- **name:** Name of module
- **source_data_reference:** Input connection that serves as source of data transfer operation.
- **destination_data_reference:** Input connection that serves as destination of data transfer operation.
- **data_factory_compute:** Azure Data Factory to use for transferring data.

In [None]:
transfer_adls_to_blob = DataTransferStep(
    name="transfer_adls_to_blob",
    source_data_reference=adls_data_ref,
    destination_data_reference=blob_data_ref,
    data_factory_compute=data_factory_compute)

## Build and Submit the Experiment

In [None]:
pipeline = Pipeline(
    description="data_transfer_101",
    workspace=ws,
    steps=[transfer_adls_to_blob])

pipeline_run = Experiment(ws, run_history_name).submit(pipeline)
pipeline_run.wait_for_completion()

### View Run Details

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(pipeline_run).show()

### Examine the run
You can cycle through the node_run objects and examine job logs, stdout, and stderr of each of the steps.

In [None]:
step_runs = pipeline_run.get_children()
for step_run in step_runs:
    status = step_run.get_status()
    print('node', step_run.name, 'status:', status)
    if status == "Failed":
        joblog = step_run.get_job_log()
        print('job log:', joblog)
        stdout_log = step_run.get_stdout_log()
        print('stdout log:', stdout_log)
        stderr_log = step_run.get_stderr_log()
        print('stderr log:', stderr_log)
        with open("logs-" + step_run.name + ".txt", "w") as f:
            f.write(joblog)
            print("Job log written to logs-"+ step_run.name + ".txt")
    if status == "Finished":
        stdout_log = step_run.get_stdout_log()
        print('stdout log:', stdout_log)