Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# AML Pipeline with DatabricksStep
To use Databricks as a compute target from [AML Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines), a [DatabricksStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in AML Pipeline.

## Databricks as a Compute Target
### Already available
1.	Running an arbitrary Databricks notebook that the customer has at her Databricks workspace (it can take inputs/arguments and can produce outputs. You can also chain inputs and outputs using Azure Blob and ADLS Data Stores)
2.	Running an arbitrary Python script that the customer has in dbfs for the ADB workspace (it can take inputs/arguments and can produce outputs)
3.	Databricks method of specifying library dependencies

### Coming soon
1.	Running a JAR job that the customer has in dbfs for the ADB workspace (it can take inputs/arguments and can produce outputs)
2.	Adding dependent data files to run the specific job through DatabricksStep. Will support uploading files and scripts required for a job to dbfs for the ADB workspace.

## AML and Pipeline SDK-specific imports

In [None]:
import os
from azureml.core.compute import ComputeTarget, DatabricksCompute
from azureml.exceptions import ComputeTargetException
from azureml.core import Workspace, Run, Experiment
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import DatabricksStep
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [None]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Create and attach Compute targets
1. You need to create an Azure Databricks workspace in the same subscription as your AML workspace or use an existing one. [Click here](https://ms.portal.azure.com/#blade/HubsExtension/Resources/resourceType/Microsoft.Databricks%2Fworkspaces) for more information.
2. Next, you need to add your Databricks workspace to AML as a compute target and give it a name. You will use this name to refer to your Databricks workspace compute target inside AML.

  - **databricks_compute_name**: The name you choose for your databricks compute target in AML.
  - **databricks_resource_id**: The Azure Resource Id for your Databricks workspace. This is in the form: `/subscriptions/b8c23406-f9b5-4ccb-8a65-a8cb5dcd6a5a/resourceGroups/alondatabricksrg/providers/Microsoft.Databricks/workspaces/alondatabricks`
  - **databricks_access_token**: You need to manually create a Databricks access token by connecting to you Databricks workspace in a web browser and copy its value here. See [this](https://docs.databricks.com/api/latest/authentication.html#generate-a-token) for more information.

In [None]:
databricks_compute_name = "<databricks_compute_name>"
databricks_resource_id = "<databricks_resource_id>"
databricks_access_token = "<databricks_access_token>"

try:
    databricks_compute = ComputeTarget(workspace=ws, name=databricks_compute_name)
    print('Compute target already exists')
except ComputeTargetException:
    print('compute not found')
    print('databricks_compute_name {}'.format(databricks_compute_name))
    print('databricks_resource_id {}'.format(databricks_resource_id))
    print('databricks_access_token {}'.format(databricks_access_token))
    databricks_compute = DatabricksCompute.attach(
             workspace=ws,
             name=databricks_compute_name,
             resource_id=databricks_resource_id,
             access_token=databricks_access_token
         )

    databricks_compute.wait_for_completion(True)

## What you will need
### Sample Databricks notebook to test functionality

```python
dbutils.widgets.get("myparam")
p = getArgument("myparam")
print ("Param -\'myparam':")
print (p)

dbutils.widgets.get("input")
i = getArgument("input")
print ("Param -\'input':")
print (i)

dbutils.widgets.get("output")
o = getArgument("output")
print ("Param -\'output':")
print (o)

n = i + "/test"
df = spark.read.csv(n)

display (df)

data = [('value1', 'value2')]
df2 = spark.createDataFrame(data)

z = o + "/test"
df2.write.csv(z)
```

### Data Connections with Inputs and Outputs

#### Interacting with inputs and outputs from a Databricks notebook
The Databricks step supports Azure Blob and ADLS Datastore inputs and outputs. You have two ways to interact with the inputs and outputs from your Databricks notebook: 
1. [Azure Blob Storage](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html)
2. [Azure Data Lake Storage](https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html)


#### Direct Data Access
Databricks allows you to interact with Azure Blob or ADLS URIs directly. The input or output uris will be mapped to a Databricks widget param in the Databricks notebook. So if you have a data reference named "myinput" it will represent the URI of the input and you can access it directly in the Databricks python notebook like so:

```python
dbutils.widgets.get("myinput")
y = getArgument("myinput")
df = spark.read.csv(y)
```

#### Mounting
You will be supplied with additional parameters and secrets that will enable you to mount your ADLS or Azure Blob input or output location in your Databricks notebook.

##### Azure Blob Mounting
Given an Azure Blob data reference named "myinput" the following widget params will be made available in the Databricks notebook:

```python
dbutils.widgets.get("myinput") # which will contain the input URI
dbutils.widgets.get("myinput_blob_secretname") # which will contain the name of a Databricks secret (in the predefined "amlscope" secret scope) that contians an access key or sas for the Azure Blob input
dbutils.widgets.get("input_blob_config") - which will contain the required configuration for mounting
```

You can bring it all together to mount the Azure Blob like so:

```python
dbutils.widgets.get("myinput")
myinput_uri = getArgument("myinput")

dbutils.widgets.get("myinput_blob_secretname")
myinput_blob_secretname = getArgument("myinput_blob_secretname")

dbutils.widgets.get("myinput_blob_config")
myinput_blob_config = getArgument("myinput_blob_config")

dbutils.fs.mount(
  source = myinput_uri,
  mount_point = "/mnt/input",
  extra_configs = {myinput_blob_config:dbutils.secrets.get(scope = "amlscope", key = myinput_blob_secretname)})
```  
  
##### ADLS Mounting
Given an ADLS data reference named "myinput" the following widget params will be made available in the Databricks notebook:

```python
dbutils.widgets.get("myinput") # which will contain the input URI
dbutils.widgets.get("myinput__adls_clientid") # which will contain the client id for the service principal that has access to the adls input
dbutils.widgets.get("myinput_adls_secretname") # which will contain the name of a Databricks secret (in the predefined "amlscope" secret scope) that contains the secret for the above mentioned service principal
dbutils.widgets.get("myinput_adls_refresh_url") # which will contain the refresh url for the mounting configs
```

You can bring it all together to mount ADLS like this:

```python
dbutils.widgets.get("myinput")
myinput_uri = getArgument("myinput")

dbutils.widgets.get("myinput_adls_clientid")
myinput_adls_clientid = getArgument("myinput_adls_clientid")

dbutils.widgets.get("myinput_adls_secretname")
myinput_adls_secretname = getArgument("myinput_adls_secretname")

dbutils.widgets.get("myinput_adls_refresh_url")
myinput_adls_refresh_url = getArgument("myinput_adls_refresh_url")

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": myinput_adls_clientid,
           "dfs.adls.oauth2.credential": dbutils.secrets.get(scope = "amlscope", key =myinput_adls_secretname),
           "dfs.adls.oauth2.refresh.url": myinput_adls_refresh_url}

dbutils.fs.mount(
  source = myinput_uri,
  mount_point = "/mnt/output",
  extra_configs = configs)
```

## Use Databricks from AML Pipeline
### Create/Register a Datastore 

In [None]:
# Use this cell to programmatically create a Datastore from an Azure blob. Fill the relevant values below.
datastore_name = "<datastore_name>"
container_name = "<container_name>"
storage_account = "<storage_account>"
storage_account_key = "<storage_account_key>"

try:
    ds = Datastore.get(workspace=ws, datastore_name=datastore_name)
    print('Datastore already exists')
except:
    print('Creating datastore')
    ds = Datastore.register_azure_blob_container(ws, datastore_name, container_name,
                                                 account_name=storage_account, account_key=storage_account_key,
                                                 create_if_not_exists=True)

### Add a DatabricksStep
Adds a Databricks notebook as a step in a Pipeline.
- ***name:** Name of the Module
- **inputs:** List of input connections for data consumed by this step. Fetch this inside the notebook using dbutils.widgets.get("input")
- **outputs:** List of output port definitions for outputs produced by this step. Fetch this inside the notebook using dbutils.widgets.get("output")
- **spark_version:** Version of spark for the databricks run cluster. default value: 4.0.x-scala2.11
- **node_type:** Azure vm node types for the databricks run cluster. default value: Standard_D3_v2
- **num_workers:** Number of workers for the databricks run cluster
- **autoscale:** The autoscale configuration for the databricks run cluster
- **spark_env_variables:** Spark environment variables for the databricks run cluster (dictionary of {str:str}). default value: {'PYSPARK_PYTHON': '/databricks/python3/bin/python3'}
- ***notebook_path:** Path to the notebook in the databricks instance.
- **notebook_params:** Parameters  for the databricks notebook (dictionary of {str:str}). Fetch this inside the notebook using dbutils.widgets.get("myparam")
- **run_name:** Name in databricks for this run
- **timeout_seconds:** Timeout for the databricks run
- **maven_libraries:** maven libraries for the databricks run
- **pypi_libraries:** pypi libraries for the databricks run
- **egg_libraries:** egg libraries for the databricks run
- **jar_libraries:** jar libraries for the databricks run
- **rcran_libraries:** rcran libraries for the databricks run
- **databricks_compute:** Azure Databricks compute
- **databricks_compute_name:** Name of Azure Databricks compute

\* *denotes required fields*  
*You must provide exactly one of num_workers or autoscale paramaters*  
*You must provide exactly one of databricks_compute or databricks_compute_name parameters*

In [None]:
step_1_input = DataReference(datastore=ds, path_on_datastore="test",
                                     data_reference_name="input")

step_1_output = PipelineData("output", datastore_name=datastore_name)

notebook_path = os.environ.get("AML_DATABRICKS_NOTEBOOK_PATH", "<databricks_notebook_path>")

dbStep = DatabricksStep(
    name="databricksmodule",
    inputs=[step_1_input],
    outputs=[step_1_output],
    num_workers=1,
    notebook_path=notebook_path,
    notebook_params={'myparam': 'testparam'},
    run_name='demo run name',
    databricks_compute=databricks_compute,
    allow_reuse=False
)

### Build and Submit the Experiment

In [None]:
# list of steps to run
steps = [dbStep]
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline_run = Experiment(ws, 'Databricks demo experiment').submit(pipeline)
#pipeline_run.wait_for_completion()

#### View Run Details

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(pipeline_run).show()

# Next: ADLA as a Compute Target
To use ADLA as a compute target from AML Pipeline, a AdlaStep is used. This [notebook](./06.use-adla-as-compute-target.ipynb) demonstrates the use of AdlaStep in AML Pipeline.