Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Using Databricks as a Compute Target from Azure Machine Learning Pipeline
To use Databricks as a compute target from [Azure Machine Learning Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines), a [DatabricksStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline.

The notebook will show:
1. Running an arbitrary Databricks notebook that the customer has in Databricks workspace
2. Running an arbitrary Python script that the customer has in DBFS
3. Running an arbitrary Python script that is available on local computer (will upload to DBFS, and then run in Databricks) 
4. Running a JAR job that the customer has in DBFS.

## Before you begin:

1. **Create an Azure Databricks workspace** in the same subscription where you have your Azure Machine Learning workspace. You will need details of this workspace later on to define DatabricksStep. [Click here](https://ms.portal.azure.com/#blade/HubsExtension/Resources/resourceType/Microsoft.Databricks%2Fworkspaces) for more information.
2. **Create PAT (access token)**: Manually create a Databricks access token at the Azure Databricks portal. See [this](https://docs.databricks.com/api/latest/authentication.html#generate-a-token) for more information.
3. **Add demo notebook to ADB**: This notebook has a sample you can use as is. Launch Azure Databricks attached to your Azure Machine Learning workspace and add a new notebook. 
4. **Create/attach a Blob storage** for use from ADB

## Configuration

In [1]:

import os

subscription_id = os.getenv("SUBSCRIPTION_ID", default="03909a66-bef8-4d52-8e9a-a346604e0902")
resource_group = os.getenv("RESOURCE_GROUP", default="AMLtestye")
workspace_name = os.getenv("WORKSPACE_NAME", default="testamlye711")
workspace_region = os.getenv("WORKSPACE_REGION", default="southcentralus")

Access your workspace
The following cell uses the Azure ML SDK to attempt to load the workspace specified by your parameters. If this cell succeeds, your notebook library will be configured to access the workspace from all notebooks using the Workspace.from_config() method. The cell can fail if the specified workspace doesn't exist or you don't have permissions to access it.

In [2]:
from azureml.core import Workspace

try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    # write the details of the workspace to a configuration file to the notebook library
    ws.write_config()
    print("Workspace configuration succeeded. Skip the workspace creation steps below")
except:
    print("Workspace not accessible. Change your parameters or create a new workspace below")

Wrote the config file config.json to: /data/home/adminye/notebooks/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml_config/config.json
Workspace configuration succeeded. Skip the workspace creation steps below


## Azure Machine Learning and Pipeline SDK-specific imports

In [3]:
import os
import azureml.core
from azureml.core.runconfig import JarLibrary
from azureml.core.compute import ComputeTarget, DatabricksCompute
from azureml.exceptions import ComputeTargetException
from azureml.core import Workspace, Experiment
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import DatabricksStep
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.2


## Initialize Workspace

Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [4]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

Found the config file in: /data/home/adminye/notebooks/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml_config/config.json
testamlye711
AMLtestye
southcentralus
03909a66-bef8-4d52-8e9a-a346604e0902


## Attach Databricks compute target
Next, you need to add your Databricks workspace to Azure Machine Learning as a compute target and give it a name. You will use this name to refer to your Databricks workspace compute target inside Azure Machine Learning.

- **Resource Group** - The resource group name of your Azure Machine Learning workspace
- **Databricks Workspace Name** - The workspace name of your Azure Databricks workspace
- **Databricks Access Token** - The access token you created in ADB

**The Databricks workspace need to be present in the same subscription as your AML workspace**

In [6]:
# Replace with your account info before running.
 
db_compute_name=os.getenv("DATABRICKS_COMPUTE_NAME", "testye") # Databricks compute name 'testye' aml compute
db_resource_group=os.getenv("DATABRICKS_RESOURCE_GROUP", "AMLtestye2") # Databricks resource group
db_workspace_name=os.getenv("DATABRICKS_WORKSPACE_NAME", "testAMLye") # Databricks workspace name
db_access_token=os.getenv("DATABRICKS","dapic7ffd9afe482076bbd884cc745027123") # Databricks access token
 
try:
    databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)
    print('Compute target {} already exists'.format(db_compute_name))
except ComputeTargetException:
    print('Compute not found, will use below parameters to attach new one')
    print('db_compute_name {}'.format(db_compute_name))
    print('db_resource_group {}'.format(db_resource_group))
    print('db_workspace_name {}'.format(db_workspace_name))
    print('db_access_token {}'.format(db_access_token))
 
    config = DatabricksCompute.attach_configuration(
        resource_group = db_resource_group,
        workspace_name = db_workspace_name,
        access_token= db_access_token)
    databricks_compute=ComputeTarget.attach(ws, db_compute_name, config)
    databricks_compute.wait_for_completion(True)


Compute target testye already exists


## Use Databricks from Azure Machine Learning Pipeline
To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. Let's define a datasource (via DataReference) and intermediate data (via PipelineData) to be used in DatabricksStep.

In [7]:
from msrest.exceptions import HttpOperationError

blob_datastore_name='MyBlobDatastore'

account_name=os.getenv("BLOB_ACCOUNTNAME_62", "amltestyediag") # Storage account name
container_name=os.getenv("BLOB_CONTAINER_62", "movielens") # Name of Azure blob container
account_key=os.getenv("BLOB_ACCOUNT_KEY_62", "6YSqy31inkACB1o8lV+mXS+ph+Na7LBMW3HidOZ3wUkHCFJBtMeVW6hkvzgxhKv9waezK4qPfsw4TFPILx1oVw==") # Storage account key

##connect data in blob to Datastore
try:
    blob_datastore = Datastore.get(ws, blob_datastore_name)
    print("found blob datastore with name: %s" % blob_datastore_name)
except HttpOperationError:
    blob_datastore = Datastore.register_azure_blob_container(
        workspace=ws,
        datastore_name=blob_datastore_name,
        account_name=account_name, # Storage account name
        container_name=container_name, # Name of Azure blob container
        account_key=account_key) # Storage account key"
    print("registered blob datastore with name: %s" % blob_datastore_name)
print('Datastore {} will be used'.format(blob_datastore.name))

#

found blob datastore with name: MyBlobDatastore
Datastore myblobdatastore will be used


In [8]:
# Use the default blob storage
#def_blob_store = Datastore(ws, "workspaceblobstore")
#print('Datastore {} will be used'.format(def_blob_store.name))

# We are uploading a sample file in the local directory to be used as a datasource
#def_blob_store.upload_files(files=["./testdata.txt"], target_path="dbtest", overwrite=False)

step_1_input = DataReference(datastore=blob_datastore, path_on_datastore="movielens",
                                     data_reference_name="input")

step_1_output = PipelineData("output", datastore=blob_datastore)

In [None]:
blob_datastore = Datastore.get(ws, blob_datastore_name)
print("found blob datastore with name: %s" % blob_datastore_name)

### 1. Running the demo notebook already added to the Databricks workspace
Create a notebook in the Azure Databricks workspace, and provide the path to that notebook as the value associated with the environment variable "DATABRICKS_NOTEBOOK_PATH". This will then set the variableÂ notebook_pathÂ when you run the code cell below:

In [9]:
from azureml.core.runconfig import RunConfiguration

runconfig = RunConfiguration()
runconfig.load(path='.', name='library')

<azureml.core.runconfig.RunConfiguration at 0x7f62d9f64c50>

In [10]:
notebook_path=os.getenv("DATABRICKS_NOTEBOOK_PATH", "/Users/yexing@microsoft.com/als_deep_dive") # Databricks notebook path

dbNbStep = DatabricksStep(
    name="DBNotebookInWS", #name of the step
    inputs=[step_1_input],
    outputs=[step_1_output],
    num_workers=1,
    notebook_path=notebook_path,
    notebook_params={'myparam': 'testparam'},  ## what is this one?
    run_name='DB_Notebook_demo',
    compute_target=databricks_compute,
    runconfig=runconfig,
    allow_reuse=False
)

#### Build and submit the Experiment

In [12]:
steps = [dbNbStep]
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline_run = Experiment(ws, 'DB_Notebook_demo').submit(pipeline)
pipeline_run.wait_for_completion()

Created step DBNotebookInWS [5aa645cf][38e22330-0146-4d24-9cb6-9a95011c762f], (This step will run and generate new outputs)
Using data reference input for StepId [c91f1265][15efa834-d51e-49d6-8fe5-c53b5fc81110], (Consumers of this data are eligible to reuse prior runs.)
Submitted pipeline run: 5b2515b8-878f-4890-b73e-aac9647997e6
status:Running
..................................................................................................
status:Failed


'Failed'

#### View Run Details

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

### 2. Running a Python script that is already added in DBFS
To run a Python script that is already uploaded to DBFS, follow the instructions below. You will first upload the Python script to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html).

The commented out code in the below cell assumes that you have uploaded `als_deep_dive.py` to the root folder in DBFS. You can upload `als_deep_dive.py` to the folder "scripts" in DBFS using this commandline so you can use `python_script_path = "dbfs:/scripts/train-db-dbfs.py"`:

```
dbfs mkdirs dbfs:/scripts
dbfs cp ./als_deep_dive.py dbfs:/scripts/
```

In [None]:
python_script_path = "dbfs:/scripts/als_deep_dive.py"

dbPythonInDbfsStep = DatabricksStep(
    name="DBPythonInDBFS",
    inputs=[step_1_input],
    num_workers=1,
    python_script_path=python_script_path,
    python_script_params={'--input_data'},
    run_name='DB_Python_demo',
    compute_target=databricks_compute,
    allow_reuse=False
)

#### Build and submit the Experiment

In [None]:
steps = [dbPythonInDbfsStep]
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline_run = Experiment(ws, 'DB_Python_demo').submit(pipeline)
pipeline_run.wait_for_completion()

#### View Run Details

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

### 3. Running a Python script in Databricks that currenlty is in local computer
To run a Python script that is currently in your local computer, follow the instructions below. 

The commented out code below code assumes that you have `als_deep_dive.py` in the `scripts` subdirectory under the current working directory.

In this case, the Python script will be uploaded first to DBFS, and then the script will be run in Databricks.

In [None]:
python_script_name = "als_deep_dive.py"
source_directory = "."  ##does it need to be under "scripts" folder?

dbPythonInLocalMachineStep = DatabricksStep(
    name="DBPythonInLocalMachine",
    inputs=[step_1_input],
    num_workers=1,
    python_script_name=python_script_name,
    source_directory=source_directory,
    run_name='DB_Python_Local_demo',
    compute_target=databricks_compute,
    allow_reuse=False
)

#### Build and submit the Experiment

In [None]:
steps = [dbPythonInLocalMachineStep]
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline_run = Experiment(ws, 'DB_Python_Local_demo').submit(pipeline)
pipeline_run.wait_for_completion()

#### View Run Details

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()