# Run a Notebook on Databricks

This notebook runs a notebook on azure databricks by creating a `DatabricksCompute` and a `DatabricksStep` within an Azure Machine Learning Pipeline.

This is based on the example notebook [here](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-databricks-as-compute-target.ipynb).

In [None]:
import os
import azureml.core
from azureml.core.runconfig import JarLibrary
from azureml.core.compute import ComputeTarget, DatabricksCompute
from azureml.exceptions import ComputeTargetException
from azureml.core import Workspace, Experiment
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import DatabricksStep
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from azureml.core.runconfig import PyPiLibrary, JarLibrary, EggLibrary
# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

In [None]:
# Databricks notebook path
notebook_path=os.getenv("DATABRICKS_NOTEBOOK_PATH", "/Users/jeremr@microsoft.com/parallel_top10/rescore_top10") 

In [None]:
# http://eastus.azuredatabricks.net/files/top10/aml_config/config.json?o=4604276322347170
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

In [None]:
# Replace with your account info before running.
 
db_compute_name=os.getenv("DATABRICKS_COMPUTE_NAME", "") # Databricks compute name
db_resource_group=os.getenv("DATABRICKS_RESOURCE_GROUP", "") # Databricks resource group
db_workspace_name=os.getenv("DATABRICKS_WORKSPACE_NAME", "") # Databricks workspace name
db_access_token=os.getenv("DATABRICKS_ACCESS_TOKEN", "") # Databricks access token


In [None]:
try:
    databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)
    print('Compute target {} already exists'.format(db_compute_name))
except ComputeTargetException:
    print('Compute not found, will use below parameters to attach new one')
    print('db_compute_name {}'.format(db_compute_name))
    print('db_resource_group {}'.format(db_resource_group))
    print('db_workspace_name {}'.format(db_workspace_name))
    print('db_access_token {}'.format(db_access_token))
    config = DatabricksCompute.attach_configuration(
        resource_group = db_resource_group,
        workspace_name = db_workspace_name,
        access_token= db_access_token)
    databricks_compute=ComputeTarget.attach(ws, db_compute_name, config)
    databricks_compute.wait_for_completion(True)

## Run a Scoring Script

Assumptions for running this script:

- `notebook_path` exists in the databricks workspace.
- The dependencies of the notebook living at `notebook_path` exist.
- The specified libraries exist on `dbfs`. These should honor the defaults in the databricks setup scripts in [../../scripts](../../scripts)
- The default notebook does the operationalization parts of the [als_movie_o16n.ipynb](als_movie_o16n.ipynb) file. It de-serializes the estimated ALS model, creates recommendations, then writes to CosmosDB. Therefore, the default script requires:
  - There is already a model estimated and serialized. The default is that there is a  `dbfs:/FileStore/top10/models/mvl-als-reco.mml`. 
  - There is a `secrets.json` file available on the cluster that controls access to the CosmosDB cluster that it scores to.

The easiest way to fullfil these requirements is to run through the first half of the [als_movie_o16n.ipynb](als_movie_o16n.ipynb) notebook (Through Section 2), and then make sure that the files where you have serialized the `.mml` file and `secrets.json` match the expectations in the default script.

The example below creates a new cluster for each run, but this can add a substantial amount of time. To run against an already existing cluster, you can pass the `DatabricksStep()` function a parameter `existing_cluster_id`.

**TODO**: Update defaults of `als_movie_o16n.ipynb` to write to a persistent location that can serve as the default of the rescore notebook. Update the rescore notebook to honor those defaults.

In [None]:
## This works, but still requires interactive authentication.

dbNbStep = DatabricksStep(
    name="DBNotebookInWS",
    spark_version="4.3.x-scala2.11",
    num_workers=8,
    notebook_path=notebook_path,
    run_name='rescore_top10',
    pypi_libraries=[PyPiLibrary(package="azureml-sdk[databricks]", repo=None)],
    egg_libraries=[EggLibrary(library="dbfs:/FileStore/jars/Recommenders.egg")],
    jar_libraries=[JarLibrary(library='dbfs:/FileStore/jars/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar')],
    compute_target=databricks_compute,
    allow_reuse=False
)

In [None]:
steps = [dbNbStep]
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline_run = Experiment(ws, 'rescore_top10').submit(pipeline)
pipeline_run.wait_for_completion()