# Configure Azure Databricks for Reference Architecture for Recommendation Systems.

The goal of this notebook is to simplify the deployment of the real-time recommendation API on Azure architecture as described [here](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/real-time-recommendation).

Specifically, this notebook allows you to programmatically do the following:

1. Create a databricks cluster of the appropriate version
2. Install necessary libraries.
3. Upload the primary notebook to the databricks workspace if it doesn't alreadyd exist.

The primary impetus for this is to simplify the creation of the databricks cluster for the end-to-end walkthrough in the notebook [here](./als_movie_o16n.ipynb). As of 2019-01-17, the recommended version of spark can no longer be deployed through the databricks portal because it is deprecated.

## Running this notebook

You can run this notebook in two different ways. In both cases, please see the notes regarding dependencies. 

1. You can manually run through the notebook through jupyter. When doing so, make sure to adjust your `TOKEN`, `DOMAIN`, and other parameters in the first code cell below.
2. You can install [`papermill`](https://github.com/nteract/papermill) and run it using papermill. When doing so, the easiest approach is to pass the relevant parameters (from the first code cell below) as arguments, e.g.:

```
papermill create_and_configure_cluster.ipynb OUTPUT.ipynb -p TOKEN XXXXXXXXXXXXXXXXXXXXX -p DOMAIN westus.azuredatabricks.net -p cluster_name my_papermill_cluster
```

## Known Issues

You may see SSL errors occasionally, or libraries may fail to get installed. The best way to do fix this is to restart the cluster.

## Dependencies


### Python dependencies

In order to execute this notebook, you must have two python libraries installed:

- `python-dotenv` to load and manage an access token if it is not being run via papermill (see authentication below)
- `requests` to send the REST calls

If you want to run the associated tests, then the environment also requires

- `papermill`

Both are installed into a conda environment by running the following command in the same directory as this notebook:

```
conda env create -f dbapi_conda.yml
```

### Other dependencies

Additionally, you must also have:

- An azure databricks workspace
- Appropriate permissions to create and modify clusters 
- A personal access token for authentication to the workspace (see below).

We also assume that you have cloned or downloaded the [Microsoft Recommenders repository](https://github.com/Microsoft/Recommenders). This notebook is a part
of that repository, 
You must adjust the variable `path_to_recommenders_repo_root` in order to upload that as a library.

### Authentication Instructions

This will work through an example using a personal access token.

In order to use this approach, you need to generate a databricks personal access token.

To do so, do the following:

- Click on the `user` icon on the top right
- Click on the `User Settings` menu item
- Select the `Access Tokens` tab
- Click on the `Generate Access Token`
- Fill in the requested fields

Copy that token into a file called `.env` with the format:

```
DB_ACCESS_TOKEN=<THISISMYLONGSTRINGFROMDATABRICKS>
```

If `TOKEN` is left empty, we will use this file later with the `python-dotenv` package to load the access token in such a way that it is not visible


## Parameters

First, set up parameters so that the notebook can be executed programmatically using [`papermill`](https://github.com/nteract/papermill). Note that the following cell has the `parameters` tag.

In [None]:
## variables that need to be updated


## Personal access token for the existing databricks workspace
##     NOTE: If this is left blank, it will then look for a .env file to load 
##           using dotenv. If it is left blank, and no .env is available, 
##           then the notebook will fail.
TOKEN = "" 

## location of the recommenders repository root directory.
## If you have downloaded the Recommenders repository and are running this notebook 
## manually with Jupyter, then the relative path default should be accurate.
## If you have downloaded this file separately
## then you need to adjust the value.
## If you are running this with papermill, the value is relative to where you invoke papermill
path_to_recommenders_repo_root = "../../" 

## record data and outcomes for testing?
record_for_tests = False

## additional things that can be configured:
DOMAIN = 'eastus.azuredatabricks.net'
cluster_name = 'reco-db4.1-api'
node_type_id = "Standard_D3_v2"
min_workers = 2
max_workers = 6
autotermination_minutes = 60
upload_location_for_endtoend_notebook = "/Shared"


## Load modules

In [None]:
## should be available on a standard python install
import os
import json
import base64
import sys
import shutil

## installed for this project:
import requests
## add to log results with papermill, only
if record_for_tests:
    import papermill as pm


In [None]:
print('\n**** Current working directory ****\n%s\n' %(os.getcwd()))
print('\n**** Root of Recommenders repository ****\n%s\n' %(os.path.abspath(path_to_recommenders_repo_root)))
print(sys.executable)

## Setup the base URL and header for the requests

In [None]:
if TOKEN == "":
    ## If the token isn't passed or updated above, try loading it from env variable:
    from dotenv import load_dotenv
    load_dotenv(verbose=True)
    TOKEN = os.getenv('DB_ACCESS_TOKEN') 
    
assert (TOKEN is not None),"Token not update in the notebook, not passed as arg, and not found as environment variable."

# setup the base url for the api
BASE_URL = 'https://%s/api/2.0/' % (DOMAIN)
my_header = {"Authorization": b"Basic " + base64.standard_b64encode(b"token:" + str.encode(TOKEN))}

## Confirm connectivity works

Just use workspace/list to confirm that connectivity to the workspace works.

In [None]:
response = requests.get(
    BASE_URL + "workspace/list",
    headers = my_header,
    json={
        "path": "/Users/"
    }
)

In [None]:
response.json()

## Set up cluster configuration and create it if necessary

The recommended and tested version of spark is set at `4.1.x-scala2.11`, and python 3 clusters are set by specifying the `spark_env_var` `PYSPARK_PYTHON`. Other fields are configurable, but those two should not be changed unless you are testing.

In [None]:
## setup the config
my_cluster_config = {
  "cluster_name": cluster_name,
  "node_type_id": node_type_id,
  "autoscale" : {
    "min_workers": min_workers,
    "max_workers": max_workers
  },
  "autotermination_minutes": autotermination_minutes,
  "spark_version": "4.1.x-scala2.11",
  "spark_env_vars": {
    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  }
}

## List clusters

This starts by listing the available clusters. If a cluster of the same name already exists. It grabs its cluster_id and attempts to continue. If a cluster of that name does NOT already exist, it will send an API call to create it.


In [None]:
response = requests.get(
    BASE_URL + "clusters/list",
    headers = my_header
)
cluster_list = response.json()['clusters']

## Search the list for cluster_name

Only create the cluster if a cluster of the same name doesn't already exist...

Databricks allows multiple clusters with the same name, so this goes through and checks to see if the same name already exists. If so, this notebook will
install to that cluster.

In [None]:
cluster_ids = [c['cluster_id'] for c in cluster_list if c['cluster_name'] == cluster_name]

if len(cluster_ids) == 0:
    print("""
    no clusters with cluster_name ("""+cluster_name+""") found. 
    Trying to create it...
    """)
    ## Post the request...
    response = requests.post(
        BASE_URL + "clusters/create",
        headers = my_header,
        json=my_cluster_config
    )
    cluster_id = response.json()['cluster_id']
else:
    print("""
    Cluster named """+cluster_name+""" found! 
    Using that one. 
    Note: It may not have the same configuration as defined in this notebook, 
          so you may want to use a different name.
    """)
    if len(cluster_ids) > 1:
        print("""Warning: Multiple clusters with the same name found. Using the first identified.""")
    cluster_id = cluster_ids[0]
    print(cluster_id)

if record_for_tests:
    pm.record('cluster_id', cluster_id)

## Now install libraries

Need to install several from Pypi, a specific jar that needs to be uploaded, and an egg that is built from the Recommenders repository. Can do them all in 1 REST call, so download the relevant data, construct the egg, and upload them to dbfs.

In [None]:
## paths for downloading and uplaoding of relevant jars and eggs
cosmosdb_jar_url = 'https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar'
local_jar_filename = cosmosdb_jar_url.split("/")[-1]
upload_cosmosdb_jar_path = "/tmp/"+local_jar_filename
upload_reco_utils_egg_path = "/tmp/reco_utils.egg"

In [None]:
## download the jar
print("*** Downloading cosmosdb jar file to %s" %(local_jar_filename))

jar_response = requests.get(cosmosdb_jar_url, stream=True)
with open(local_jar_filename, "wb") as handle:
    for chunk in jar_response.iter_content(chunk_size=1024): 
        if chunk: # filter out keep-alive new chunks
            handle.write(chunk)
print("*** Done.")

## Upload the jar to the filestore.

Following the example [here](https://docs.databricks.com/api/latest/examples.html#upload-a-big-file-into-dbfs). Also, looked at the cli github repository to figure out exactly how to encode the [body](https://github.com/databricks/databricks-cli/blob/master/databricks_cli/dbfs/api.py).

In [None]:
def dbfs_rpc(action, my_header, body):
    """ A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
    response = requests.post(
        BASE_URL + "dbfs/" + action,
        headers=my_header,
        json=body
    )
    return response.json()

def upload_large_file(local_name, upload_path, my_header):
    print("\tUploading the data to %s...\n\tThis can take a few moments..." %(upload_path))
    # Create a handle that will be used to add blocks
    handle = dbfs_rpc("create", 
                      my_header, 
                      {"path": upload_path, "overwrite": "true"})['handle']

    ## go through the blocks...
    with open(local_name, 'rb') as f:
        while True:
            # A block can be at most 1MB
            block = f.read(1 << 20)
            if not block:
                break
            b64data = base64.b64encode(block)
            dbfs_rpc("add-block", my_header, {"handle": handle, "data": b64data.decode()})
            sys.stdout.write('.')
    # close the handle to finish uploading
    dbfs_rpc("close", my_header, {"handle": handle})
    print("Done!")

In [None]:
upload_large_file(local_jar_filename, upload_cosmosdb_jar_path, my_header)

## Zip up the reco_utils directory

This assumes you have adjusted the variable `path_to_recommenders_repo_root` so that it points to the directory where you have cloned or downloaded the recommenders [repository](https://github.com/Microsoft/Recommenders/).


In [None]:
myzipfile = shutil.make_archive('reco_utils',
                    'zip', 
                    root_dir = path_to_recommenders_repo_root, 
                    base_dir = 'reco_utils'
                   )

In [None]:
local_eggname = myzipfile.replace(".zip",".egg")

## overwrite egg if it previously existed
if os.path.exists(local_eggname):
    os.unlink(local_eggname)
os.rename(myzipfile,local_eggname)

## Upload the egg.

This is a small file, but do it this way in case it ever gets bigger... Reuse `upload_large_file()` from above...

In [None]:
upload_large_file(local_eggname, upload_reco_utils_egg_path, my_header)

## Install Libraries

Set up the libraries configuration, then post the request.

In [None]:
my_lib_config = {
  "cluster_id": cluster_id,
  "libraries": [
    {
      "jar": "dbfs:"+upload_cosmosdb_jar_path
    },
    {
      "egg": "dbfs:"+upload_reco_utils_egg_path
    },
    {
      "pypi": {
        "package": "azure-cli"
      }
    },
    {
      "pypi": {
        "package": "azureml-sdk[databricks]"
      }
    },
    {
      "pypi": {
        "package": "pydocumentdb"
      }
    }
  ]
}

In [None]:
## This requires the cluster to be started or at least pending.
## it will return an "unknown cluster" error if the cluster is off.
response = requests.post(
        BASE_URL + "libraries/install",
        headers = my_header,
        json=my_lib_config
)
response.json()
if record_for_tests:
    pm.record('lib_install_code', response.status_code)
    pm.record('lib_install_json', response.json())

## Check Install Status


In [None]:
response = requests.get(
        BASE_URL + 'libraries/cluster-status?cluster_id='+cluster_id,
        headers = my_header
)
response.json()

if record_for_tests:
    pm.record('lib_status_code', response.status_code)
    pm.record('lib_status_json', response.json())

## Upload the end-to-end Notebook

This also just uploads the end-to-end notebook that creates all resources for the architecture and trains and deploys a model.

Need to use import here, because `/Shared` on DBFS is not the same as `/Shared`
in the workspace.

In [None]:
local_path_to_ref_arch_notebook = path_to_recommenders_repo_root+"notebooks/05_operationalize/als_movie_o16n.ipynb"

with open(local_path_to_ref_arch_notebook, 'rb') as f:
    notebook_data = f.read()

import_config = {
  "path": upload_location_for_endtoend_notebook+'/als_movie_o16n.ipynb',
  "format": "JUPYTER",
  "language": "PYTHON",
  "overwrite": "false"
}

In [None]:
response = requests.post(
        BASE_URL + "workspace/import",
        headers = my_header,
        data=import_config,
        files = {"content": notebook_data}
)
response.json()

if record_for_tests:
    pm.record('nb_upload_code', response.status_code)
    pm.record('nb_upload_json', response.json())

## List the directory contents

Just to confirm it's there.

In [None]:
response = requests.get(
    BASE_URL + "workspace/list",
    headers = my_header,
    json={
        "path": upload_location_for_endtoend_notebook
    }
)
response.json()

## Done!

Now, just navigate to your cluster, and you should be able to navigate to where you uploaded the notebook, and walk through it to create the end-to-end architecture.