# Databricks Stack CLI and You

## Agenda

1. Who is this demo for?
2. What is the problem?
3. What is the databricks stack cli?
4. How does this affect me?
5. Where can I learn more?

## Who is this demo for?

People who are working in databricks and are interested in how we can automate deployment of code and config to databricks.

## What is the problem?

Current deployment process triggers on merge into the `develop` (staging) or `master` (production) branches. A CircleCI job runs which updates the notebooks within that environment. The core of that script revolves around using the databricks workspace API.

```yaml
copy_notebooks_to_databricks: &copy_notebooks_to_databricks
  run: |
    databricks workspace import_dir notebooks/adhoc /adhoc --overwrite
    databricks workspace import_dir notebooks/DE_REPORT /DE_Report --overwrite
    databricks workspace import_dir notebooks/DS_Shared /DS_Shared --overwrite
    databricks workspace import_dir notebooks/ETL /ETL --overwrite
    databricks workspace import_dir notebooks/public_reports /public_reports --overwrite
    databricks workspace import_dir notebooks/SAMPLES /SAMPLES --overwrite
    databricks workspace import_dir notebooks/tools /tools --overwrite
    databricks workspace import_dir notebooks/utils /utils --overwrite
```

The current deployment script has some downsides.

* Creating a notebook outside of the list folders shown above won't be deployed to databricks. Adding a new folder is a manual step to the deployment script which may not be intuitive.
* Current script doesn't deploy jobs. Could use the databricks jobs API to do so, but would require a bunch of work to check which jobs exist / don't exist before deploying to an environment.

## What is the Databricks Stack CLI?

Copying and pasting a quote from the databricks documentation.

> The stack CLI provides a way to manage a stack of Databricks resources, such as jobs, notebooks, and DBFS files.

In essence the stack API provides a single interface for deploying code and configuration to databricks. This makes it easier to push changes from a local machine to our staging and production databricks workspaces.

I'm not good with words, so I'll demo a simple usage of the databricks stack API here. In our demo, we are going to push a notebook and job to our staging databricks workspace.

First things first, let's confirm that we have databricks installed.

In [29]:
!which databricks

/home/linuxbrew/.linuxbrew/bin/databricks


I heavily use the python package caled `pygments` in this notebook, so I would also just installing that if you're running the cells as you follow along.

In [30]:
!pip install pygments



We're going to run a super simple snippet of a notebook that I stole from databricks. The notebook reads a text file and prints the first 10 lines in the file.

In [31]:
!pygmentize simple_notebook.py

[37m# Databricks notebook source[39;49;00m

filePath = [33m"[39;49;00m[33mdbfs:/databricks-datasets/SPARK_README.md[39;49;00m[33m"[39;49;00m [37m# path in Databricks File System[39;49;00m
lines = sc.textFile(filePath) [37m# read the file into the cluster[39;49;00m
lines.take([34m10[39;49;00m) [37m# display first 10 lines in the file[39;49;00m


The job is also pretty simple, when it runs it will spin up a cluster and run the notebook.

In [32]:
!pygmentize job_config.json

{
  [94m"name"[39;49;00m: [33m"Super simple job that should be deleted ASAP"[39;49;00m,
  [94m"new_cluster"[39;49;00m: {
    [94m"autoscale"[39;49;00m: {
      [94m"min_workers"[39;49;00m: [34m1[39;49;00m,
      [94m"max_workers"[39;49;00m: [34m10[39;49;00m
    },
    [94m"spark_version"[39;49;00m: [33m"6.3.x-scala2.11"[39;49;00m,
    [94m"aws_attributes"[39;49;00m: {
      [94m"first_on_demand"[39;49;00m: [34m1[39;49;00m,
      [94m"availability"[39;49;00m: [33m"SPOT_WITH_FALLBACK"[39;49;00m,
      [94m"zone_id"[39;49;00m: [33m"us-east-1a"[39;49;00m,
      [94m"spot_bid_price_percent"[39;49;00m: [34m100[39;49;00m,
      [94m"ebs_volume_type"[39;49;00m: [33m"GENERAL_PURPOSE_SSD"[39;49;00m,
      [94m"ebs_volume_count"[39;49;00m: [34m1[39;49;00m,
      [94m"ebs_volume_size"[39;49;00m: [34m100[39;49;00m
    },
    [94m"node_type_id"[39;49;00m: [33m"m4.large"[39;49;00m,
    [94m"enable_elastic_disk"[39;49;00m: [34mtrue[39;49;00m
  

To use the databricks stack API, we'll need to create a configuration file matching their API. The important things to note in this JSON is that there are two items under the `resources` key. The first item represents the notebook in databricks, linking a file locally to a path in the databricks workspace. The second item represents a job which when run will go through the simple notebook we created earlier.

In [33]:
!pygmentize staging_stack.json

{
  [94m"name"[39;49;00m: [33m"databricks-stack-demo"[39;49;00m,
  [94m"resources"[39;49;00m: [
    {
      [94m"id"[39;49;00m: [33m"example-workspace-notebook"[39;49;00m,
      [94m"service"[39;49;00m: [33m"workspace"[39;49;00m,
      [94m"properties"[39;49;00m: {
        [94m"source_path"[39;49;00m: [33m"simple_notebook.py"[39;49;00m,
        [94m"path"[39;49;00m: [33m"/Users/lennox.stevenson@prodigygame.com/databricks_stack_demo/simple_notebook"[39;49;00m,
        [94m"object_type"[39;49;00m: [33m"NOTEBOOK"[39;49;00m
      }
    },
    {
      [94m"id"[39;49;00m: [33m"simple-job"[39;49;00m,
      [94m"service"[39;49;00m: [33m"jobs"[39;49;00m,
      [94m"properties"[39;49;00m: {
        [94m"name"[39;49;00m: [33m"Super simple job that should be deleted ASAP"[39;49;00m,
        [94m"new_cluster"[39;49;00m: {
          [94m"autoscale"[39;49;00m: {
            [94m"min_workers"[39;49;00m: [34m1[39;49;00m,
            [94m"max_workers"[

With the basic config setup, let's see what happens when we hit the deploy button and push it to staging.

In [34]:
!databricks --profile staging stack deploy staging_stack.json

################################################################################
Deploying stack at: staging_stack.json with options: {'overwrite': False}
################################################################################
Validating fields in stack configuration...
Validating fields in resource with ID "example-workspace-notebook"
Validating fields in "properties" of workspace resource.
Validating fields in resource with ID "simple-job"
Validating fields in "properties" of jobs resource.
################################################################################
Validating fields in stack status...
Validating fields in resource status of resource with ID "example-workspace-notebook"
Validating fields in "databricks_id" of workspace resource status
Validating fields in resource status of resource with ID "simple-job"
Validating fields in "databricks_id" of jobs resource status
################################################################################
Deploying s

It outputs a bunch of information on the steps it took to make sure the resources listed in the stack config file gets pushed to databricks. Now we can go look at our staging workspace and see that the notebook and job exists. (Seriously, go look. I really hope it's there).

Let's say I screwed up and I want the notebook to print 100 lines, not just the first 10. Making that change in the notebook is simple, and I've made that change in a separate file.

In [35]:
!pygmentize simple_notebook_v2.py

[37m# Databricks notebook source[39;49;00m

filePath = [33m"[39;49;00m[33mdbfs:/databricks-datasets/SPARK_README.md[39;49;00m[33m"[39;49;00m [37m# path in Databricks File System[39;49;00m
lines = sc.textFile(filePath) [37m# read the file into the cluster[39;49;00m
lines.take([34m100[39;49;00m) [37m# display first 10 lines in the file[39;49;00m


Uploading this change to databricks is super simple, as we just need to rerun the deploy command again with the `--override` flag. Note that I use a v2 of the config file which points the to the new notebook with the changes we want to see deployed.

In [36]:
!pygmentize staging_stack_v2.json

{
  [94m"name"[39;49;00m: [33m"databricks-stack-demo"[39;49;00m,
  [94m"resources"[39;49;00m: [
    {
      [94m"id"[39;49;00m: [33m"example-workspace-notebook"[39;49;00m,
      [94m"service"[39;49;00m: [33m"workspace"[39;49;00m,
      [94m"properties"[39;49;00m: {
        [94m"source_path"[39;49;00m: [33m"simple_notebook_v2.py"[39;49;00m,
        [94m"path"[39;49;00m: [33m"/Users/lennox.stevenson@prodigygame.com/databricks_stack_demo/simple_notebook"[39;49;00m,
        [94m"object_type"[39;49;00m: [33m"NOTEBOOK"[39;49;00m
      }
    },
    {
      [94m"id"[39;49;00m: [33m"simple-job"[39;49;00m,
      [94m"service"[39;49;00m: [33m"jobs"[39;49;00m,
      [94m"properties"[39;49;00m: {
        [94m"name"[39;49;00m: [33m"Super simple job that should be deleted ASAP"[39;49;00m,
        [94m"new_cluster"[39;49;00m: {
          [94m"autoscale"[39;49;00m: {
            [94m"min_workers"[39;49;00m: [34m1[39;49;00m,
            [94m"max_workers

In [37]:
!databricks --profile staging stack deploy staging_stack_v2.json --overwrite

################################################################################
Deploying stack at: staging_stack_v2.json with options: {'overwrite': True}
################################################################################
Validating fields in stack configuration...
Validating fields in resource with ID "example-workspace-notebook"
Validating fields in "properties" of workspace resource.
Validating fields in resource with ID "simple-job"
Validating fields in "properties" of jobs resource.
################################################################################
Validating fields in stack status...
Validating fields in resource status of resource with ID "example-workspace-notebook"
Validating fields in "databricks_id" of workspace resource status
Validating fields in resource status of resource with ID "simple-job"
Validating fields in "databricks_id" of jobs resource status
################################################################################
Deploying

Once again, I now send you on an adventure to the staging workspace to confirm that the changes made were pushed to databricks.

## How does this affect me?

I created a [Pull Request](https://github.com/SMARTeacher/prodigy-databricks/pull/74) in the [prodigy-databricks](https://github.com/SMARTeacher/prodigy-databricks) repository to implement the stack API for the project. Assuming you follow the conventions of storing notebooks in the `notebooks` folder and jobs in the `job_config` folder, those notebooks and jobs will be automatically deployed to our databricks environments upon merging the changes into the repository.

For people who aren't checking in their code into the repository, then the stack API has no affect on you or how you work. As listed at the beginging of this demo, this workflow will be more important to those building out our data ingestion pipelines or applying models they've created in their exploratory analysis to production. It is in these cases where getting code review and configuration listed in code will provide value.

## Where can I learn more?

* https://github.com/SMARTeacher/prodigy-databricks/pull/74 - PR for integrating the stack API into our repository.
* https://docs.databricks.com/dev-tools/cli/stack-cli.html - documentation on the stack CLI.