-sandbox
# Automation with runnable notebooks

Notebooks are one--and not the only--way of interacting with the Spark and Databricks environment.  Notebooks can be executed independently and as recurring jobs.  They can also be exported and versioned using git.  Python files and Scala/Java jars can be executed against a Databricks cluster as well, allowing full integration with a developer's normal workflow.  Since notebooks can be executed like code files and compiled binaries, they offer a way of building production pipelines.

Functional programming design principles aid in thinking about pipelines.  In functional programming, your code always has known inputs and outputs without any side effects.  In the case of automating notebooks, coding notebooks in this way helps reduce any unintended side effects where each stage in a pipeline can operate independently from the rest.

More complex workflows using notebooks require managing dependencies between tasks and passing parameters into notebooks.  Dependency management can done by chaining notebooks together, for instance to run reporting logic after having completed a database write. Sometimes, when these pipelines become especially complex, chaining notebooks together isn't sufficient. In those cases, scheduling with Apache Airflow has become the preferred solution. Notebook widgets can be used to pass parameters to a notebook when the parameters are determined at runtime.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/notebook-workflows.png" style="height: 400px; margin: 20px"/></div>

-sandbox
### Widgets

Widgets allow for the customization of notebooks without editing the code itself.  They also allow for passing parameters into notebooks.  There are 4 types of widgets:

| Type          | Description                                                                                        |
|:--------------|:---------------------------------------------------------------------------------------------------|
| `text`        | Input a value in a text box.                                                                       |
| `dropdown`    | Select a value from a list of provided values.                                                     |
| `combobox`    | Combination of text and dropdown. Select a value from a provided list or input one in the text box.|
| `multiselect` | Select one or more values from a list of provided values.                                          |

Widgets are Databricks utility functions that can be accessed using the `dbutils.widgets` package and take a name, default value, and values (if not a `text` widget).

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://docs.azuredatabricks.net/user-guide/notebooks/widgets.html#id1" target="_blank">the Databricks documentation on widgets for additional information </a>

In [3]:
dbutils.widgets.dropdown("MyWidget", "1", [str(x) for x in range(1, 5)])

Notice the widget created at the top of the screen.  Choose a number from the dropdown menu.  Now, bring that value into your code using the `get` method.

In [5]:
dbutils.widgets.get("MyWidget")

Clear the widgets using either `remove()` or `removeAll()`

In [7]:
dbutils.widgets.removeAll()

While great for adding parameters to notebooks and dashboards, widgets also allow us to pass parameters into notebooks when we run them like a Python or JAR file.

### Running Notebooks

There are two options for running notebooks.  The first is 
- Using `dbutils.notebook.run("<path>", "<timeout>")`.  
- Using `%run` magic.

This variable is not passed into our current environment.  The difference between `dbutils.notebook.run()` and `%run` is that the parent notebook will inherit variables from the ran notebook with `%run`.

Notebook widgets allow to pass parameters into notebooks.  This can be done in the form of a dictionary that maps the widget name to a value as a `string`.

The execution record of the ran notebook can be reviewed in the *Jobs* section of the workspace.

Running notebooks can allow for more advanced workflows in the following ways:<br><br>

* Managing **dependencies** can be ensured by running a notebook that triggers other notebooks in the desired order
* Setting **timeouts** ensures that jobs have a set limit on when they must either complete or fail
* **Retry logic** ensures that fleeting failures do not prevent the proper execution of a notebook
* **Data can passed** between notebooks by saving the data to a blob store or table and passing the path as an exit parameter

## An example of a simple ETL job

1. Takes three parameters: 
  - Azure Storage account
  - Azure Blob Storage container with CSV files
  - Output pathname
  - Path to Spark ML classification model
1. Reads the CSV files to DataFrame 
1. Applies the Spark ML classification model adding a prediction column to the input DataFrame
1. Writes the result to DBFS as a parquet file 
1. Exits with the input path and output path as a result

In [11]:
%fs ls "wasbs://models@azureailabs.blob.core.windows.net/churn_classifier"

In [12]:
import json

STORAGE_ACCOUNT = "azureailabs"
CONTAINER = "churn"
OUTPUT_PATH = "/datasets/churn.parquet"
ML_PATH = "wasbs://models@azureailabs.blob.core.windows.net/churn_classifier"


result = dbutils.notebook.run("./Runnable/score", 120, 
                     {"STORAGE_ACCOUNT": STORAGE_ACCOUNT,
                      "CONTAINER": CONTAINER,
                      "ML_PATH": ML_PATH,
                      "OUTPUT_PATH": OUTPUT_PATH })


In [13]:
print(json.loads(result))

In [14]:
scoredDF = spark.read.parquet(OUTPUT_PATH)
display(scoredDF)