d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>
# Runnable Notebooks

Apache Spark&trade; and Databricks&reg; notebooks can be run, opening the door for automated workflows.

## In this lesson you:
* Parameterize notebooks using widgets
* Execute single and multiple notebooks with dependencies
* Pass variables into notebooks using widgets

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Chrome
* A cluster configured with **8 cores** and **DBR 6.2**
* Course: ETL Part 1 from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>
* Course: ETL Part 2 from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/6a1rc4aaif?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/6a1rc4aaif?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### A Step Toward Full Automation

Notebooks are one--and not the only--way of interacting with the Spark and Databricks environment.  Notebooks can be executed independently and as recurring jobs.  They can also be exported and versioned using git.  Python files and Scala/Java jars can be executed against a Databricks cluster as well, allowing full integration with a developer's normal workflow.  Since notebooks can be executed like code files and compiled binaries, they offer a way of building production pipelines.

Functional programming design principles aid in thinking about pipelines.  In functional programming, your code always has known inputs and outputs without any side effects.  In the case of automating notebooks, coding notebooks in this way helps reduce any unintended side effects where each stage in a pipeline can operate independently from the rest.

More complex workflows using notebooks require managing dependencies between tasks and passing parameters into notebooks.  Dependency management can done by chaining notebooks together, for instance to run reporting logic after having completed a database write. Sometimes, when these pipelines become especially complex, chaining notebooks together isn't sufficient. In those cases, scheduling with Apache Airflow has become the preferred solution. Notebook widgets can be used to pass parameters to a notebook when the parameters are determined at runtime.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/notebook-workflows.png" style="height: 400px; margin: 20px"/></div>

-sandbox
### Widgets

Widgets allow for the customization of notebooks without editing the code itself.  They also allow for passing parameters into notebooks.  There are 4 types of widgets:

| Type          | Description                                                                                        |
|:--------------|:---------------------------------------------------------------------------------------------------|
| `text`        | Input a value in a text box.                                                                       |
| `dropdown`    | Select a value from a list of provided values.                                                     |
| `combobox`    | Combination of text and dropdown. Select a value from a provided list or input one in the text box.|
| `multiselect` | Select one or more values from a list of provided values.                                          |

Widgets are Databricks utility functions that can be accessed using the `dbutils.widgets` package and take a name, default value, and values (if not a `text` widget).

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://docs.azuredatabricks.net/user-guide/notebooks/widgets.html#id1" target="_blank">the Databricks documentation on widgets for additional information </a>

In [8]:
dbutils.widgets.dropdown("MyWidget", "1", [str(x) for x in range(1, 5)])

Notice the widget created at the top of the screen.  Choose a number from the dropdown menu.  Now, bring that value into your code using the `get` method.

In [10]:
dbutils.widgets.get("MyWidget")

Clear the widgets using either `remove()` or `removeAll()`

In [12]:
dbutils.widgets.removeAll()

While great for adding parameters to notebooks and dashboards, widgets also allow us to pass parameters into notebooks when we run them like a Python or JAR file.

### Running Notebooks

There are two options for running notebooks.  The first is using `dbutils.notebook.run("<path>", "<timeout>")`.  This will run the notebook.  [Take a look at this notebook first to see what it accomplishes.]($./Runnable/Runnable-1 )

Now run the notebook with the following command.

In [15]:
return_value = dbutils.notebook.run("./Runnable/Runnable-1", 30)

print("Notebook successfully ran with return value: {}".format(return_value))

Notice how the `Runnable-1` notebook ends with the command `dbutils.notebook.exit("returnValue")`.  This is a `string` that's passed back into the running notebook's environment.

Run the following cell and note how the variable doesn't exist.

-sandbox
This variable is not passed into our current environment.  The difference between `dbutils.notebook.run()` and `%run` is that the parent notebook will inherit variables from the ran notebook with `%run`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This is how the classroom setup script at the beginning of each lesson works.  Among other things, it defines the variable `userhome` for you so that you have a unique write destination from colleagues on your Databricks workspace

In [19]:
#  @  ./Runnable/Runnable-1

In [20]:
%run ./Runnable/Runnable-1

Now this variable is available for use in this notebook

In [22]:
print(my_variable)

### Parameter Passing and Debugging

Notebook widgets allow us to pass parameters into notebooks.  This can be done in the form of a dictionary that maps the widget name to a value as a `string`.

[Take a look at the second notebook to see what it accomplishes.]($./Runnable/Runnable-2 )

Pass your parameters into `dbutils.notebook.run` and save the resulting return value

In [26]:
basePath =  "{}/etl3p/".format(getUserhome())
dest_path = "{}/academy/raw_logs.parquet".format(basePath)

result = dbutils.notebook.run("./Runnable/Runnable-2", 60, {"date": "11-27-2013", "dest_path": dest_path})

Click on `Notebook job #XXX` above to view the output of the notebook.  **This is helpful for debugging any problems.**

Parse the JSON string results

In [29]:
import json
print(json.loads(result))

Now look at what this accomplished: cell phone logs were parsed corresponding to the date of the parameter passed into the notebook.  The results were saved to the given destination path.

In [31]:
display(spark.read.parquet(dest_path))

-sandbox
### Dependency Management, Timeouts, Retries, and Passing Data

Running notebooks can allow for more advanced workflows in the following ways:<br><br>

* Managing **dependencies** can be ensured by running a notebook that triggers other notebooks in the desired order
* Setting **timeouts** ensures that jobs have a set limit on when they must either complete or fail
* **Retry logic** ensures that fleeting failures do not prevent the proper execution of a notebook
* **Data can passed** between notebooks by saving the data to a blob store or table and passing the path as an exit parameter

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://docs.azuredatabricks.net/user-guide/notebooks/notebook-workflows.html" target="_blank">the Databricks documentation on Notebook Workflows for additional information </a>

## Exercise 1: Building a Generalized Notebook

Build a notebook that allows for customization using input parameters

### Step 1: Filter an Hour of Log Data

[Fill out the `Runnable-3` notebook]($./Runnable/Runnable-3 ) that takes accomplishes the following (it's helpful to open this in another tab):<br><br>

1. Takes two parameters: `hour` and `output_path`
1. Reads the following log file: `/mnt/training/EDGAR-Log-20170329/enhanced/EDGAR-Log-20170329-sample.parquet`
1. Filters the data for the hour provided
1. Writes the result to the `output_path`
1. Exits with the `output_path` as the exit parameter

In [35]:
path = "{}/hour_03.parquet".format(basePath)

dbutils.notebook.run("./Runnable/Runnable-3", 60, {"hour": "03", "output_path": path})

In [36]:
# TEST - Run this cell to test your solution
import random

r = str(random.randint(0, 10**10))
_path = "{}/hour_08_{}.parquet".format(basePath, r)

_returnValue = dbutils.notebook.run("./Runnable/Runnable-3", 60, {"hour": "08", "output_path": _path})
_df = spark.read.parquet(_returnValue)

dbTest("ET3-P-03-01-01", True, _path == _returnValue)
dbTest("ET3-P-03-01-02", 54206, _df.count())

dbutils.fs.rm(_path, True)

print("Tests passed!")

## Review
**Question:** How can I start to transition from notebooks to production environments?  
**Answer:** Runnable notebooks are the first step towards transitioning into production environments since they allow us to generalize and parameterize our code.  There are other options, including running Python files and Scala/Java jars against a cluster as well.

**Question:** How does passing parameters into and out of notebooks work?  
**Answer:** Widgets allow for the customization of notebooks and passing parameters into them.  This takes place in the form of a dictionary (Python) or map (Scala) of key/value pairs that match the names of widgets.  Only strings can be passed out of a notebook as an exit parameter.

**Question:** Since I can only pass strings of a limited length out of a notebook, how can I pass data out of a notebook?  
**Answer:** The preferred way is to save your data to a blob store or Spark table.  On the notebook's exit, pass the location of that data as a string.  It can then be easily imported and manipulated in another notebook.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [39]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Scheduling Jobs Programatically]($./ETL3 04 - Scheduling Jobs Programatically ).

## Additional Topics & Resources

**Q:** How can I integrate Databricks with more complex workflow schedulers like Apache Airflow?  
**A:** Check out the Databricks blog <a href="https://databricks.com/blog/2017/07/19/integrating-apache-airflow-with-databricks.html" target="_blank">Integrating Apache Airflow with Databricks</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>