d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>
# Scheduling Jobs Programatically

Apache Spark&trade; and Databricks&reg; can be automated through the Jobs UI, REST API, or the command line

## In this lesson you:
* Submit jobs using the Jobs UI and REST API
* Monitor jobs using the REST API

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Chrome
* A cluster configured with **8 cores** and **DBR 6.2**
* Course: ETL Part 1 from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>
* Course: ETL Part 2 from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/1ie2iv3vou?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/1ie2iv3vou?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Automating ETL Workloads

Since recurring production jobs are the goal of ETL workloads, Spark needs a way to integrate with other automation and scheduling tools.  We also need to be able to run Python files and Scala/Java jars.

Recall from <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction" target="_blank">ETL Part 1 course from Databricks Academy</a> how we can schedule jobs using the Databricks user interface.  In this lesson, we'll explore more robust solutions to schedule jobs.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/jobs.png" style="height: 400px; margin: 20px"/></div>

There are a number of different automation and scheduling tools including the following:<br><br>

* Command line tools integrated with the UNIX scheduler Cron
* The workflow scheduler Apache Airflow
* Microsoft's Scheduler or Data Factory

The gateway into job scheduling is programmatic access to Databricks, which can be achieved either through the REST API or the Databricks Command Line Interface (CLI).

### Access Tokens

Access tokens provide programmatic access to the Databricks CLI and REST API.  This lesson uses the REST API but could also be completed <a href="https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html" target="_blank">using the command line alternative.</a>

To get started, first generate an access token.

-sandbox
In order to generate a token:<br><br>

1. Click on the person icon in the upper-right corner of the screen.
2. Click **User Settings**
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/token-1.png" style="height: 400px; margin: 20px"/></div>
3. Click on **Access Tokens**
4. Click on **Generate New Token**
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/token-2-azure.png" style="height: 400px; margin: 20px"/></div>

5. Name your token
6. Designate a lifespan (a shorter lifespan is generally better to minimize risk exposure)
7. Click **Generate**
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/token-3.png" style="height: 400px; margin: 20px"/></div>
8. Copy your token.  You'll only be able to see it once.
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/token-4.png" style="height: 400px; margin: 20px"/></div>

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Be sure to keep this key secure.  This grants the holder full programmatic access to Databricks, including both resources and data that's available to your Databricks environment.

Paste your token into the following cell along with the domain of your Databricks deployment (you can see this in the notebook's URL).  The deployment should look something like `https://westus2.azuredatabricks.net`

In [10]:
# ANSWER

token = "no"


domain = "https://westus2.azuredatabricks.net"

#domain = "https://example.cloud.databricks.com/api/2.0/"

header = {'Authorization': "Bearer "+ token}


Test that the connection works by listing all files in the root directory of DBFS.

In [12]:
try:
  import json
  import requests

  endPoint = domain+"dbfs/list?path=/"
  r = requests.get(endPoint, headers=header)

  [i.get("path") for i in json.loads(r.text).get("files")]  

except Exception as e:
  print(e)
  print("\n** Double check your previous settings **\n")
  
  

-sandbox
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The REST API can be used at the command line using a command like `curl -s -H "Authorization: Bearer token" https://domain.cloud.databricks.com/api/2.0/dbfs/list\?path\=/`
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The CLI can be used with the command `databricks fs ls dbfs:/` once it has been installed and configured with your access token

In [14]:
# ! curl -s -H "Authorization: Bearer token" https://domain.cloud.databricks.com/api/2.0/dbfs/list\?path\=/

### Scheduling with the REST API and CLI

Jobs can either be scheduled for running on a consistent basis or they can be run every time the API call is made.  Since there are many parameters in scheduling jobs, it's often best to schedule a job through the user interface, parse the configuration settings, and then run later jobs using the API.

Run the following cell to get the sense of what a basic job accomplishes.

In [16]:
path = dbutils.notebook.run("./Runnable/Runnable-4", 120, {"username": getUsername(), "ranBy": "NOTEBOOK"})
display(spark.read.parquet(path))

-sandbox
The notebook `Runnable-4` logs a timestamp and how the notebook is run.  This will log our jobs.

Schedule this job notebook as a job using parameters by first navigating to the jobs panel on the left-hand side of the screen and creating a new job.  Customize the job as follows:<br><br>

1. Give the job a name
2. Choose the notebook `Runnable-4` in the `Runnable` directory of this course
3. Add parameters for `username`, which is your Databricks login email (this gives you a unique path to save your data), and set `ranBy` as `JOB`
4. Choose a cluster of 2 workers and 1 driver (the default is too large for our needs).  **You can also choose to run a job against an already active cluster, reducing the time to spin up new resources.**
5. Click **Run now** to execute the job.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-3/runnable-4-execution.png" style="height: 400px; margin: 20px"/></div>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Set recurring jobs in the same way by adding a schedule
<img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Set email alerts in case of job failure

-sandbox
When the job completes, paste the `Run ID` that appears under completed runs below.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> On the jobs page, you can access the logs to determine the cause of any failures.

In [19]:
try:
  runId = "FILL_IN"
  endPoint = domain + "jobs/runs/get?run_id={}".format(runId)

  json.loads(requests.get(endPoint, headers=header).text)
  
except Exception as e:
  print(e)
  print("\n** Double check your runId and domain **\n")

Now take a look at the table to see the update

In [21]:
display(spark.read.parquet(path))

-sandbox
With this design pattern, you can have full, programmatic access to Databricks.  <a href="https://docs.databricks.com/api/latest/examples.html#jobs-api-examples" target="_blank">See the documentation</a> for examples on submitting jobs from Python files and JARs and other API examples.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Always run production jobs on a new cluster to minimize the chance of unexpected behavior.  Autoscaling clusters allows for elastically allocating more resources to a job as needed

## Exercise 1: Create and Submit a Job using the REST API

Now that a job has been submitted through the UI, we can easily capture and re-run that job.  Re-run the job using the REST API and different parameters.

### Step 1: Create the `POST` Request Payload

To create a new job, communicate the specifications about the job using a `POST` request.  First, define the following variables:<br><br>

* `name`: The name of your job
* `notebook_path`: The path to the notebook `Runnable-4`.  This will be the `noteboook_path` variable listed in the API call above.

In [25]:


import json

name = "Lesson-04-Lab"

notebook_path = "/Shared/ETL-Part-3/Python/Runnable/Runnable-4"

data = {
  "name": name,
  "new_cluster": {
    "spark_version": "4.2.x-scala2.11",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 2,
    "spark_conf": {"spark.databricks.delta.preview.enabled": "true"}
  },
  "notebook_task": {
    "notebook_path": notebook_path,
    "base_parameters": {
      "username": username, "ranBy": "REST-API"
    }
  }
}

data_str = json.dumps(data)
print(data_str)


### Step 2: Create the Job

Use the base `domain` defined above to create a URL for the REST endpoint `jobs/create`.  Then, submit a `POST` request using `data_str` as the payload.

In [27]:
# ANSWER
# createEndPoint = domain + "jobs/create"
# r = requests.post(createEndPoint, headers=header, data=data_str)

# job_id = json.loads(r.text).get("job_id")
# print(job_id)

### Step 3: Run the Job

Run the job using the `job_id` from above.  You'll need to submit the post request to the `RunEndPoint` URL of `jobs/run-now`

In [29]:
# ANSWER
# RunEndPoint = domain + "jobs/run-now"

# data2 = {"job_id": job_id}
# data2_str = json.dumps(data2)

# r = requests.post(RunEndPoint, headers=header, data=data2_str)

# r.text

### Step 4: Confirm that the Job Ran

Confirm that the job ran by checking the parquet file.  It can take a few minutes for the job to run and update this file.

In [31]:
display(spark.read.parquet(path))

In [32]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

APICounts = (spark.read.parquet(path)
  .filter(col("ranBy") == "REST-API")
  .count()
)

if APICounts > 0:
  print("Tests passed!")
else:
  print("Test failed, no records found")

## Review
**Question:** What ways can you schedule jobs on Databricks?  
**Answer:** Jobs can be scheduled using the UI, REST API, or Databricks CLI.

**Question:** How can you gain programmatic access to Databricks?  
**Answer:** Generating a token will give programmatic access to most Databricks services.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [35]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Job Failure]($./ETL3 05 - Job Failure ).

## Additional Topics & Resources

**Q:** Where can I get more information on the REST API?  
**A:** Check out the <a href="https://docs.azuredatabricks.net/api/index.html" target="_blank">Databricks documentation.</a>

**Q:** How can I set up the Databricks CLI?  
**A:** Check out the <a href="https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#set-up-the-cli" target="_blank">Databricks documentation for step-by-step instructions.</a>

**Q:** How can I do a `spark-submit` job using the API?  
**A:** Check out the <a href="https://docs.azuredatabricks.net/api/latest/examples.html#spark-submit-api-example" target="_blank">Databricks documentation for API examples.</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>