
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# Conditional Tasks and Repairing Runs

Databricks Workflow Jobs have the ability to run tasks based on the result of previously run tasks. For example, you can setup a task to only run if a previous task fails.

Also, when a task fails, you can repair the run and restart tasks, without restarting the whole job. This can save a significant amount of time.

In this lesson, we will configure a pipeline with conditional logic, and we will learn how to repair runs.

Start by running the Classroom-Setup script below.

In [0]:
%run ./Includes/Classroom-Setup-05.1.1

Run the cell below to use the DA object to create a job. Note that the DA object is specific only to Databricks Academy courses.

After the cell completes, click the link in the output of the cell to open the job in a new tab.

In [0]:
DA.create_job_v1()

## Conditional Tasks
We are going to make some changes to this job, and configure our main task to run only if all previous tasks run successfully. Complete the following:

1. On the job configuration page, click the **`Tasks`** tab in the upper-left corner. The **`Reset`** task is selected.
1. Change the name of the task to **`Ingest_Source_1`**.
1. Click the path field, change the notebook to **`Lesson 4 Notebooks/Ingest Source 1`**, and click **`Confirm`**

So that we can simulate a real-world experience, this notebook is configured to either succeed or fail, based on a task parameter. At this point, we want to see what happens when an upstream notebook fails, so we will configure our task parameter so that the notebook will fail. Configure a task parameter as follows:

4. In the **`Parameters`** field, click **`Add`**.
1. For **`Key`**, type **`test_value`**.
1. For **`Value`**, type **`Failure`**.
1. Click **`Save task`**.

We now have our first task created. We are going to create two more tasks:

8. Click **`Add task`**, and select **`Notebook`**. 
1. Name the task, **`Ingest_Source_2`**.
1. Click the path field, select the notebook, **`Lesson 4 Notebooks/Ingest Source 2`**, and click **`Confirm`**
1. In the **`Depends on`** field, click the **`x`** to remove the task. The **`Depends on`** field should be empty.
1. Click **`Create task`**.

Repeat the instructions to configure **`Ingest_Source_3`** using notebook **`Lesson 4 Notebooks/Ingest Source 3`**

You should now have three tasks that are all independent of one another. Note that the DAG shows the three tasks with no connections between them. Let's configure our final task and set up some conditional logic:

13. Click **`Add task`**, and select **`Notebook`**. 
1. Name the task, **`Clean_Data`**.
1. Click the path field, select the notebook, **`Lesson 4 Notebooks/Clean Data`**, and click **`Confirm`**
1. Click the **`Depends on`** field, and select all three of the tasks we configured above to add them to the list. The DAG should show connections from the first three tasks to the **`Clean_Data`** task.
1. Click **`All succeeded`** in the **`Run if dependencies`** field to drop open the combo box.

Note the variety of conditions available. 

18. Select **`All succeeded`**.
1. Click **`Create task`**.

## Run the Job
1. Run the job by clicking **`Run now`** in the upper-right corner. 

A pop-up window appears with a link to the job run. 

2. Click **`View run`**.

Watch the tasks in the DAG. The colors change to show the progress of the task:

* **Gray** -- the task has not started
* **Green stripes** -- the task is currently running
* **Solid green** -- the task completed successfully
* **Dark red** -- the task failed
* **Light red** -- an upstream task failed, so the current task never ran

When the run is finished, note that **`Ingest_Source_1`** failed. This was expected. Also, note that our **`Clean_Data`** task never ran because we required that all three parent tasks must succeed before the **`Clean_Data`** task will run.

## Repairing Job Runs
We have the ability to view the notebook used in a task, including its output, as part of the job run. This can help us diagnose errors. We also have the ability to re-run specific tasks in a failed job run. Consider the following example:

You are developing a job that includes a handful of notebooks. During a run of the job, one of the tasks fails. You can change the code in that notebook and re-run that task and any tasks that depended on that task. Additionally, you can change task parameters and re-run a task. Let's do this now:

1. In the upper-right corner, click **`Repair run`**.
1. Change the value for our **`test_value`** from **`Failure`** to **`Succeed`**.
1. Click **`Repair run (2)`**.

Note the "2" in the **`Repair run`** button. This is because Databricks selected the failed task and the task that depended on the failed task. You can select and deselect whichever tasks you wish for a re-run.

Lastly, when we use **`Repair run`** to change a parameter, the original parameter in the task definition is not changed, only the parameter in the current run. Since we don't want this notebook to fail in our future runs of this job:
* Update the task definition for **`Ingest_Source_1`** and set **`test_value`** to **`Succeed`**.

## If/Else Condition Task
We can also perform branching logic based on a boolean condition. Let's see an example of this in action. Suppose our **`Clean_Data`** task pushes bad records to a quarantine table. We will want to fix these bad records, if possible, before continuing to the next task. Let's set up this logic:

1. Go back to the job definition page for our job and make sure you are on the **`Tasks`** tab.
1. Click **`Add task`**, and select **`If/else condition`**.
1. Name the task **`Bad_Record_Check`**.
1. For the condition field, type **`{{tasks.Clean_Data.values.bad_records}}`** (make sure to include both sets of curly braces; we will talk about this next) in the left-hand side field, choose **`==`** in the dropdown, and type **`0`** in the right-hand side field.
1. Ensure **`Depends on`** is set to **`Clean_Data`** and **`Run if dependencies`** is set to **`All succeeded`**.
1. Click **`Create task`**.

## Dynamic Value References
Databricks Workflow Jobs provide [options](https://docs.databricks.com/en/workflows/jobs/parameter-value-references.html) for passing information about jobs/tasks to tasks and from one task to another. 

In step 4 above, we are passing a value with the key **`bad_records`** from our **`Clean_Data`** task into our **`Bad_Record_Check`** task and comparing it to 0. To see how we are setting our **`bad_records`** value, open the Clean_Data notebook by clicking on the Clean_Data task in the DAG and clicking the square-arrow icon to the right of the path field. As you can see, we use **`dbutils.jobs.taskValues.set(key = 'bad_records', value = 5)`**. Note that this can be set dynamically in whatever way you wish.

## Setting Our "False" Task
Let's setup a task that will fix bad records before moving on in the job:

1. Click **`Add task`** on the job definition page, and select **`Notebook`**
1. Name the task **`Fix_Bad_Records`**.
1. For the path, navigate to **`Lesson 4 Notebooks/Fix Bad Records`**
1. For **`Depends on`**, select only **`Bad_Record_Check (false)`**.
1. Ensure that Run if dependencies is set to **`All succeeded`**.
1. Click **`Create task`**

We are setting this task to only run if the **`Bad_Record_Check`** fails, meaning the value for **`bad_records`** did not equal 0.

## Aggregate Records Task
Let's setup a task for aggregating data. We want this task to either run if the **`Bad_Record_Check`** is true (meaning **`bad_records == 0`**) or if the **`Bad_Record_Check`** is false (meaning **`bad_records != 0`**) and the **`Fix_Bad_Records`** task completed:

1. Click **`Add task`** on the job definition page, and select **`Notebook`**
1. Name the task **`Aggregate_Records`**.
1. For the path, navigate to **`Lesson 4 Notebooks/Aggregate Records`**
1. For **`Depends on`**, select **`Bad_Record_Check (true)`** *and* **`Fix_Bad_Records`**.
1. Set **`Run if dependencies`** to **`At least one succeeded`**.
1. Click **`Create task`**

This task will run if either:
* Bad_Record_Check is true, or
* Fix_Bad_Records completed successfully

## Run the Job

**IMPORTANT NOTE**: Before running the job, go to the **Task** tab on the **Job Details** page, select **`Ingest_Source_1`**, and in the **`Parameters`** section, change the **`test_value`** from **`Failure`** to **`Succeed`**.

Click **`Run now`** to run the job.
  
At the moment, our hard-coded **`bad_records`** value is set to 5, so we should see the **`Fix_Bad_Records`** task run before the **`Aggregate_Records`** task when we run the job

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>