# **PHASE 2: DATA ENGINEERING (Days 5-8)**

## **DAY 7 (15/01/26) - Workflows & Job Orchestration**



### **Section 1 - Learn**:

### **_1. Databricks Jobs vs notebooks_**

In Databricks, **Notebooks** are for development and exploration, while **Jobs** are for production and automation. Think of the Notebook as the "draft" and the Job as the "scheduled execution."

##### **1. Databricks Notebooks (Development)**

* **Interactive Coding:** Designed for a "human-in-the-loop" experience where you run cells one by one and see results (tables/charts) immediately.
* **Collaboration:** Multiple users can edit the same notebook simultaneously (like Google Docs) and leave comments.
* **Manual Execution:** Typically run on **All-Purpose Clusters**, which are always "on" and more expensive but provide instant response for testing.
* **Flexibility:** Easily switch between Python, SQL, Scala, and R within the same file using magic commands (`%sql`).
* **Visualizations:** Includes built-in `display()` functions to render interactive plots and data profiles for quick analysis.

##### **2. Databricks Jobs (Production)**

* **Automation:** Used to schedule notebooks, JARs, or Python scripts to run at specific times (CRON) or in a sequence (Workflows).
* **Job Clusters (Transient):** Jobs usually run on **Job Clusters**, which spin up when the task starts and shut down immediately after. These are **significantly cheaper** (approx. half the cost) than All-Purpose clusters.
* **Orchestration:** Jobs allow you to create **DAGs (Directed Acyclic Graphs)** of tasksâ€”for example: "Run Ingest Notebook -> if success, run Transform Notebook -> if failure, send Email."
* **Reliability:** Includes built-in retry logic, timeout settings, and sophisticated alerting (Email/Slack/PagerDuty) if a task fails.
* **Parameters:** You can pass different values (like `date` or `region`) into a Job at runtime using **Widgets**, allowing one notebook to serve many different scheduled tasks.

##### **Key Comparison Table**

| Feature | Notebooks | Jobs (Workflows) |
| --- | --- | --- |
| **Primary Goal** | Exploration & Prototyping | Production & Automation |
| **Compute Type** | All-Purpose Cluster (Interactive) | Job Cluster (Transient/Short-lived) |
| **Cost** | High ($$ per DBU) | Low ($ per DBU) |
| **Execution** | Manual / Cell-by-cell | Scheduled / Automated Trigger |
| **Error Handling** | Manual debugging | Auto-retries & Notifications |
| **Version Control** | Integrated Repos (Git) | Runs specific Git commits/tags |

##### **The "Best Practice" Workflow**

1. **Develop** your logic interactively in a **Notebook** using a small sample of data.
2. **Modularize** the code so it can accept parameters (e.g., using `dbutils.widgets`).
3. **Deploy** that notebook by creating a **Job** that runs it on a schedule against the full production dataset.


---

### **_2. Multi-task workflows_**

In Databricks, a **Multi-task Workflow** allows you to orchestrate complex data pipelines by stringing together multiple individual tasks into a **Directed Acyclic Graph (DAG)**. This replaces the need for external tools like Airflow for many Lakehouse use cases.

##### **1. Core Concepts of Multi-task Jobs**

* **Task Dependencies:** You can define the order of execution. For example, the "Silver" task only starts if the "Bronze" task completes successfully.
* **Task Types:** A single workflow can mix different types of tasks, including **Notebooks**, **Python scripts**, **SQL queries**, **Delta Live Tables**, and even **dbt** projects.
* **Shared Compute:** Multiple tasks can run on the same **Job Cluster** to save on startup time, or you can assign different clusters to different tasks based on their resource needs (e.g., a memory-heavy cluster for ML training and a small one for SQL).

##### **2. Key Features for Production**

* **Conditional Execution:** Use "If/Else" logic within the workflow (e.g., "If the data quality check fails, run the Quarantine task; otherwise, run the Gold task").
* **Task Values:** Tasks can "talk" to each other. You can set a value in Task A (like a file path or a record count) and retrieve it in Task B using `dbutils.jobs.taskValues`.
* **Repair and Rerun:** If a workflow with 10 tasks fails at task #8, you don't have to restart the whole thing. You can **"Repair"** the run, and Databricks will only execute the failed task and those that follow it.
* **Matrix Tasks:** Allows you to run the same task multiple times in parallel with different parameters (e.g., running the same "Report" notebook for 50 different countries simultaneously).

##### **3. Workflow Triggers**

* **Scheduled:** Traditional CRON-based timing (e.g., every day at 2:00 AM).
* **File Arrival:** Triggered automatically as soon as new files land in your cloud storage (S3/ADLS).
* **Continuous:** The job restarts immediately after it finishes, ideal for near real-time processing.
* **API-based:** Triggered via an external call (e.g., from an Azure Data Factory pipeline or a GitHub Action).

##### **4. Best Practices for Design**

* **Modularity:** Keep your notebooks small and focused on one goal (e.g., one notebook for Bronze, one for Silver). This makes debugging much easier.
* **Use Job Clusters:** Always use **New Job Clusters** for production to take advantage of the lower DBU pricing compared to All-Purpose clusters.
* **Notifications:** Configure **System Notifications** for specific events (e.g., On Failure, On Duration Exceeded) to ensure your team is alerted before the business notices a delay.
* **Timeouts:** Set a maximum completion time for each task to prevent a "stuck" process from wasting money and blocking the rest of the pipeline.

---

### **_3. Parameters & scheduling_**

In Databricks, **Parameters** allow you to make your code dynamic, while **Scheduling** ensures your logic runs at the right time without manual intervention.

##### **1. Parameters using Widgets**

Widgets are the primary way to pass variables into a Databricks notebook. They allow the same notebook to be reused for different dates, regions, or data sources.

* **Types of Widgets:** You can create text boxes, dropdown menus, or multi-select lists at the top of your notebook.
* **Default Values:** Always set a default value in your code so the notebook can run even if no parameter is provided.
* **Accessing Values:** Use the `dbutils.widgets` API to retrieve the value passed by the user or the Job.
* **SQL Integration:** You can access these parameters directly in SQL cells using the `$parameter_name` syntax.

**Example (Python):**

```python
# Create a text widget
dbutils.widgets.text("run_date", "2026-01-01")

# Get the value into a variable
current_date = dbutils.widgets.get("run_date")

# Use it in a query
df = spark.table("sales").filter(F.col("date") == current_date)

```

##### **2. Scheduling Patterns**

Scheduling turns your interactive notebook into a production pipeline.

* **CRON Expressions:** Use standard CRON syntax for complex timing (e.g., "0 30 2 * * ?" for every day at 2:30 AM).
* **Simple Scheduler:** A user-friendly UI for selecting frequency (Minutes, Hours, Days, Weeks).
* **Time Zone Awareness:** Ensure you set the schedule to the correct time zone (usually UTC for global teams) to avoid confusion during Daylight Savings.
* **Concurrent Runs:** You can control whether a second run should start if the first one is still running (**Concurrency Limit**).

##### **3. Advanced Triggering**

Beyond simple time-based schedules, Databricks supports event-driven triggers:

* **File Arrival Trigger:** The job starts as soon as a new file is detected in a specific S3 bucket or Azure Container. This is much more efficient than "polling" every hour.
* **Continuous Execution:** The job runs in a loop. As soon as one run finishes, the next one starts. This is best for low-latency Delta Live Tables.
* **API Triggering:** Use the **Jobs API** to trigger runs from external tools like Airflow, Azure Data Factory, or GitHub Actions.

##### **4. Best Practices**

* **Parameterized Paths:** Don't hardcode file paths. Use a parameter for the "environment" (e.g., `/mnt/data/${env}/orders`) so the same job can run in Dev, Test, and Prod.
* **Job Parameters vs. Task Parameters:** You can set parameters at the **Job level** (available to all tasks) or the **Task level** (specific to one notebook).
* **Monitoring Alerts:** Always set an "On Failure" alert to notify your team via Email or Slack if a scheduled run fails.
* **Timeout Settings:** Define a timeout for scheduled jobs (e.g., 2 hours) so that a "stuck" job doesn't run forever and consume your entire budget.

---

### **_4. Error handling_**

Error handling in Databricks is critical for transitioning from "it works on my machine" to a robust production pipeline. It involves a mix of standard Python try-except blocks and Databricks-specific orchestration features.

##### **1. Notebook-Level Error Handling (Python)**

* **Try-Except Blocks:** Use these to catch specific Spark exceptions (like `AnalysisException` for missing tables) and log them without crashing the entire cluster.
* **Graceful Failures:** Instead of letting a job crash, use `dbutils.notebook.exit("Error Message")` to stop a notebook gracefully and pass a status message back to the parent Job.
* **Assertion Checks:** Use `assert` or `if/raise` to validate data before processing (e.g., "Raise error if the input DataFrame is empty").

##### **2. Workflow/Job-Level Handling**

* **Retries:** Configure the **Max Retries** setting in the Job UI. If a task fails due to a "spot instance" being reclaimed or a temporary network glitch, Databricks will automatically try again.
* **Timeout Limits:** Set a **Timeout** for every task. If a notebook gets stuck in an infinite loop or a "deadlock," Databricks will kill the task rather than letting it burn your budget.
* **On-Failure Tasks:** Create a specific task in your Workflow that only runs if a previous task fails. This is commonly used to send a custom Slack/Email alert or "clean up" temporary files.

##### **3. Delta Lake "ACID" Safety**

* **Automatic Rollbacks:** Because Delta is ACID compliant, if an error occurs mid-write, Delta Lake automatically ensures no partial data is saved. You don't need to write manual "cleanup" code for failed writes.
* **Expectations (DLT):** In Delta Live Tables, you can define **Expectations** to handle bad data:
* `expect`: Track the error but allow the record.
* `expect or drop`: Remove the specific record that failed.
* `expect or fail`: Crash the entire pipeline if a critical error occurs.



##### **4. Logging and Monitoring**

* **Log4j / Python Logging:** Use the standard `logging` library to send custom logs to the **Driver Logs** tab in the cluster UI.
* **Query Watchdog:** Enable the "Query Watchdog" in cluster settings to automatically kill queries that are taking up too many resources or producing an accidental "Cartesian Product" (huge join).

##### **Common Pattern: The "Try-Except-Exit" Template**

```python
try:
    # Your Spark logic here
    df = spark.read.table("bronze_data")
    df.write.mode("append").saveAsTable("silver_data")
    
except Exception as e:
    # Log the error and exit with a failure message for the Job UI
    error_msg = f"Task failed: {str(e)}"
    print(error_msg)
    dbutils.notebook.exit(error_msg) 

```

---

### **Practice**

In [0]:
from pyspark.sql import functions as F

In [0]:
def load_ecommerce_dataset(Month_name):
    df = spark.read.csv(f"/Volumes/workspace/ecommerce/ecommerce_data/2019-{Month_name}.csv", header=True, inferSchema=True)
    return df

In [0]:
# df_n = load_ecommerce_dataset("Nov")
df_o = load_ecommerce_dataset("Oct")

#### **1. Add parameter widgets to notebooks** 

In [0]:
# Add widgets for parameters
dbutils.widgets.text("source_path", "workspace.silver.ecommerce_cleaned")
dbutils.widgets.dropdown("layer", "bronze", ["bronze","silver","gold"])
dbutils.widgets.dropdown("month", "Oct", ["Oct", "Nov"])

In [0]:
# Use parameters
source = dbutils.widgets.get("source_path")
layer = dbutils.widgets.get("layer")
month = dbutils.widgets.get("month")

In [0]:
print(source)
print(layer)
print(month)

workspace.silver.ecommerce_cleaned
silver
Oct


---

### **Resources**
- [Databricks Jobs](https://docs.databricks.com/jobs/)
- [Parameters](https://docs.databricks.com/jobs/parameters.html)

----