# 🎼 **Lab 2 - Orchestrating Spark**
In this module, we will explore how to orchestrate Spark workloads using Data Factory, Fabric Scheduler, and built in orchestator functions. Additionally, we will also explore how to use resource files to make code more modular.

## 🎯 What You'll Learn 

By the end of this lab, you'll gain insights into:  

- Reference Notebook via ```%run```
- Reference Notebook via ```notebookutils.notebook.run```
- Reference multiple Notebooks via ```notebookutils.notebook.runMultiple```
- How to use Notebook resources
- How to add Notebooks into pipelines
- Running Notebooks in a High Concurrency (HC) Session
- Scheduling notebook with the Fabric Scheduler
- Using Notebook/Environment Resources to build and orchestrate modular code
- Using Spark Job definition as batch job
---

**Get Ready to Code!**
Now that you have an overview, let's get started with hands-on exercises! 🚀


## 🚧 **2.1 Create and Prepare Code Assets**
### **2.1.1 Create Child Notebooks**
Create two notebooks: `childNotebook1` and `childNotebook2`. You can create them manually or download and upload the prebuilt versions:
- [childNotebook1.ipynb](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_lab_materials/childNotebook1.ipynb)
- [childNotebook2.ipynb](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_lab_materials/childNotebook2.ipynb)

<details> <summary><strong>🔑 childNotebook1:</strong> Click to reveal code</summary>

```python
# Code cell 1, marked as parameters cell
parameter1 = ''
parameter2 = ''

# Code cell 2
print(f'This is child notebook with parameter1 = {parameter1}, parameter2 = {parameter2}')

# Code cell 3
# Return the function with exit value
notebookutils.notebook.exit(f'Exit with current Notebook Name: {mssparkutils.runtime.context["currentNotebookName"]}')

```
</details>

<details> <summary><strong>🔑 childNotebook2:</strong> Click to reveal code</summary>

```python
# Code cell 1, marked as parameters cell
input1 = ''
input2 = ''

# Code cell 2
print("cell1 in childNotebook2")
print(f'input1 = {input1}\ninput2 = {input2}')

# Code cell 3
# Return the function with exit value
notebookutils.notebook.exit(f'Exit with current Notebook Name: {mssparkutils.runtime.context["currentNotebookName"]}')
```
</details>

### **2.1.2 Create Python Files**
Now we will create a simple Python module in the Notebook built-in resource.

Run the following to create a Python module called _my_module_:


In [None]:
mssparkutils.fs.put("file:///synfs/nb_resource/builtin/observations.py", "", True)

## 🔗 **2.2 Run and Chain Notebooks**
### **2.2.1 Inject a Notebook with _%run_**
Use [`%run`](https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook#reference-run) to inject another Notebook's code into the current session:

```python
%run childNotebook1 { 'parameter1': 'value1', 'parameter2': 'value2' }
```
You can also reference Python or SQL files from Notebook or Environment resource folders:

```python
%run [-b/--builtin | -e/--environment | -c/--current] script_file.py/.sql [variables ...]
```

`%run` options:
- `-b` / `--builtin`: Built-in notebook resources
- `-e` / `--environment`: Environment resources
- `-c` / `--current`: Always uses the current Notebook's resources, even if the current Notebook is referenced by other Notebooks

📌 **Challenge:** Use `run%` to run the code from **childNotebook2** 


In [None]:
%run

<details>
  <summary><strong>🔑 Answer:</strong> Click to reveal</summary>

```python
%run childNotebook2 { 'input1': 'foo', 'input2': 'bar' }
```

</details>

## **2.2.2 Run a Notebook Programmatically with _notebookutils.notebook.run_**
The [```notebookutils.notebook.run```](https://learn.microsoft.com/en-us/fabric/data-engineering/notebook-utilities#reference-a-notebook) function references a notebook and returns its exit value. You can run nesting function calls in a notebook interactively or in a pipeline. The notebook being referenced runs on the Spark pool of the notebook that calls this function. In comparison to `%run`, this method shows up as a distinct job with a Notebook snapshot avalable in the Monitoring hub.

```python
notebookutils.notebook.run("notebook name", <timeoutSeconds>, <parameterMap>, <workspaceId>)
```

![nbutils.run](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/Reference%20notebook%20via%20nbutils.jpg?raw=true)

### **2.2.3 Reference multi notebooks via _notebookutils.notebook.runMultiple_**
The [`notebookutils.notebook.runMultiple`](https://learn.microsoft.com/en-us/fabric/data-engineering/notebook-utilities#reference-run-multiple-notebooks-in-parallel) function allows you to run multiple notebooks in parallel or with a predefined DAG (directed-acyclic-graph). The API executes the child notebooks similar to high-concurrency mode as the same spark session is used so that compute resources are shared.

```python
notebookutils.notebook.runMultiple(["NotebookSimple", "NotebookSimple2"])
```

📌 **Challenge:** Use `runMultiple` to run both **childNotebook1** and **childNotebook2**:

<details>
  <summary><strong>🔑 Answer:</strong> Click to reveal</summary>

~~~python
exitValues = notebookutils.notebook.runMultiple(["childNotebook1", "childNotebook2"])
print(exitValues)
~~~
</details>

---

💡 **Tip:** you can use the `json` Python module to parse and format the exit values:

<br>

```python
import json
print(json.dumps(exitValues, indent=4))
```

![nbutils.multirun](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/Reference%20multi%20notebooks%20via%20nbutils.jpg?raw=true)


<br>

#### **2.2.3.1 Specifying a DAG for Additional Control**
Run the below code to see an example of how you have use a DAG to control the exact sequencing and Notebook level configuration options:

In [None]:
DAG = {
    "activities": [
        {
            "name": "step1", # activity name, must be unique
            "path": "childNotebook1", # notebook path
            "timeoutPerCellInSeconds": 90, # max timeout for each cell, default to 90 seconds
            "args": {"parameter1": "foo", "parameter2": "bar"}, # notebook parameters
        },
        {
            "name": "step2",
            "path": "childNotebook2",
            "timeoutPerCellInSeconds": 120,
            "args": {"input1": "foo", "input2": "bar"},
            "dependencies": ["step1"]
        }
    ],
    "timeoutInSeconds": 43200, # max timeout for the entire DAG, default to 12 hours
    "concurrency": 50 # max number of notebooks to run concurrently, default to 50, this is limited by the number of executors in your Spark Pool.
}
results = notebookutils.notebook.runMultiple(DAG)

### **2.2.4 Manual Multithreading**
Manual multithreading puts **you** in control. Instead of relying on notebook chaining or orchestration tools, you can spin up lightweight concurrent task execution right inside your notebook using Python's built-in `concurrent.futures` module.

This approach is perfect for:

- 🔄 **I/O-bound operations** (e.g., API calls, file reads)
- ⏱️ **Parallelizing lightweight tasks** without leaving the notebook
- 🧪 Quick experiments where full orchestration would be overkill

You define a function, launch it in multiple threads, and collect the results—all in the same cell. No snapshotting, no external runners, no magic, just raw code.

> ⚠️ With great power comes great responsibility:  
> Multithreading gives you raw control, but it also means managing error handling, result collection, and potential thread safety issues yourself.

Used wisely, it's a powerful tool in your notebook arsenal—especially when speed and flexibility matter more than rich monitoring and debugging capabilities.

Run the below code which simulates running 8 instances of a lightweight task.


In [None]:
import concurrent.futures
import time

# Example job function
def job(job_id):
    print(f"Starting job {job_id}")
    time.sleep(5)  # Simulate a task
    print(f"Finished job {job_id}")
    return f"Job {job_id} completed."

# Total jobs to execute
total_jobs = 8
# Maximum number of concurrent jobs
max_concurrency = 8

# Use ThreadPoolExecutor to manage concurrency
with concurrent.futures.ThreadPoolExecutor() as executor:
    # Submit all jobs to the executor
    futures = [executor.submit(job, job_id) for job_id in range(total_jobs)]
    
    # As each job completes, you can process its result (if needed) in the order they are completed
    for future in concurrent.futures.as_completed(futures):
        print(future.result())

### **2.2.5 Data Factory Pipelines**
The Notebook activity in pipeline allows you to run Notebook created in Microsoft Fabric. You can create a Notebook activity directly through the Fabric user interface. This article provides a step-by-step walkthrough that describes how to create a Notebook activity using the Data Factory user interface.

> ⚠️ **For this section there's no interactive lab material but feel free to explore during one of the breaks.**

#### **2.2.5.1 Orchestrating as Notebook Activity**
Add notebook into new/existing pipelines:

<br>

![](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/Add%20to%20pipeline.jpg?raw=true)

<br>

#### **2.2.5.2 Add session tag for High Concurrency Session to reuse your sessions**

<br>

![](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/session-tag-001.png?raw=true)

<br>

Trigger a pipeline run:

<br>

![](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/trigger%20pipeline%20run.jpg?raw=true)

<br>


Select the Settings tab, select an existing notebook from the Notebook dropdown, and optionally specify any parameters to pass to the notebook.

<br>

![](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/Pass%20parameters%20from%20Notebook%20activity%20.jpg?raw=true)

<br>




### **2.2.6 Four ways to run notebooks. One right choice—depending on your goal**

Notebooks are powerful—but once you need to **reuse logic**, **chain executions**, or **run tasks in parallel**, you're faced with a decision:

### 🧩 `%run`
Injects and runs another notebook *inline* in the current session.

### 📦 `notebookutils.notebook.run()` / `runMultiple()`
Executes notebooks as *isolated tasks*, optionally with return values.

### ⚙️ Manual Multithreading
Uses Python’s `concurrent.futures` to run functions concurrently inside the same notebook.

### 🛠️ Data Factory (DF) Pipeline
Orchestrates notebooks *from outside*, as part of a managed pipeline with retries, dependencies, and monitoring.

---

Each approach has its own superpowers—and limitations—when it comes to:

- 🔁 Execution context  
- 🛠 Parameter handling  
- 🧠 Return values  
- 👀 Output visibility  
- 🔐 Variable isolation  
- ⚡ Performance  
- 🔥 Error handling  
- 🔄 Parallel execution  

---

👇 The table below breaks it all down so you can pick the **right orchestration strategy**—whether you're building modular pipelines, triggering notebooks conditionally, or executing high-throughput tasks in parallel.



| Feature / Behavior                     | `%run`                        | `notebookutils.notebook.run/runMultiple()`       | Manual Multithreading (`concurrent.futures`)                     | Data Factory Activity                                 |
|----------------------------------------|-------------------------------|--------------------------------------------------|------------------------------------------------------------------|-------------------------------------------------------------|
| **Execution Context**                  | Inline in current session     | Separate notebook within same Spark session       | Threads within current notebook                                 | External orchestrator calling notebooks                    |
| **Parameterization**                   | ❌ Static only                | ✅ Dynamic supported                             | ✅ Fully dynamic                                                 | ✅ Pipeline parameters passed in                            |
| **Return Value**                       | ❌ None                       | ✅ via `exitValue`                               | ✅ via `.result()`                                               | ✅ via `exitValue` exposed in activity outputs              |
| **Output Visibility**                  | ✅ Inline                     | ✅ Visible in Monitoring snapshot                | ❌ Hidden unless logged                                          | ✅ Visible in Monitoring snapshot               |
| **Variable Sharing**                   | ✅ Full access                | ❌ Isolated                                      | ⚠️ Partial (globals/shared memory)                              | ❌ Fully isolated execution, even when using HC mode    |
| **Use Case**                           | Reusing setup/config code     | Modular notebook logic                            | Optimizing compute utilization: parallel I/O, API calls, Spark tasks  | UI based orchestration w/ robust out-of-the-box features (retries, conditional logic, etc.)   |
| **Execution Overhead**                 | 🟢 Low                        | 🟡 Medium                                        | 🟢 Ultra-Low                                                     | 🔴 Medium-High (launch, logging overhead, cross engine communication)   |
| **Error Handling**                     | ❌ Immediate stop             | ✅ Errors captured via `exitValue`               | ⚠️ Must handle manually in threads                              | ✅ Built-in retry, errors captured via `exitValue` |
| **Performance (Orchestration)**        | 🟢 Semi-fast                  | 🚀 Good for modular orchestration but doesn't scale well | ⚡ Fastest, particularly for I/O bound operations       | 🧱 Slower, built for reliability                            |
| **Parallel Execution Support**         | ❌ No                         | ✅ Yes via `runMultiple()`                       | ✅ Yes via threads                                              | ✅ Yes via parallel pipeline activities |


### **2.2.7 Enable Notebook schedule on Notebook settings page**

<br>

![](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/Schedule%20Notebook.jpg?raw=true)

## **2.3 Using Resources to Execute Python Modules**
The notebook resource explorer provides a Unix-like file system to help you manage your folders and files. It offers a writable file system space where you can store small-sized files, such as code modules or any other code assets. You can easily access them with code in the notebook as if you were working with your local file system.

> Data files generally **should not** be stored in the Resource folder, use OneLake instead. Use of data files in resources should be limited to very small files needed to run unit tests.

### **2.3.1 Editing Resource Files**
Go to **Resources** in the left-side object explorer, expand **Built-in** and select to **View and Edit** to open the `observations.py` Python file for editing on the right side of your screen.

**Copy and Paste** the below code into the file editor and click the **save button**.
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, expr, explode_outer
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, ArrayType

class Observations:
    def __init__(self, spark: SparkSession, raw_path: str, checkpoint_path: str, target_table: str):
        self.spark = spark
        self.raw_path = raw_path
        self.checkpoint_path = checkpoint_path
        self.target_table = target_table

        # Schema for the JSON files
        self.schema = StructType([
            StructField("id", StringType(), True),
            StructField("resourceType", StringType(), True),
            StructField("status", StringType(), True),
            StructField("subject", StructType([
                StructField("reference", StringType(), True)
            ]), True),
            StructField("encounter", StructType([
                StructField("reference", StringType(), True)
            ]), True),
            StructField("category", ArrayType(StructType([
                StructField("coding", ArrayType(StructType([
                    StructField("code", StringType(), True),
                    StructField("display", StringType(), True)
                ])), True)
            ])), True),
            StructField("code", StructType([
                StructField("coding", ArrayType(StructType([
                    StructField("code", StringType(), True),
                    StructField("display", StringType(), True)
                ])), True)
            ]), True),
            StructField("effectiveDateTime", StringType(), True),
            StructField("valueQuantity", StructType([
                StructField("value", DoubleType(), True),
                StructField("unit", StringType(), True)
            ]), True)
        ])

    def parseRaw(self, df):
        df = df.withColumn("category_array", explode_outer(col("category")))

        return df.select(
            col("id"),
            col("resourceType"),
            col("status"),
            col("subject.reference").alias("subject_reference"),
            split(col("subject.reference"), "/")[1].alias("patient_id"),
            col("encounter.reference").alias("encounter_reference"),
            col("category_array.coding")[0]["code"].alias("category_code"),
            col("category_array.coding")[0]["display"].alias("category_display"),
            col("code.coding")[0]["code"].alias("observation_code"),
            col("effectiveDateTime"),
            col("valueQuantity.value").alias("value_quantity"),
            col("valueQuantity.unit").alias("value_unit")
        )

    def streamToSilver(self):
        raw_df = self.spark.readStream.schema(self.schema).json(self.raw_path)

        parsed_df = self.parseRaw(raw_df)

        query = parsed_df.writeStream \
            .outputMode("append") \
            .format("delta") \
            .trigger(availableNow=True) \
            .option("checkpointLocation", self.checkpoint_path) \
            .toTable(self.target_table)

        query.awaitTermination()

    def updateGold(self):
        self.spark.sql("""
            INSERT INTO TABLE gold.dbo.patientobservations
            SELECT
                o.id AS observation_id,
                o.resourceType,
                o.status,
                o.patient_id,
                o.encounter_reference,
                
                -- Patient details from the patient_silver table
                p.gender,
                p.birthDate,
                p.deceasedDateTime,
                p.last_name,
                p.first_name,
            
                -- Observation specific details
                o.category_code,
                o.category_display,
                o.observation_code,
                o.effectiveDateTime,
                o.value_quantity,
                o.value_unit
            
            FROM silver.dbo.observations_streaming2 o
            JOIN silver.dbo.patient p
                ON o.patient_id = p.id
            WHERE NOT EXISTS (SELECT DISTINCT observation_id FROM gold.dbo.patientobservations where observation_id <> o.id)
        """)

```

Python modules in the Notebook and Environment resources can be imported just like any other Python module (i.e. from PyPi).
- **Notebook Resources**: use `from builtin.<optional_folder_path>.<module_name> import *`
- **Environment Resources**: use `from env.<optional_folder_path>.<module_name> import *`


📌 **Challenge:** Import the `observations.py` module aliased as `observations`. You can either manually type the import statement OR drag and drop the `observations.py` module onto the Notebook canvas.


<details>
  <summary><strong>🔑 Answer:</strong> Click to reveal</summary>

~~~python
from builtin.observations import *
~~~
</details>

---

💡 **Tip:** Python modules, including those imported from resources can be reloaded via executing the below magic commands. _You only need to run this once_, anytime the modules classes or functions are referenced the source module files will be re-imported. This is extremely helpful while doing development and testing.

In [None]:
%load_ext autoreload
%autoreload 2

Now run the below to initialize an instance of the `observations` class.

In [35]:
observations_elt = observations.Observations(
    spark=spark,
    raw_path="Files/observationsraw",
    checkpoint_path=f"abfss://{notebookutils.runtime.context['currentWorkspaceName']}@{spark.conf.get('fs.defaultFS').split('@')[1]}silver.Lakehouse/Files/checkpoints/observations/",
    target_table="silver.dbo.observations_streaming"
)

StatementMeta(, 6aed272a-34d4-40b2-8f31-ffd972952b29, 37, Finished, Available, Finished)

To perform **actions** on this class we can now call methods (functions relating to the class) to perform our batch streaming job to process new data into silver.

In [36]:
observations_elt.streamToSilver()

StatementMeta(, 6aed272a-34d4-40b2-8f31-ffd972952b29, 38, Finished, Available, Finished)

We can do the same to incrementally update our gold _patientobservations_ table with the new data from silver.

📌 **Challenge:** call the `updateGold()` method from the `observations_elt` class instance to update the _patientobservations_ table


<details>
  <summary><strong>🔑 Answer:</strong> Click to reveal</summary>

~~~python
observations_elt.updateGold()
~~~
</details>

## **2.4 Create an Apache Spark job definition in Fabric**
To create a Spark job definition for PySpark:
1. Download the sample Parquet file [yellow_tripdata_sampledata.parquet](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_lab_materials/yellow_tripdata_sampledata.parquet) and upload it to the files section of the lakehouse.
1. Create a new Spark job definition.
1. Select **PySpark (Python)** from the **Language** dropdown.
1. Download the [createTablefromParquet.py](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_lab_materials/createTablefromParquet.py) sample and upload it as the main definition file. The main definition file (*job.Main*) is the file that contains the application logic and is mandatory to run a Spark job. For each Spark job definition, you can only upload one main definition file.
1. Provide command line arguments for the job, if needed. Use a space as a splitter to separate the arguments.
1. Add the lakehouse reference to the job. You must have at least one lakehouse reference added to the job. This lakehouse is the default lakehouse context for the job.

![](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/run%20sjd%20for%20debug.jpg?raw=true)

### **2.4.1 Enable Spark job definition schedule on settings page**

<br>

![](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/run%20sjd%20for%20schedule.jpg?raw=true)



## 🎉 Wrapping Up the Exercise: Orchestrating Spark

Congrats on completing this hands-on exercise where we covered various options to orchestrate your Spark jobs, from pro-code to no-code, plenty of options exist to build robust data solutions.