Skip to content

sysopmatt/python-dab-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python DABs Demo: Dynamic Job Generation from YAML Config

This project demonstrates how Python DABs (Databricks Asset Bundles with Python resource generation) can replace custom scripting workflows for managing Databricks jobs.

The Problem

A deployment pattern seen in the field is managing Databricks jobs by exporting them as JSON, then running a script that dynamically injects task blocks based on a config file. This works, but it means maintaining both the config and the script that translates it into job definitions. Environment-specific differences (cluster sizes, parameters) add more conditional logic to the script.

How This Solves It

Python DABs lets you write Python code that runs at deploy time to generate Databricks resources. Instead of a separate script that patches JSON exports, the bundle itself reads a YAML config and produces the job definition natively.

The workflow becomes:

  1. Drop a YAML file in config/ (the filename becomes the job name)
  2. Run databricks bundle deploy -t <target>
  3. The Python code in resources/jobs.py discovers all YAML files, builds a job for each one with environment-appropriate settings, and deploys them

No intermediate scripts, no JSON patching, no manual environment switching. Adding a new job is just adding a new YAML file.

Project Structure

python-dab-demo/
├── databricks.yml                          # Bundle config: enables Python resources, defines targets
├── pyproject.toml                          # Python dependencies
├── config/
│   ├── python_dab_demo_pipeline.yaml       # One job per YAML file (filename = job name)
│   └── daily_reporting_pipeline.yaml       # Add more YAML files to create more jobs
├── resources/
│   └── jobs.py                             # Discovers config/*.yaml and generates jobs at deploy time
└── src/
    └── sample_task.py                      # Parameterized notebook (placeholder for real task notebooks)

How It Works

databricks.yml

The python block is what enables Python resource generation:

python:
  venv_path: .venv
  resources:
    - resources.jobs:load_resources

This tells the bundle to call load_resources() from resources/jobs.py during validation and deployment. The function returns resource definitions (jobs, in this case) that get merged into the bundle just like YAML-defined resources would.

Three targets are defined: dev, stage, and prod. The target name is passed into the Python code so it can adjust cluster sizing and task parameters per environment.

config/*.yaml

Each YAML file in config/ defines a separate job. The filename (minus .yaml) becomes the job name. A file just needs a list of tasks:

tasks:
  - name: ingest_raw_data
    notebook: src/sample_task.py
    description: Ingest raw data from source systems into bronze layer

This is the file you'd hand to someone and say "add your tasks here." No Databricks API knowledge needed. Want another job? Create another YAML file.

resources/jobs.py

This is where the generation happens. At deploy time:

  1. load_resources() globs all *.yaml files in config/
  2. For each file, build_job() creates a condition task (check_is_monday) that gates the pipeline on whether the trigger day is Monday, using {{job.trigger.time.iso_weekday}}
  3. It loops over the YAML tasks and builds notebook task dicts, each depending on the condition task's true outcome
  4. Cluster sizing scales per target: dev = 1 worker, stage = 2, prod = 5
  5. Each task receives task_name and environment as notebook parameters
  6. Job.from_dict() constructs the job (accepts the same structure as YAML job definitions, making it easy to translate between formats)

src/sample_task.py

A Databricks notebook that receives task_name and environment via widget parameters. In a real pipeline, each task entry in the YAML would point to its own notebook. Here they all share one notebook that branches on the task name for demonstration purposes.

Setup

Prerequisites: Databricks CLI and uv installed, with a CLI profile configured.

# Clone and set up
git clone <repo-url>
cd python-dab-demo
uv venv && uv pip install -e .

# Validate (checks that Python resource generation works)
databricks bundle validate -t dev

# Deploy
databricks bundle deploy -t dev

If you use a non-default CLI profile, either set it in your environment or pass it as an env var:

DATABRICKS_CONFIG_PROFILE=myprofile databricks bundle deploy -t dev

Customizing

Add a new job: Create a new YAML file in config/ with a tasks list. The filename becomes the job name. Redeploy.

Add a task to an existing job: Add an entry to that job's YAML file and redeploy.

Change cluster sizing: Edit the worker_counts dict in get_cluster_config() inside resources/jobs.py.

Change the condition logic: The check_is_monday condition task uses {{job.trigger.time.iso_weekday}} (1 = Monday, 7 = Sunday). Swap the operator or reference value to gate on a different day, or replace it with {{job.trigger.time.is_weekday}} to run on any weekday.

Different notebooks per task: Update the notebook field in each YAML task entry to point to different notebook paths.

Cloud provider: The default node_type_id is Standard_D4s_v3 (Azure). For AWS, use something like i3.xlarge. For GCP, use n1-standard-4.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages