# 🚀 Fail Fast or Quarantine? How to Implement Data Quality Patterns with SparkDQ

In modern data pipelines, ensuring high data quality is critical to prevent the propagation of incorrect or incomplete information. SparkDQ is a modular, Spark-native data quality framework designed specifically for PySpark.
In this notebook, we’ll demonstrate how to implement three common data quality integration patterns using SparkDQ:

* Fail Fast – stop the pipeline on any critical violation
* Quarantine & Continue – isolate invalid rows, let the rest proceed
* Hybrid Threshold – tolerate limited errors, abort if a threshold is exceeded


## 🚖 Download NYC Yellow Taxi Dataset
To follow this demo, we’ll use a real-world dataset from the NYC Taxi & Limousine Commission.
This dataset contains detailed records of yellow taxi rides in January 2025 — including timestamps, distances, and fares.

The following cell will create a data/ directory and download the Parquet file into it using curl.

In [6]:
!mkdir -p data/source
!curl -L -o data/source/yellow_tripdata_2025.parquet https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 56.4M  100 56.4M    0     0  1544k      0  0:00:37  0:00:37 --:--:-- 1306k


## 📂 Load NYC Taxi Dataset
Let’s load the Yellow Taxi dataset from Parquet format.
This data includes ride timestamps, trip distances, fare amounts, and other metadata.

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("./data/source/")
df = df.select(
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime",
    "trip_distance",
    "fare_amount",
    "passenger_count"
)
df.show(5)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/25 18:03:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/25 18:03:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


+--------------------+---------------------+-------------+-----------+---------------+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|passenger_count|
+--------------------+---------------------+-------------+-----------+---------------+
| 2025-01-01 00:18:38|  2025-01-01 00:26:59|          1.6|       10.0|              1|
| 2025-01-01 00:32:40|  2025-01-01 00:35:13|          0.5|        5.1|              1|
| 2025-01-01 00:44:04|  2025-01-01 00:46:01|          0.6|        5.1|              1|
| 2025-01-01 00:14:27|  2025-01-01 00:20:01|         0.52|        7.2|              3|
| 2025-01-01 00:21:34|  2025-01-01 00:25:06|         0.66|        5.8|              3|
+--------------------+---------------------+-------------+-----------+---------------+
only showing top 5 rows



## Install SparkDQ from PyPI
To get started, we'll install the sparkdq package directly from PyPI.
This gives us access to all core validation features — including row- and aggregate-level checks, declarative configs, and result routing.

You can install it using the following command:

In [None]:
!python -m ensurepip
!python -m pip install --quiet sparkdq pyyaml

Ensure that SparkDQ is installed

In [2]:
import sparkdq
sparkdq.__version__

'0.7.1'

## 📜 Declarative Data Quality Configuration (YAML)

In [4]:
import yaml
from sparkdq.management import CheckSet

with open("dq_checks.yaml") as f:
    config = yaml.safe_load(f)

check_set = CheckSet()
check_set.add_checks_from_dicts(config)
print(check_set)

CheckSet:
  - pickup-null (NullCheck)
  - dropoff-null (NullCheck)
  - trip-positive (NumericMinCheck)
  - fare-positive (NumericMinCheck)
  - chronological (ColumnLessThanCheck)


## ✅ Run Validation with SparkDQ
Now that the checks are defined and the data is loaded, we can initialize the SparkDQ engine and run the validation.

The engine will apply all configured checks to the input DataFrame and return a structured result, which can be used to filter passing and failing records, inspect errors, or calculate summary statistics.

In [8]:
from sparkdq.engine import BatchDQEngine
engine = BatchDQEngine(check_set)
validation_result = engine.run_batch(df)

## 📊 Inspect Validation Summary
Once the checks are applied, we can print the validation summary to get a high-level view of the data quality.

In this case, the reported pass_rate is 0.94, meaning that 6% of all records failed at least one critical check.

That may not sound like much — but with millions of records, it represents a significant volume of problematic data that could distort downstream metrics or lead your ML models to learn from invalid inputs.

By making this visible, SparkDQ gives you the confidence to trust what moves forward — and to understand what gets filtered out.

In [13]:
summary = validation_result.summary()
print(summary)



Validation Summary (2025-05-25 18:06:27)
Total records:   3,475,226
Passed records:  3,252,514
Failed records:  222,712
Pass rate:       94.00%


                                                                                

## 🚨 Pattern #1 – Fail Fast

Fail-Fast Validation is a strict and uncompromising data quality strategy. As the name suggests, the moment a critical rule is violated, the pipeline halts — no data is written, no transformation is applied, and no downstream step is triggered.

This approach is ideal for environments where data correctness is non-negotiable — such as financial transactions, medical records, or regulatory reporting. In these cases, it's better to stop early than to risk propagating flawed or incomplete data through the system.

### ✅ Key Principles:

* Checks must be clearly defined and minimal false positives guaranteed
* Violations are treated as hard failures
* Developers and stakeholders get immediate feedback
* Encourages trust and accountability across teams

### Implementation in SparkDQ

In [15]:
if not summary.all_passed:
    raise RuntimeError("Critical checks failed — stopping pipeline.")

RuntimeError: Critical checks failed — stopping pipeline.

## 🧯 Pattern 2: Quarantine Strategy – Summary

The Quarantine Strategy takes a flexible approach by allowing data pipelines to continue even when some records fail validation. Invalid rows are separated into a quarantine zone, while valid data proceeds as usual.

Each quarantined record is enriched with metadata — including error types, severity levels, and timestamps — enabling teams to analyze issues, track trends, and improve data quality over time.

This pattern is ideal for fast-paced environments like data lake ingestion or machine learning pipelines, where partial success is preferable to complete failure.

### ✅ Key Principles of the Quarantine Strategy

* Graceful Degradation: The pipeline continues to operate even when some data fails validation.
* Data Separation: Valid and invalid records are clearly split into two distinct outputs.
* Error Transparency: Failed records carry rich metadata (e.g., failed checks, timestamps, severity).
* Operational Resilience: Prevents small issues from blocking critical processes.
* Continuous Improvement: Enables teams to iteratively clean, monitor, and improve quarantined data.

### Implementation in SparkDQ

In [17]:
# Write good data
validation_result.pass_df().write.parquet("data/trusted-zone/")

# Write bad data
validation_result.fail_df().write.parquet("data//quarantine-zone/")

                                                                                

## ⚖️ Pattern 3: Hybrid Strategy – Quality Threshold
The Hybrid Strategy, also known as the Quality Threshold approach, combines the strengths of both the Fail-Fast and Quarantine patterns. Instead of aborting immediately or accepting all records, it allows a limited amount of bad data, but enforces a strict upper threshold for how much is tolerable.

This pattern is useful in scenarios where small data issues are acceptable, but major violations must still trigger a failure. It offers a balanced trade-off between reliability and flexibility.

### ✅ Key Principles of the Hybrid Strategy

* Controlled Tolerance: Defines an acceptable error ratio (e.g., 20%) that should not be exceeded.
* Automated Decision Logic: Validation results are programmatically evaluated post-check.
* Conditional Continuation: The pipeline proceeds only if the error ratio is within safe limits.
* Balanced Risk Management: Avoids full pipeline stops due to a few bad records, while still protecting against large-scale issues.
* Customizable Thresholds: Different pipelines or datasets can define their own risk appetite.

### Implementation in SparkDQ


In [21]:
if summary.pass_rate < 0.8:
    raise RuntimeError("Pass ratio is too low.")

validation_result.pass_df().write.parquet("data/trusted-zone/", mode="overwrite")
validation_result.fail_df().write.parquet("data//quarantine-zone/", mode="overwrite")

                                                                                

## 🔁 Conclusion
Data quality is not a one-size-fits-all discipline — it requires thoughtful strategies tailored to the context, data maturity, and business criticality of each pipeline.

With SparkDQ, you can implement a wide range of validation patterns natively in PySpark:

* Use Fail-Fast when correctness is paramount and every violation is a blocker.
* Apply the Quarantine Strategy to build resilient systems that isolate and enrich invalid records without halting execution.
* Leverage the Hybrid Quality Threshold to balance flexibility and control — accepting some bad data, but never too much.

These patterns are not mutually exclusive. A single pipeline might use all three at different stages, depending on the purpose and reliability requirements of each component.

Ultimately, SparkDQ empowers data teams to move from ad-hoc data checks to structured, testable, and observable validation logic — enabling safer pipelines, faster debugging, and better collaboration between producers and consumers.