# Kronodroid Pipeline Runner (Notebook)

Notebook equivalent of `tools/scripts/run_kronodroid_pipeline.py`.

Data flow:
```
Kaggle → dlt → Parquet/Avro → MinIO → Spark + dbt → Iceberg (LakeFS) → Feast → (optional) LakeFS commit
```

## 1) Environment setup

- Loads `.env` from repo root (same behavior as the script).
- Ensures the repo is importable from `notebooks/`.
- Sets `LAKEFS_BRANCH` from the parameters cell.

In [14]:
import os
import sys
from pathlib import Path

# Ensure repo root is importable when running from the notebooks directory.
REPO_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from tools.scripts import run_kronodroid_pipeline as krono

# Load env vars from `.env` (if present)
krono.load_env_file(REPO_ROOT / ".env")

print(f"Repo root: {REPO_ROOT}")
print(f"LAKEFS_REPOSITORY: {os.getenv('LAKEFS_REPOSITORY', 'kronodroid')}")

Repo root: /Users/benjaminbrown/Documents/GitHub/mlops
LAKEFS_REPOSITORY: kronodroid


## 2) Run parameters

Mirror the CLI flags from `tools/scripts/run_kronodroid_pipeline.py`.

**Important Note on Spark Connectivity:**

When using `TRANSFORM_RUNNER = "spark-operator"`:
- The transformations run in the Kind cluster using Spark Operator (no local Spark needed)
- However, **Feast with Spark offline store still requires Spark connectivity** to register features and materialize
- Setting `SKIP_SPARK_CHECK = True` bypasses the connectivity check, but Feast may still fail if Spark is unavailable

**Options:**
1. **Use Spark Thrift Server** (recommended): Deploy it even when using spark-operator
   ```bash
   kubectl apply -k infra/k8s/kind/addons/spark-thrift/
   kubectl -n dfp wait --for=condition=ready pod -l app=spark-thrift-server --timeout=120s
   ```
2. **Skip Feast steps**: Set `SKIP_FEAST_APPLY = True` and `SKIP_MATERIALIZE = True`
3. **Accept Feast failures**: Let `SKIP_SPARK_CHECK = True` and catch any Feast errors

In [None]:
# Core parameters
BRANCH = "main"
TRANSFORM_RUNNER = "spark-operator"  # "dbt" or "spark-operator"
DBT_TARGET = "dev"  # "dev" (embedded Spark) or "thrift" (Spark server)
K8S_NAMESPACE = "dfp"  # used when TRANSFORM_RUNNER == "spark-operator"
SPARK_IMAGE = "apache/spark:3.5.7-python3"  # used when TRANSFORM_RUNNER == "spark-operator"
SPARK_TIMEOUT_SECONDS = 60 * 30  # used when TRANSFORM_RUNNER == "spark-operator"
FILE_FORMAT = "parquet"  # "parquet" or "avro" (dlt loader uses parquet for avro requests)

# Step toggles
SKIP_INGESTION = False
SKIP_DBT = False
SKIP_FEAST_APPLY = False
SKIP_MATERIALIZE = False
SKIP_COMMIT = False

# Spark check: Set to False when using spark-operator (Feast will try to connect anyway)
# If Feast fails, you may need to start Spark Thrift Server or skip Feast steps
SKIP_SPARK_CHECK = True if TRANSFORM_RUNNER == "spark-operator" else False

# Materialize-only mode (equivalent to `--materialize-only`)
MATERIALIZE_ONLY = False
MATERIALIZE_DAYS = 30

os.environ["LAKEFS_BRANCH"] = BRANCH

print(
    {
        "BRANCH": BRANCH,
        "TRANSFORM_RUNNER": TRANSFORM_RUNNER,
        "DBT_TARGET": DBT_TARGET,
        "FILE_FORMAT": FILE_FORMAT,
        "SKIP_SPARK_CHECK": SKIP_SPARK_CHECK,
        "MATERIALIZE_ONLY": MATERIALIZE_ONLY,
        "MATERIALIZE_DAYS": MATERIALIZE_DAYS,
    }
)

{'BRANCH': 'main', 'TRANSFORM_RUNNER': 'spark-operator', 'DBT_TARGET': 'dev', 'FILE_FORMAT': 'parquet', 'SKIP_SPARK_CHECK': True, 'MATERIALIZE_ONLY': False, 'MATERIALIZE_DAYS': 30}


## 3) Optional: quick dependency check

The pipeline uses the `dbt` and `feast` CLIs.

In [16]:
import shutil

print({"dbt": shutil.which("dbt"), "feast": shutil.which("feast"), "kubectl": shutil.which("kubectl")})

{'dbt': '/Users/benjaminbrown/Documents/GitHub/mlops/.venv/bin/dbt', 'feast': '/Users/benjaminbrown/Documents/GitHub/mlops/.venv/bin/feast', 'kubectl': '/usr/local/bin/kubectl'}


## 4) Run the pipeline

Run the cells below step-by-step, or run the "Run full pipeline" cell to mirror the script's behavior.

### Step 1: dlt ingestion (Kaggle → MinIO)

In [4]:
if not MATERIALIZE_ONLY and not SKIP_INGESTION:
    ok = krono.run_dlt_ingestion(file_format=FILE_FORMAT)
    if not ok:
        raise RuntimeError("dlt ingestion failed")
else:
    print("Skipped")


Step 1: Running dlt ingestion from Kaggle → Parquet → MinIO
Created bucket: dlt-data
Dataset URL: https://www.kaggle.com/datasets/dhoogla/kronodroid-2021
Pipeline completed: Pipeline kronodroid_minio load step completed in 0.61 seconds
1 load package(s) were loaded to destination filesystem and into dataset kronodroid_raw
The filesystem destination used s3://dlt-data location to store data
Load package 1767916086.435871 is LOADED and contains no failed jobs
  - Format requested: parquet
  - Loader format used: parquet
  - Dataset: kronodroid_raw
dlt pipeline completed successfully
  - Dataset: kronodroid_raw
  - Format: parquet
  - Destination: MinIO


### Step 2: transformations (MinIO → Iceberg on LakeFS)

In [18]:
if not MATERIALIZE_ONLY and not SKIP_DBT:
    if TRANSFORM_RUNNER == "dbt":
        ok = krono.run_dbt_spark_transformations(target=DBT_TARGET)
    else:
        ok = krono.run_kubeflow_spark_operator_transformations(
            branch=BRANCH,
            namespace=K8S_NAMESPACE,
            spark_image=SPARK_IMAGE,
            timeout_seconds=SPARK_TIMEOUT_SECONDS,
        )
    if not ok:
        raise RuntimeError("transformations failed")
else:
    print("Skipped")


Step 2 (alt): Running Spark Operator transformations → Iceberg tables


KeyboardInterrupt: 

### Step 3: Feast apply (register feature definitions)

In [None]:
if not MATERIALIZE_ONLY and not SKIP_FEAST_APPLY:
    check_spark = not SKIP_SPARK_CHECK
    ok = krono.run_feast_apply(check_spark=check_spark)
    if not ok:
        raise RuntimeError("feast apply failed")
else:
    print("Skipped")

### Step 4: Feast materialize (offline → online store)

In [None]:
if not SKIP_MATERIALIZE:
    check_spark = not SKIP_SPARK_CHECK
    ok = krono.run_feast_materialize(days_back=MATERIALIZE_DAYS, check_spark=check_spark)
    if not ok:
        print("WARNING: feature materialization failed")
else:
    print("Skipped")

### Step 5: Commit to LakeFS (Iceberg tables → commit)

Uses `engines.spark_engine.dfp_spark.iceberg_catalog.commit_iceberg_changes`.

In [None]:
from datetime import datetime

if not MATERIALIZE_ONLY and not SKIP_COMMIT:
    krono.commit_to_lakefs(
        branch=BRANCH,
        message=f"Notebook run: Iceberg tables updated {datetime.now().isoformat()}",
    )
else:
    print("Skipped")

## Run full pipeline (script-like)

Convenience cell mirroring the script's `main()` flow.

In [None]:
success = True
check_spark = not SKIP_SPARK_CHECK

if MATERIALIZE_ONLY:
    success = krono.run_feast_materialize(days_back=MATERIALIZE_DAYS, check_spark=check_spark)
else:
    if not SKIP_INGESTION:
        success = krono.run_dlt_ingestion(file_format=FILE_FORMAT)
        if not success:
            raise RuntimeError("Pipeline failed at dlt ingestion")

    if not SKIP_DBT:
        if TRANSFORM_RUNNER == "dbt":
            success = krono.run_dbt_spark_transformations(target=DBT_TARGET)
        else:
            success = krono.run_kubeflow_spark_operator_transformations(
                branch=BRANCH,
                namespace=K8S_NAMESPACE,
                spark_image=SPARK_IMAGE,
                timeout_seconds=SPARK_TIMEOUT_SECONDS,
            )
        if not success:
            raise RuntimeError("Pipeline failed at transformations")

    if not SKIP_FEAST_APPLY:
        success = krono.run_feast_apply(check_spark=check_spark)
        if not success:
            raise RuntimeError("Pipeline failed at feast apply")

    if not SKIP_MATERIALIZE:
        ok = krono.run_feast_materialize(days_back=MATERIALIZE_DAYS, check_spark=check_spark)
        if not ok:
            print("WARNING: feature materialization failed")

    if not SKIP_COMMIT:
        krono.commit_to_lakefs(
            branch=BRANCH,
            message=f"Notebook run: Iceberg tables updated {datetime.now().isoformat()}",
        )

print({"success": bool(success)})