# Kronodroid Pipeline Runner (Notebook)

Notebook equivalent of `tools/scripts/run_kronodroid_pipeline.py`.

Data flow:
```
Kaggle → dlt → Parquet/Avro → MinIO → Spark + dbt → Iceberg (LakeFS) → Feast → (optional) LakeFS commit
```

## 1) Environment setup

- Loads `.env` from repo root (same behavior as the script).
- Ensures the repo is importable from `notebooks/`.
- Sets `LAKEFS_BRANCH` from the parameters cell.

In [None]:
import os
import sys
from pathlib import Path

# Ensure repo root is importable when running from the notebooks directory.
REPO_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from tools.scripts import run_kronodroid_pipeline as krono

# Load env vars from `.env` (if present)
krono.load_env_file(REPO_ROOT / ".env")

print(f"Repo root: {REPO_ROOT}")
print(f"LAKEFS_REPOSITORY: {os.getenv('LAKEFS_REPOSITORY', 'kronodroid')}")

## 2) Run parameters

Mirror the CLI flags from `tools/scripts/run_kronodroid_pipeline.py`.

In [None]:
# Core parameters
BRANCH = "main"
DBT_TARGET = "dev"  # "dev" (embedded Spark) or "thrift" (Spark server)
FILE_FORMAT = "parquet"  # "parquet" or "avro" (dlt loader uses parquet for avro requests)

# Step toggles
SKIP_INGESTION = False
SKIP_DBT = False
SKIP_FEAST_APPLY = False
SKIP_MATERIALIZE = False
SKIP_COMMIT = False

# Materialize-only mode (equivalent to `--materialize-only`)
MATERIALIZE_ONLY = False
MATERIALIZE_DAYS = 30

os.environ["LAKEFS_BRANCH"] = BRANCH

print(
    {
        "BRANCH": BRANCH,
        "DBT_TARGET": DBT_TARGET,
        "FILE_FORMAT": FILE_FORMAT,
        "MATERIALIZE_ONLY": MATERIALIZE_ONLY,
        "MATERIALIZE_DAYS": MATERIALIZE_DAYS,
    }
)

## 3) Optional: quick dependency check

The pipeline uses the `dbt` and `feast` CLIs.

In [None]:
import shutil

print({"dbt": shutil.which("dbt"), "feast": shutil.which("feast")})

## 4) Run the pipeline

Run the cells below step-by-step, or run the "Run full pipeline" cell to mirror the script's behavior.

### Step 1: dlt ingestion (Kaggle → MinIO)

In [None]:
if not MATERIALIZE_ONLY and not SKIP_INGESTION:
    ok = krono.run_dlt_ingestion(file_format=FILE_FORMAT)
    if not ok:
        raise RuntimeError("dlt ingestion failed")
else:
    print("Skipped")

### Step 2: dbt-spark transformations (MinIO → Iceberg on LakeFS)

In [None]:
if not MATERIALIZE_ONLY and not SKIP_DBT:
    ok = krono.run_dbt_spark_transformations(target=DBT_TARGET)
    if not ok:
        raise RuntimeError("dbt transformations failed")
else:
    print("Skipped")

### Step 3: Feast apply (register feature definitions)

In [None]:
if not MATERIALIZE_ONLY and not SKIP_FEAST_APPLY:
    ok = krono.run_feast_apply()
    if not ok:
        raise RuntimeError("feast apply failed")
else:
    print("Skipped")

### Step 4: Feast materialize (offline → online store)

In [None]:
if not SKIP_MATERIALIZE:
    ok = krono.run_feast_materialize(days_back=MATERIALIZE_DAYS)
    if not ok:
        print("WARNING: feature materialization failed")
else:
    print("Skipped")

### Step 5: Commit to LakeFS (Iceberg tables → commit)

Uses `engines.spark_engine.dfp_spark.iceberg_catalog.commit_iceberg_changes`.

In [None]:
from datetime import datetime

if not MATERIALIZE_ONLY and not SKIP_COMMIT:
    krono.commit_to_lakefs(
        branch=BRANCH,
        message=f"Notebook run: Iceberg tables updated {datetime.now().isoformat()}",
    )
else:
    print("Skipped")

## Run full pipeline (script-like)

Convenience cell mirroring the script's `main()` flow.

In [None]:
success = True

if MATERIALIZE_ONLY:
    success = krono.run_feast_materialize(days_back=MATERIALIZE_DAYS)
else:
    if not SKIP_INGESTION:
        success = krono.run_dlt_ingestion(file_format=FILE_FORMAT)
        if not success:
            raise RuntimeError("Pipeline failed at dlt ingestion")

    if not SKIP_DBT:
        success = krono.run_dbt_spark_transformations(target=DBT_TARGET)
        if not success:
            raise RuntimeError("Pipeline failed at dbt transformations")

    if not SKIP_FEAST_APPLY:
        success = krono.run_feast_apply()
        if not success:
            raise RuntimeError("Pipeline failed at feast apply")

    if not SKIP_MATERIALIZE:
        ok = krono.run_feast_materialize(days_back=MATERIALIZE_DAYS)
        if not ok:
            print("WARNING: feature materialization failed")

    if not SKIP_COMMIT:
        krono.commit_to_lakefs(
            branch=BRANCH,
            message=f"Notebook run: Iceberg tables updated {datetime.now().isoformat()}",
        )

print({"success": bool(success)})