# Compile Kronodroid Autoencoder Pipeline

This notebook compiles the Kronodroid Autoencoder training pipeline to a YAML file
that can be submitted to Kubeflow Pipelines.

## Pipeline Variants

There are two pipeline variants:

1. **Training Pipeline** (`kronodroid_autoencoder_training_pipeline`): Training-only, assumes data is already in LakeFS
2. **Full Pipeline** (`kronodroid_autoencoder_full_pipeline`): Spark transform -> LakeFS commit -> Training

## Setup Project Path

**Run this cell first** to ensure the project modules can be imported.

In [None]:
import os
import sys
from pathlib import Path

# Find project root (directory containing pyproject.toml)
notebook_dir = Path.cwd()
project_root = notebook_dir.parent

# Verify we found the right directory
if not (project_root / "pyproject.toml").exists():
    # Try current directory
    if (notebook_dir / "pyproject.toml").exists():
        project_root = notebook_dir
    else:
        raise RuntimeError(f"Cannot find project root. Current dir: {notebook_dir}")

# Add to Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Change working directory to project root for consistent paths
os.chdir(project_root)

print(f"Project root: {project_root}")
print(f"Working directory: {os.getcwd()}")
print(f"Python path includes project: {str(project_root) in sys.path}")

## Check KFP Version

In [None]:
import kfp
print(f"KFP version: {kfp.__version__}")

# Verify kfp-kubernetes is available
try:
    from kfp import kubernetes
    print("kfp-kubernetes: available")
except ImportError:
    print("kfp-kubernetes: NOT INSTALLED - run 'uv add kfp-kubernetes'")

## Import Pipelines

In [None]:
from orchestration.kubeflow.dfp_kfp.pipelines.kronodroid_autoencoder_pipeline import (
    kronodroid_autoencoder_training_pipeline,
    kronodroid_autoencoder_full_pipeline,
    compile_kronodroid_autoencoder_pipeline,
)

print("Pipelines imported successfully!")
print(f"Training pipeline: {kronodroid_autoencoder_training_pipeline.name}")
print(f"Full pipeline: {kronodroid_autoencoder_full_pipeline.name}")

## Compile Training Pipeline

This pipeline assumes data is already transformed and available in LakeFS.

In [None]:
from kfp import compiler

# Output path for the training pipeline
training_output = Path("kronodroid_autoencoder_pipeline.yaml")

# Compile the training-only pipeline
compiler.Compiler().compile(
    pipeline_func=kronodroid_autoencoder_training_pipeline,
    package_path=str(training_output),
)

print(f"Training pipeline compiled to: {training_output.absolute()}")
print(f"File size: {training_output.stat().st_size:,} bytes")

## Compile Full Pipeline (Optional)

This pipeline includes Spark transformation, LakeFS commit, and training.

In [None]:
# Output path for the full pipeline
full_output = Path("kronodroid_autoencoder_full_pipeline.yaml")

# Compile the full pipeline
compiler.Compiler().compile(
    pipeline_func=kronodroid_autoencoder_full_pipeline,
    package_path=str(full_output),
)

print(f"Full pipeline compiled to: {full_output.absolute()}")
print(f"File size: {full_output.stat().st_size:,} bytes")

## Preview Compiled YAML

In [None]:
# Show first 80 lines of compiled training pipeline
with open(training_output) as f:
    lines = f.readlines()[:80]
    print(f"First 80 lines of {training_output.name}:")
    print("=" * 60)
    print("".join(lines))

## Submit to Kubeflow Pipelines (Optional)

If you have a KFP server running, you can submit the pipeline directly.

First, start the KFP port-forward:
```bash
task kfp:port-forward
```

In [None]:
# Configuration
KFP_HOST = "http://localhost:8080"
SUBMIT_PIPELINE = False  # Set to True to submit

if SUBMIT_PIPELINE:
    client = kfp.Client(host=KFP_HOST)
    
    # Create a run with default parameters
    run = client.create_run_from_pipeline_func(
        kronodroid_autoencoder_training_pipeline,
        arguments={
            "lakefs_ref": "main",
            "max_epochs": 5,
            "batch_size": 256,
        },
        experiment_name="kronodroid-autoencoder",
        run_name="autoencoder-training-test",
    )
    
    print(f"Submitted run: {run.run_id}")
    print(f"View at: {KFP_HOST}/#/runs/details/{run.run_id}")
else:
    print("Set SUBMIT_PIPELINE = True to submit the pipeline")
    print(f"Or submit via CLI: kfp run create -f {training_output}")

## Pipeline Parameters Reference

### Training Pipeline Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `lakefs_endpoint` | `http://lakefs:8000` | LakeFS endpoint |
| `lakefs_repository` | `kronodroid` | LakeFS repository |
| `lakefs_ref` | `main` | LakeFS branch/commit to read from |
| `mlflow_tracking_uri` | `http://mlflow:5000` | MLflow tracking server |
| `mlflow_experiment_name` | `kronodroid-autoencoder` | MLflow experiment |
| `mlflow_model_name` | `kronodroid_autoencoder` | Registered model name |
| `latent_dim` | `16` | Autoencoder latent dimension |
| `hidden_dims_json` | `[128, 64]` | Hidden layer dimensions |
| `batch_size` | `512` | Training batch size |
| `max_epochs` | `10` | Maximum training epochs |
| `seed` | `1337` | Random seed |

### Full Pipeline Additional Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `minio_endpoint` | `http://minio:9000` | MinIO endpoint for raw data |
| `minio_bucket` | `dlt-data` | MinIO bucket |
| `minio_prefix` | `kronodroid_raw` | Raw data prefix |
| `spark_image` | `dfp-spark:latest` | Spark job Docker image |
| `target_branch` | `main` | LakeFS branch to merge into |