## 1. The Problem

### Initial State (December 2025)

The ETL pipeline had evolved organically over time, resulting in inconsistent interfaces across steps:

#### upload_pdfs.py
```bash
python scripts/upload_pdfs.py --input config/docs.yaml
```
- ‚úÖ Simple and clear
- ‚ùå No citekey filtering support

#### run_ocr.py
```bash
python scripts/run_ocr.py /path/to/pdfs/
python scripts/run_ocr.py --citekeys dag_v01 dag_v02
```
- ‚úÖ Flexible (PDFs or citekeys)
- ‚ùå No source job tracking
- ‚ùå Ambiguous when both provided

#### sync_ocr.py
```bash
python scripts/sync_ocr.py --ocr-job-id 2026-01-02_00-33-11
python scripts/sync_ocr.py --ocr-job-id latest
```
- ‚úÖ Clear source reference
- ‚ùå "latest" breaks reproducibility
- ‚ùå No citekey filtering

#### parse_structure.py
```bash
python scripts/parse_structure.py --ocr-job-id 2026-01-02_00-33-11
python scripts/parse_structure.py --sync-job-id 2026-01-02_11-15-30
python scripts/parse_structure.py --citekeys dag_v01
```
- ‚ùå Multiple job ID options confusing
- ‚ùå Unclear which to use when
- ‚ùå No clear lineage chain

### Pain Points

#### 1. Reproducibility Issues
```bash
# Run 1 (January 2)
python scripts/sync_ocr.py --ocr-job-id latest
# ‚Üí Uses job 2026-01-02_00-33-11

# Run 2 (January 3) 
python scripts/sync_ocr.py --ocr-job-id latest
# ‚Üí Now uses job 2026-01-03_08-15-00
# üö® Same command, different results!
```

#### 2. Lost Lineage
```bash
# Which upload job did this OCR come from?
python scripts/run_ocr.py --citekeys dagz_v01
# No way to track back to upload_pdfs job!
```

#### 3. No Standard Way to Resume
```bash
# Job 2026-01-02_11-32-58 had 3 failures
# How do I resume just the failures?
# ‚Üí No built-in support, must manually identify failed citekeys
```

#### 4. Ambiguous Parameters
```bash
# parse_structure.py accepts both
python scripts/parse_structure.py \
    --ocr-job-id 2026-01-02_00-33-11 \
    --sync-job-id 2026-01-02_11-15-30
# Which one takes precedence? Not obvious!
```

## 2. Design Iterations

### Iteration 1: "Just Add Source Tracking"

Initial thought: Just add `--upload-job-id` to run_ocr, `--ocr-job-id` to sync_ocr, etc.

```bash
python scripts/run_ocr.py \
    --upload-job-id 2026-01-02_00-28-28 \
    --citekeys dagz_v01
```

**Problems:**
- Still inconsistent (some steps use paths, some use citekeys)
- Doesn't solve resume problem
- Different parameter names per step (upload-job-id, ocr-job-id, sync-job-id)

**Verdict:** ‚ùå Not enough

### Iteration 2: "Unified Input Pattern"

Insight: Every step needs the same three ways to specify input:
1. From a file (`--input`)
2. Explicit list (`--citekeys`)
3. Resume failures (`--resume-from`)

```bash
# Standard pattern
python scripts/{step}.py \
    [--input FILE | --citekeys CK1 CK2 ... | --resume-from JOB_ID] \
    --source-job-id JOB_ID
```

**Improvements:**
- ‚úÖ Consistent across all steps
- ‚úÖ Built-in resume support
- ‚úÖ Clear lineage with unified `--source-job-id`

**Remaining issues:**
- What about "latest"?
- How to handle reprocessing?

**Verdict:** ‚úÖ Getting closer

### Iteration 3: "Ban 'latest' and Add Force Rerun"

Key decisions:

#### Decision 1: No More "latest"
```python
def validate_job_id(job_id: str, arg_name: str):
    if job_id == "latest":
        raise ValueError(
            f"{arg_name} cannot be 'latest' (breaks reproducibility).\n"
            f"Use explicit timestamp like: 2026-01-02_00-33-11"
        )
```
- Symlinks still useful for reading
- But never as pipeline inputs

#### Decision 2: Add Force Rerun Flag
```bash
python scripts/parse_structure.py \
    --source-job-id 2026-01-02_11-15-30 \
    --citekeys dagz_v01 \
    --force-rerun
```
- Always creates new job
- Processes all citekeys (ignores existing results)
- Use when parameters change

**Verdict:** ‚úÖ Solid foundation

### Iteration 4: "Add Power Features"

With the core standardization in place, add convenience features:

#### Wildcard Support (Limited)
```bash
# Process all volumes
python scripts/parse_structure.py \
    --source-job-id 2026-01-02_11-15-30 \
    --citekeys "dagz_v*"
```
- Only support `_v*` (volumes) and `_y*` (years)
- Prevents accidental broad matches

#### Dry Run
```bash
# Preview before executing
python scripts/parse_structure.py \
    --source-job-id 2026-01-02_11-15-30 \
    --citekeys "dagz_v*" \
    --dry-run
```
- Shows what would be processed
- Displays skip reasons
- No execution

**Verdict:** ‚úÖ Complete solution

## 3. Final Solution

### Standardized Interface

All pipeline steps (except upload_pdfs) now follow this pattern:

```bash
python scripts/{step}.py \
    [--input FILE | --citekeys CK1 CK2 ... | --resume-from JOB_ID] \
    --source-job-id EXPLICIT_TIMESTAMP \
    [--force-rerun] \
    [--dry-run]
```

### Key Principles

1. **Explicit over implicit:** No "latest", always use timestamps
2. **Single source of truth:** Each step references only immediate prior step
3. **Mutually exclusive inputs:** Can't mix --input, --citekeys, --resume-from
4. **Required lineage:** --source-job-id always required (except upload_pdfs)
5. **Safe defaults:** Without --force-rerun, skip existing results
6. **Limited wildcards:** Only _v* and _y* to prevent errors

### Run Mode Semantics

| Mode | Creates New Job? | Processes |
|------|-----------------|----------|
| Regular run | If needed | Missing results only |
| Resume | No (reuses) | Failed items only |
| Force rerun | Always | All items |
| Dry run | Never | Nothing (preview only) |

## 4. Implementation

### Core Utilities (etl_metadata.py)

#### Validation
```python
def validate_job_id(job_id: str, arg_name: str):
    """Reject 'latest', validate timestamp format."""
    if job_id == "latest":
        raise ValueError(
            f"{arg_name} cannot be 'latest' (breaks reproducibility).\n"
            f"Use explicit timestamp like: 2026-01-02_00-33-11"
        )
    # Validate format: YYYY-MM-DD_HH-MM-SS
    pattern = r"^\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}$"
    if not re.match(pattern, job_id):
        raise ValueError(
            f"{arg_name} must be in format YYYY-MM-DD_HH-MM-SS, got: {job_id}"
        )
```

#### Resume Support
```python
def get_failed_citekeys(
    step_name: str, 
    resume_job_id: str, 
    source_job_id: str
) -> List[str]:
    """Extract failed citekeys from previous job."""
    metadata_file = ANALYTICS_ROOT / step_name / "job_metadata" / f"{resume_job_id}.json"
    
    with metadata_file.open("r") as f:
        metadata = json.load(f)
    
    # Verify same source
    if metadata.get("source_job_id") != source_job_id:
        raise ValueError("Source job ID mismatch")
    
    # Extract failed items
    failed = [
        ck for ck, info in metadata["citekeys"].items()
        if info.get("status") == "failed"
    ]
    return failed
```

#### Force Rerun Logic
```python
def should_process_citekey(
    step_name: str,
    citekey: str,
    source_job_id: str,
    force_rerun: bool = False,
    output_dir: Optional[Path] = None
) -> Tuple[bool, str]:
    """Determine if citekey should be processed."""
    
    if force_rerun:
        return True, "Force rerun requested"
    
    # Check if result exists for this citekey + source_job_id
    result_file = output_dir / citekey / "latest" / f"{citekey}.json"
    if result_file.exists():
        with result_file.open("r") as f:
            result = json.load(f)
        if result.get("source_job_id") == source_job_id:
            return False, f"Result exists"
    
    return True, "No matching result"
```

#### Wildcard Expansion
```python
def expand_citekey_patterns(
    patterns: List[str],
    available_citekeys: List[str]
) -> List[str]:
    """Expand _v* and _y* patterns only."""
    
    expanded = set()
    
    for pattern in patterns:
        if "*" in pattern:
            # Validate: only _v* or _y* allowed
            if not ("_v*" in pattern or "_y*" in pattern):
                raise ValueError(
                    f"Only _v* and _y* wildcards supported, got: {pattern}"
                )
            matches = fnmatch.filter(available_citekeys, pattern)
            expanded.update(matches)
        else:
            expanded.add(pattern)
    
    return sorted(expanded)
```

#### Dry Run Preview
```python
def preview_pipeline_run(
    step_name: str,
    citekeys: List[str],
    source_job_id: str,
    force_rerun: bool = False
) -> Dict:
    """Generate dry run preview."""
    
    # Filter what would be processed
    to_process, skip_reasons = filter_citekeys_to_process(
        step_name, citekeys, source_job_id, force_rerun
    )
    
    return {
        "step_name": step_name,
        "source_job_id": source_job_id,
        "total_citekeys": len(citekeys),
        "to_process": to_process,
        "to_skip": skip_reasons,
        "force_rerun": force_rerun,
        "will_create_new_job": len(to_process) > 0
    }

def print_dry_run_summary(preview: Dict):
    """Pretty print preview."""
    print("\n" + "="*70)
    print(f"üîç DRY RUN PREVIEW: {preview['step_name']}")
    print("="*70)
    # ... detailed output ...
```

### Per-Step Changes

#### upload_pdfs.py
```python
# Before
parser.add_argument("--input", required=True)

# After
input_group = parser.add_mutually_exclusive_group(required=True)
input_group.add_argument("--input")
input_group.add_argument("--citekeys", nargs="+")
# Note: No --source-job-id (it's the source!)
```

#### run_ocr.py
```python
# Before
parser.add_argument("pdf_dir", nargs="?")
parser.add_argument("--citekeys", nargs="*")

# After
input_group = parser.add_mutually_exclusive_group(required=True)
input_group.add_argument("--input")
input_group.add_argument("--citekeys", nargs="+")
input_group.add_argument("--resume-from")
parser.add_argument("--source-job-id", required=True)  # Upload job
parser.add_argument("--force-rerun", action="store_true")
parser.add_argument("--dry-run", action="store_true")
```

#### sync_ocr.py
```python
# Before
parser.add_argument("--ocr-job-id", required=True)
# Supported --ocr-job-id latest

# After
input_group = parser.add_mutually_exclusive_group(required=True)
input_group.add_argument("--input")
input_group.add_argument("--citekeys", nargs="+")
input_group.add_argument("--resume-from")
parser.add_argument("--source-job-id", required=True)  # OCR job
parser.add_argument("--force-rerun", action="store_true")
parser.add_argument("--dry-run", action="store_true")

# Removed resolve_job_id() function entirely
```

#### parse_structure.py
```python
# Before
parser.add_argument("--ocr-job-id")
parser.add_argument("--sync-job-id")
parser.add_argument("--citekeys", nargs="*")

# After
input_group = parser.add_mutually_exclusive_group(required=True)
input_group.add_argument("--input")
input_group.add_argument("--citekeys", nargs="+")
input_group.add_argument("--resume-from")
parser.add_argument("--source-job-id", required=True)  # Sync job only!
parser.add_argument("--force-rerun", action="store_true")
parser.add_argument("--dry-run", action="store_true")
```

## 5. Lessons Learned

### Design Principles

1. **Consistency is king**
   - Users develop muscle memory
   - Easier to document and teach
   - Reduces cognitive load

2. **Explicit over implicit**
   - "latest" seemed convenient but caused confusion
   - Explicit timestamps make everything traceable
   - Verbosity is better than ambiguity

3. **Single responsibility per parameter**
   - Don't mix concerns (--ocr-job-id vs --sync-job-id)
   - Use uniform names (--source-job-id everywhere)
   - Clear what each parameter controls

4. **Build for the 99% case**
   - Most runs are regular processing
   - Make common case easy, advanced cases possible
   - Default to safe behavior (skip existing results)

5. **Preview before execution**
   - Dry run prevents costly mistakes
   - Shows what *would* happen
   - Builds confidence

### Technical Decisions

#### Why Mutually Exclusive Input Groups?
```python
input_group = parser.add_mutually_exclusive_group(required=True)
input_group.add_argument("--input")
input_group.add_argument("--citekeys", nargs="+")
input_group.add_argument("--resume-from")
```
- Prevents ambiguous combinations
- argparse enforces exactly one
- Clear error messages

#### Why Ban "latest"?
- Same command ‚Üí different results over time
- Hard to debug "why did this change?"
- Symlinks still useful for reading, just not pipeline inputs

#### Why Limit Wildcards?
- `*` or `dag*` could match hundreds of files
- `_v*` and `_y*` match our naming conventions
- Prevents accidental bulk operations

#### Why Separate --source-job-id from --resume-from?
```bash
# Resume failed items from job X, but process against source Y
python scripts/parse_structure.py \
    --source-job-id 2026-01-02_11-15-30 \  # Sync job (data source)
    --resume-from 2026-01-02_11-32-58     # Parse job (previous attempt)
```
- Different purposes: data lineage vs execution history
- Resume validates source matches
- Prevents accidentally mixing data versions

### Migration Strategy

How we rolled out the changes:

1. **Phase 1: Add utilities** (etl_metadata.py)
   - validate_job_id()
   - get_failed_citekeys()
   - No breaking changes yet

2. **Phase 2: Update one step** (run_ocr.py)
   - Full refactor to new interface
   - Test thoroughly
   - Document changes

3. **Phase 3: Roll out to remaining steps**
   - sync_ocr.py
   - parse_structure.py
   - upload_pdfs.py (minor changes)

4. **Phase 4: Add advanced features**
   - Wildcard support
   - Dry run preview
   - Force rerun logic

5. **Phase 5: Document**
   - Pipeline overview notebook
   - This design journal
   - Update README

### What We'd Do Differently

Looking back, here's what we'd change:

1. **Start with standardization**
   - Should have designed interface first
   - Then implemented steps
   - Harder to refactor than build right initially

2. **Document as we go**
   - Easy to forget design rationale
   - This notebook should have been written during design
   - Future-you will thank past-you

3. **Add validation earlier**
   - "latest" caused problems for months
   - Simple validation would have caught issues
   - Fail fast, fail loud

4. **Test with real users**
   - Our abstractions made sense to us
   - Users had different mental models
   - Dry run came from user feedback

### What Worked Well

1. **Incremental refactoring**
   - Didn't break everything at once
   - Could test each step independently
   - Maintained working system throughout

2. **Utility functions**
   - Centralized validation
   - Reusable across steps
   - Single source of truth

3. **Notebooks for documentation**
   - Runnable examples
   - Can show actual data
   - More engaging than plain docs

## Conclusion

The standardization effort took significant time but delivered:

- ‚úÖ **Consistent interface** across all pipeline steps
- ‚úÖ **Reproducible results** (no more "latest")
- ‚úÖ **Clear lineage** (source_job_id chain)
- ‚úÖ **Robust resume** (built into all steps)
- ‚úÖ **Safe operations** (dry run, validation)
- ‚úÖ **Power features** (wildcards, force rerun)

Most importantly: **Users can now predict how any pipeline step works** based on knowing one step.

That's the real win.