## Citekey Validation

A citekey is a structured identifier for archival documents. Format:

```
<slug>[_s<series>][_y<year>[_<year>]][_v<volume>]
```

### Examples

| Citekey | Meaning |
|---------|----------|
| `zywxhb_v01` | Zywxhb collection, volume 01 |
| `zywxhb_v002` | Zywxhb collection, volume 002 (3-digit for this series) |
| `zywxhb_y1900_v01` | Zywxhb, year 1900, volume 01 |
| `zywxhb_y1900_1905_v01` | Zywxhb, years 1900-1905, volume 01 |
| `zywxhb_s03_v01` | Zywxhb, series 03, volume 01 |

### Validation Rules

1. **Lowercase only**: `zywxhb_v01` ✓, `ZYWXHB_V01` ✗
2. **No hyphens**: `zywxhb_v01` ✓, `zywxhb-v01` ✗
3. **Volume padding**:
   - `zywxhb`: 2-digit volumes (`v01`, `v02`, ..., `v99`)
   - `zywxhb`: 3-digit volumes (`v001`, `v002`, ..., `v999`) [configured in `upload_config.yaml`]
4. **Year format**: 4 digits (YYYY)
5. **Year ranges**: Must be ascending (`y1900_1905` ✓, `y1905_1900` ✗)

## Setup

### Prerequisites

1. **Python environment** with uv:
   ```bash
   cd /path/to/etl
   uv venv
   source .venv/bin/activate
   uv pip install -r requirements.txt
   ```

2. **B2 Credentials** (set as environment variables):
   ```bash
   export B2_SOURCE_ACCESS_KEY_ID="your_app_key_id"
   export B2_SOURCE_SECRET_ACCESS_KEY="your_app_key"
   ```

3. **Optional overrides** (default values from `config/upload_config.yaml`):
   ```bash
   export B2_SOURCE_BUCKET="cna-sources"
   export B2_SOURCE_REGION="us-east-005"
   ```

Or create a `.env` file in the project root:
```
B2_SOURCE_ACCESS_KEY_ID=your_key_id
B2_SOURCE_SECRET_ACCESS_KEY=your_secret
```

## Basic Usage

### Upload a single PDF
```bash
uv run scripts/upload_pdfs.py --input data/sources/2026_project/zywxhb_v01.pdf
```

### Upload all PDFs from a directory
```bash
uv run scripts/upload_pdfs.py --input data/sources/2026_project/
```

### Upload from multiple sources
```bash
uv run scripts/upload_pdfs.py --input data/sources/2026_project1/ data/sources/2026_project2/ some_file.pdf
```

### Dry-run (validate without uploading)
```bash
uv run scripts/upload_pdfs.py --input data/sources/2026_project/ --dry-run
```

### Custom staging directory
```bash
uv run scripts/upload_pdfs.py --input data/sources/ --staging ./my_staging/
```

### Quiet mode (suppress progress output)
```bash
uv run scripts/upload_pdfs.py --input data/sources/ --quiet
```

### Skip metadata upload to B2
```bash
uv run scripts/upload_pdfs.py --input data/sources/ --no-metadata-sync
```

## How It Works

### Step 1: Collect PDFs

The script scans input paths for `.pdf` files. It accepts:
- **Directories**: recursively searches for PDFs
- **Individual files**: single PDF paths
- **Mixed**: both directories and files

```python
pdfs = collect_pdfs(input_paths)
# Returns: List of Path objects pointing to all found PDFs
```

### Step 2: Validate Citekeys

Each PDF filename (without extension) is validated as a citekey:

```python
validate_citekey("zywxhb_v01")  # ✓ Valid
validate_citekey("ZYWXHB_V01")  # ✗ Error: must be lowercase
validate_citekey("zywxhb_v1")   # ✗ Error: volume must be 2 digits
```

### Step 3: Normalize Structure

PDFs are copied into a standard structure in `staging/`:

```
staging_normalized/
  zywxhb_v01/
    zywxhb_v01.pdf
  zywxhb_v02/
    zywxhb_v02.pdf
```

This ensures consistent structure regardless of input folder layout.

### Step 4: Compute Checksums

SHA256 checksum is computed for each PDF:

```python
checksum = sha256(pdf_path)
# Returns: "a1b2c3d4e5f6..."
```

### Step 5: Check Against History

Load checksums from all previous completed uploads:

```python
checksums = load_checksums_from_metadata(metadata_dir)
# Returns: {"zywxhb_v01": "abc123...", "zywxhb_v02": "def456..."}
```

**Skip** if checksum matches. **Upload** if new or changed.

### Step 6: Upload to B2

For each PDF that needs uploading:

```
s3.upload_file(
    local_file=staging/zywxhb_v01/zywxhb_v01.pdf,
    bucket=cna-sources,
    key=zywxhb_v01/zywxhb_v01.pdf
)
```

B2 bucket structure:
```
cna-sources/
  zywxhb_v01/zywxhb_v01.pdf
  zywxhb_v02/zywxhb_v02.pdf
  run_metadata/2026-01-01_12-30-45.json
```

### Step 7: Save Metadata

Metadata is saved locally and uploaded to B2:

```json
{
  "job_id": "2026-01-01_12-30-45",
  "timestamp": "2026-01-01T12:30:45.123456",
  "bucket": "cna-sources",
  "status": "completed",
  "citekeys": {
    "total": 2,
    "uploaded": 2,
    "skipped": 0,
    "list": ["zywxhb_v01", "zywxhb_v02"]
  },
  "checksums": {
    "zywxhb_v01": "abc123...",
    "zywxhb_v02": "def456..."
  }
}
```

**Local**: `data/sources/job_metadata/2026-01-01_12-30-45.json`
**B2**: `cna-sources/run_metadata/2026-01-01_12-30-45.json`

## Common Workflows

### Workflow 1: First-time Upload

Data worker prepares PDFs in `data/sources/2026_january_batch/`:

```bash
# Dry-run first to validate
uv run scripts/upload_pdfs.py --input data/sources/2026_january_batch/ --dry-run

# If validation passes, upload for real
uv run scripts/upload_pdfs.py --input data/sources/2026_january_batch/
```

Result:
- PDFs uploaded to `cna-sources/zywxhb_v01/`, etc.
- Metadata saved to `data/sources/job_metadata/2026-01-01_12-30-45.json`
- Metadata uploaded to `cna-sources/run_metadata/2026-01-01_12-30-45.json`

### Workflow 2: Incremental Upload

Data worker adds new PDFs to `data/sources/2026_january_batch/`:

```bash
# Script compares checksums against previous jobs
# Only uploads new files
uv run scripts/upload_pdfs.py --input data/sources/2026_january_batch/
```

Output:
```
Found 5 PDF(s).
Normalizing structure...
✓ normalized: zywxhb_v01
✓ normalized: zywxhb_v02
✓ normalized: zywxhb_v03
✓ normalized: zywxhb_v04
✓ normalized: zywxhb_v05

Uploading PDFs...
Uploading PDFs: 25%|██▌ | 2/5 [uploaded=2, skipped=2]

Summary: 1 uploaded, 4 skipped.
```

### Workflow 3: Fix a Corrupted Upload

If a file was corrupted on B2 and needs re-uploading:

```bash
# Simply re-run the upload
uv run scripts/upload_pdfs.py --input data/sources/2026_january_batch/

# Script detects the checksum matches history and skips
# To force re-upload, you'd modify the local PDF or its checksum cache
# (This is intentional—prevents accidental re-uploads)
```

## Configuration

Settings are in `config/upload_config.yaml`. Key options:

```yaml
directories:
  staging: ./staging_normalized      # Temp folder for normalized PDFs
  metadata: data/sources/job_metadata # Where to save job metadata

b2:
  default_bucket: cna-sources        # B2 bucket name
  default_region: us-east-005        # B2 region
  default_endpoint_url: https://...  # B2 S3 endpoint

citekey_rules:
  three_digit_volume_series:
    - zywxhb                         # This series uses 3-digit volumes
```

Override via environment variables:

```bash
export B2_SOURCE_BUCKET="my-bucket"
export B2_SOURCE_REGION="eu-central-003"
```

## Troubleshooting

### Error: Citekey validation failed

**Problem**: `zywxhb_V04` (uppercase)
```
ValueError: zywxhb_V04: citekey must be lowercase
```

**Solution**: Rename file to `zywxhb_v04.pdf` (lowercase)

### Error: Volume must be 3 digits

**Problem**: `zywxhb_v4` for the zywxhb series
```
ValueError: zywxhb_v4: volume must be 3 digits for series 'zywxhb'
```

**Solution**: Rename to `zywxhb_v004.pdf` (3 digits)

### Error: B2 configuration incomplete

**Problem**: Missing B2 credentials
```
❌ B2 source configuration incomplete. Required: B2_SOURCE_ACCESS_KEY_ID, ...
```

**Solution**: Set environment variables or `.env` file:
```bash
export B2_SOURCE_ACCESS_KEY_ID="your_key"
export B2_SOURCE_SECRET_ACCESS_KEY="your_secret"
```

### Error: No PDFs found

**Problem**: Input directory doesn't contain `.pdf` files

**Solution**: Check path and ensure files end in `.pdf`

### Duplicated uploads

**Problem**: Same citekey in staging directory
```
RuntimeError: Duplicate PDF for citekey zywxhb_v01
```

**Solution**: Ensure input directory doesn't have duplicate PDFs with same name

## Best Practices

### For Data Workers

1. **Organize by batch**: Create folders like `data/sources/2026_january/`, `data/sources/2026_february/`
2. **Use valid citekeys**: Follow the naming convention strictly
3. **Always dry-run first**: Use `--dry-run` to validate before uploading
4. **Document your work**: Note which batch you uploaded and when

### For Developers

1. **Test with small files**: Use `tests/fixtures/sample_pdfs/` for testing, not `data/sources/`
2. **Keep metadata**: Don't delete `data/sources/job_metadata/` — it's your audit trail
3. **Check job history**: Look at metadata JSON to see what was uploaded when
4. **Monitor B2 costs**: Large uploads consume bandwidth; plan batches accordingly

### For the Pipeline

1. **Staging directory is temporary**: Gets deleted after upload; don't store important files there
2. **Checksums are immutable**: Once uploaded with checksum X, same file won't re-upload
3. **Metadata is versioned**: Each upload creates a new metadata file with ISO timestamp
4. **B2 is source of truth**: Local metadata is a mirror for fast access