# Trusted Zone — Images Processing

This notebook handles the **image processing** step for the Trusted Zone of our data pipeline.  
Its primary goal is to:

1. **Extract recipe IDs** from image filenames in the Formatted Zone
2. **Identify which recipes have images** and build a mapping
3. **Copy filtered images** to the Trusted Zone (only images with valid recipes)
4. **Generate a recipe IDs file** for the documents processing step

This notebook works in conjunction with `documents.ipynb` to ensure data integrity in the Trusted Zone.


## 1. Setup and Configuration


In [17]:
import os, io, json, re
from pathlib import PurePosixPath
from datetime import datetime, timezone
from typing import Dict, List, Set, Iterable

import boto3
from botocore.config import Config
from botocore.exceptions import ClientError
from dotenv import load_dotenv

load_dotenv()

# S3 / MinIO Configuration
MINIO_USER     = os.getenv("MINIO_USER")
MINIO_PASSWORD = os.getenv("MINIO_PASSWORD")
MINIO_ENDPOINT = os.getenv("MINIO_ENDPOINT")

session = boto3.session.Session(
    aws_access_key_id=MINIO_USER,
    aws_secret_access_key=MINIO_PASSWORD,
    region_name="us-east-1"
)
s3 = session.client(
    "s3",
    endpoint_url=MINIO_ENDPOINT,
    config=Config(signature_version="s3v4", s3={"addressing_style": "path"})
)

# Paths and Buckets
FORM_BUCKET         = "formatted-zone"
FORM_IMAGES_PREFIX  = "images"

TRUST_BUCKET        = "trusted-zone"
TRUST_IMAGES_PREFIX = "images"
TRUST_REPORT_PREFIX = "reports"

# Output file for documents processing
RECIPE_IDS_FILE = "recipe_ids_with_images.json"

# Behavior flags
DRY_RUN   = False
OVERWRITE = True

def utc_ts():
    return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")


## 2. S3 Helper Functions

These utility functions provide a clean interface for S3 operations, handling common patterns like listing objects, checking existence, and copying files between buckets.


In [18]:
def s3_list_keys(bucket: str, prefix: str) -> Iterable[str]:
    """List all object keys in a bucket with the given prefix."""
    paginator = s3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []) or []:
            key = obj["Key"]
            if not key.endswith("/"):
                yield key

def s3_head(bucket: str, key: str):
    """Get object metadata, return None if not found."""
    try:
        return s3.head_object(Bucket=bucket, Key=key)
    except ClientError as e:
        if e.response.get("Error", {}).get("Code") in ("404", "NoSuchKey", "NotFound"):
            return None
        raise

def s3_copy_object(src_bucket: str, src_key: str, dst_bucket: str, dst_key: str, overwrite: bool = True):
    """Copy an object between buckets with optional overwrite control."""
    if not overwrite and s3_head(dst_bucket, dst_key) is not None:
        return "skip-exists"
    return s3.copy_object(
        Bucket=dst_bucket,
        Key=dst_key,
        CopySource={"Bucket": src_bucket, "Key": src_key},
        MetadataDirective="COPY"
    )


## 3. Extract Recipe IDs from Image Filenames

Each image stored in the Formatted Zone follows a structured naming convention that encodes metadata, including the recipe identifier.  
The general pattern is:

**fileType$dataSource$ingestionTimestamp$hash__recipeId_positionOnImagesUrlArrayFromLayer2.extension**

From these filenames, we extract the `recipeId` using a regular expression.  
This allows us to associate every image with its corresponding recipe entry, even when multiple images exist for the same recipe.  
The result of this step is two structures:

- `img_ids`: a set of **unique recipe IDs** that have at least one image.
- `id_to_imgkeys`: a dictionary mapping each `recipeId` to **all its image keys** (to preserve one-to-many relationships).


In [19]:
# Regular expression to extract recipe ID from image filenames
# Recognizes names like:
#   images/type$src$ts$hash__000018c8a5_0.jpg
#   images/type$src$ts$hash__abcd_ef-12_3.JPEG
# ID part: letters, digits, underscore, dash
ID_REGEX = re.compile(
    r"__([A-Za-z0-9_\-]+)_(\d+)\.(?:jpe?g|png|webp|gif|bmp|tiff)$",
    re.IGNORECASE
)

def recipe_id_from_image_key(key: str) -> str | None:
    """Extract recipe ID from an image S3 key."""
    name = PurePosixPath(key).name
    m = ID_REGEX.search(name)
    return m.group(1) if m else None

print("Extracting recipe IDs from image filenames...")

img_ids: Set[str] = set()                    # unique IDs (for filtering)
id_to_imgkeys: Dict[str, List[str]] = {}     # ALL images per ID

count_keys = 0
for key in s3_list_keys(FORM_BUCKET, FORM_IMAGES_PREFIX + "/"):
    count_keys += 1
    rid = recipe_id_from_image_key(key)
    if not rid:
        continue
    img_ids.add(rid)
    id_to_imgkeys.setdefault(rid, []).append(key)

# Make copies deterministic (optional)
for rid in id_to_imgkeys:
    id_to_imgkeys[rid].sort()

total_images = sum(len(v) for v in id_to_imgkeys.values())
print(f"[INFO] scanned image keys: {count_keys}")
print(f"[INFO] unique recipeIds with images: {len(img_ids)}")
print(f"[INFO] total image files matched to recipeIds: {total_images}")


Extracting recipe IDs from image filenames...
[INFO] scanned image keys: 11
[INFO] unique recipeIds with images: 7
[INFO] total image files matched to recipeIds: 11


## 4. Save Recipe IDs for Documents Processing

We save the extracted recipe IDs to a JSON file that will be used by the `documents.ipynb` notebook to filter the recipe documents. This creates a clean separation between image and document processing while maintaining the necessary coupling.


In [20]:
# Prepare data for documents processing
recipe_ids_data = {
    "timestamp": utc_ts(),
    "source": f"s3://{FORM_BUCKET}/{FORM_IMAGES_PREFIX}/",
    "total_images_scanned": count_keys,
    "unique_recipe_ids": len(img_ids),
    "total_images_matched": total_images,
    "recipe_ids_with_images": sorted(list(img_ids)),
    "recipe_to_images": {rid: keys for rid, keys in id_to_imgkeys.items()}
}

# Save to local file
with open(RECIPE_IDS_FILE, 'w', encoding='utf-8') as f:
    json.dump(recipe_ids_data, f, ensure_ascii=False, indent=2)

print(f"[OK] saved recipe IDs to {RECIPE_IDS_FILE}")
print(f"[INFO] {len(img_ids)} recipe IDs will be used for document filtering")


[OK] saved recipe IDs to recipe_ids_with_images.json
[INFO] 7 recipe IDs will be used for document filtering


## 5. Copy Images to Trusted Zone

Now we copy all the images that have valid recipe IDs from the **Formatted Zone** to the **Trusted Zone**.  
We preserve the original filenames to maintain traceability and ensure that the image-to-recipe mapping remains intact.

This step ensures that:
- Only images with corresponding recipe entries are copied
- All images for a recipe are preserved (multiple images per recipe)
- Original filenames and metadata are maintained


In [21]:
print("Copying images to Trusted Zone...")

copied = skipped = 0

if DRY_RUN:
    print("[DRY_RUN] Would copy the following images:")
    for rid, keys in id_to_imgkeys.items():
        for src_key in keys:
            dst_key = f"{TRUST_IMAGES_PREFIX}/{PurePosixPath(src_key).name}"
            print(f"  {src_key} -> {dst_key}")
    copied = total_images
else:
    # Copy all images for IDs that have valid recipes
    for rid, keys in id_to_imgkeys.items():
        # keys are full 'images/...' relative keys in formatted-zone
        for src_key in keys:
            dst_key = f"{TRUST_IMAGES_PREFIX}/{PurePosixPath(src_key).name}"
            try:
                result = s3_copy_object(FORM_BUCKET, src_key, TRUST_BUCKET, dst_key, overwrite=OVERWRITE)
                if result == "skip-exists":
                    print(f"[SKIP] {dst_key} already exists")
                    skipped += 1
                else:
                    copied += 1
            except ClientError as e:
                print(f"[WARN] copy failed {src_key} -> {dst_key}: {e}")
                skipped += 1

print(f"[STATS] images copied={copied} skipped={skipped}")


Copying images to Trusted Zone...
[STATS] images copied=11 skipped=0


## 6. Generate Processing Report

Finally, we generate a comprehensive report of the image processing step and save it to the Trusted Zone for audit and monitoring purposes.


In [22]:
report = {
    "timestamp": utc_ts(),
    "processing_step": "images",
    "source_images_prefix": f"s3://{FORM_BUCKET}/{FORM_IMAGES_PREFIX}/",
    "destination_images_prefix": f"s3://{TRUST_BUCKET}/{TRUST_IMAGES_PREFIX}/",
    "total_images_scanned": count_keys,
    "unique_recipe_ids_with_images": len(img_ids),
    "total_images_matched": total_images,
    "images_copied": copied,
    "images_skipped": skipped,
    "recipe_ids_file": RECIPE_IDS_FILE,
    "dry_run": DRY_RUN,
    "overwrite": OVERWRITE
}

if not DRY_RUN:
    s3.put_object(
        Bucket=TRUST_BUCKET,
        Key=f"{TRUST_REPORT_PREFIX}/images_processing_{utc_ts()}.json",
        Body=json.dumps(report, ensure_ascii=False, indent=2).encode("utf-8"),
        ContentType="application/json"
    )
    print(f"[OK] wrote report -> s3://{TRUST_BUCKET}/{TRUST_REPORT_PREFIX}/")
else:
    print("[DRY_RUN] report:", json.dumps(report, indent=2))

print("\n" + "="*60)
print("IMAGES PROCESSING COMPLETE")
print("="*60)
print(f"Next step: Run documents.ipynb to filter recipe documents")
print(f"Recipe IDs file: {RECIPE_IDS_FILE}")


[OK] wrote report -> s3://trusted-zone/reports/

IMAGES PROCESSING COMPLETE
Next step: Run documents.ipynb to filter recipe documents
Recipe IDs file: recipe_ids_with_images.json
