# Catalog/Batch Enrichment (Metadata Enrichment)

Goal: enrich a batch of catalog-like records (CSV) with summaries and tags **reliably**.

What you’ll practice:
- Batch prompting pattern (one item per call)
- Retries + error handling
- Caching (avoid paying twice)
- Cost awareness (rough token/cost estimate)


## 1. Setup and Installation

**Estimated time:** ~60–90 minutes (with exercises)

### Install
If needed, install dependencies:
```bash
pip install -U openai pydantic pandas numpy scikit-learn
```

### Environment
Set your API key:
```bash
export OPENAI_API_KEY="..."
```

> **Note:** All example data in this notebook is synthetic (safe to share in training).

In [None]:
import os

assert os.getenv('OPENAI_API_KEY'), "Set OPENAI_API_KEY in your environment"

## 2. Imports + API client

In [None]:
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY from env

In [None]:
from pydantic import BaseModel, Field
from typing import List, Literal, Optional
import pandas as pd
import hashlib
import json
import time
import os


## 3. Sample catalog CSV (synthetic)


In [None]:
items = [
  {"id":"bk_1001","title":"The Secret Garden","author":"Frances Hodgson Burnett","year":1911},
  {"id":"bk_1002","title":"The Time Machine","author":"H. G. Wells","year":1895},
  {"id":"bk_1003","title":"A Study in Scarlet","author":"Arthur Conan Doyle","year":1887},
  {"id":"bk_1004","title":"Poems (Selected)","author":"Emily Dickinson","year":1890},
  {"id":"bk_1005","title":"Intro to Python for Library Data","author":"(Training Dept.)","year":2026},
]
df = pd.DataFrame(items)
df

## 4. Define an enrichment schema


In [None]:
class EnrichedRecord(BaseModel):
    id: str
    one_sentence_summary: str = Field(..., description="<= 25 words")
    subject_tags: List[str] = Field(..., description="3–8 tags, title case")
    audience: Literal["Kids","Teens","Adults","All"]
    tone: Optional[Literal["Informative","Playful","Serious"]] = None

## 5. Caching + request function

Caching saves time + money. We cache by hashing the input payload.


In [None]:
CACHE_DIR = ".cache_enrichment"
os.makedirs(CACHE_DIR, exist_ok=True)

def cache_key(payload: dict) -> str:
    s = json.dumps(payload, sort_keys=True).encode("utf-8")
    return hashlib.sha256(s).hexdigest()

def cache_get(key: str):
    p = os.path.join(CACHE_DIR, key + ".json")
    if os.path.exists(p):
        with open(p, "r", encoding="utf-8") as f:
            return json.load(f)
    return None

def cache_set(key: str, obj: dict):
    p = os.path.join(CACHE_DIR, key + ".json")
    with open(p, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)

## 6. Enrich one record (structured output)


In [None]:
SYSTEM = """You are helping enrich library catalog records.
Keep summaries short and factual. If unsure, be conservative.
Return only structured data that matches the schema."""

def enrich_one(row: dict) -> EnrichedRecord:
    payload = {"id": row["id"], "title": row["title"], "author": row["author"], "year": row["year"]}
    k = cache_key(payload)
    cached = cache_get(k)
    if cached:
        return EnrichedRecord(**cached)

    resp = client.responses.parse(
        model="gpt-4o-2024-08-06",
        input=[
            {"role":"system","content": SYSTEM},
            {"role":"user","content": f"Enrich this record:\n{json.dumps(payload)}"}
        ],
        text_format=EnrichedRecord
    )
    out = resp.output_parsed
    cache_set(k, out.model_dump())
    return out

enrich_one(df.iloc[0].to_dict())

## 7. Batch enrichment with retries

You’ll want retries around transient errors (429/5xx). Here’s a simple pattern.


In [None]:
def enrich_with_retry(row: dict, max_retries: int = 3, base_sleep: float = 1.0) -> EnrichedRecord:
    last_err = None
    for attempt in range(max_retries):
        try:
            return enrich_one(row)
        except Exception as e:
            last_err = e
            sleep = base_sleep * (2 ** attempt)
            print(f"Retry {attempt+1}/{max_retries} after error: {e} (sleep {sleep:.1f}s)")
            time.sleep(sleep)
    raise last_err

out_rows=[]
for _, r in df.iterrows():
    out_rows.append(enrich_with_retry(r.to_dict()).model_dump())

df_enriched = pd.DataFrame(out_rows)
df_enriched

## 8. Join enriched fields back to your table


In [None]:
df_joined = df.merge(df_enriched, on="id", how="left")
df_joined

## 9. Cost awareness (very rough)

In real deployments you’ll estimate tokens. Here we just show the idea: average chars per item -> rough tokens.


In [None]:
def rough_tokens(text: str) -> int:
    # crude: ~4 chars/token
    return max(1, len(text)//4)

avg_prompt_tokens = int(df.apply(lambda r: rough_tokens(str(r.to_dict())), axis=1).mean())
print("Avg prompt tokens (rough):", avg_prompt_tokens)

## 10. Exercises


In [None]:

# EXERCISE — SOLUTION
# Add an 'reading_level' field to the schema (e.g., 'K-2', '3-5', '6-8', '9-12', 'Adult'). Then re-run enrichment.

from typing import Literal

class EnrichedRecordV2(EnrichedRecord):
    reading_level: Literal["K-2","3-5","6-8","9-12","Adult"]

SYSTEM_V2 = SYSTEM + "\nAlso infer an approximate reading_level. If unsure, use Adult."

def enrich_one_v2(row: dict) -> EnrichedRecordV2:
    payload = {"id": row["id"], "title": row["title"], "author": row["author"], "year": row["year"]}
    resp = client.responses.parse(
        model="gpt-4o-2024-08-06",
        input=[
            {"role":"system","content": SYSTEM_V2},
            {"role":"user","content": f"Enrich this record:\n{json.dumps(payload)}"}
        ],
        text_format=EnrichedRecordV2
    )
    return resp.output_parsed

out=[]
for _, r in df.iterrows():
    out.append(enrich_one_v2(r.to_dict()).model_dump())

pd.DataFrame(out)[["id","audience","reading_level","subject_tags"]]


In [None]:
# EXERCISE — SOLUTION
# Implement 'partial failure' handling: if one record fails after retries, store an error row and continue the batch.

out=[]
for _, r in df.iterrows():
    try:
        out.append(enrich_with_retry(r.to_dict()).model_dump() | {"error": None})
    except Exception as e:
        out.append({"id": r["id"], "one_sentence_summary": None, "subject_tags": [], "audience": "All", "tone": None, "error": str(e)})

df_out = pd.DataFrame(out)
df_out[["id","error"]]


In [None]:
# EXERCISE — SOLUTION
# Add a post-check that enforces subject_tags are unique and <= 8 items. If not, fix them.

def normalize_tags(tags: list[str]) -> list[str]:
    seen=[]
    for t in tags:
        t=t.strip()
        if t and t not in seen:
            seen.append(t)
    return seen[:8]

df_enriched2 = df_enriched.copy()
df_enriched2["subject_tags"] = df_enriched2["subject_tags"].apply(normalize_tags)
df_enriched2.head()
