# Jordan Peterson — Data Preparation Notebook

This notebook centralises all data-preparation steps for the Jordan Peterson
fine-tuning pipeline.  It is designed to run **once** before any fine-tuning
notebook and produces the Q&A dataset that all downstream notebooks consume.

## Pipeline Position

```
[This notebook]                  [Fine-tuning notebooks]
  PDF extraction                   Qwen3_14B_V2 / Qwen3_32B
      ↓                                    ↓
  Front-matter removal        ←  qa_dataset/peterson_qa.jsonl
      ↓
  Q&A generation
  (local model OR Anthropic API)
      ↓
  qa_dataset/peterson_qa.jsonl
```

## Why a Separate Notebook?

Previously the extraction + generation code was **duplicated** inside both the
V2 and 32B fine-tuning notebooks.  Two problems drove this refactor:

1. **Front-matter pollution**: extracted passages included title pages,
   copyright notices, table of contents, and forewords — publisher boilerplate,
   not Peterson's writing.
2. **API dependency**: question generation required a paid Anthropic API key.
   This notebook adds a **free local alternative** using a small HuggingFace
   model.


## Configuration File (`peterson_config.json`)

All tunable parameters live in `peterson_config.json` in this directory.
Edit that file to change paths, chunk sizes, or the generation backend —
no notebook code changes required.

### Backend Comparison

| Backend | Model | VRAM | Speed | Cost |
|---------|-------|------|-------|------|
| `local` (default) | Qwen3-4B 4-bit | ~2.5 GB | ~45 min | Free |
| `local` | Qwen3-1.7B 4-bit | ~1.2 GB | ~25 min | Free |
| `local` | Qwen3-8B 4-bit | ~5 GB | ~80 min | Free |
| `local` | Phi-4-mini-instruct | ~2.5 GB | ~45 min | Free |
| `anthropic` | claude-haiku-4-5-20251001 | 0 GB | ~20 min | ~$1–3 |

To switch to the Anthropic backend, set `"backend": "anthropic"` in
`peterson_config.json` and ensure `ANTHROPIC_API_KEY` is set.


In [None]:
import json
import re
import os
import sys
import time
import gc
from pathlib import Path

import fitz    # PyMuPDF
import torch

# ── Load config ───────────────────────────────────────────────────────────────
_config_path = Path(__file__).parent / "peterson_config.json" if "__file__" in dir() else Path("peterson_config.json")
with open(_config_path) as _f:
    config = json.load(_f)

# Paths
BOOKS_DIR = Path(config["paths"]["books_dir"])
QA_CACHE  = Path(config["paths"]["qa_cache"])
QA_DIR    = QA_CACHE.parent
QA_DIR.mkdir(exist_ok=True)

# Extraction params
CHUNK_WORDS     = config["extraction"]["chunk_words"]
OVERLAP_WORDS   = config["extraction"]["overlap_words"]
MIN_CHUNK_WORDS = config["extraction"]["min_chunk_words"]

# Generation params
QUESTIONS_PER_PASSAGE = config["generation"]["questions_per_passage"]
MAX_NEW_TOKENS        = config["generation"]["max_new_tokens"]
BACKEND               = config["generation"]["backend"]
LOCAL_MODEL_NAME      = config["generation"]["local_model"]
ANTHROPIC_MODEL       = config["generation"]["anthropic_model"]
SYSTEM_PROMPT         = config["system_prompt"]

print("Configuration loaded:")
print(f"  Books dir          : {BOOKS_DIR.resolve()}")
print(f"  Q&A cache          : {QA_CACHE.resolve()}")
print(f"  Chunk words        : {CHUNK_WORDS}  (overlap: {OVERLAP_WORDS}, min: {MIN_CHUNK_WORDS})")
print(f"  Questions/passage  : {QUESTIONS_PER_PASSAGE}")
print(f"  Backend            : {BACKEND}")
if BACKEND == "local":
    print(f"  Local model        : {LOCAL_MODEL_NAME}")
else:
    print(f"  Anthropic model    : {ANTHROPIC_MODEL}")
print(f"  CUDA available     : {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  GPU                : {torch.cuda.get_device_name(0)}")
    print(f"  VRAM               : {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")


---
# Part 1 — PDF Extraction with Front-Matter Removal

## The Front-Matter Problem

When PyMuPDF extracts a book PDF page-by-page, the first several pages contain
publisher boilerplate that does NOT reflect Peterson's writing:

- **Title page**: book title, author name, publisher logo
- **Copyright page**: ISBN, rights notices, Library of Congress data
- **Table of contents**: chapter names and page numbers (not prose)
- **Foreword**: written by someone else (e.g. Norman Doidge in *12 Rules*)
- **Preface / Overture**: sometimes counts as front matter

Including these in training data teaches the model to regurgitate publishing
metadata when asked Peterson-style questions.

## Detection Strategy (3 Tiers)

**Tier 1 — Chapter marker page** (works for 12 Rules, Beyond Order)

Search for a page that:
- Contains a chapter-1 marker (`Chapter 1`, `Rule 1`, `Rule I`, `Overture`, ...)
- Has ≥ 150 words (real content, not just a chapter heading on otherwise blank page)
- Does NOT also contain a chapter-2 marker (filters out TOC entries)

**Tier 2 — Post-copyright content** (needed for We Who Wrestle with God)

Some books don't have explicit chapter-1 markers early enough. Fall back to:
- Find the last page containing copyright/publisher indicators (ISBN, etc.)
- Return the first subsequent page with ≥ 200 words

**Tier 3 — No-op** (fallback for Maps of Meaning, which starts immediately)

If neither tier succeeds, return page 0 — extract from the beginning.


In [None]:
def clean_text(raw: str) -> str:
    """Remove PDF artefacts: control chars, excess whitespace, ligatures."""
    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', '', raw)
    text = re.sub(r'\s+', ' ', text)
    text = text.replace('\ufb01', 'fi').replace('\ufb02', 'fl')
    return text.strip()


# ── Front-matter detection patterns ──────────────────────────────────────────
_CHAPTER1_RE = re.compile(
    r'\b(Chapter\s+1|CHAPTER\s+1|Rule\s+1|RULE\s+1'
    r'|Rule\s+I\b|RULE\s+I\b'
    r'|Overture|OVERTURE'
    r'|Cain\s+and\s+Abel|CAIN\s+AND\s+ABEL)\b',
    re.IGNORECASE,
)

_CHAPTER2_RE = re.compile(
    r'\b(Chapter\s+2|CHAPTER\s+2|Rule\s+2|RULE\s+2'
    r'|Rule\s+II\b|RULE\s+II\b'
    r'|Noah\b)\b',
    re.IGNORECASE,
)

_FRONTMATTER_RE = re.compile(
    r'ISBN|All rights reserved|Library of Congress'
    r'|copyright.*\d{4}|\d{4}.*copyright'
    r'|Published by|penguinrandomhouse|routledge\.com',
    re.IGNORECASE,
)


def _find_chapter1_page(pages: list[str]) -> int:
    """
    Return the index of the first real-content page (skipping front matter).

    Uses a 3-tier heuristic — see the markdown cell above for rationale.
    """
    # Tier 1: page with chapter-1 marker, ≥150 words, no chapter-2 marker
    for i, page in enumerate(pages):
        if (_CHAPTER1_RE.search(page)
                and len(page.split()) >= 150
                and not _CHAPTER2_RE.search(page)):
            return i

    # Tier 2: find last copyright/publisher page, return first ≥200-word page after it
    fm_pages = [i for i, p in enumerate(pages) if _FRONTMATTER_RE.search(p)]
    if fm_pages:
        last_fm = max(fm_pages)
        for i in range(last_fm + 1, len(pages)):
            if len(pages[i].split()) >= 200:
                return i

    # Tier 3: no-op — start from beginning
    return 0


def extract_chunks(
    pdf_path: Path,
    chunk_words: int = CHUNK_WORDS,
    overlap_words: int = OVERLAP_WORDS,
    min_chunk_words: int = MIN_CHUNK_WORDS,
) -> tuple[list[str], int]:
    """
    Extract text from a PDF, skip front matter, and split into overlapping chunks.

    Returns (chunks, pages_skipped).
    """
    doc   = fitz.open(str(pdf_path))
    pages = [clean_text(page.get_text()) for page in doc]
    doc.close()

    start_page = _find_chapter1_page(pages)
    content_pages = pages[start_page:]
    full_text = ' '.join(content_pages)

    words = full_text.split()
    step  = chunk_words - overlap_words
    chunks = []
    for start in range(0, len(words), step):
        chunk = ' '.join(words[start: start + chunk_words])
        if len(chunk.split()) >= min_chunk_words:
            chunks.append(chunk)

    return chunks, start_page


print("Extraction functions defined.")
print(f"  Patterns: _CHAPTER1_RE, _CHAPTER2_RE, _FRONTMATTER_RE")
print(f"  Functions: clean_text(), _find_chapter1_page(), extract_chunks()")


In [None]:
# ── Run extraction over all 4 books ──────────────────────────────────────────
pdf_files = sorted(BOOKS_DIR.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files in {BOOKS_DIR.resolve()}")
for p in pdf_files:
    print(f"  {p.name}")

all_chunks = []    # flat list of passage strings
chunk_meta = []    # parallel list of {"book": label} dicts

def _label_pdf(fname: str) -> str:
    fname = fname.lower()
    if "maps" in fname:
        return "Maps of Meaning"
    elif "12 rules" in fname or "antidote" in fname:
        return "12 Rules for Life"
    elif "beyond" in fname:
        return "Beyond Order"
    else:
        return "We Who Wrestle with God"

print()
for pdf in pdf_files:
    label = _label_pdf(pdf.name)
    chunks, skipped = extract_chunks(pdf)
    all_chunks.extend(chunks)
    chunk_meta.extend([{"book": label}] * len(chunks))
    total_words = sum(len(c.split()) for c in chunks)
    print(f"  {label:<35}  skipped {skipped:2d} pages  |  "
          f"{len(chunks):4d} chunks  (~{total_words:,} words)")

print(f"\nTotal passages: {len(all_chunks):,}")


In [None]:
# ── Spot-check: first ~120 chars of first chunk per book ─────────────────────
# Should show actual prose, NOT copyright/ISBN/title-page text.
print("First chunk preview per book (verify no boilerplate):")
print("=" * 72)
seen = set()
for chunk, meta in zip(all_chunks, chunk_meta):
    book = meta["book"]
    if book not in seen:
        seen.add(book)
        preview = chunk[:120].replace("\n", " ")
        print(f"\n{book}:")
        print(f"  {preview}...")
    if len(seen) == 4:
        break


---
# Part 2 — Q&A Generation

## Backend Options

This notebook supports two question-generation backends, controlled by
`"backend"` in `peterson_config.json`:

| Setting | Description |
|---------|-------------|
| `"local"` (default) | Loads a small 4-bit model via Unsloth — **free, no API key** |
| `"anthropic"` | Calls Claude Haiku API — faster (~20 min), costs ~$1–3 |

### Why the Local Model Works

Question generation is not a demanding task. We only need plausible, on-topic
questions that the passage directly answers. The *training signal* is entirely
in the **answers** (Peterson's verbatim text), not the questions.

A Qwen3-4B model at 4-bit quantization (~2.5 GB VRAM) is more than capable of
this and runs on the same GPU as the fine-tuning pipeline.

### Speed Estimates (Qwen3-4B, RTX 4090)

| Model | VRAM | Time for ~2,519 passages × 2 Q |
|-------|------|-------------------------------|
| Qwen3-1.7B 4-bit | ~1.2 GB | ~25 min |
| **Qwen3-4B 4-bit** | **~2.5 GB** | **~45 min** |
| Qwen3-8B 4-bit | ~5 GB | ~80 min |
| Phi-4-mini-instruct | ~2.5 GB | ~45 min |


In [None]:
# ── Backend selection ─────────────────────────────────────────────────────────
print(f"Generation backend: {BACKEND!r}")

if BACKEND == "local":
    print(f"Local model       : {LOCAL_MODEL_NAME}")
    print(f"Max new tokens    : {MAX_NEW_TOKENS}")
    print("\nSkipping Anthropic setup — using local model.")
elif BACKEND == "anthropic":
    print(f"Anthropic model   : {ANTHROPIC_MODEL}")
    print("\nSkipping local model setup — using Anthropic API.")
else:
    raise ValueError(f"Unknown backend {BACKEND!r}. Must be 'local' or 'anthropic'.")


In [None]:
# ── Local model setup (skipped if backend=anthropic) ─────────────────────────
gen_model = None
gen_tokenizer = None

if BACKEND == "local":
    from unsloth import FastLanguageModel

    print(f"Loading {LOCAL_MODEL_NAME} for inference ...")
    gen_model, gen_tokenizer = FastLanguageModel.from_pretrained(
        model_name     = LOCAL_MODEL_NAME,
        max_seq_length = 1024,
        load_in_4bit   = True,
        dtype          = None,
    )
    FastLanguageModel.for_inference(gen_model)

    vram_used = torch.cuda.memory_reserved() / 1e9
    print(f"Model loaded. VRAM reserved: {vram_used:.1f} GB")
    print("gen_model and gen_tokenizer are ready.")
else:
    print("Local model setup skipped (backend=anthropic).")


In [None]:
# ── Anthropic setup (skipped if backend=local) ────────────────────────────────
anthropic_client = None

if BACKEND == "anthropic":
    import subprocess
    _result = subprocess.run(
        [sys.executable, "-m", "pip", "install", "anthropic", "-q"],
        capture_output=True, text=True,
    )
    if _result.returncode != 0:
        print("pip install warning:", _result.stderr[:300])

    import anthropic as _anthropic_module

    # Load API key from env or ~/.env
    _api_key = os.environ.get("ANTHROPIC_API_KEY", "")
    if not _api_key:
        _env_file = Path.home() / ".env"
        if _env_file.exists():
            for _line in _env_file.read_text().splitlines():
                if _line.startswith("ANTHROPIC_API_KEY="):
                    _api_key = _line.split("=", 1)[1].strip().strip('"').strip("'")
                    break

    if not _api_key:
        raise EnvironmentError(
            "ANTHROPIC_API_KEY not found.\n"
            "Set it with:  export ANTHROPIC_API_KEY='sk-ant-...'\n"
            "Or add it to ~/.env as:  ANTHROPIC_API_KEY=sk-ant-..."
        )

    anthropic_client = _anthropic_module.Anthropic(api_key=_api_key)
    print(f"Anthropic SDK ready (v{_anthropic_module.__version__}).")
    print(f"Using model: {ANTHROPIC_MODEL}")
else:
    print("Anthropic setup skipped (backend=local).")


In [None]:
# ── Generation functions ──────────────────────────────────────────────────────

def _make_prompt(passage: str, book: str) -> str:
    """Shared prompt template used by both backends."""
    return (
        f"You are building a training dataset for a Jordan B. Peterson AI model.\n\n"
        f"The passage below is from Peterson's book '{book}'. Generate exactly "
        f"{QUESTIONS_PER_PASSAGE} questions that:\n"
        f"1. This passage directly and substantively answers\n"
        f"2. Someone interested in Peterson's ideas might genuinely ask\n"
        f"3. Cover different angles of the passage (e.g. one concrete, one philosophical)\n\n"
        f"Peterson's topics include: order vs chaos, meaning, personal responsibility, "
        f"suffering, mythology, archetypes, the shadow, logos, truth, religion, "
        f"Jungian psychology, hierarchy, heroism, sacrifice, being.\n\n"
        f"Return ONLY a JSON array of exactly {QUESTIONS_PER_PASSAGE} question strings. "
        f"No other text. No markdown fences.\n"
        f'Example: ["Why is confronting chaos necessary for meaning?", '
        f'"What role does suffering play in personal development?"]\n\n'
        f"Passage:\n{passage}"
    )


def _parse_questions(raw: str) -> list[str]:
    """Strip <think>...</think> blocks and extract a JSON array of questions."""
    # Remove thinking blocks (Qwen3 with enable_thinking=True adds these)
    cleaned = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
    match = re.search(r'\[.*?\]', cleaned, re.DOTALL)
    if not match:
        raise ValueError(f"No JSON array found in: {cleaned[:120]!r}")
    questions = json.loads(match.group())
    if not isinstance(questions, list) or len(questions) < 1:
        raise ValueError(f"Expected list, got {type(questions).__name__}")
    return [str(q).strip() for q in questions[:QUESTIONS_PER_PASSAGE]]


def _generate_local(passage: str, book: str, max_retries: int = 3) -> list[str]:
    """Generate questions using the local gen_model."""
    prompt_text = _make_prompt(passage, book)
    messages = [{"role": "user", "content": prompt_text}]

    for attempt in range(max_retries):
        try:
            # Format with ChatML template (enable_thinking=False for structured output)
            text = gen_tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False,
            )
            inputs = gen_tokenizer(text, return_tensors="pt").to("cuda")

            with torch.no_grad():
                out = gen_model.generate(
                    **inputs,
                    max_new_tokens   = MAX_NEW_TOKENS,
                    do_sample        = True,
                    temperature      = 0.7,
                    top_p            = 0.8,
                    top_k            = 20,
                    repetition_penalty = 1.1,
                )

            new_tokens = out[0][inputs["input_ids"].shape[1]:]
            raw = gen_tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
            return _parse_questions(raw)

        except Exception as e:
            wait = 2 ** attempt
            if attempt < max_retries - 1:
                print(f"    [retry {attempt+1}] {e!s:.80} — wait {wait}s")
                time.sleep(wait)
            else:
                print(f"    [FAILED] {e!s:.80}")
                return []
    return []


def _generate_anthropic(passage: str, book: str, max_retries: int = 3) -> list[str]:
    """Generate questions using the Anthropic API."""
    prompt_text = _make_prompt(passage, book)

    for attempt in range(max_retries):
        try:
            response = anthropic_client.messages.create(
                model      = ANTHROPIC_MODEL,
                max_tokens = MAX_NEW_TOKENS,
                messages   = [{"role": "user", "content": prompt_text}],
            )
            raw = response.content[0].text.strip()
            return _parse_questions(raw)

        except Exception as e:
            wait = 2 ** attempt
            if attempt < max_retries - 1:
                print(f"    [retry {attempt+1}] {e!s:.80} — wait {wait}s")
                time.sleep(wait)
            else:
                print(f"    [FAILED] {e!s:.80}")
                return []
    return []


def generate_questions(passage: str, book: str) -> list[str]:
    """Dispatcher: route to local or Anthropic backend based on config."""
    if BACKEND == "local":
        return _generate_local(passage, book)
    else:
        return _generate_anthropic(passage, book)


print("Generation functions defined:")
print("  _make_prompt(), _parse_questions()")
print("  _generate_local(), _generate_anthropic(), generate_questions()")


In [None]:
# ── Cache-aware generation loop ───────────────────────────────────────────────
existing_records = []
if QA_CACHE.exists():
    with open(QA_CACHE) as _f:
        existing_records = [json.loads(line) for line in _f if line.strip()]

_expected  = len(all_chunks) * QUESTIONS_PER_PASSAGE
_coverage  = len(existing_records) / _expected if _expected else 0

print(f"Passages            : {len(all_chunks):,}")
print(f"Expected Q&A pairs  : {_expected:,}  ({QUESTIONS_PER_PASSAGE} per passage)")
print(f"Cached              : {len(existing_records):,}  ({100*_coverage:.1f}% coverage)")

if _coverage >= 0.90:
    print("\nCache is ≥90% complete — skipping generation.")
    print(f"Delete {QA_CACHE} to force regeneration.")
else:
    already_done   = len(existing_records) // QUESTIONS_PER_PASSAGE
    remaining      = all_chunks[already_done:]
    remaining_meta = chunk_meta[already_done:]
    skipped        = 0
    t_start        = time.time()

    print(f"\nGenerating questions for {len(remaining):,} passages "
          f"(resuming from passage {already_done + 1})...")
    if BACKEND == "anthropic":
        cost_est = len(remaining) * 0.00068
        print(f"Estimated API cost: ~${cost_est:.2f}")
    print()

    with open(QA_CACHE, "a") as out_f:
        for i, (passage, meta) in enumerate(zip(remaining, remaining_meta)):
            global_idx = already_done + i + 1

            if i % 50 == 0:
                elapsed = time.time() - t_start
                pct = 100 * (already_done + i) / len(all_chunks)
                print(f"  [{global_idx:4d}/{len(all_chunks):4d}]  {pct:5.1f}%  "
                      f"{elapsed:5.0f}s elapsed  book: {meta['book']}")

            questions = generate_questions(passage, meta["book"])

            if not questions:
                skipped += 1
                continue

            for q in questions:
                record = {"question": q, "answer": passage, "book": meta["book"]}
                out_f.write(json.dumps(record) + "\n")
            out_f.flush()

            # Rate-limit only for Anthropic (local GPU doesn't need this)
            if BACKEND == "anthropic":
                time.sleep(0.3)

    elapsed_total = time.time() - t_start
    print(f"\nGeneration complete in {elapsed_total/60:.1f} min.")
    print(f"  Passages processed : {len(remaining) - skipped:,}")
    print(f"  Passages skipped   : {skipped}")

# Re-read final dataset
with open(QA_CACHE) as _f:
    qa_records = [json.loads(line) for line in _f if line.strip()]

print(f"\nTotal Q&A pairs in cache: {len(qa_records):,}")


---
# Part 3 — Dataset Statistics and Cleanup

Verify the dataset before using it for fine-tuning.


In [None]:
from collections import Counter

# ── Distribution by book ──────────────────────────────────────────────────────
book_counts = Counter(r["book"] for r in qa_records)
print("Q&A pairs by book:")
for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
    print(f"  {book:<35}  {count:4d} pairs  ({100*count/len(qa_records):.1f}%)")

# ── Answer length stats ───────────────────────────────────────────────────────
ans_lengths = [len(r["answer"].split()) for r in qa_records]
q_lengths   = [len(r["question"].split()) for r in qa_records]
print(f"\nAnswer length (words):")
print(f"  Min: {min(ans_lengths)}  Max: {max(ans_lengths)}  "
      f"Mean: {sum(ans_lengths)/len(ans_lengths):.0f}  "
      f"Median: {sorted(ans_lengths)[len(ans_lengths)//2]}")
print(f"Question length (words):")
print(f"  Min: {min(q_lengths)}  Max: {max(q_lengths)}  "
      f"Mean: {sum(q_lengths)/len(q_lengths):.1f}")

# ── Sample Q&A per book ───────────────────────────────────────────────────────
print("\n" + "=" * 72)
print("SAMPLE Q&A PAIRS — first example per book (spot-check for boilerplate)")
print("=" * 72)
seen_books: set[str] = set()
for r in qa_records:
    if r["book"] not in seen_books:
        seen_books.add(r["book"])
        print(f"\nBook: {r['book']}")
        print(f"Q: {r['question']}")
        answer_preview = r["answer"][:200].replace("\n", " ")
        print(f"A: {answer_preview}...")
    if len(seen_books) == 4:
        break

print("\n" + "=" * 72)
print("If any answer above starts with ISBN/copyright/title text,")
print("the front-matter detection needs adjustment for that book.")


In [None]:
# ── Cleanup: unload local model to free VRAM ─────────────────────────────────
if BACKEND == "local" and gen_model is not None:
    print("Unloading local generation model to free VRAM...")
    del gen_model
    del gen_tokenizer
    gen_model = None
    gen_tokenizer = None
    gc.collect()
    torch.cuda.empty_cache()
    vram_after = torch.cuda.memory_reserved() / 1e9
    print(f"VRAM after unload: {vram_after:.1f} GB")
else:
    print("No local model to unload.")

# ── Final summary ─────────────────────────────────────────────────────────────
print()
print("=" * 60)
print("DATA PREPARATION COMPLETE")
print("=" * 60)
print(f"  Q&A cache path  : {QA_CACHE.resolve()}")
print(f"  Total Q&A pairs : {len(qa_records):,}")
print(f"  Books covered   : {len(book_counts)}")
print()
print("Next step: run a fine-tuning notebook that reads this cache.")
print("  Qwen3_14B_JordanPeterson_V2_FineTuning.ipynb")
print("  Qwen3_32B_JordanPeterson_FineTuning.ipynb")
