Scripts and Claude Code skills for digitising physical books from scans and preparing manuscripts for publication in Vellum.
scatterpub-toolchain-example is a ready-to-run project containing real sample scans and a step-by-step tutorial. Fork it, follow the tutorial, and you will have a complete working pipeline in around 30 minutes — without needing to set up a project from scratch.
| Component | Purpose |
|---|---|
scripts/extract-vellum.py |
Extract a Vellum project to Markdown |
scripts/ocr-to-markdown.py |
OCR clean page scans into raw Markdown |
scripts/clean-ocr.py |
Clean OCR artefacts, running headers, invisible characters |
scripts/clean-vellum.py |
Remove invisible artefacts from a .vellum package |
scripts/md-to-docx.py |
Convert Markdown to Word for Vellum import |
.claude/skills/copyeditor |
Claude Code skill: copy-edit; style guide driven by language in book.md |
.claude/skills/pdf |
Claude Code skill: general-purpose PDF processing |
.claude/skills/skill-creator |
Claude Code skill: create and improve skills |
The toolchain is designed to be embedded in a book project repository as a git submodule.
git submodule add https://github.com/scattercode/scatterpub-toolchain.git toolchain
git submodule update --initbrew install poppler tesseract pandoc # system tools
cd toolchain && poetry install # Python depsSo that Claude Code can discover the skills via the standard .claude/skills/ path:
mkdir -p .claude/skills
ln -s ../../toolchain/.claude/skills/copyeditor .claude/skills/copyeditor
ln -s ../../toolchain/.claude/skills/pdf .claude/skills/pdf
ln -s ../../toolchain/.claude/skills/skill-creator .claude/skills/skill-creatorCommit the symlinks — they are tracked by git and wired up automatically for every contributor.
In each book folder, create a book.md with YAML front matter:
---
title: "Book Title"
author: Author Name
language: en-GB
---The scripts read this automatically and inject it as front matter in every generated Markdown file. The language field also tells the /copyeditor skill which style guide to apply during copy-editing:
language |
Style guide |
|---|---|
en-GB |
Hart's Rules (British English) |
en-US |
Chicago Manual of Style, 18th edition (US English) |
To use the toolchain directly (not as a submodule):
git clone https://github.com/scattercode/scatterpub-toolchain.git
cd scatterpub-toolchain
brew install poppler tesseract pandoc
poetry installThe .vellum file is treated as final, manually edited content. Extract it and review directly — no automated cleaning pass is needed.
# Extract to Markdown
python3 toolchain/scripts/extract-vellum.py "publishing/<title>/<title>.vellum" \
"publishing/<title>/draft/<slug>.md"
# Copy-edit review
# Load /copyeditor in Claude Code and provide the extracted Markdown path
# Generate Word document
python3 toolchain/scripts/md-to-docx.py "publishing/<title>/draft/<slug>.md"# Step 1: OCR the clean scans
toolchain/.venv/bin/python toolchain/scripts/ocr-to-markdown.py \
"publishing/<title>/ocr/scans/clean"
# Step 2: Clean the raw output
python3 toolchain/scripts/clean-ocr.py "publishing/<title>/ocr/<slug>-raw.md" --join-hyphens
# Step 3: Copy-edit
# Load /copyeditor in Claude Code and provide the cleaned Markdown path| Tool | Install | Required for |
|---|---|---|
| Python 3.6+ | system / pyenv | all scripts |
| Poetry | pip install poetry |
Python dep management |
| pandoc | brew install pandoc |
.docx conversion |
| poppler | brew install poppler |
PDF text extraction |
| tesseract | brew install tesseract |
--ocr tesseract |
| marker-pdf | poetry install |
--ocr marker (default) |
The .vellum package format is macOS-specific; extract-vellum.py and clean-vellum.py require macOS.