Scatterpub Toolchain

Scripts and Claude Code skills for digitising physical books from scans and preparing manuscripts for publication in Vellum.

New here? Start with the example project

scatterpub-toolchain-example is a ready-to-run project containing real sample scans and a step-by-step tutorial. Fork it, follow the tutorial, and you will have a complete working pipeline in around 30 minutes — without needing to set up a project from scratch.

What's included

Component	Purpose
`scripts/extract-vellum.py`	Extract a Vellum project to Markdown
`scripts/ocr-to-markdown.py`	OCR clean page scans into raw Markdown
`scripts/clean-ocr.py`	Clean OCR artefacts, running headers, invisible characters
`scripts/clean-vellum.py`	Remove invisible artefacts from a `.vellum` package
`scripts/md-to-docx.py`	Convert Markdown to Word for Vellum import
`.claude/skills/copyeditor`	Claude Code skill: copy-edit; style guide driven by `language` in `book.md`
`.claude/skills/pdf`	Claude Code skill: general-purpose PDF processing
`.claude/skills/skill-creator`	Claude Code skill: create and improve skills

Using as a submodule

The toolchain is designed to be embedded in a book project repository as a git submodule.

1. Add the submodule

git submodule add https://github.com/scattercode/scatterpub-toolchain.git toolchain
git submodule update --init

2. Install Python dependencies

brew install poppler tesseract pandoc  # system tools
cd toolchain && poetry install         # Python deps

3. Symlink the skills

So that Claude Code can discover the skills via the standard .claude/skills/ path:

mkdir -p .claude/skills
ln -s ../../toolchain/.claude/skills/copyeditor .claude/skills/copyeditor
ln -s ../../toolchain/.claude/skills/pdf .claude/skills/pdf
ln -s ../../toolchain/.claude/skills/skill-creator .claude/skills/skill-creator

Commit the symlinks — they are tracked by git and wired up automatically for every contributor.

4. Create book.md

In each book folder, create a book.md with YAML front matter:

---
title: "Book Title"
author: Author Name
language: en-GB
---

The scripts read this automatically and inject it as front matter in every generated Markdown file. The language field also tells the /copyeditor skill which style guide to apply during copy-editing:

`language`	Style guide
`en-GB`	Hart's Rules (British English)
`en-US`	Chicago Manual of Style, 18th edition (US English)

Standalone use

To use the toolchain directly (not as a submodule):

git clone https://github.com/scattercode/scatterpub-toolchain.git
cd scatterpub-toolchain
brew install poppler tesseract pandoc
poetry install

Publishing workflow

From a Vellum source file

The .vellum file is treated as final, manually edited content. Extract it and review directly — no automated cleaning pass is needed.

# Extract to Markdown
python3 toolchain/scripts/extract-vellum.py "publishing/<title>/<title>.vellum" \
  "publishing/<title>/draft/<slug>.md"

# Copy-edit review
# Load /copyeditor in Claude Code and provide the extracted Markdown path

# Generate Word document
python3 toolchain/scripts/md-to-docx.py "publishing/<title>/draft/<slug>.md"

From physical book scans

# Step 1: OCR the clean scans
toolchain/.venv/bin/python toolchain/scripts/ocr-to-markdown.py \
  "publishing/<title>/ocr/scans/clean"

# Step 2: Clean the raw output
python3 toolchain/scripts/clean-ocr.py "publishing/<title>/ocr/<slug>-raw.md" --join-hyphens

# Step 3: Copy-edit
# Load /copyeditor in Claude Code and provide the cleaned Markdown path

Prerequisites

Tool	Install	Required for
Python 3.6+	system / pyenv	all scripts
Poetry	`pip install poetry`	Python dep management
pandoc	`brew install pandoc`	`.docx` conversion
poppler	`brew install poppler`	PDF text extraction
tesseract	`brew install tesseract`	`--ocr tesseract`
marker-pdf	`poetry install`	`--ocr marker` (default)

The .vellum package format is macOS-specific; extract-vellum.py and clean-vellum.py require macOS.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude/skills		.claude/skills
scripts		scripts
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scatterpub Toolchain

New here? Start with the example project

What's included

Using as a submodule

1. Add the submodule

2. Install Python dependencies

3. Symlink the skills

4. Create book.md

Standalone use

Publishing workflow

From a Vellum source file

From physical book scans

Prerequisites

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scatterpub Toolchain

New here? Start with the example project

What's included

Using as a submodule

1. Add the submodule

2. Install Python dependencies

3. Symlink the skills

4. Create book.md

Standalone use

Publishing workflow

From a Vellum source file

From physical book scans

Prerequisites

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages