Skip to content

feat: add --extract support for local PDF files#203

Merged
steipete merged 2 commits intosteipete:mainfrom
mvance:feat/pdf-extract-support
Apr 26, 2026
Merged

feat: add --extract support for local PDF files#203
steipete merged 2 commits intosteipete:mainfrom
mvance:feat/pdf-extract-support

Conversation

@mvance
Copy link
Copy Markdown
Contributor

@mvance mvance commented Apr 24, 2026

Summary

  • Extends --extract to accept local PDF files, routing them through the existing markitdown extraction path (same as remote PDF URLs)
  • Updates the validation guard in runner-plan.ts to allow .pdf extensions alongside audio/video
  • Adds isPdfExtension() helper to src/run/flows/asset/input.ts alongside isTranscribableExtension
  • Adds MAX_PDF_EXTRACT_BYTES constant (500 MB) to raise the default 50 MB limit for PDF extraction
  • Adds a progress spinner ("Loading file" → "Extracting text") around the load+extract sequence
  • Adds tests covering happy path, non-PDF rejection, and missing uvx error

Behaviour

# Before: rejected with an error
summarize --extract document.pdf

# After: extracts and prints markdown via markitdown, no LLM call
summarize --extract document.pdf

Note: image-based (scanned) PDFs with no embedded text layer will produce empty output from markitdown — this is a markitdown limitation, not a bug in this change. OCR plugin support would be a separate feature.

Test plan

  • pnpm test tests/cli.asset.local-pdf-extract.test.ts — 3 new tests pass
  • pnpm test — no regressions in existing tests
  • Manual: tested against a real multi-hundred-page PDF, output confirmed correct

🤖 Generated with Claude Code

mvance and others added 2 commits April 23, 2026 19:01
Extend the --extract flag to accept local PDF files, routing them
through the existing markitdown extraction path (same as remote PDF
URLs). Previously, --extract on a local file was only supported for
audio/video media files.

- Update validation in runner-plan.ts to allow .pdf extensions
- Add PDF extract path in runner-execution.ts before handleFileInput
- Add tests covering happy path, non-PDF rejection, and missing uvx

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add isPdfExtension() helper to input.ts (replaces duplicated .endsWith checks)
- Add MAX_PDF_EXTRACT_BYTES constant (500 MB) to raise the 50 MB limit
- Add progress spinner around loadLocalAsset + extractAssetContent
- Wrap test temp dirs in try/finally for cleanup
- Track execFileMock calls to verify local PDF branch is exercised

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@steipete steipete merged commit 760e6ce into steipete:main Apr 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants