Skip to content

Add Markdown export pipeline (RailReader.Export)#89

Merged
sjvrensburg merged 4 commits into
mainfrom
feature/markdown-export
Apr 12, 2026
Merged

Add Markdown export pipeline (RailReader.Export)#89
sjvrensburg merged 4 commits into
mainfrom
feature/markdown-export

Conversation

@sjvrensburg
Copy link
Copy Markdown
Owner

Summary

  • New RailReader.Export library with IMarkdownExportService interface in Core and MarkdownExportService implementation — structured PDF-to-Markdown export using layout analysis, VLM transcription, heading resolution (outline fuzzy-match), and annotation blockquotes
  • New railreader2-cli export command with graceful degradation: ONNX+VLM → ONNX-only → plain text fallback
  • Shared helpers extracted to Core: VlmService.GetBlockAction, VlmEndpointConfig.FromAppConfigWithOverrides, LayoutConstants.GetClassName
  • 32 new tests in RailReader.Export.Tests, 0 regressions in existing 193 Core tests
  • Documentation updated across CLAUDE.md, README.md, user guide, and website

Test plan

  • dotnet build RailReader2.slnx -c Release — 0 warnings, 0 errors
  • dotnet test tests/RailReader.Export.Tests — 32/32 pass
  • dotnet test tests/RailReader.Core.Tests — 193/193 pass
  • railreader2-cli export --help displays correctly
  • railreader2-cli export <pdf> --no-vlm --output plain.md — verify heading hierarchy, [equation] placeholders, annotation blockquotes
  • railreader2-cli export <pdf> --pages 50-52 --endpoint ... --model ... --output rich.md — verify LaTeX equations, pipe tables, figure descriptions

🤖 Generated with Claude Code

sjvrensburg and others added 4 commits April 12, 2026 10:49
New library providing structured PDF-to-Markdown export using layout
analysis, VLM transcription, and annotation extraction. Includes CLI
`export` command with graceful degradation (ONNX+VLM → ONNX-only →
plain text fallback).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the new `export` CLI command and RailReader.Export library to all
documentation surfaces: architecture diagrams, feature lists, CLI
reference sections, and the guide.html website page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract VlmService.GetBlockAction and VlmEndpointConfig.FromAppConfigWithOverrides
  to Core, replacing duplicated logic in Export + VlmCommand + ExportCommand
- Add LayoutConstants.GetClassName helper, replacing scattered bounds-check patterns
- Cache ExtractBlockText results per block to avoid O(blocks * chars) repeated scans
- Flatten PDF outline once per document instead of per page
- Unify AppendAnnotations/AppendAnnotationsWithText into single method with optional
  PageText — enables rich highlight extraction in both layout and plain-text paths
- Remove dead annotations parameter from PageMarkdownBuilder.Build
- Remove redundant vlmAvailable bool (derive from vlmEndpoint nullness)
- Remove double VLM endpoint resolution (ExportCommand resolves, service trusts it)
- Strip unnecessary WHAT comments, keep WHY comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sjvrensburg sjvrensburg merged commit 3b4141e into main Apr 12, 2026
@sjvrensburg sjvrensburg deleted the feature/markdown-export branch April 12, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant