Add Markdown export pipeline (RailReader.Export)#89
Merged
Conversation
New library providing structured PDF-to-Markdown export using layout analysis, VLM transcription, and annotation extraction. Includes CLI `export` command with graceful degradation (ONNX+VLM → ONNX-only → plain text fallback). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the new `export` CLI command and RailReader.Export library to all documentation surfaces: architecture diagrams, feature lists, CLI reference sections, and the guide.html website page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract VlmService.GetBlockAction and VlmEndpointConfig.FromAppConfigWithOverrides to Core, replacing duplicated logic in Export + VlmCommand + ExportCommand - Add LayoutConstants.GetClassName helper, replacing scattered bounds-check patterns - Cache ExtractBlockText results per block to avoid O(blocks * chars) repeated scans - Flatten PDF outline once per document instead of per page - Unify AppendAnnotations/AppendAnnotationsWithText into single method with optional PageText — enables rich highlight extraction in both layout and plain-text paths - Remove dead annotations parameter from PageMarkdownBuilder.Build - Remove redundant vlmAvailable bool (derive from vlmEndpoint nullness) - Remove double VLM endpoint resolution (ExportCommand resolves, service trusts it) - Strip unnecessary WHAT comments, keep WHY comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RailReader.Exportlibrary withIMarkdownExportServiceinterface in Core andMarkdownExportServiceimplementation — structured PDF-to-Markdown export using layout analysis, VLM transcription, heading resolution (outline fuzzy-match), and annotation blockquotesrailreader2-cli exportcommand with graceful degradation: ONNX+VLM → ONNX-only → plain text fallbackVlmService.GetBlockAction,VlmEndpointConfig.FromAppConfigWithOverrides,LayoutConstants.GetClassNameRailReader.Export.Tests, 0 regressions in existing 193 Core testsTest plan
dotnet build RailReader2.slnx -c Release— 0 warnings, 0 errorsdotnet test tests/RailReader.Export.Tests— 32/32 passdotnet test tests/RailReader.Core.Tests— 193/193 passrailreader2-cli export --helpdisplays correctlyrailreader2-cli export <pdf> --no-vlm --output plain.md— verify heading hierarchy,[equation]placeholders, annotation blockquotesrailreader2-cli export <pdf> --pages 50-52 --endpoint ... --model ... --output rich.md— verify LaTeX equations, pipe tables, figure descriptions🤖 Generated with Claude Code