Skip to content

fix: recover hybrid xref offsets in pdf.js issue17147#804

Closed
vitormattos wants to merge 1 commit into
smalot:masterfrom
vitormattos:fix/pdfjs-issue17147-objref-string
Closed

fix: recover hybrid xref offsets in pdf.js issue17147#804
vitormattos wants to merge 1 commit into
smalot:masterfrom
vitormattos:fix/pdfjs-issue17147-objref-string

Conversation

@vitormattos
Copy link
Copy Markdown

@vitormattos vitormattos commented Apr 24, 2026

Summary

  • add regression fixture samples/bugs/PullRequest804-pdf.js.pdf from pdf.js issue17147
  • add integration test covering parsing when a hybrid trailer points to /XRefStm and /Prev
  • harden xref parsing to recover when startxref points near, but not exactly at, the xref keyword
  • avoid type errors when xref-stream lookup starts on trailer dictionaries and follow trailer offsets instead

Root cause

issue17147.pdf contains a hybrid-reference structure where trailer metadata references both /XRefStm and /Prev, and one xref offset points near xref table entries rather than exactly at xref.

The parser raised a type error (array passed as object reference) and then failed with Unable to find xref.

Fix details

  • in decodeXref(), follow /XRefStm trailers so hybrid references are decoded
  • in decodeXrefStream(), guard object-reference type and recover from trailer dictionaries by following /XRefStm and /Prev
  • in getXrefData(), search a small backward window for xref when offsets land shortly after the keyword

Validation

  • red/green regression on tests/PHPUnit/Integration/DocumentIssueFocusTest.php with filter testParseFileWithArrayXrefObjectReferenceInStream
  • full run of tests/PHPUnit/Integration/DocumentIssueFocusTest.php
  • direct parse smoke-check of issue17147.pdf returns 1 page

All validation was run in Docker (php:8.3-cli) because host PHP is unavailable.

Source PDF

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
@vitormattos
Copy link
Copy Markdown
Author

Superseded by the RawDataParser consolidation chain in the fork.

This fix (hybrid xref offset recovery, XRefStm handling, backward xref scan) is included in vitormattos#32, stacked on fix/invalid-object-reference-tolerant-parser.

@vitormattos vitormattos deleted the fix/pdfjs-issue17147-objref-string branch April 27, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant