fix: require leading slash in trailer fields and tolerate malformed xref stream offset by vitormattos · Pull Request #796 · smalot/pdfparser

vitormattos · 2026-04-24T03:01:46Z

Bug fixed in this PR

The parser failed in invalid object reference/xref scenarios, especially for PDFs with fragile xref structure.

Fixture(s) and source

samples/bugs/PullRequestInvalidObjectReference.pdf
- https://github.com/veraPDF/veraPDF-corpus/blob/staging/PDF_A-1b/6.1%20File%20structure/6.1.2%20File%20header/veraPDF%20test%20suite%206-1-2-t01-fail-a.pdf
samples/bugs/PullRequest797-vera.pdf
- https://github.com/veraPDF/veraPDF-corpus/blob/staging/PDF_A-1b/6.1%20File%20structure/6.1.2%20File%20header/veraPDF%20test%20suite%206-1-2-t01-fail-a.pdf
samples/bugs/PullRequest797-pdf.js.pdf
- https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue9252.pdf

Note to maintainer

You can merge this PR directly, or use it only as a code-review helper and merge the aggregator PR (#809).

Some PDFs include bytes before the %PDF- header while still using absolute xref offsets from the beginning of the file. The parser trimmed data before %PDF-, which shifted offsets and caused xref lookup failures. This manifested as an Invalid object reference error in the veraPDF corpus header case. Changes: - Keep original byte layout in RawDataParser::parseData - Add stricter trailer key matching for /Size /Root /Encrypt /Info /Prev - Add defensive handling in xref stream resolution when startxref is near, but not exactly at, the xref stream object - Add regression fixture and integration test Regression fixture: - samples/bugs/PullRequestInvalidObjectReference.pdf Test: - DocumentIssueFocusTest::testParseFileWithCompressedObjRefInXrefStream Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Some PDFs set startxref to the whitespace immediately before the xref keyword instead of the first letter of xref. The parser required an exact match and incorrectly switched to xref stream decoding, which then failed with Invalid object reference. Changes: - Skip PDF whitespace before checking startxref position - Use adjusted offset when decoding classic xref - Apply same whitespace tolerance for Unix line-ending detection - Tighten trailer key regexes to match /Size /Root /Encrypt /Info /Prev - Add regression fixture and integration test Regression fixture: - samples/bugs/PullRequestXrefWhitespaceStart.pdf Test: - DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespace Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos · 2026-04-27T18:32:43Z

Superseded by the updated scope in #812.

vitormattos mentioned this pull request Apr 24, 2026

fix: aggregate pre-header xref offset robustness vitormattos/pdfparser#2

Merged

test: use assertCount for page count assertion

917ad5d

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos mentioned this pull request Apr 25, 2026

sync: include missing PR806 follow-up commit in integration vitormattos/pdfparser#24

Closed

vitormattos added 4 commits April 25, 2026 18:34

test: move PR796 regression to RawDataParserTest

b8ec7b3

test: add pdf.js compressed xref regression

edbacca

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

test: clarify pull request fixture provenance

cc85357

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos force-pushed the fix/invalid-object-reference-tolerant-parser branch from b1684d5 to cc85357 Compare April 25, 2026 23:27

test(rawdata): keep PR796/797 regressions in RawDataParserTest only

cbd0bbf

vitormattos mentioned this pull request Apr 26, 2026

integration: consolidated PDF.js parsing resilience fixes #809

Closed

vitormattos added 2 commits April 25, 2026 21:35

test(rawdata): add fixture source @see links for PR796

1f71566

style(test): fix @see indentation in RawDataParserTest

0cb2995

vitormattos changed the title ~~fix: preserve absolute xref offsets with pre-header bytes~~ fix: require leading slash in trailer fields and tolerate malformed xref stream offset Apr 26, 2026

fix(rawdata): recover xref_command_missing in PR796 stack

6e3695c

This was referenced Apr 27, 2026

fix(rawdata): recover malformed xref_command_missing startxref path #815

Closed

chore(integration): include PR815 xref_command_missing regression vitormattos/pdfparser#41

Closed

vitormattos closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: require leading slash in trailer fields and tolerate malformed xref stream offset#796

fix: require leading slash in trailer fields and tolerate malformed xref stream offset#796
vitormattos wants to merge 10 commits into
smalot:masterfrom
vitormattos:fix/invalid-object-reference-tolerant-parser

vitormattos commented Apr 24, 2026 •

edited

Loading

Uh oh!

vitormattos commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vitormattos commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug fixed in this PR

Fixture(s) and source

Note to maintainer

Uh oh!

vitormattos commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vitormattos commented Apr 24, 2026 •

edited

Loading