Skip to content

fix: require leading slash in trailer fields and tolerate malformed xref stream offset#796

Closed
vitormattos wants to merge 10 commits into
smalot:masterfrom
vitormattos:fix/invalid-object-reference-tolerant-parser
Closed

fix: require leading slash in trailer fields and tolerate malformed xref stream offset#796
vitormattos wants to merge 10 commits into
smalot:masterfrom
vitormattos:fix/invalid-object-reference-tolerant-parser

Conversation

@vitormattos
Copy link
Copy Markdown

@vitormattos vitormattos commented Apr 24, 2026

Bug fixed in this PR

The parser failed in invalid object reference/xref scenarios, especially for PDFs with fragile xref structure.

Fixture(s) and source

Note to maintainer

You can merge this PR directly, or use it only as a code-review helper and merge the aggregator PR (#809).

Some PDFs include bytes before the %PDF- header while still using
absolute xref offsets from the beginning of the file.

The parser trimmed data before %PDF-, which shifted offsets and caused
xref lookup failures. This manifested as an Invalid object reference
error in the veraPDF corpus header case.

Changes:
- Keep original byte layout in RawDataParser::parseData
- Add stricter trailer key matching for /Size /Root /Encrypt /Info /Prev
- Add defensive handling in xref stream resolution when startxref is near,
  but not exactly at, the xref stream object
- Add regression fixture and integration test

Regression fixture:
- samples/bugs/PullRequestInvalidObjectReference.pdf

Test:
- DocumentIssueFocusTest::testParseFileWithCompressedObjRefInXrefStream

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Some PDFs set startxref to the whitespace immediately before the
xref keyword instead of the first letter of xref.

The parser required an exact match and incorrectly switched to xref
stream decoding, which then failed with Invalid object reference.

Changes:
- Skip PDF whitespace before checking startxref position
- Use adjusted offset when decoding classic xref
- Apply same whitespace tolerance for Unix line-ending detection
- Tighten trailer key regexes to match /Size /Root /Encrypt /Info /Prev
- Add regression fixture and integration test

Regression fixture:
- samples/bugs/PullRequestXrefWhitespaceStart.pdf

Test:
- DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespace

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
@vitormattos vitormattos changed the title fix: preserve absolute xref offsets with pre-header bytes fix: require leading slash in trailer fields and tolerate malformed xref stream offset Apr 26, 2026
@vitormattos
Copy link
Copy Markdown
Author

Superseded by the updated scope in #812.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant