Skip to content

fix(rawdata): consolidate malformed xref/startxref and page-tree recovery#816

Closed
vitormattos wants to merge 37 commits into
smalot:masterfrom
vitormattos:fix/rawdata-next-xref-trailer-recovery
Closed

fix(rawdata): consolidate malformed xref/startxref and page-tree recovery#816
vitormattos wants to merge 37 commits into
smalot:masterfrom
vitormattos:fix/rawdata-next-xref-trailer-recovery

Conversation

@vitormattos
Copy link
Copy Markdown

@vitormattos vitormattos commented Apr 27, 2026

Summary

Improve RawData and parser resilience for malformed PDF.js-style documents by combining xref/startxref recovery, tolerant object/header parsing, and page-tree/catalog recovery fixtures.

Included Coverage

  • Recover malformed or missing startxref/xref table placement.
  • Tolerate malformed xref-stream offsets and nearby indirect object headers.
  • Improve header parsing tolerance for malformed tokens and ObjStm edge-cases.
  • Keep page-tree/catalog extraction robust on malformed Kids/catalog structures.
  • Add and validate regression fixtures from PDF.js and related malformed corpora.

Updated Areas

  • src/Smalot/PdfParser/RawData/RawDataParser.php
  • src/Smalot/PdfParser/Parser.php
  • src/Smalot/PdfParser/Pages.php
  • tests/PHPUnit/Integration/RawData/RawDataParserTest.php
  • tests/PHPUnit/Integration/DocumentIssueFocusTest.php
  • samples/bugs/rawdata/*

Validation

  • tests/PHPUnit/Integration/RawData/RawDataParserTest.php
  • tests/PHPUnit/Integration/DocumentIssueFocusTest.php

@vitormattos vitormattos changed the title fix(rawdata): recover malformed xref trailers and page trees fix(rawdata): consolidate malformed xref/startxref and page-tree recovery Apr 27, 2026
vitormattos added a commit to vitormattos/pdfparser that referenced this pull request Apr 27, 2026
# Conflicts:
#	src/Smalot/PdfParser/RawData/RawDataParser.php
#	tests/PHPUnit/Integration/DocumentIssueFocusTest.php
#	tests/PHPUnit/Integration/RawData/RawDataParserTest.php
vitormattos added a commit to vitormattos/pdfparser that referenced this pull request Apr 28, 2026
# Conflicts:
#	tests/PHPUnit/Integration/RawData/RawDataParserTest.php
@vitormattos vitormattos force-pushed the fix/rawdata-next-xref-trailer-recovery branch from 625e9d0 to a7cca15 Compare April 28, 2026 13:48
vitormattos added a commit to vitormattos/pdfparser that referenced this pull request Apr 29, 2026
vitormattos added a commit to vitormattos/pdfparser that referenced this pull request Apr 29, 2026
vitormattos added a commit to vitormattos/pdfparser that referenced this pull request Apr 29, 2026
# Conflicts:
#	src/Smalot/PdfParser/RawData/RawDataParser.php
#	tests/PHPUnit/Integration/RawData/RawDataParserTest.php
Some PDFs include bytes before the %PDF- header while still using
absolute xref offsets from the beginning of the file.

The parser trimmed data before %PDF-, which shifted offsets and caused
xref lookup failures. This manifested as an Invalid object reference
error in the veraPDF corpus header case.

Changes:
- Keep original byte layout in RawDataParser::parseData
- Add stricter trailer key matching for /Size /Root /Encrypt /Info /Prev
- Add defensive handling in xref stream resolution when startxref is near,
  but not exactly at, the xref stream object
- Add regression fixture and integration test

Regression fixture:
- samples/bugs/PullRequestInvalidObjectReference.pdf

Test:
- DocumentIssueFocusTest::testParseFileWithCompressedObjRefInXrefStream

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Some PDFs set startxref to the whitespace immediately before the
xref keyword instead of the first letter of xref.

The parser required an exact match and incorrectly switched to xref
stream decoding, which then failed with Invalid object reference.

Changes:
- Skip PDF whitespace before checking startxref position
- Use adjusted offset when decoding classic xref
- Apply same whitespace tolerance for Unix line-ending detection
- Tighten trailer key regexes to match /Size /Root /Encrypt /Info /Prev
- Add regression fixture and integration test

Regression fixture:
- samples/bugs/PullRequestXrefWhitespaceStart.pdf

Test:
- DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespace

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
@vitormattos
Copy link
Copy Markdown
Author

Replaced by #795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant