Improve PDF.js corpus resilience with parser-native page recovery, dimensions API, and regression coverage#795
Conversation
4f628d3 to
bbbd1d3
Compare
|
@vitormattos Thank you for all these PRs 🚀 I'm really busy right now and don't have much time, but I'll try to make time for it over the next few weeks. What I've seen so far looks very solid. If @j0k3r or others from the community wanna step in, I'd be happy to assist. |
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
|
@vitormattos Several PRs were closed very recently and new ones replaced them (to some extent). It seems you currently doing a deep dive into the library. I only have so much time to look into these. Which one(s) should I prioritize? |
|
Short answer: Yes, there are two possible ways to review:
Reason: #809 is an integration PR that combines everything I’ve been working on. More contextI used pdfparser in LibreSign to extract page dimensions and generate stamped PDFs with footers. Over time, we started receiving bug reports related to parsing inconsistencies. As a workaround, I switched to After about a year using this workaround, I decided to come back to pdfparser and improve it so it can behave closer to To do that, I built a test engine that compares outputs between pdfparser and pdfinfo using large PDF datasets. Then I started fixing cases where pdfparser behaves differently (mainly page count and dimensions). About the PR structureInitially, I tried a single integration PR, but it quickly became hard to maintain due to conflicts and overlapping changes. So I reorganized the work into smaller PRs grouped by related fixes, making them easier to review and safer to merge. More details about the PR reorganization (optional to read)At first, I had one large integration branch, but as fixes evolved, I started finding issues in very close or overlapping parts of the code. This created a lot of conflicts and made the integration PR unstable and hard to review. Because of that, I split and regrouped changes by file and responsibility. Some PRs were closed and replaced by cleaner versions to better isolate the fixes. That’s why you may see several closed PRs, they were part of this restructuring process. Next steps (planned)
Finally, thank you for maintaining this project. I’m not able to sponsor right now since LibreSign still doesn’t have enough funding, but I hope to do that in the future. For now, contributing with code is my way to give back. |
- narrow catch type to Exception in Document fallback flow - narrow catch type to Exception in Page text extraction fallback - narrow catch type to Exception in test dimension helper
- use Page::getDimensions() for CropBox/MediaBox resolution - remove legacy getDetails-based fallback parsing - drop obsolete helper and unused import
- call Page::getDimensions directly in page dimension assertions - remove redundant extractPageDimensions helper
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
- Make config property always non-null after construction - Align with RawDataParser pattern for consistency - Remove defensive null coalescing operators where now unnecessary - Improve type safety and eliminate defensive checks
- Pop recursion stack at the end of getTextArray - Prevent state leakage between calls/tests - Keep circular-reference guard behavior unchanged
…o inlined closures" This reverts commit 55dfe59.
|
@k00ni I completed the work proposed in this PR: improving the parser's resilience so it can read more PDFs from the PDF.js and veraPDF corpora. Any further changes would greatly expand this PR's scope and move us away from the original goal. I prefer to reserve future improvements (e.g., performance optimizations or refactoring other components) for separate PRs. |
|
@vitormattos I really appreciate the amount of time you invested in this library. Although this library is used by many projects, its not developed any further. Therefore, if you plan to invest further time in it, you should check out https://github.com/PrinsFrank/pdfparser by @PrinsFrank before. It is currently under active development again and might be a better "bet" for future development. That said, I will review this PR as soon as I find the time (in May), unless you want to switch to https://github.com/PrinsFrank/pdfparser. |
|
Thanks for the work @vitormattos. But as a "maintainer" (even if @k00ni does 99,9% of the job here) I find very complicated to merge such a huge PR which might going to introduce a lot of changes (in the good direction I'm sure) that we won't be able to maintain without taking a lot of time to invest in the near future. As @k00ni said, that library should only receive little fix as it's not actively maintained. Maybe you should focus your time on the other lib. |
|
Thanks a lot @k00ni and @j0k3r for the comment, I really appreciate the feedback. I completely understand the concern about the size of this PR and the maintenance implications. I’m also a maintainer of some open source projects, so I’m very aware of how important it is to keep things reviewable and sustainable. At first glance it does look quite large, and I agree that merging something like this all at once would be difficult to review safely. Just to give a bit more context on why it ended up this way: A significant portion of the changes here are actually test fixtures (real-world PDFs from the PDF.js corpus), along with regression tests built on top of them. The goal was to ensure that every behavior change is backed by reproducible cases, especially for malformed PDFs, which are quite common in the real world. In other words, the size is less about added complexity in the parser itself, and more about increasing confidence and coverage around real-world cases. That said, I fully agree that the current format is not ideal for review or long-term maintenance. I’m happy to restructure this in a way that better fits the project’s expectations. For example, we could split this into smaller, focused PRs (e.g. page tree traversal, xref recovery, dimensions API, test coverage), so each part can be reviewed independently and safely. Regarding the suggestion about https://github.com/PrinsFrank/pdfparser, thanks for pointing that out, I’ll definitely take a look. For now, my main goal is still to improve this project, since there are still users relying on this package. Happy to adapt to whichever direction you think is most appropriate 👍 |
|
@vitormattos for what its worth: I am glad that you reached @PrinsFrank in #315. Judging by the way things are developing right know, would it be OK for you to close this PR, so you can solely focusing your time and energy on https://github.com/PrinsFrank/pdfparser? Having a strong, reliable project is better for the community than a few, which are "half baked". Its unfortunate that I can't free up enough time right now to start reviewing. I want to at least participate in the dialogue. As it was mentioned before, this library is just being kept alive on the latest PHP version but without further development (except contributions from the community occasionally). It would be a shame if all your effort went into a dead end. We don't have to rush here, but time for Open Source projects is valuable (especially yours) and I don't want to waste it here. In case you want to stay with smalot/pdfparser for now, I will switch back to my plan to review your code. I'll get back to you about the option of having several smaller PRs instead of one large one. |
Objective
Strengthen parser reliability on malformed real-world PDFs and validate outcomes directly by parser readability (not by external tool comparison).
What this PR changes
Page::getDimensions($boxName = 'CropBox')Document::getPagesDimensions($boxName = 'CropBox')Validation approach
Readable (plain)Readable (encryption-marked)Unreadable by parserResults (PDF.js corpus: 930 files / 929 unique hashes)
Additional validation
Notes