Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-consider completely file-based testing? #2148

Closed
stefan6419846 opened this issue Sep 5, 2023 · 2 comments
Closed

Re-consider completely file-based testing? #2148

stefan6419846 opened this issue Sep 5, 2023 · 2 comments

Comments

@stefan6419846
Copy link
Collaborator

At the moment, testing in pypdf is de-facto file-based only. While this is completely fine and basically tests on the integration layer, standalone unit tests might make sense as well to ensure some method is working correctly.

Let's take #2147 or #2110 as an example: I have some PDF files which show issues, but I cannot really provide them due privacy reasons. There are all sorts of PDF generators available, some of them having bugs in some versions, some of them not minding to actually generate completely valid PDF files at all; PDF/A might solve this, but you usually will not find many of them in everyday life. Generating crafted files which show such issues might be doable, but requires a deep understanding of the inner structure of PDF files. Using mocking, one might provide such problematic data in an easier way, which would allow for non-PDF-based tests of single methods (although this might impose some overhead when having to adapt mocks during enhancements/refactoring).

In the past, I sometimes used PyMuPDF to mostly anonymize PDF files, but this only works to some extent as well: The error in #2147 will silently be solved, PDFs of scanned images generally cannot really be anonymized, some images to extract might be scans or other private resources (like signatures), ...

One approach for me would be to stop reporting such issues and maintaining own patches (which would avoid these conflicts), but I generally appreciate the work which goes into such a freely available library and want to support development as far as I am able to.

If someone has realistic alternative approaches which avoids the aforementioned issues, I am open for them as well as for further discussions on this topic to enhance the general contribution experience of pypdf.

@pubpub-zz
Copy link
Collaborator

@stefan6419846 / @MartinThoma
I propose to convert this into a discussion.

@pubpub-zz
Copy link
Collaborator

My opinion
Without input data, investigation is impossible in all most all cases. have test files through email seems a good option and we are doing our best to keep privacy and this seems accepted by many people.

The only alternative I see is to pay a company that in anycase will ask for data inputs, but up to now I've seen only one person accepting to contribute for support, altough I'm pretty sure this library is used in business projects.
Also when receiving private data, once the issue is fixed the test we are building uses some manually generated issued to ensure the issue is not coming back.

Finally just for general information, pypdf includes remove_text and remove_images that should wipe out most information.

@py-pdf py-pdf locked and limited conversation to collaborators Sep 5, 2023
@MartinThoma MartinThoma converted this issue into discussion #2152 Sep 5, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants