Re-consider completely file-based testing? #2148

stefan6419846 · 2023-09-05T09:30:42Z

At the moment, testing in pypdf is de-facto file-based only. While this is completely fine and basically tests on the integration layer, standalone unit tests might make sense as well to ensure some method is working correctly.

Let's take #2147 or #2110 as an example: I have some PDF files which show issues, but I cannot really provide them due privacy reasons. There are all sorts of PDF generators available, some of them having bugs in some versions, some of them not minding to actually generate completely valid PDF files at all; PDF/A might solve this, but you usually will not find many of them in everyday life. Generating crafted files which show such issues might be doable, but requires a deep understanding of the inner structure of PDF files. Using mocking, one might provide such problematic data in an easier way, which would allow for non-PDF-based tests of single methods (although this might impose some overhead when having to adapt mocks during enhancements/refactoring).

In the past, I sometimes used PyMuPDF to mostly anonymize PDF files, but this only works to some extent as well: The error in #2147 will silently be solved, PDFs of scanned images generally cannot really be anonymized, some images to extract might be scans or other private resources (like signatures), ...

One approach for me would be to stop reporting such issues and maintaining own patches (which would avoid these conflicts), but I generally appreciate the work which goes into such a freely available library and want to support development as far as I am able to.

If someone has realistic alternative approaches which avoids the aforementioned issues, I am open for them as well as for further discussions on this topic to enhance the general contribution experience of pypdf.

pubpub-zz · 2023-09-05T17:51:54Z

@stefan6419846 / @MartinThoma
I propose to convert this into a discussion.

pubpub-zz · 2023-09-05T18:03:52Z

My opinion
Without input data, investigation is impossible in all most all cases. have test files through email seems a good option and we are doing our best to keep privacy and this seems accepted by many people.

The only alternative I see is to pay a company that in anycase will ask for data inputs, but up to now I've seen only one person accepting to contribute for support, altough I'm pretty sure this library is used in business projects.
Also when receiving private data, once the issue is fixed the test we are building uses some manually generated issued to ensure the issue is not coming back.

Finally just for general information, pypdf includes remove_text and remove_images that should wipe out most information.

py-pdf locked and limited conversation to collaborators Sep 5, 2023

MartinThoma converted this issue into discussion #2152 Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Re-consider completely file-based testing? #2148

Re-consider completely file-based testing? #2148

stefan6419846 commented Sep 5, 2023

pubpub-zz commented Sep 5, 2023

pubpub-zz commented Sep 5, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Re-consider completely file-based testing? #2148

Re-consider completely file-based testing? #2148

Comments

stefan6419846 commented Sep 5, 2023

pubpub-zz commented Sep 5, 2023

pubpub-zz commented Sep 5, 2023

This issue was moved to a discussion.