Ensure indirect sweep to handle all objects. #1064

Hatell · 2022-07-05T12:30:57Z

This fixes #1062

MartinThoma · 2022-07-05T12:38:41Z

You're quick!

I've just made a bugfix-release so that we have time to fix this. No need to rush :-)

Hatell · 2022-07-05T13:02:13Z

Could you run tests for unknown reason they didn't run.

I found that there was missing some cross references added them as well.

PyPDF2/_writer.py

MartinThoma · 2022-07-05T13:04:04Z

@Hatell I've just added a tiny stylistic commit. I don't know why CI wasn't triggered before

MartinThoma · 2022-07-05T13:16:08Z

@Hatell The failed test might be actually ok. It checks for file identity - so the file needs to be updated. We need to manually inspect the merged file if it looks ok. Not ideal, but better than no check.

Hatell · 2022-07-05T13:45:09Z

Don't merge yet. There is some issues left.

Hatell · 2022-07-05T14:04:28Z

I found that some data to calculate hash was ignored. Now tests should pass.

codecov · 2022-07-05T14:06:56Z

Codecov Report

Merging #1064 (b17baef) into main (a345690) will increase coverage by 0.09%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1064      +/-   ##
==========================================
+ Coverage   90.86%   90.95%   +0.09%     
==========================================
  Files          24       24              
  Lines        4508     4520      +12     
  Branches      920      923       +3     
==========================================
+ Hits         4096     4111      +15     
+ Misses        271      268       -3     
  Partials      141      141

Impacted Files	Coverage Δ
PyPDF2/_writer.py	`89.25% <100.00%> (+0.18%)`	⬆️
PyPDF2/generic.py	`91.57% <100.00%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a345690...b17baef. Read the comment docs.

Hatell · 2022-07-05T15:47:46Z

Now all issues should be resolved.

Added test because coverage checks and found a bug.

There is a bug in main and this fixes it also. I found from previous PR #207 those files with NoneType errors and checked that xref-table is incorrectly produced.

MartinThoma · 2022-07-05T19:50:56Z

Thank you ❤️

MartinThoma · 2022-07-05T19:56:16Z

I'm running the compression checks again

MartinThoma · 2022-07-05T20:52:44Z

It looks ok for me. @MasterOdin what do you think?

MasterOdin · 2022-07-06T21:16:32Z

I'm still quite concerned that both extern_map and self._idnum_hash are doing very similar things with regards to trying to remove duplicate indirect objects during the sweep, but can potentially mingle in weird ways, especially since we only use self._idnum_hash when extern_map has a cache miss, but when it doesn't I'm not sure there's a guarantee that what extern_map has cached will be the same as what we'd have for self._idnum_hash, and so I guess we could still end up with repeated objects, though less than before.

Can we replace extern_map with self._idnum_hash?

Hatell · 2022-07-07T06:32:42Z

It may work.

Need only to figure how this extern_map which is like tree structure based on pdf, generation and idnum to transform a flat lookup-table.

Also I would change this recursive depth-first processing to iterative version. Maybe after this structure change.

Hatell · 2022-07-07T13:04:34Z

I created PR #1072 which do need testing.

Hatell marked this pull request as draft July 5, 2022 12:51

Ensure indirect sweep to handle all objects.

190479c

Hatell force-pushed the IS-1062 branch from 9312225 to 190479c Compare July 5, 2022 12:57

Hatell marked this pull request as ready for review July 5, 2022 12:57

MartinThoma reviewed Jul 5, 2022

View reviewed changes

PyPDF2/_writer.py Outdated Show resolved Hide resolved

Update PyPDF2/_writer.py

b6cc70e

Hatell closed this Jul 5, 2022

Hatell reopened this Jul 5, 2022

Hatell marked this pull request as draft July 5, 2022 13:46

StreamObject.hash_value_data get DictionaryObject data as well.

1ec14d1

Hatell marked this pull request as ready for review July 5, 2022 14:03

Harry Karvonen added 2 commits July 5, 2022 18:42

Added test to merge PDF with missing objects.

36aa111

Handle case when PdfWriter._sweep_indirect_references returns None.

b17baef

Hatell force-pushed the IS-1062 branch from 9d8e13d to b17baef Compare July 5, 2022 15:42

Hatell closed this Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure indirect sweep to handle all objects. #1064

Ensure indirect sweep to handle all objects. #1064

Hatell commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

Hatell commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

Hatell commented Jul 5, 2022

Hatell commented Jul 5, 2022

codecov bot commented Jul 5, 2022 •

edited

Hatell commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MasterOdin commented Jul 6, 2022

Hatell commented Jul 7, 2022

Hatell commented Jul 7, 2022

Ensure indirect sweep to handle all objects. #1064

Ensure indirect sweep to handle all objects. #1064

Conversation

Hatell commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

Hatell commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

Hatell commented Jul 5, 2022

Hatell commented Jul 5, 2022

codecov bot commented Jul 5, 2022 • edited

Codecov Report

Hatell commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

MasterOdin commented Jul 6, 2022

Hatell commented Jul 7, 2022

Hatell commented Jul 7, 2022

codecov bot commented Jul 5, 2022 •

edited