Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Process XRefStm #1297

Merged
merged 27 commits into from Sep 3, 2022
Merged

ENH: Process XRefStm #1297

merged 27 commits into from Sep 3, 2022

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Aug 28, 2022

Fixes #1273
Fixes #1279
Fixes #1292
Fixes #1294
Fixes #1295

ROB: Cope with xref starting on \r\n
ROB: Escaped octal code followed by decimal int
ROB: Cope with some corrupted entries in xref table
ROB: Extend xref autorepair cases

fixes py-pdf#1295
includes test file adjustment
@codecov
Copy link

codecov bot commented Aug 29, 2022

Codecov Report

Merging #1297 (4edf6f8) into main (3b74312) will decrease coverage by 0.40%.
The diff coverage is 82.01%.

@@            Coverage Diff             @@
##             main    #1297      +/-   ##
==========================================
- Coverage   95.07%   94.67%   -0.41%     
==========================================
  Files          30       30              
  Lines        4973     5106     +133     
  Branches     1023     1052      +29     
==========================================
+ Hits         4728     4834     +106     
- Misses        139      157      +18     
- Partials      106      115       +9     
Impacted Files Coverage Δ
PyPDF2/_reader.py 89.49% <72.52%> (-2.19%) ⬇️
PyPDF2/_page.py 94.36% <100.00%> (+<0.01%) ⬆️
PyPDF2/_writer.py 91.04% <100.00%> (-0.51%) ⬇️
PyPDF2/generic/_base.py 100.00% <100.00%> (+1.02%) ⬆️
PyPDF2/generic/_utils.py 100.00% <100.00%> (ø)
PyPDF2/types.py 100.00% <100.00%> (ø)
PyPDF2/_codecs/adobe_glyphs.py 100.00% <0.00%> (ø)
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma,
Ready for review

@pubpub-zz
Copy link
Collaborator Author

stdby

fixes  py-pdf#1279 / Status_v1_Reviewers-Guide.pdf
* if chained xref/trailer are not good
* if the object header ('id' 'gen' obj) or if the object is not present in the xref table, will search the file for the object.

fixes  py-pdf#1273
reader.xmp_metadata
assert exc.value.args[0].startswith("XML in XmpInformation was invalid")
assert exc.value.args[0].startswith("Stream length not defined")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change? I guess the reader.xmp_metadata isn't even touched, is it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, one could at least get the number of pages:

assert len(reader.pages) == 5

I guess with this PR it no longer works?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to modify the test result. I did not analyze further

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, one could at least get the number of pages:

assert len(reader.pages) == 5

I guess with this PR it no longer works?

under analysis

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDF was corrupted : the XRef package had a /Length key corrupted. I've changed the code to discard the loading of the XRef object to allow the main program to recover to a maximum information : you can now get the metadata 😊
the access to number of pages is (still?) possible

@pubpub-zz
Copy link
Collaborator Author

I had to merge iss_1292 to have a global PR.

this PR is now complete

@MartinThoma MartinThoma changed the title ENH : Process XRefStm ENH: Process XRefStm Sep 2, 2022
tests/test_xmp.py Outdated Show resolved Hide resolved
tests/test_reader.py Outdated Show resolved Hide resolved
pubpub-zz and others added 2 commits September 3, 2022 09:41
Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
@pubpub-zz
Copy link
Collaborator Author

5 sec before me 😝

@MartinThoma
Copy link
Member

I'll look into applying black automatically in the CI as an extra commit today 😄

Also, I want to make flake8 run in parallel to the tests and mypy after pytest so that I can still see issues there in a failed run.

@pubpub-zz
Copy link
Collaborator Author

I don't think it worth it.
the line missing came from the code review.
One thing I've noticed is that 3.10 check is performed twice. Do you know why ?(for energy saving)

@MartinThoma
Copy link
Member

One thing I've noticed is that 3.10 check is performed twice.

It's a different test scenario. pycryptodome is removed in that test run.

@MartinThoma MartinThoma merged commit 1252a49 into py-pdf:main Sep 3, 2022
@pubpub-zz pubpub-zz deleted the XRefStm branch September 3, 2022 19:53
MartinThoma added a commit that referenced this pull request Sep 4, 2022
Version 2.10.5, 2022-09-04
--------------------------

New Features (ENH):
-  Process XRefStm (#1297)
-  Auto-detect RTL for text extraction (#1309)

Bug Fixes (BUG):
-  Avoid scaling cropbox twice (#1314)

Robustness (ROB):
-  Fix offset correction in revised PDF (#1318)
-  Crop data of /U and /O in encryption dictionary to 48 bytes (#1317)
-  MultiLine bfrange in cmap (#1299)
-  Cope with 2 digit codes in bfchar (#1310)
-  Accept '/annn' charset as ASCII code (#1316)
-  Log errors during Float / NumberObject initialization (#1315)
-  Cope with corrupted entries in xref table (#1300)

Documentation (DOC):
-  Migration guide (PyPDF2 1.x \xe2\x9e\x94 2.x) (#1324)
-  Creating a coverage report (#1319)
-  Fix AnnotationBuilder.free_text example (#1311)
-  Fix usage of page.scale by replacing it with page.scale_by (#1313)

Developer Experience (DEV):
-  Only run coverage for PyPDF2

Maintenance (MAINT):
-  PdfReaderProtocol (#1303)
-  Throw PdfReadError if Trailer can't be read (#1298)
-  Remove catching OverflowException (#1302)

Full Changelog: 2.10.4...2.10.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment