Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_cmap.py: avoid IndexError in parse_to_unicode #1110

Merged
merged 1 commit into from
Jul 14, 2022

Conversation

dkg
Copy link
Contributor

@dkg dkg commented Jul 14, 2022

The code within the if block assumes that lst has index 0 and index 1.
So the predicate should depend on lst having at least two elements.

This resolves the error I described at
#1091 (comment)
(I'm not sure that it would resolve the other issue raised by
@MartinThoma)

The code within the if block assumes that lst has index 0 and index 1.
So the predicate should depend on lst having at least two elements.

This resolves the error I described at
py-pdf#1091 (comment)
(I'm not sure that it would resolve the other issue raised by
@MartinThoma)
@codecov
Copy link

codecov bot commented Jul 14, 2022

Codecov Report

Merging #1110 (b6a8fb5) into main (682eff9) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1110   +/-   ##
=======================================
  Coverage   91.94%   91.94%           
=======================================
  Files          24       24           
  Lines        4642     4642           
  Branches      957      957           
=======================================
  Hits         4268     4268           
  Misses        229      229           
  Partials      145      145           
Impacted Files Coverage Δ
PyPDF2/_cmap.py 93.54% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 682eff9...b6a8fb5. Read the comment docs.

@dkg
Copy link
Contributor Author

dkg commented Jul 14, 2022

fwiw, in the dump.pdf that i ran into this problem with, on the first page, we see a character resource that is described in this way:

stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
15 beginbfchar
<0018> <05e805d505e005d9002005d0>
<000c> <>
<0003> <>
<006a> <>
<008b> <>
<0043> <>
<007c> <>
<0056> <05e805d505e005d9002005d005d105df>
<0015> <>
<0037> <>
<003d> <>
<002a> <>
<0010> <>
<006b> <05ea05dc002005d005d105d905d1>
<0060> <>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream

So i think the problem in parse_to_unicode is deeper than this particular off-by-one error. the textual manipulation that drops the angle brackets and newlines, etc doesn't seem to acknowledge that there could be empty angle brackets in the table.

@dkg
Copy link
Contributor Author

dkg commented Jul 14, 2022

The fix proposed in this MR is definitely incomplete -- please see #1111

@MartinThoma
Copy link
Member

Thank you so much 🤗

I've just merged it and I will create a release on PyPI on Sunday (17.07.2022)

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Jul 14, 2022
mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this pull request Jul 15, 2022
The code within the if block assumes that `lst` has index 0 and index 1.

Fixes py-pdf#1091
Related to py-pdf#1111
MartinThoma added a commit that referenced this pull request Jul 17, 2022
New Features (ENH):
-  Add color and font_format to PdfReader.outlines[i] (#1104)
-  Extract Text Enhancement (whitespaces) (#1084)

Bug Fixes (BUG):
-  Use `build_destination` for named destination outlines (#1128)
-  Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118)
-  Prevent deduplication of PageObject (#1105)
-  None-check in DictionaryObject.read_from_stream (#1113)
-  Avoid IndexError in _cmap.parse_to_unicode (#1110)

Documentation (DOC):
-  Explanation for git submodule
-  Watermark and stamp (#1095)

Maintenance (MAINT):
-  Text extraction improvements (#1126)
-  Destination.color returns ArrayObject instead of tuple as fallback (#1119)
-  Use add_bookmark_destination in add_bookmark (#1100)
-  Use add_bookmark_destination in add_bookmark_dict (#1099)

Testing (TST):
-  Remove xfail from test_outline_title_issue_1121
-  Add test for arab text (#1127)
-  Add xfail for decryption fail (#1125)
-  Add xfail test for IndexError when extracting text (#1124)
-  Add MCVE showing outline title issue (#1123)

Code Style (STY):
-  Apply black and isort
-  Use IntFlag for permissions_flag / update_page_form_field_values (#1094)
-  Simplify code (#1101)

Full Changelog: 2.5.0...2.6.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants