_cmap.py: avoid IndexError in parse_to_unicode #1110

dkg · 2022-07-14T17:42:04Z

The code within the if block assumes that lst has index 0 and index 1.
So the predicate should depend on lst having at least two elements.

This resolves the error I described at
#1091 (comment)
(I'm not sure that it would resolve the other issue raised by
@MartinThoma)

@MartinThoma

The code within the if block assumes that lst has index 0 and index 1. So the predicate should depend on lst having at least two elements. This resolves the error I described at py-pdf#1091 (comment) (I'm not sure that it would resolve the other issue raised by @MartinThoma)

codecov · 2022-07-14T17:47:21Z

Codecov Report

Merging #1110 (b6a8fb5) into main (682eff9) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1110   +/-   ##
=======================================
  Coverage   91.94%   91.94%           
=======================================
  Files          24       24           
  Lines        4642     4642           
  Branches      957      957           
=======================================
  Hits         4268     4268           
  Misses        229      229           
  Partials      145      145

Impacted Files	Coverage Δ
PyPDF2/_cmap.py	`93.54% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 682eff9...b6a8fb5. Read the comment docs.

dkg · 2022-07-14T18:09:23Z

fwiw, in the dump.pdf that i ran into this problem with, on the first page, we see a character resource that is described in this way:

stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
15 beginbfchar
<0018> <05e805d505e005d9002005d0>
<000c> <>
<0003> <>
<006a> <>
<008b> <>
<0043> <>
<007c> <>
<0056> <05e805d505e005d9002005d005d105df>
<0015> <>
<0037> <>
<003d> <>
<002a> <>
<0010> <>
<006b> <05ea05dc002005d005d105d905d1>
<0060> <>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream

So i think the problem in parse_to_unicode is deeper than this particular off-by-one error. the textual manipulation that drops the angle brackets and newlines, etc doesn't seem to acknowledge that there could be empty angle brackets in the table.

dkg · 2022-07-14T18:57:26Z

The fix proposed in this MR is definitely incomplete -- please see #1111

MartinThoma · 2022-07-14T18:58:01Z

Thank you so much 🤗

I've just merged it and I will create a release on PyPI on Sunday (17.07.2022)

The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111

New Features (ENH): - Add color and font_format to PdfReader.outlines[i] (#1104) - Extract Text Enhancement (whitespaces) (#1084) Bug Fixes (BUG): - Use `build_destination` for named destination outlines (#1128) - Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118) - Prevent deduplication of PageObject (#1105) - None-check in DictionaryObject.read_from_stream (#1113) - Avoid IndexError in _cmap.parse_to_unicode (#1110) Documentation (DOC): - Explanation for git submodule - Watermark and stamp (#1095) Maintenance (MAINT): - Text extraction improvements (#1126) - Destination.color returns ArrayObject instead of tuple as fallback (#1119) - Use add_bookmark_destination in add_bookmark (#1100) - Use add_bookmark_destination in add_bookmark_dict (#1099) Testing (TST): - Remove xfail from test_outline_title_issue_1121 - Add test for arab text (#1127) - Add xfail for decryption fail (#1125) - Add xfail test for IndexError when extracting text (#1124) - Add MCVE showing outline title issue (#1123) Code Style (STY): - Apply black and isort - Use IntFlag for permissions_flag / update_page_form_field_values (#1094) - Simplify code (#1101) Full Changelog: 2.5.0...2.6.0

dkg mentioned this pull request Jul 14, 2022

cmap handling: failure when a beginbfchar pair has a zero-length second element #1111

Closed

MartinThoma merged commit bb2d1db into py-pdf:main Jul 14, 2022

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Jul 14, 2022

mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this pull request Jul 15, 2022

BUG: Avoid IndexError in _cmap.parse_to_unicode (py-pdf#1110)

2ba30d6

The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_cmap.py: avoid IndexError in parse_to_unicode #1110

_cmap.py: avoid IndexError in parse_to_unicode #1110

dkg commented Jul 14, 2022

codecov bot commented Jul 14, 2022

dkg commented Jul 14, 2022

dkg commented Jul 14, 2022

MartinThoma commented Jul 14, 2022

_cmap.py: avoid IndexError in parse_to_unicode #1110

_cmap.py: avoid IndexError in parse_to_unicode #1110

Conversation

dkg commented Jul 14, 2022

codecov bot commented Jul 14, 2022

Codecov Report

dkg commented Jul 14, 2022

dkg commented Jul 14, 2022

MartinThoma commented Jul 14, 2022