local variable 'cm' referenced before assignment #2702

thelazydogsback · 2024-06-05T16:19:51Z

Trying to extract text from page.
Tested in Win11 & Linux container.
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.5'), PIL=none

Traceback

File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 2052, in extract_text
    return self._layout_mode_text(
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1950, in _layout_mode_text
    fonts = self._layout_mode_fonts()
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1902, in _layout_mode_fonts
    *cmap, font_dict_obj = build_char_map(font_name, 200.0, self)
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 58, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 235, in parse_to_unicode
    cm = prepare_cm(ft)
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 260, in prepare_cm
    if isinstance(cm, str):
UnboundLocalError: local variable 'cm' referenced before assignment

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-06-05T18:35:06Z

please provide code and input file

pubpub-zz · 2024-06-09T07:52:47Z

@thelazydogsback please update the issue with code and input file, else we will have to close the issue as "can't reproduce"

pubpub-zz · 2024-06-17T20:29:18Z

@thelazydogsback
Please update the issue with code and input file.
Else, the issue will be closed as can not be reproduced

pubpub-zz · 2024-06-21T19:10:34Z

I close this dead issue

bazinga014 · 2024-07-14T08:48:10Z

I encountered the same issue! Here is the code and the PDF file.

for idx, page in enumerate(PdfReader(pdf_path).pages):
        page_content = ""
        text = page.extract_text() # UnboundLocalError: local variable 'cm' referenced before assignment

and there is some infomation I can provide when I running code: Ignoring wrong pointing object 1758 0 (offset 12558676) Ignoring wrong pointing object 1759 0 (offset 12561201) Object 548 0 not defined. Object 557 0 not defined. Object 562 0 not defined. [Uploading train_a.pdf…]()

pubpub-zz · 2024-07-14T08:51:34Z

@bazinga014
the file seems to be missing.

bazinga014 · 2024-07-14T08:54:23Z

train_a.pdf

bazinga014 · 2024-07-14T08:55:26Z

@pubpub-zz I just uploaded the file in the issue

pubpub-zz · 2024-07-14T09:06:38Z

@bazinga014
at first glanc,e, looking at the file the errors opening it with pypdf and confirmed with Acrobat Reader, I can observe some errors being reported this could explain the issue.

bazinga014 · 2024-07-14T09:09:00Z

@pubpub-zz Yes, I can open it normally with Chrome's built-in PDF parser, but there are errors when opening it with Acrobat Reader. So how should I handle this situation? How can I parse it with pypdf?

stefan6419846 · 2024-07-14T09:58:45Z

Apparently, opening with pypdf is already possible - there just is some issue with the text extraction if this is about the undefined cm variable. I am going to re-open this issue for now to evaluate if there is a suitable solution for this case.

pubpub-zz · 2024-07-14T12:39:49Z

@bazinga014

in _cmap.py
can you try this patch :

def prepare_cm(ft: DictionaryObject) -> bytes:
    tu = ft["/ToUnicode"]
    cm: bytes
    if isinstance(tu, StreamObject):
        cm = b_(cast(DecodedStreamObject, ft["/ToUnicode"]).get_data())
    elif (tu is None) or (isinstance(tu, str) and tu.startswith("/Identity")):             ###### <- line to be replaced
        # the full range 0000-FFFF will be processed
        cm = b"beginbfrange\n<0000> <0001> <0000>\nendbfrange"

and confirm the output is ok

bazinga014 · 2024-07-14T13:58:26Z

@pubpub-zz OK! I will try it! thanks a lot!

thelazydogsback · 2024-07-15T17:53:00Z

Thanks @bazinga014 for following up -- I was unable to provide the file because it is a customer file that I was unable to share.

pubpub-zz · 2024-07-16T17:13:10Z

@thelazydogsback can you also try the patch ?

naktinis · 2024-07-18T12:46:26Z

@pubpub-zz also had the same issue (with a customer's private file) and the patch seems to help!

closes py-pdf#2702

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jun 5, 2024

pubpub-zz closed this as completed Jun 21, 2024

stefan6419846 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 22, 2024

stefan6419846 reopened this Jul 14, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 18, 2024

ROB: fix extract_text issues on damaged PDFs

eb1e776

closes py-pdf#2702

pubpub-zz linked a pull request Jul 18, 2024 that will close this issue

ROB: fix extract_text() issues on damaged PDFs #2760

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local variable 'cm' referenced before assignment #2702

local variable 'cm' referenced before assignment #2702

thelazydogsback commented Jun 5, 2024

pubpub-zz commented Jun 5, 2024

pubpub-zz commented Jun 9, 2024

pubpub-zz commented Jun 17, 2024

pubpub-zz commented Jun 21, 2024

bazinga014 commented Jul 14, 2024

pubpub-zz commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

pubpub-zz commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

stefan6419846 commented Jul 14, 2024

pubpub-zz commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

thelazydogsback commented Jul 15, 2024

pubpub-zz commented Jul 16, 2024

naktinis commented Jul 18, 2024

local variable 'cm' referenced before assignment #2702

local variable 'cm' referenced before assignment #2702

Comments

thelazydogsback commented Jun 5, 2024

Traceback

pubpub-zz commented Jun 5, 2024

pubpub-zz commented Jun 9, 2024

pubpub-zz commented Jun 17, 2024

pubpub-zz commented Jun 21, 2024

bazinga014 commented Jul 14, 2024

pubpub-zz commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

pubpub-zz commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

stefan6419846 commented Jul 14, 2024

pubpub-zz commented Jul 14, 2024

bazinga014 commented Jul 14, 2024

thelazydogsback commented Jul 15, 2024

pubpub-zz commented Jul 16, 2024

naktinis commented Jul 18, 2024