Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local variable 'cm' referenced before assignment #2702

Open
thelazydogsback opened this issue Jun 5, 2024 · 16 comments · May be fixed by #2760
Open

local variable 'cm' referenced before assignment #2702

thelazydogsback opened this issue Jun 5, 2024 · 16 comments · May be fixed by #2760
Labels
needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem needs-pdf The issue needs a PDF file to show the problem workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@thelazydogsback
Copy link

Trying to extract text from page.
Tested in Win11 & Linux container.
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.5'), PIL=none

Traceback

File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 2052, in extract_text
    return self._layout_mode_text(
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1950, in _layout_mode_text
    fonts = self._layout_mode_fonts()
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1902, in _layout_mode_fonts
    *cmap, font_dict_obj = build_char_map(font_name, 200.0, self)
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 58, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 235, in parse_to_unicode
    cm = prepare_cm(ft)
  File "/usr/local/lib/python3.10/site-packages/pypdf/_cmap.py", line 260, in prepare_cm
    if isinstance(cm, str):
UnboundLocalError: local variable 'cm' referenced before assignment
@pubpub-zz
Copy link
Collaborator

please provide code and input file

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jun 5, 2024
@pubpub-zz
Copy link
Collaborator

@thelazydogsback please update the issue with code and input file, else we will have to close the issue as "can't reproduce"

@pubpub-zz
Copy link
Collaborator

@thelazydogsback
Please update the issue with code and input file.
Else, the issue will be closed as can not be reproduced

@pubpub-zz
Copy link
Collaborator

I close this dead issue

@stefan6419846 stefan6419846 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 22, 2024
@bazinga014
Copy link

I encountered the same issue! Here is the code and the PDF file.

for idx, page in enumerate(PdfReader(pdf_path).pages):
        page_content = ""
        text = page.extract_text() # UnboundLocalError: local variable 'cm' referenced before assignment

and there is some infomation I can provide when I running code: Ignoring wrong pointing object 1758 0 (offset 12558676) Ignoring wrong pointing object 1759 0 (offset 12561201) Object 548 0 not defined. Object 557 0 not defined. Object 562 0 not defined. [Uploading train_a.pdf…]()

@pubpub-zz
Copy link
Collaborator

@bazinga014
the file seems to be missing.

@bazinga014
Copy link

train_a.pdf

@bazinga014
Copy link

@pubpub-zz I just uploaded the file in the issue

@pubpub-zz
Copy link
Collaborator

@bazinga014
at first glanc,e, looking at the file the errors opening it with pypdf and confirmed with Acrobat Reader, I can observe some errors being reported this could explain the issue.

@bazinga014
Copy link

@pubpub-zz Yes, I can open it normally with Chrome's built-in PDF parser, but there are errors when opening it with Acrobat Reader. So how should I handle this situation? How can I parse it with pypdf?

@stefan6419846
Copy link
Collaborator

Apparently, opening with pypdf is already possible - there just is some issue with the text extraction if this is about the undefined cm variable. I am going to re-open this issue for now to evaluate if there is a suitable solution for this case.

@stefan6419846 stefan6419846 reopened this Jul 14, 2024
@pubpub-zz
Copy link
Collaborator

@bazinga014

in _cmap.py
can you try this patch :

def prepare_cm(ft: DictionaryObject) -> bytes:
    tu = ft["/ToUnicode"]
    cm: bytes
    if isinstance(tu, StreamObject):
        cm = b_(cast(DecodedStreamObject, ft["/ToUnicode"]).get_data())
    elif (tu is None) or (isinstance(tu, str) and tu.startswith("/Identity")):             ###### <- line to be replaced
        # the full range 0000-FFFF will be processed
        cm = b"beginbfrange\n<0000> <0001> <0000>\nendbfrange"

and confirm the output is ok

@bazinga014
Copy link

@pubpub-zz OK! I will try it! thanks a lot!

@thelazydogsback
Copy link
Author

Thanks @bazinga014 for following up -- I was unable to provide the file because it is a customer file that I was unable to share.

@pubpub-zz
Copy link
Collaborator

@thelazydogsback can you also try the patch ?

@naktinis
Copy link

@pubpub-zz also had the same issue (with a customer's private file) and the patch seems to help!

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 18, 2024
@pubpub-zz pubpub-zz linked a pull request Jul 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem needs-pdf The issue needs a PDF file to show the problem workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants