Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError while loading some documents, caused by _cmap.py, line 93 #2286

Closed
elhele opened this issue Nov 8, 2023 · 4 comments · Fixed by #2288
Closed

TypeError while loading some documents, caused by _cmap.py, line 93 #2286

elhele opened this issue Nov 8, 2023 · 4 comments · Fixed by #2288

Comments

@elhele
Copy link

elhele commented Nov 8, 2023

While using the library I'm getting the following error:
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'int'

Environment

Which environment were you using when you encountered the problem?

macOS-10.16-x86_64-i386-64bit
pypdf==3.17.0, crypt_provider=('cryptography', '37.0.4'), PIL=9.0.1

It happens locally as well as during Azure-deployment.

Code + PDF

This is a minimal, complete example that shows the issue:

reader = PdfReader(file)
pages = reader.pages

Unfortunately I cannot share the document that causes the problem as it contains sensitive information. I also couldn't reproduce it with other documents. This adjustment after line 89, however, solves the problem:

import pypdf
...
sp_width = compute_space_width(ft, sp, space_width)
sp_width = sp_width if type(sp_width) != pypdf.generic._base.IndirectObject else sp_width.get_object()

Traceback

This is the complete Traceback I see:

  File "..././scripts/prepdocs.py", line 262, in <module>
    loop.run_until_complete(main(file_strategy, azd_credential, args))
  File ".../opt/anaconda3/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "..././scripts/prepdocs.py", line 137, in main
    await strategy.run(search_info)
  File ".../scripts/prepdocslib/filestrategy.py", line 58, in run
    pages = [page async for page in self.pdf_parser.parse(content=file.content)]
  File ".../scripts/prepdocslib/filestrategy.py", line 58, in <listcomp>
    pages = [page async for page in self.pdf_parser.parse(content=file.content)]
  File ".../scripts/prepdocslib/pdfparser.py", line 52, in parse
    page_text = p.extract_text()
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_page.py", line 2284, in extract_text
    return self._extract_text(
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_page.py", line 1903, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_cmap.py", line 93, in build_char_map_from_dict
    float(sp_width / 2),
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'int'

@pubpub-zz
Copy link
Collaborator

@elhele
can you in _cmap.py file, at the end of function compute_space_width (line460 normally)

    if isinstance(sp_width,indirect_object):    ## to be added
        sp_width = sp_width.get_object()      ## to be added
    return sp_width

and check weither it fixes the error

@elhele
Copy link
Author

elhele commented Nov 9, 2023

Hello @pubpub-zz thank you so much for such a fast reply!

"indirect_object" in the code above doesn't seem to be defined. If I change it in such a way, it works perfectly and fixes the error:

    indirect_object = pypdf.generic._base.IndirectObject ## added

    if isinstance(sp_width, indirect_object):  ## added

        sp_width = sp_width.get_object()  ## added

    return sp_width

@MartinThoma
Copy link
Member

@elhele A PR with the fix was created. I guess until end of the week we will have the fix on PyPI.

I'm curious which program created the bad PDF. Can you share that?

from pypdf import PdfReader
reader = PdfReader('example.pdf')
print(reader.metadata)

It should be something like:

{'/ModDate': "D:20220901234405+02'00'", 
'/Creator': 'pdftk-java 3.2.2',
 '/CreationDate': "D:20220901234405+02'00'",
 '/Producer': 'itext-paulo-155 (itextpdf.sf.net-lowagie.com)'}

I'm especially interested in the /Creator and /Producer.

@elhele
Copy link
Author

elhele commented Nov 9, 2023

@MartinThoma thank you very much!

That's what I'm getting:

{'/Creator': 'Sejda Console 3.2.75',
'/Producer': 'SAMBox 1.1.53 (www.sejda.org)',
'/ModDate': "D:20230531143456+02'00'"}

I'm also getting this error only with one page of one document in my database. This page looks kind of misplaced and I don't think that this problem occurs very often.

MartinThoma added a commit that referenced this issue Nov 14, 2023
Fixes #2286

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants