Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong text bboxes for some Type 3 font #1433

Closed
LaiSongxuan opened this issue Dec 2, 2021 · 16 comments
Closed

Wrong text bboxes for some Type 3 font #1433

LaiSongxuan opened this issue Dec 2, 2021 · 16 comments
Assignees
Labels
upstream bug bug outside this package

Comments

@LaiSongxuan
Copy link

LaiSongxuan commented Dec 2, 2021

Hello, JorjMcKie~

PyMuPDF extracts wrong bbox for the following pdf:
https://influencermarketinghub.com/ebooks/IMH_SOCIAL_BENCHMARK_REPORT_2021.pdf

The bbox is like this:
[(177.83700561523438, -85899337728.0, 287.6545104980469, 85899345920.0, 'Social', 0, 0, 0), ...]
(However, pdfplumber gives correct results for this document)

Another example with wrong bbox:
http://iapsop.com/ssoc/1902__ione___food_studies.pdf

Could you please give a workaround for this bug?

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Dec 2, 2021

The following info is missing:

  • output of fitz.__doc__
  • page number where this occurs
  • you made me guess you used page.get_text("words")- correct?

The second example file cannot be downloaded b/o security issues!

@LaiSongxuan
Copy link
Author

LaiSongxuan commented Dec 2, 2021

The following info is missing:

  • output of fitz.__doc__
  • page number where this occurs
  • you made guess you used page.get_text("words")- correct?

The second example file cannot be downloaded b/o security issues!

Thank you for your quick reply.

fitz.__doc__ is as follows:
'\nPyMuPDF 1.19.2: Python bindings for the MuPDF 1.19.0 library.\nVersion date: 2021-11-20 00:00:01.\nBuilt for Python 3.7 on linux (64-bit).\n'

The bug occurs for the whole document, and page.get_text("words") returns wrong coordinates either.

@JorjMcKie
Copy link
Collaborator

PyMuPDF cannot yet deal with fonts having a bbox of [0 0 0 0] - which is the case here.
I am working on 1.19.3 which resolves this.
Let me finish a few things first, then you can download a preliminary wheel. Should be within the next few hours.
I will keep you posted.

@LaiSongxuan
Copy link
Author

Great. Look forward to the new version.

@JorjMcKie
Copy link
Collaborator

Done. Download from here.

@LaiSongxuan
Copy link
Author

LaiSongxuan commented Dec 2, 2021

Thank you!The new wheel works for most text lines, but there are still some bad bboxes, e.g., the last several lines ("the total world's population ...") of page 3 of the above PDF.

PS, page.get_text_words() returns the correct bboxes, but span['bbox'] does not.

@JorjMcKie
Copy link
Collaborator

BTW: this actually is an upstream bug (MuPDF): if using xml output mutool draw -o page1.xml -F stext x.pdf 1 the same results are delivered.
You should file a bug at their issue site: https://bugs.ghostscript.com/enter_bug.cgi

@JorjMcKie
Copy link
Collaborator

Thank you!The new wheel works for most text lines, but there are still some bad bboxes, e.g., the last several lines ("the total world's polulation ...") of page 3 of the above PDF.

I will into this.

@JorjMcKie
Copy link
Collaborator

Found the problem:
MuPDF returns crazy x coordinates for the newline character \n. And because I compute all bboxes as the union of single character bboxes, those crazy x coordinates percolate up the hierarchy.
So I added another correction, which is about to be published on the folder I told you before. Should be finished in a minute.

@JorjMcKie
Copy link
Collaborator

I do recommend that you submit a bug report to MuPDF!
I corrected the error in PyMuPDF text processing, but I can do nothing in terms of (X)HTML / XML output and also in terms of the redaction logic ... this is MuPDF's code which I must use blindly ...

@JorjMcKie
Copy link
Collaborator

New wheels are now ready to download.

@JorjMcKie
Copy link
Collaborator

I do recommend that you submit a bug report to MuPDF!

When you do that, you must include an example PDF page and a script (not Python PyMuPDF!). Best use the CLI mutool example I mentioned in a previous post.

@LaiSongxuan
Copy link
Author

Thank you. I will report this bug to MuPDF later as suggested.

@JorjMcKie JorjMcKie added upstream bug bug outside this package and removed bug labels Dec 2, 2021
@JorjMcKie
Copy link
Collaborator

Thank you. I will report this bug to MuPDF later as suggested.

@LaiSongxuan - please let me have the bug id assigned by MuPDF's bug tracker. I want to putmyself on the CC list there.

@JorjMcKie JorjMcKie changed the title Wrong bbox coordinates Wrong text bboxes for a Type 3 font Dec 7, 2021
@LaiSongxuan
Copy link
Author

@JorjMcKie The bug id is 704747.

@JorjMcKie JorjMcKie changed the title Wrong text bboxes for a Type 3 font Wrong text bboxes for some Type 3 font Dec 8, 2021
@JorjMcKie
Copy link
Collaborator

New version 1.19.3 is being uploaded to PyPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants