Wrong text bboxes for some Type 3 font #1433

LaiSongxuan · 2021-12-02T08:45:32Z

Hello, JorjMcKie~

PyMuPDF extracts wrong bbox for the following pdf:
https://influencermarketinghub.com/ebooks/IMH_SOCIAL_BENCHMARK_REPORT_2021.pdf

The bbox is like this:
[(177.83700561523438, -85899337728.0, 287.6545104980469, 85899345920.0, 'Social', 0, 0, 0), ...]
(However, pdfplumber gives correct results for this document)

Another example with wrong bbox:
http://iapsop.com/ssoc/1902__ione___food_studies.pdf

Could you please give a workaround for this bug?

JorjMcKie · 2021-12-02T10:14:12Z

The following info is missing:

output of fitz.__doc__
page number where this occurs
you made me guess you used page.get_text("words")- correct?

The second example file cannot be downloaded b/o security issues!

LaiSongxuan · 2021-12-02T10:29:39Z

The following info is missing:

output of fitz.__doc__

page number where this occurs

you made guess you used page.get_text("words")- correct?

The second example file cannot be downloaded b/o security issues!

Thank you for your quick reply.

fitz.__doc__ is as follows:
'\nPyMuPDF 1.19.2: Python bindings for the MuPDF 1.19.0 library.\nVersion date: 2021-11-20 00:00:01.\nBuilt for Python 3.7 on linux (64-bit).\n'

The bug occurs for the whole document, and page.get_text("words") returns wrong coordinates either.

JorjMcKie · 2021-12-02T10:46:31Z

PyMuPDF cannot yet deal with fonts having a bbox of [0 0 0 0] - which is the case here.
I am working on 1.19.3 which resolves this.
Let me finish a few things first, then you can download a preliminary wheel. Should be within the next few hours.
I will keep you posted.

LaiSongxuan · 2021-12-02T11:01:52Z

Great. Look forward to the new version.

JorjMcKie · 2021-12-02T11:19:30Z

Done. Download from here.

LaiSongxuan · 2021-12-02T11:32:09Z

Thank you！The new wheel works for most text lines, but there are still some bad bboxes, e.g., the last several lines ("the total world's population ...") of page 3 of the above PDF.

PS, page.get_text_words() returns the correct bboxes, but span['bbox'] does not.

JorjMcKie · 2021-12-02T11:35:17Z

BTW: this actually is an upstream bug (MuPDF): if using xml output mutool draw -o page1.xml -F stext x.pdf 1 the same results are delivered.
You should file a bug at their issue site: https://bugs.ghostscript.com/enter_bug.cgi

JorjMcKie · 2021-12-02T11:35:50Z

Thank you！The new wheel works for most text lines, but there are still some bad bboxes, e.g., the last several lines ("the total world's polulation ...") of page 3 of the above PDF.

I will into this.

JorjMcKie · 2021-12-02T14:01:17Z

Found the problem:
MuPDF returns crazy x coordinates for the newline character \n. And because I compute all bboxes as the union of single character bboxes, those crazy x coordinates percolate up the hierarchy.
So I added another correction, which is about to be published on the folder I told you before. Should be finished in a minute.

JorjMcKie · 2021-12-02T14:03:45Z

I do recommend that you submit a bug report to MuPDF!
I corrected the error in PyMuPDF text processing, but I can do nothing in terms of (X)HTML / XML output and also in terms of the redaction logic ... this is MuPDF's code which I must use blindly ...

JorjMcKie · 2021-12-02T14:04:15Z

New wheels are now ready to download.

JorjMcKie · 2021-12-02T14:16:52Z

I do recommend that you submit a bug report to MuPDF!

When you do that, you must include an example PDF page and a script (not Python PyMuPDF!). Best use the CLI mutool example I mentioned in a previous post.

LaiSongxuan · 2021-12-02T15:03:38Z

Thank you. I will report this bug to MuPDF later as suggested.

JorjMcKie · 2021-12-07T10:31:34Z

Thank you. I will report this bug to MuPDF later as suggested.

@LaiSongxuan - please let me have the bug id assigned by MuPDF's bug tracker. I want to putmyself on the CC list there.

LaiSongxuan · 2021-12-07T10:37:40Z

@JorjMcKie The bug id is 704747.

JorjMcKie · 2021-12-12T12:03:06Z

New version 1.19.3 is being uploaded to PyPI.

LaiSongxuan added the bug label Dec 2, 2021

LaiSongxuan assigned JorjMcKie Dec 2, 2021

JorjMcKie added upstream bug bug outside this package and removed bug labels Dec 2, 2021

JorjMcKie changed the title ~~Wrong bbox coordinates~~ Wrong text bboxes for a Type 3 font Dec 7, 2021

JorjMcKie changed the title ~~Wrong text bboxes for a Type 3 font~~ Wrong text bboxes for some Type 3 font Dec 8, 2021

JorjMcKie closed this as completed Dec 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong text bboxes for some Type 3 font #1433

Wrong text bboxes for some Type 3 font #1433

LaiSongxuan commented Dec 2, 2021 •

edited

JorjMcKie commented Dec 2, 2021 •

edited

LaiSongxuan commented Dec 2, 2021 •

edited

JorjMcKie commented Dec 2, 2021

LaiSongxuan commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

LaiSongxuan commented Dec 2, 2021 •

edited

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

LaiSongxuan commented Dec 2, 2021

JorjMcKie commented Dec 7, 2021

LaiSongxuan commented Dec 7, 2021

JorjMcKie commented Dec 12, 2021

Wrong text bboxes for some Type 3 font #1433

Wrong text bboxes for some Type 3 font #1433

Comments

LaiSongxuan commented Dec 2, 2021 • edited

JorjMcKie commented Dec 2, 2021 • edited

LaiSongxuan commented Dec 2, 2021 • edited

JorjMcKie commented Dec 2, 2021

LaiSongxuan commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

LaiSongxuan commented Dec 2, 2021 • edited

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

JorjMcKie commented Dec 2, 2021

LaiSongxuan commented Dec 2, 2021

JorjMcKie commented Dec 7, 2021

LaiSongxuan commented Dec 7, 2021

JorjMcKie commented Dec 12, 2021

LaiSongxuan commented Dec 2, 2021 •

edited

JorjMcKie commented Dec 2, 2021 •

edited

LaiSongxuan commented Dec 2, 2021 •

edited

LaiSongxuan commented Dec 2, 2021 •

edited