pymupdf can't get the correct text from cjk pdf #87

liaicheng · 2017-03-21T10:13:42Z

@JorjMcKie hi, i m here! come from issue:#42
when i used pymupdf like bellow:
`import fitz
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

ENCODING = "utf-8"
file = 'test_pdf/test.pdf'
ofile = "test_pdf/test.txt"

doc = fitz.open(file)
pages = len(doc)

fout = open(ofile,"w")

for i in range(pages):
text = doc.getPageText(i)
fout.write(text.encode(ENCODING,"ignore"))

fout.close()`

PDF link:https://www.dropbox.com/s/3jxutny7v17gzet/test.pdf?dl=0
when i parse this PDF, it occured error:
warning: not building glyph bbox table for font 'WJZZHG+SimSun' with 22141 glyphs
warning: not building glyph bbox table for font 'WXWEBB+SimHei' with 22021 glyphs

it seems mupdf can't parse the pdf, but actually, it can view it correctly. I didn't have much experience aoubt C language. So when i used some other python package like:pdfminer or pdfplumber, they can't handle this pdf.
hope can get respone, thanks!

JorjMcKie · 2017-03-21T10:59:14Z

@liaicheng -
I have done quite some investigation yesterday. All this lead to more or less the same result as what you are showing above.
Here is what I did:

re-generated MuPDF to include all fonts and run text extraction skripts again
used mutool draw to extract text (this MuPDF tool also contains all fonts)
used a couple of other tools. Among them Word 2016, Nitro, XPDF, Able Word.

All of these experiments result in the same failure. So I am out of advice, currently.

It seams that TYPE0 fonts can be processed correctly, but (some, not all) TrueType fonts can't.
A PDF with Chinese text, generated from a Wikipedia page for example works correctly. Using method doc.getPageFontList(0) would show something like

[122, 0, b'Type0', b'LiberationSerif-Bold', b'']
[127, 0, b'Type0', b'LiberationSerif', b'']
[132, 0, b'Type0', b'WenQuanYiZenHei', b'']
[137, 0, b'Type0', b'LiberationSerif-Italic', b'']
[142, 0, b'Type0', b'DejaVuSans', b'']

for this page and text extraction just works fine.

An example from @liaicheng however gave me

[19, 0, b'TrueType', b'OCVNVZ+KaiTi_GB2312', b'']
[8, 0, b'TrueType', b'JSRZNG+SimSun', b'']

and text extraction (plain text, JSON, XML ...) will not produce anything useful. However PyMuPDF and other PDF readers will correctly render this type of PDF.

So that's where I am :-(
@rk700 - any comments?

mozbugbox · 2017-03-21T11:13:23Z

You will get the same warning by opening the doc by mupdf.

I guess its a feature of mupdf that it wouldn't pre-built/cache the bbox table if a embedded font has too much glyphs in it. Maybe to speed up initial loading of pdf files.

Hmm... from the code of fitz/font.c, it seem the threshold for glyph count is 4096.

Maybe there is some way to turn off warning if you don't want to see the messages.

JorjMcKie · 2017-03-21T11:18:07Z

by the way, messages like

warning: not building glyph bbox table for font 'WJZZHG+SimSun' with 22141 glyphs

are no errors, but just warnings (about performance improvement not being done in this case). They occur as well when I extract text from a PDF that has been exported from a Word document (e.g. in German). Text extraction works fine in this case, even if TrueType fonts are being used in the document.

JorjMcKie · 2017-03-21T11:29:54Z

@mozbugbox - realized too late, you already answered.

liaicheng · 2017-03-21T11:30:44Z

thanks @JorjMcKie @mozbugbox

yes, this warning seems not serious. But the pdf i offered seems really strange. it can be redenered by Adobe Acrobat DC, can be redenered by mupdf.exe. but can't extract any correct text with other tools.

In my oppion, if it can be redenered, its text should be extract, and mupdf should handle TrueType font. so , maybe encoding problem, or maybe something like that?

JorjMcKie · 2017-03-21T11:37:33Z

@liaicheng - I don't believe it is an encoding issue. I tried literally every codec available in Python during my yesterday experiments. None would yield a useful result. Of course, most of the codecs would produce exceptions, but all those that worked (GB18030, UTF-16xx, UTF-32xx, ...) did not produce recognizable stuff either.

mozbugbox · 2017-03-21T13:02:40Z

If you extract the embeded fonts from the pdf, it seems that the font is not encoded normally.

It seems the program that generated the PDF file used its own encoding scheme for the embedded font and text.

JorjMcKie · 2017-03-21T20:15:37Z

@mozbugbox - any chance to heal this ...?

mozbugbox · 2017-03-22T03:40:53Z

I have no idea how to work around this. It could depend on whatever the PDF generation program was used.

liaicheng · 2017-03-22T06:09:35Z

I just have many type pdfs, some can be parsed, some not. Parseing and redenering used different method?
while using pdfminer to extract text , this pdf returned cid num . i guess cmaps missing, is that correct?

liaicheng · 2017-03-22T06:59:13Z

It's really a very strange pdf, just before, i export it as word in Adobe PDF Acrobat, some text is right (not all, other tools can't extract any right char).

rk700 · 2017-03-22T07:38:12Z

I opened the PDF with Adobe, selected some text, and then copy-and-paste. What I got was some meaningless text, so I think text extraction might be unavailable for this PDF.

mozbugbox · 2017-03-22T09:03:29Z

Since pdffonts shows that the embedded fonts have ToUnicode map, it could be the fonts were using some customized CID map and most pdf parsers don't auto convert text to unicode with that ToUnicode map while extracting text. Does mupdf text extraction API has any tounicode flags?

Really don't know, fill bug report to mupdf lib?

see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

Edited: Or if the author don't want to let people copy&paste the text from the PDFs, the content of ToUnicode map is intentional garbage?

liaicheng · 2017-03-23T02:08:17Z

Thanks all ! I would make well know about CID, CMAPS, hope that can make some sense.

JorjMcKie · 2017-03-28T17:15:51Z

@mozbugbox - think you are right. To my knowledge, MuPDF does not have any some option.
And I don't know of any reader or other PDF software being better here either - including Adobe's Acrobat.

So I am about to suggest closing this issue with the won't fix flag, if @liaicheng would agree ...

mozbugbox · 2017-03-29T02:13:45Z

Closer looking at the test.pdf, the CIDs for the two embedded fonts indeed have intentionally bad tounicode mapping.

The codespacerange fields were:

1 begincodespacerange
<00><ff>
endcodespacerange

for both fonts. The code space is for the single byte 256 latin1 codepage. This is obviously impossible to represent thousands chars in Chinese. Normally, it should be at least be 2-byte long for UCS2 or 4-byte for UTF32

As for the beginbfrange fields for the SimSun font, it's:

100 beginbfrange
<09><09><0009>
<0a><0a><0009>
<0b><0b><0009>
<0c><0c><0009>
<0d><0d><0009>
<0e><0e><0009>
<0f><0f><0009>
<10><10><0009>
<11><11><0009>
<12><12><0009>
<13><13><0009>
<14><14><0002>
<15><15><0002>
<16><16><0002>
<17><17><0002>
<18><18><0018>
<19><19><0018>
<1a><1a><0018>
<1b><1b><0018>
<1c><1c><0018>
<1d><1d><0018>
<1e><1e><0018>
<1f><1f><0018>
<20><20><0018>
<21><21><0018>
<22><22><0018>
<23><23><0023>
<24><24><0023>
<25><25><0023>
<26><26><0023>
<27><27><0023>
<28><28><0023>
<29><29><0023>
<2a><2a><002a>
<2b><2b><002a>
<2c><2c><002a>
<2d><2d><002a>
<2e><2e><002a>
<2f><2f><002a>
<30><30><002a>
<31><31><002a>
<32><32><002a>
<33><33><002a>
<34><34><0015>
<35><35><0015>
<36><36><0014>
<37><37><0014>
<38><38><000d>
<39><39><000d>
<3a><3a><000d>
<3b><3b><000d>
<3c><3c><000d>
<3d><3d><003d>
<3e><3e><003d>
<3f><3f><003d>
<40><40><003d>
<41><41><003d>
<42><42><003d>
<43><43><0043>
<44><44><0043>
<45><45><0043>
<46><46><0043>
<47><47><0043>
<48><48><0043>
<49><49><0043>
<4a><4a><0043>
<4b><4b><0002>
<4c><4c><0002>
<4d><4d><0002>
<4e><4e><0002>
<4f><4f><0047>
<50><50><0047>
<51><51><0047>
<52><52><0047>
<53><53><0047>
<54><54><0047>
<55><55><0047>
<56><56><0047>
<57><57><0047>
<58><58><0047>
<59><59><0047>
<5a><5a><0047>
<5b><5b><0047>
<5c><5c><0047>
<5d><5d><0047>
<5e><5e><0047>
<5f><5f><0047>
<60><60><0047>
<61><61><0047>
<62><62><0047>
<63><63><0047>
<64><64><0047>
<65><65><0065>
<66><66><0065>
<67><67><0065>
<68><68><0065>
<69><69><0065>
<6a><6a><0065>
<6b><6b><0065>
<6c><6c><0065>
endbfrange
47 beginbfrange
<6d><6d><0065>
<6e><6e><0065>
<6f><6f><0065>
<70><70><0065>
<71><71><0065>
<72><72><0072>
<73><73><0072>
<74><74><0072>
<75><75><0075>
<76><76><0075>
<77><77><0075>
<78><78><0075>
<79><79><0075>
<7a><7a><0075>
<7b><7b><0075>
<7c><7c><0075>
<7d><7d><007d>
<7e><7e><007d>
<7f><7f><007d>
<80><80><007d>
<81><81><007d>
<82><82><007d>
<83><83><007d>
<84><84><007d>
<85><85><007d>
<86><86><007d>
<87><87><0047>
<88><88><0047>
<89><89><0047>
<8a><8a><0047>
<8b><8b><008b>
<8c><8c><008b>
<8d><8d><008d>
<8e><8e><008d>
<8f><8f><008d>
<90><90><008d>
<91><91><008d>
<92><92><008d>
<93><93><008d>
<94><94><008d>
<95><95><008d>
<96><96><008d>
<97><97><008d>
<98><98><0009>
<99><99><0099>
<9a><9a><0099>
<9b><9b><0099>
endbfrange

I guess that's why we saw so many GGG (uni0047), eee(uni0065) or whatever in the text extraction.

@liaicheng you can try to decode the text by mapping the CID values with the common cid map like IDENTITY-H or GB1 or CNS instead of the embedded CID map, with better luck.

JorjMcKie · 2017-04-18T20:31:53Z

@liaicheng - any news or reaction from your side?
I would like to either close the issue or get a handle to what else we can do ...

liaicheng · 2017-04-25T05:40:39Z

@JorjMcKie yes, no other improvement for it. what i get is it's hard to solve..you can close it, thanks.

rk700 closed this as completed May 2, 2017

matter-funds mentioned this issue Sep 16, 2019

Extracted text shows unicode character 65533 #365

Closed

dothinking mentioned this issue Nov 25, 2020

中文乱码？ ArtifexSoftware/pdf2docx#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pymupdf can't get the correct text from cjk pdf #87

pymupdf can't get the correct text from cjk pdf #87

liaicheng commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

mozbugbox commented Mar 21, 2017 •

edited

Loading

JorjMcKie commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

liaicheng commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

mozbugbox commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

mozbugbox commented Mar 22, 2017

liaicheng commented Mar 22, 2017

liaicheng commented Mar 22, 2017

rk700 commented Mar 22, 2017

mozbugbox commented Mar 22, 2017 •

edited

Loading

liaicheng commented Mar 23, 2017

JorjMcKie commented Mar 28, 2017

mozbugbox commented Mar 29, 2017

JorjMcKie commented Apr 18, 2017

liaicheng commented Apr 25, 2017

pymupdf can't get the correct text from cjk pdf #87

pymupdf can't get the correct text from cjk pdf #87

Comments

liaicheng commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

mozbugbox commented Mar 21, 2017 • edited Loading

JorjMcKie commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

liaicheng commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

mozbugbox commented Mar 21, 2017

JorjMcKie commented Mar 21, 2017

mozbugbox commented Mar 22, 2017

liaicheng commented Mar 22, 2017

liaicheng commented Mar 22, 2017

rk700 commented Mar 22, 2017

mozbugbox commented Mar 22, 2017 • edited Loading

liaicheng commented Mar 23, 2017

JorjMcKie commented Mar 28, 2017

mozbugbox commented Mar 29, 2017

JorjMcKie commented Apr 18, 2017

liaicheng commented Apr 25, 2017

mozbugbox commented Mar 21, 2017 •

edited

Loading

mozbugbox commented Mar 22, 2017 •

edited

Loading