Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pymupdf can't get the correct text from cjk pdf #87

Closed
liaicheng opened this issue Mar 21, 2017 · 18 comments
Closed

pymupdf can't get the correct text from cjk pdf #87

liaicheng opened this issue Mar 21, 2017 · 18 comments

Comments

@liaicheng
Copy link

@JorjMcKie hi, i m here! come from issue:#42
when i used pymupdf like bellow:
`import fitz
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

ENCODING = "utf-8"
file = 'test_pdf/test.pdf'
ofile = "test_pdf/test.txt"

doc = fitz.open(file)
pages = len(doc)

fout = open(ofile,"w")

for i in range(pages):
text = doc.getPageText(i)
fout.write(text.encode(ENCODING,"ignore"))

fout.close()`

PDF link:https://www.dropbox.com/s/3jxutny7v17gzet/test.pdf?dl=0
when i parse this PDF, it occured error:
warning: not building glyph bbox table for font 'WJZZHG+SimSun' with 22141 glyphs
warning: not building glyph bbox table for font 'WXWEBB+SimHei' with 22021 glyphs

it seems mupdf can't parse the pdf, but actually, it can view it correctly. I didn't have much experience aoubt C language. So when i used some other python package like:pdfminer or pdfplumber, they can't handle this pdf.
hope can get respone, thanks!

@JorjMcKie
Copy link
Collaborator

@liaicheng -
I have done quite some investigation yesterday. All this lead to more or less the same result as what you are showing above.
Here is what I did:

  1. re-generated MuPDF to include all fonts and run text extraction skripts again
  2. used mutool draw to extract text (this MuPDF tool also contains all fonts)
  3. used a couple of other tools. Among them Word 2016, Nitro, XPDF, Able Word.

All of these experiments result in the same failure. So I am out of advice, currently.

It seams that TYPE0 fonts can be processed correctly, but (some, not all) TrueType fonts can't.
A PDF with Chinese text, generated from a Wikipedia page for example works correctly. Using method doc.getPageFontList(0) would show something like

[122, 0, b'Type0', b'LiberationSerif-Bold', b'']
[127, 0, b'Type0', b'LiberationSerif', b'']
[132, 0, b'Type0', b'WenQuanYiZenHei', b'']
[137, 0, b'Type0', b'LiberationSerif-Italic', b'']
[142, 0, b'Type0', b'DejaVuSans', b'']

for this page and text extraction just works fine.

An example from @liaicheng however gave me

[19, 0, b'TrueType', b'OCVNVZ+KaiTi_GB2312', b'']
[8, 0, b'TrueType', b'JSRZNG+SimSun', b'']

and text extraction (plain text, JSON, XML ...) will not produce anything useful. However PyMuPDF and other PDF readers will correctly render this type of PDF.

So that's where I am :-(
@rk700 - any comments?

@mozbugbox
Copy link
Contributor

mozbugbox commented Mar 21, 2017

You will get the same warning by opening the doc by mupdf.

I guess its a feature of mupdf that it wouldn't pre-built/cache the bbox table if a embedded font has too much glyphs in it. Maybe to speed up initial loading of pdf files.

Hmm... from the code of fitz/font.c, it seem the threshold for glyph count is 4096.

Maybe there is some way to turn off warning if you don't want to see the messages.

@JorjMcKie
Copy link
Collaborator

by the way, messages like

warning: not building glyph bbox table for font 'WJZZHG+SimSun' with 22141 glyphs

are no errors, but just warnings (about performance improvement not being done in this case). They occur as well when I extract text from a PDF that has been exported from a Word document (e.g. in German). Text extraction works fine in this case, even if TrueType fonts are being used in the document.

@JorjMcKie
Copy link
Collaborator

@mozbugbox - realized too late, you already answered.

@liaicheng
Copy link
Author

thanks @JorjMcKie @mozbugbox

yes, this warning seems not serious. But the pdf i offered seems really strange. it can be redenered by Adobe Acrobat DC, can be redenered by mupdf.exe. but can't extract any correct text with other tools.

In my oppion, if it can be redenered, its text should be extract, and mupdf should handle TrueType font. so , maybe encoding problem, or maybe something like that?

@JorjMcKie
Copy link
Collaborator

@liaicheng - I don't believe it is an encoding issue. I tried literally every codec available in Python during my yesterday experiments. None would yield a useful result. Of course, most of the codecs would produce exceptions, but all those that worked (GB18030, UTF-16xx, UTF-32xx, ...) did not produce recognizable stuff either.

@mozbugbox
Copy link
Contributor

If you extract the embeded fonts from the pdf, it seems that the font is not encoded normally.

It seems the program that generated the PDF file used its own encoding scheme for the embedded font and text.

@JorjMcKie
Copy link
Collaborator

@mozbugbox - any chance to heal this ...?

@mozbugbox
Copy link
Contributor

I have no idea how to work around this. It could depend on whatever the PDF generation program was used.

@liaicheng
Copy link
Author

I just have many type pdfs, some can be parsed, some not. Parseing and redenering used different method?
while using pdfminer to extract text , this pdf returned cid num . i guess cmaps missing, is that correct?

@liaicheng
Copy link
Author

It's really a very strange pdf, just before, i export it as word in Adobe PDF Acrobat, some text is right (not all, other tools can't extract any right char).

@rk700
Copy link
Contributor

rk700 commented Mar 22, 2017

I opened the PDF with Adobe, selected some text, and then copy-and-paste. What I got was some meaningless text, so I think text extraction might be unavailable for this PDF.

@mozbugbox
Copy link
Contributor

mozbugbox commented Mar 22, 2017

Since pdffonts shows that the embedded fonts have ToUnicode map, it could be the fonts were using some customized CID map and most pdf parsers don't auto convert text to unicode with that ToUnicode map while extracting text. Does mupdf text extraction API has any tounicode flags?

Really don't know, fill bug report to mupdf lib?

see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

Edited: Or if the author don't want to let people copy&paste the text from the PDFs, the content of ToUnicode map is intentional garbage?

@liaicheng
Copy link
Author

Thanks all ! I would make well know about CID, CMAPS, hope that can make some sense.

@JorjMcKie
Copy link
Collaborator

@mozbugbox - think you are right. To my knowledge, MuPDF does not have any some option.
And I don't know of any reader or other PDF software being better here either - including Adobe's Acrobat.

So I am about to suggest closing this issue with the won't fix flag, if @liaicheng would agree ...

@mozbugbox
Copy link
Contributor

Closer looking at the test.pdf, the CIDs for the two embedded fonts indeed have intentionally bad tounicode mapping.

The codespacerange fields were:

1 begincodespacerange
<00><ff>
endcodespacerange

for both fonts. The code space is for the single byte 256 latin1 codepage. This is obviously impossible to represent thousands chars in Chinese. Normally, it should be at least be 2-byte long for UCS2 or 4-byte for UTF32

As for the beginbfrange fields for the SimSun font, it's:

100 beginbfrange
<09><09><0009>
<0a><0a><0009>
<0b><0b><0009>
<0c><0c><0009>
<0d><0d><0009>
<0e><0e><0009>
<0f><0f><0009>
<10><10><0009>
<11><11><0009>
<12><12><0009>
<13><13><0009>
<14><14><0002>
<15><15><0002>
<16><16><0002>
<17><17><0002>
<18><18><0018>
<19><19><0018>
<1a><1a><0018>
<1b><1b><0018>
<1c><1c><0018>
<1d><1d><0018>
<1e><1e><0018>
<1f><1f><0018>
<20><20><0018>
<21><21><0018>
<22><22><0018>
<23><23><0023>
<24><24><0023>
<25><25><0023>
<26><26><0023>
<27><27><0023>
<28><28><0023>
<29><29><0023>
<2a><2a><002a>
<2b><2b><002a>
<2c><2c><002a>
<2d><2d><002a>
<2e><2e><002a>
<2f><2f><002a>
<30><30><002a>
<31><31><002a>
<32><32><002a>
<33><33><002a>
<34><34><0015>
<35><35><0015>
<36><36><0014>
<37><37><0014>
<38><38><000d>
<39><39><000d>
<3a><3a><000d>
<3b><3b><000d>
<3c><3c><000d>
<3d><3d><003d>
<3e><3e><003d>
<3f><3f><003d>
<40><40><003d>
<41><41><003d>
<42><42><003d>
<43><43><0043>
<44><44><0043>
<45><45><0043>
<46><46><0043>
<47><47><0043>
<48><48><0043>
<49><49><0043>
<4a><4a><0043>
<4b><4b><0002>
<4c><4c><0002>
<4d><4d><0002>
<4e><4e><0002>
<4f><4f><0047>
<50><50><0047>
<51><51><0047>
<52><52><0047>
<53><53><0047>
<54><54><0047>
<55><55><0047>
<56><56><0047>
<57><57><0047>
<58><58><0047>
<59><59><0047>
<5a><5a><0047>
<5b><5b><0047>
<5c><5c><0047>
<5d><5d><0047>
<5e><5e><0047>
<5f><5f><0047>
<60><60><0047>
<61><61><0047>
<62><62><0047>
<63><63><0047>
<64><64><0047>
<65><65><0065>
<66><66><0065>
<67><67><0065>
<68><68><0065>
<69><69><0065>
<6a><6a><0065>
<6b><6b><0065>
<6c><6c><0065>
endbfrange
47 beginbfrange
<6d><6d><0065>
<6e><6e><0065>
<6f><6f><0065>
<70><70><0065>
<71><71><0065>
<72><72><0072>
<73><73><0072>
<74><74><0072>
<75><75><0075>
<76><76><0075>
<77><77><0075>
<78><78><0075>
<79><79><0075>
<7a><7a><0075>
<7b><7b><0075>
<7c><7c><0075>
<7d><7d><007d>
<7e><7e><007d>
<7f><7f><007d>
<80><80><007d>
<81><81><007d>
<82><82><007d>
<83><83><007d>
<84><84><007d>
<85><85><007d>
<86><86><007d>
<87><87><0047>
<88><88><0047>
<89><89><0047>
<8a><8a><0047>
<8b><8b><008b>
<8c><8c><008b>
<8d><8d><008d>
<8e><8e><008d>
<8f><8f><008d>
<90><90><008d>
<91><91><008d>
<92><92><008d>
<93><93><008d>
<94><94><008d>
<95><95><008d>
<96><96><008d>
<97><97><008d>
<98><98><0009>
<99><99><0099>
<9a><9a><0099>
<9b><9b><0099>
endbfrange

I guess that's why we saw so many GGG (uni0047), eee(uni0065) or whatever in the text extraction.

@liaicheng you can try to decode the text by mapping the CID values with the common cid map like IDENTITY-H or GB1 or CNS instead of the embedded CID map, with better luck.

@JorjMcKie
Copy link
Collaborator

@liaicheng - any news or reaction from your side?
I would like to either close the issue or get a handle to what else we can do ...

@liaicheng
Copy link
Author

@JorjMcKie yes, no other improvement for it. what i get is it's hard to solve..you can close it, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants