-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pymupdf can't get the correct text from cjk pdf #87
Comments
@liaicheng -
All of these experiments result in the same failure. So I am out of advice, currently. It seams that
for this page and text extraction just works fine. An example from @liaicheng however gave me
and text extraction (plain text, JSON, XML ...) will not produce anything useful. However PyMuPDF and other PDF readers will correctly render this type of PDF. So that's where I am :-( |
You will get the same warning by opening the doc by I guess its a feature of mupdf that it wouldn't pre-built/cache the bbox table if a embedded font has too much glyphs in it. Maybe to speed up initial loading of pdf files. Hmm... from the code of Maybe there is some way to turn off warning if you don't want to see the messages. |
by the way, messages like
are no errors, but just warnings (about performance improvement not being done in this case). They occur as well when I extract text from a PDF that has been exported from a Word document (e.g. in German). Text extraction works fine in this case, even if |
@mozbugbox - realized too late, you already answered. |
thanks @JorjMcKie @mozbugbox yes, this warning seems not serious. But the pdf i offered seems really strange. it can be redenered by Adobe Acrobat DC, can be redenered by mupdf.exe. but can't extract any correct text with other tools. In my oppion, if it can be redenered, its text should be extract, and mupdf should handle TrueType font. so , maybe encoding problem, or maybe something like that? |
@liaicheng - I don't believe it is an encoding issue. I tried literally every codec available in Python during my yesterday experiments. None would yield a useful result. Of course, most of the codecs would produce exceptions, but all those that worked ( |
If you extract the embeded fonts from the pdf, it seems that the font is not encoded normally. It seems the program that generated the PDF file used its own encoding scheme for the embedded font and text. |
@mozbugbox - any chance to heal this ...? |
I have no idea how to work around this. It could depend on whatever the PDF generation program was used. |
I just have many type pdfs, some can be parsed, some not. Parseing and redenering used different method? |
It's really a very strange pdf, just before, i export it as word in Adobe PDF Acrobat, some text is right (not all, other tools can't extract any right char). |
I opened the PDF with Adobe, selected some text, and then copy-and-paste. What I got was some meaningless text, so I think text extraction might be unavailable for this PDF. |
Since Really don't know, fill bug report to mupdf lib? see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf Edited: Or if the author don't want to let people copy&paste the text from the PDFs, the content of |
Thanks all ! I would make well know about CID, CMAPS, hope that can make some sense. |
@mozbugbox - think you are right. To my knowledge, MuPDF does not have any some option. So I am about to suggest closing this issue with the |
Closer looking at the test.pdf, the CIDs for the two embedded fonts indeed have intentionally bad tounicode mapping. The
for both fonts. The code space is for the single byte 256 latin1 codepage. This is obviously impossible to represent thousands chars in Chinese. Normally, it should be at least be 2-byte long for UCS2 or 4-byte for UTF32 As for the
I guess that's why we saw so many GGG (uni0047), eee(uni0065) or whatever in the text extraction. @liaicheng you can try to decode the text by mapping the CID values with the common cid map like IDENTITY-H or GB1 or CNS instead of the embedded CID map, with better luck. |
@liaicheng - any news or reaction from your side? |
@JorjMcKie yes, no other improvement for it. what i get is it's hard to solve..you can close it, thanks. |
@JorjMcKie hi, i m here! come from issue:#42
when i used pymupdf like bellow:
`import fitz
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
ENCODING = "utf-8"
file = 'test_pdf/test.pdf'
ofile = "test_pdf/test.txt"
doc = fitz.open(file)
pages = len(doc)
fout = open(ofile,"w")
for i in range(pages):
text = doc.getPageText(i)
fout.write(text.encode(ENCODING,"ignore"))
fout.close()`
PDF link:https://www.dropbox.com/s/3jxutny7v17gzet/test.pdf?dl=0
when i parse this PDF, it occured error:
warning: not building glyph bbox table for font 'WJZZHG+SimSun' with 22141 glyphs
warning: not building glyph bbox table for font 'WXWEBB+SimHei' with 22021 glyphs
it seems mupdf can't parse the pdf, but actually, it can view it correctly. I didn't have much experience aoubt C language. So when i used some other python package like:pdfminer or pdfplumber, they can't handle this pdf.
hope can get respone, thanks!
The text was updated successfully, but these errors were encountered: