Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot find builtin CJK font #189

Closed
HazyGuo opened this issue Jul 23, 2018 · 15 comments
Closed

cannot find builtin CJK font #189

HazyGuo opened this issue Jul 23, 2018 · 15 comments

Comments

@HazyGuo
Copy link

HazyGuo commented Jul 23, 2018

I got this error while running the pdfviewer(wx).py file. What is the CJK font?

@JorjMcKie
Copy link
Collaborator

CJK = China, Japa, Korea
Can you send me an example PDF? Maybe there is a bigger issue underneath ...

@JorjMcKie
Copy link
Collaborator

Using PDFdisplay.py works ok?

@HazyGuo
Copy link
Author

HazyGuo commented Jul 23, 2018

sorry, the pdf is confidential data getting from the financial customer.

@HazyGuo
Copy link
Author

HazyGuo commented Jul 23, 2018

there is no error appear, but the text content can't display.
image

@JorjMcKie
Copy link
Collaborator

uhhh that's bad!!!

I am afraid I have to change the way how I generate PyMuPDF and MuPDF then.
Currently I am excluding fonts from the MuPDF generation whenever I thought I can do that. Obviously I have to re-consider this ...

How urgent is your issue? I need a few days to change and upload a new version ...

@HazyGuo
Copy link
Author

HazyGuo commented Jul 23, 2018

thanks for your reply, and maybe one week remains for my issue.

@JorjMcKie
Copy link
Collaborator

I just checked what will happen if I include minimal CJK support in my MuPDF and subsequent PyMuPDF generation.

  1. I do not know if this already solves your problem
  2. It increases the size of the PyMuPDF binary (_fitz<...>.so or _fitz.pyd) from about 3 or 4 MB to 8 or 9 MB.

Are you using Windows? In that case I could send you a wheel to check if your issue is already solved ...

@HazyGuo
Copy link
Author

HazyGuo commented Jul 23, 2018

yes, i write test code in windows.
it's pleasure that you can send it to my email(guoqinghe99@gmail.com)

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 23, 2018 via email

@HazyGuo
Copy link
Author

HazyGuo commented Jul 23, 2018

my python version is 3.5, and the operating system is windows 64

@JorjMcKie
Copy link
Collaborator

You should have received PyMuPDF-1.13.15-cp35-cp35m-win_amd64.whl by now. Please install it via

python -m pip install --upgrade PyMuPDF-1.13.15-cp35-cp35m-win_amd64.whl

@HazyGuo
Copy link
Author

HazyGuo commented Jul 23, 2018

thank you very much, it can extract text content from my pdf.
but the structure of the html page rendered by the extracted content seems not like the original in the pdf.
image
in addition, can i distinguish which content is come from the table and which is not?

@JorjMcKie
Copy link
Collaborator

Did you mean the HTML page created from page.getText("html")?
This is the same as it is created by mutool draw ....
This is a different issue - has nothing to do with the font support. The same problem arises with HTML output for ASCII or LATIN fonts. The basic reason seems to be that the generated HTML contains font specifications that are not complete.

To your final question:
No, there is no way to find out whether some text comes from a table or not. The only way would be using text position information. This type of information is contained in page.getText("dict"), page.getTextBlocks(), page.getTextWords() and page.getText("xml").
All these methods provide various levels of info detail. By using them you can determine, where a line ("dict" method), a paragraph ("blocks" method), a word ("words" method), or even a character ("xml") method is located on the page.
You may want to have a look at the most current Wiki about extracting text from rectangles.

@HazyGuo
Copy link
Author

HazyGuo commented Jul 23, 2018

thanks for your careful answer, and i get it now.

@HazyGuo HazyGuo closed this as completed Jul 23, 2018
@JorjMcKie
Copy link
Collaborator

@HazyGuo - Hi again!
I have written a little script that takes the words of a document page and creates a new word list with the following properties:

  1. duplicate words omitted (i.e. words with same text and page position - which happened in your example).

  2. sorted by ascending vertical position, then by horizontal position

  3. combined words into one if the gap between them is very small

In your example document, this script creates a list of 70 words, all numbers are correctly produced, no duplicate words, etc.

Hope it helps!
unify-words.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants