-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeking advice regarding classification problem only present with Chinese #49
Comments
Are you sure the documents are in UTF-8? Windows software would often default to UTF-16 (if not some legacy code page). |
This definitely sounds like an encoding issue on the document side. When we trained |
For what it's worth, I see the opposite issue: bias towards Chinese
All are identified as Chinese, generally with > 98% probability. Perhaps the Chinese data is actually all in the Latin alphabet? This should be the easiest language to keep separate, so it reeks of fundamental bug or preproc issue. |
Pardon, looks like in most cases it is the result of invisible chars in dirty data. (But |
Hello,
I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with pdfminer and whilst langid works excellently with all my samples in a variety of languages, it seems to have problems for me when I run it with Chinese (I have samples in both simplified or traditional) because it always suggests 'en'.
Does anyone have any advice on how I should approach investigating what the problem might be?
Are there any standard example documents that I could try that would confirm there isn't something quirky with my PDF extraction?
I could be wrong, but I don't think it's necessarily a UTF-8 encoding issue as I have managed to get it working with other non-Latin texts (eg Cyrillic).
The languages that I've found to work with my samples, so far, are: en, it, de, ru. I will be checking pt, fr, pl and ja ones shortly.
There is a tiny portion of English in the header section, but that does not throw off the language detection for the other samples and I have tried focusing on pages where the body of the text is entirely Chinese and present in significantly larger quantities than in the header.
It also makes no difference if I preselect the languages (unfortunately the false suggestion of English needs to be in the list, as there are likely to be samples in English present)
langid.set_languages(['en','es','pt','fr','ru','pl','de','it','ja',
'zh'])Even if I try taking out English then it merely suggests a different wrong language (eg German), although the confidence level is fairly low (eg typically 0.16 to 0.25, whether it guesses English or German).
My set up is Windows 7, with Python 2.7 (needed due to use of PDFMiner, although I could try Python 3.5 if it was thought to solve the issue).
Many thanks,
Neil
The text was updated successfully, but these errors were encountered: