Seeking advice regarding classification problem only present with Chinese #49

nmstoker · 2016-03-01T13:22:17Z

Hello,

I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with pdfminer and whilst langid works excellently with all my samples in a variety of languages, it seems to have problems for me when I run it with Chinese (I have samples in both simplified or traditional) because it always suggests 'en'.

Does anyone have any advice on how I should approach investigating what the problem might be?

Are there any standard example documents that I could try that would confirm there isn't something quirky with my PDF extraction?

I could be wrong, but I don't think it's necessarily a UTF-8 encoding issue as I have managed to get it working with other non-Latin texts (eg Cyrillic).

The languages that I've found to work with my samples, so far, are: en, it, de, ru. I will be checking pt, fr, pl and ja ones shortly.

There is a tiny portion of English in the header section, but that does not throw off the language detection for the other samples and I have tried focusing on pages where the body of the text is entirely Chinese and present in significantly larger quantities than in the header.

It also makes no difference if I preselect the languages (unfortunately the false suggestion of English needs to be in the list, as there are likely to be samples in English present)

langid.set_languages(['en','es','pt','fr','ru','pl','de','it','ja', 'zh'])

Even if I try taking out English then it merely suggests a different wrong language (eg German), although the confidence level is fairly low (eg typically 0.16 to 0.25, whether it guesses English or German).

My set up is Windows 7, with Python 2.7 (needed due to use of PDFMiner, although I could try Python 3.5 if it was thought to solve the issue).

Many thanks,
Neil

The text was updated successfully, but these errors were encountered:

tripleee · 2016-03-01T13:30:21Z

Are you sure the documents are in UTF-8? Windows software would often default to UTF-16 (if not some legacy code page).

saffsd · 2016-03-08T21:20:02Z

This definitely sounds like an encoding issue on the document side. When we trained langid.py we tried to include a representative sample of encodings, but I think the coverage for Chinese might be pretty poor. It's possible to retrain langid.py but this requires a bit of effort and training data. As @tripleee points out, windows often uses UTF-16, and quite a bit of the langid.py training data is in UTF-8. The easiest thing to try might be to transcode all documents (perhaps PDFMiner supports this directly? I'm not familiar) to UTF-8 and try again. Hope that helps!

bittlingmayer · 2016-03-10T12:22:23Z

For what it's worth, I see the opposite issue: bias towards Chinese

¡No! (only 24%)
‪#‎WCIT
Tʻagavorn apracʻ kenna
ՏԵՍԱՆՅՈՒԹ
#Cizre
#MustRead (only 77%)
Աֆրիկա (2nd, only 14%)

All are identified as Chinese, generally with > 98% probability.

Perhaps the Chinese data is actually all in the Latin alphabet? This should be the easiest language to keep separate, so it reeks of fundamental bug or preproc issue.

bittlingmayer · 2016-03-10T13:10:55Z

Pardon, looks like in most cases it is the result of invisible chars in dirty data. (But ՏԵՍԱՆՅՈՒԹ and ¡No! are clean, and ʻ in Tʻagavorn apracʻ kenna is not so exotic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking advice regarding classification problem only present with Chinese #49

Seeking advice regarding classification problem only present with Chinese #49

nmstoker commented Mar 1, 2016

tripleee commented Mar 1, 2016

saffsd commented Mar 8, 2016

bittlingmayer commented Mar 10, 2016

bittlingmayer commented Mar 10, 2016

Seeking advice regarding classification problem only present with Chinese #49

Seeking advice regarding classification problem only present with Chinese #49

Comments

nmstoker commented Mar 1, 2016

tripleee commented Mar 1, 2016

saffsd commented Mar 8, 2016

bittlingmayer commented Mar 10, 2016

bittlingmayer commented Mar 10, 2016