Hindi, Arabic, Korean and Japanese #2

sinjax · 2012-02-15T19:07:20Z

Hello!

Thanks so much for the great tool and paper, it is really helping me learn about this stuff.
Had a question about the language model provided in the code and what it was trained on.

I'm finding that strings from languages like korean and hindi are not working correctly. For example
Hindi: यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है
Korean: 이것은 아랍어 문자열입니다

both get incorrectly matched as english with a confidence of 0.0196
Upon further inspection, I find that the dot product on the nb_classify function is returning a vector of 0s. I would take this to mean that the model simply wasn't trained on these languages. However, upon closer inspection i found that the nb_pc vector (which i think is the prior probabilities for each class?) is non zero for hindi.

Am i misunderstanding something? Was the basic model trained on hindi etc?

Thanks

sina

saffsd · 2012-02-16T02:30:33Z

Hello Sina!

The model is trained on Hindi and Korean, and my testing shows that
langid.py works correctly for both the strings you provided.

यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है
('hi', 0.0044859052608894794)
이것은 아랍어 문자열입니다
('ko', 0.0047176192477553112)

My testing above was done with langid.py used interactively in a
terminal session. I suspect that you are facing an encoding issue.
langid.py does not perform encoding detection, it assumes that text
provided is utf8 encoded. Could you please provide me a bit more
detail about how you are using langid.py?

Cheers
Marco

sinjax · 2012-02-16T10:13:07Z

Hey Marco,

Thanks for the quick reply! It would seem I, just like the only other issue on this project, was having encoding problems!

If i did the following:
s = u"\u092F\u0939 \u090F\u0915 \u0938\u094D\u091F\u094D\u0930\u093F\u0902\u0917 \u0939\u0948 \u0915\u093F \u092E\u0948\u0902 \u0939\u093F\u0902\u0926\u0940 \u092E\u0947\u0902 \u0932\u093F\u0916\u0928\u093E \u0939\u0948"
classify(s)

the ordinal encoding of the string s is:
[2351, 2361, 32, 2319, 2325, 32, 2360, 2381, 2335, 2381, 2352, 2367, 2306, 2327, 32, 2361, 2376, 32, 2325, 2367, 32, 2350, 2376, 2306, 32, 2361, 2367, 2306, 2342, 2368, 32, 2350, 2375, 2306, 32, 2354, 2367, 2326, 2344, 2366, 32, 2361, 2376]

As you trained your naive bayes classifier on byte features this resulted in 0 value feature vectors. the correct way to solve this problem is either to set as with the original characters such as:

s = "यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है"
for which the ordinal values are:
[224, 164, 175, 224, 164, 185, 32, 224, 164, 143, 224, 164, 149, 32, 224, 164, 184, 224, 165, 141, 224, 164, 159, 224, 165, 141, 224, 164, 176, 224, 164, 191, 224, 164, 130, 224, 164, 151, 32, 224, 164, 185, 224, 165, 136, 32, 224, 164, 149, 224, 164, 191, 32, 224, 164, 174, 224, 165, 136, 224, 164, 130, 32, 224, 164, 185, 224, 164, 191, 224, 164, 130, 224, 164, 166, 224, 165, 128, 32, 224, 164, 174, 224, 165, 135, 224, 164, 130, 32, 224, 164, 178, 224, 164, 191, 224, 164, 150, 224, 164, 168, 224, 164, 190, 32, 224, 164, 185, 224, 165, 136]

and the features are correct.

I think there is a reasonable fix for this, you can encode an incoming string to the tokenize function as so:
map(ord,s.encode("utf-8"))

if the string is unicode encoded. Knowing whether a string is encoded in advance is difficult but instead you can use the technique outlined here: http://code.activestate.com/recipes/466341-guaranteed-conversion-to-unicode-or-byte-string/ which should let you encode any string as bytes safely

Thanks very much for your help.

On a side note, I am taking your great project and porting it to java for my work. I'd really love to feed back to your project, shall I do this by branching and adding the java classes? I havn't ported the feature selector and trainer yet, so I was converting the language model to a format readable in java. Some kinks to iron out, but I will let you know what i come up with if you are interested :-). Of course my java classes attribute you as the original author etc.

Thanks again!

Sina

saffsd · 2012-02-17T05:37:24Z

Hello Sina

Thank you for your input. It does seem that encoding has caused problems for some users, but I think that encoding detection is beyond the scope of langid.py. There are other modules out there to perform this function, such as chardet . If this continues to give users difficulty, I may consider attempting to detect the situation and warning the user.

I wish you all the best in your java implementation, but I think that it is best that you maintain your own repository. I would be happy to provide a link to your project when it is complete. The compiled model should not be difficult to convert, it is essentially just a large heap of numbers. I am planning to write a paper soon to describe the implementation of langid.py from a more technical perspective.

All the best!

Cheers
Marco

sinjax closed this as completed Feb 16, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hindi, Arabic, Korean and Japanese #2

Hindi, Arabic, Korean and Japanese #2

sinjax commented Feb 15, 2012

saffsd commented Feb 16, 2012

sinjax commented Feb 16, 2012

saffsd commented Feb 17, 2012

Hindi, Arabic, Korean and Japanese #2

Hindi, Arabic, Korean and Japanese #2

Comments

sinjax commented Feb 15, 2012

saffsd commented Feb 16, 2012

sinjax commented Feb 16, 2012

saffsd commented Feb 17, 2012