-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hindi, Arabic, Korean and Japanese #2
Comments
Hello Sina! The model is trained on Hindi and Korean, and my testing shows that
My testing above was done with langid.py used interactively in a Cheers |
Hey Marco, Thanks for the quick reply! It would seem I, just like the only other issue on this project, was having encoding problems! If i did the following: the ordinal encoding of the string s is: As you trained your naive bayes classifier on byte features this resulted in 0 value feature vectors. the correct way to solve this problem is either to set as with the original characters such as: s = "यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है" and the features are correct. I think there is a reasonable fix for this, you can encode an incoming string to the tokenize function as so: if the string is unicode encoded. Knowing whether a string is encoded in advance is difficult but instead you can use the technique outlined here: http://code.activestate.com/recipes/466341-guaranteed-conversion-to-unicode-or-byte-string/ which should let you encode any string as bytes safely Thanks very much for your help. On a side note, I am taking your great project and porting it to java for my work. I'd really love to feed back to your project, shall I do this by branching and adding the java classes? I havn't ported the feature selector and trainer yet, so I was converting the language model to a format readable in java. Some kinks to iron out, but I will let you know what i come up with if you are interested :-). Of course my java classes attribute you as the original author etc. Thanks again!
|
Hello Sina Thank you for your input. It does seem that encoding has caused problems for some users, but I think that encoding detection is beyond the scope of langid.py. There are other modules out there to perform this function, such as chardet . If this continues to give users difficulty, I may consider attempting to detect the situation and warning the user. I wish you all the best in your java implementation, but I think that it is best that you maintain your own repository. I would be happy to provide a link to your project when it is complete. The compiled model should not be difficult to convert, it is essentially just a large heap of numbers. I am planning to write a paper soon to describe the implementation of langid.py from a more technical perspective. All the best! Cheers |
Hello!
Thanks so much for the great tool and paper, it is really helping me learn about this stuff.
Had a question about the language model provided in the code and what it was trained on.
I'm finding that strings from languages like korean and hindi are not working correctly. For example
Hindi: यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है
Korean: 이것은 아랍어 문자열입니다
both get incorrectly matched as english with a confidence of 0.0196
Upon further inspection, I find that the dot product on the nb_classify function is returning a vector of 0s. I would take this to mean that the model simply wasn't trained on these languages. However, upon closer inspection i found that the nb_pc vector (which i think is the prior probabilities for each class?) is non zero for hindi.
Am i misunderstanding something? Was the basic model trained on hindi etc?
Thanks
The text was updated successfully, but these errors were encountered: