-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class probability computation is very inefficient (patch enclosed) #11
Comments
Oops, let's try that again (first-time poster on GitHub):
|
Hi Ralf! Thanks for your patch. I've applied it to a new branch. Dawid Weiss addresses the same issue (sparse feature vector) in his Java implementation (https://github.com/carrotsearch/langid-java). His solution uses a different data structure. I do appreciate the simplicity of your patch, handling "long" and "short" strings separately. |
I did some experimentation to try and determine a suitable threshold for transitioning from "short" to "long" handling. Your initial estimate using the in-built feature set was num_feats / 10, which worked out to 748. I've made a slight change to that, allowing the user to specify the document length in bytes where the changeover should occur. I experimented with samples from a European government dataset, sampling 1000 strings of n bytes, for 100 <= n <= 3000 in increments of 100. The break-even point in my experiments appears to be right around 650 bytes, and I've made this change in the a/m branch. |
I'm glad I could help to improve your code. The speed issue came up The solution in my patch is in fact the same approach I use in my own I took a quick look at langid-java, but it seems to be a raw library BTW, I set the switchover point relative to num_feats because it will On 8/1/13, saffsd notifications@github.com wrote:
|
I'm familiar with your work, I read your paper when it came out. Great to see others releasing open-source tools! I've never pushed this implementation of langid.py to the limits you're taking it to, I'll be very interested to see the final outcome of your comparison. I hope you are planning to publish it? I'm not sure server mode will help, the main advantage is to avoid unpacking the model multiple times. Are you doing that? If so, it would be faster if you access langid.py as a library, creating an single instance of the LanguageIdentifier class and then using that. I think this implementation of langid.py is as fast as I'm able to make it while sticking with numpy, I suspect I could do a fair bit better if I implemented the core estimation as a C extension module. Ah, I see your rationale for the num_feats/10, that is quite sensible. As an aside, I've experimented with la-strings briefly. Is there any way to access the language identifier component directly? What I ended up doing was invoking it as "la-strings -i -I 1 ", and then taking the most common prediction as the prediction for the whole document, which seems like a rather roundabout way of doing it. |
That's in fact what is happening, because my evaluation script calls
The language identification code is all in the langident/ subdirectory
I have a paper at the Text, Speech, and Discourse conference next |
I've added handling of the --line option in --batch mode, this should be the most effective way for you to avoid unpacking the model over and over. You can find the modified version of langid.py in the new branch I opened for this issue. The file itself is here. Please let me know if this is helpful. Congratulations on your publication! I will look for it when the proceedings come online. |
The following patch produces the same output with a 4.4-fold speedup for language identification (not counting startup time) in --line mode given 650-byte average line lengths, and a 33-fold speedup with 62-byte average line lengths when using the default language model. Larger models with more features show an even larger speedup.
The speedup results from avoiding a matrix multiplication against a feature-count vector which is mostly zeros. You may wish to tweak the cut-over from "short" to "long" texts by adjusting the self.nb_numfeats/10; it could probably be moved higher, but I was being conservative.
259a260,302
The text was updated successfully, but these errors were encountered: