-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving language detection #929
Comments
Testing different libraries using |
Language Detection TestsfasttextRequirements - pybind11, numpy spacy_langdetectRequirements - spacy, langdetect langdetectRequirements - six langdetect
spacy
fasttext
dateparser
Other observations:
|
Thanks @gavishpoddar . On which dataset did you measure the quality? |
The accuracy of langdetect is ~ 0.24 when the input text is small input data (1-3 word) on Few language libraries were removed from results or not tested due to speed/maintenance/installation-related issues. Other NotesIssues discussed with langdetect are confirmed. |
thanks @gavishpoddar , that is test data from the dateparser library, right? Do you have accuracy values for other libraries on that data? |
Accuracy scores were drastic changes if we discard the test data which is not supported by the library the below scores are are observed after the data is cleaned. Accuracy values: Using test_languages data (Partial) **: fasttext Accuracy Score: 0.7823 Data cleaned stats Mapping Not Found: 204 lang_detect Accuracy Score: 0.6608 Data cleaned stats Mapping Not Found: 299 ** Data cleaned and removed for those languages which are not supported by the above libraries, cleaning codes shared in below comment. |
Code for cleaning supported languages:
|
@gavishpoddar Thanks for posting the code - that's really helpful. And it's useful to know how many of the examples are in languages missing from the library. But this way to evaluate quality does not sound fair to me, especially comparing accuracy on different datasets between libraries. I don't think we should discard examples just because a library does not support them - after all we want them to work, don't we? I'd rather run evaluation on the whole dataset, doing re-mapping between languages if needed. Also I see that even fasttext has a lot of "Mapping not found" - is that because languages are really missing, or because the names of languages are different? |
Hi @lopuhin, Thanks for the review 😀 I will try to remap as many of them as possible and run tests to find the results without discarding the values. I am also working on the remapping part not just for the tests. For some case, language is missing and in another case, it's because of mapping as fastest uses IS0-639 our data uses language codes by CLDR data. However, I will try to get some more insight into that. Additionally, there are a few languages supported by CLDR and many of our test data with not very commonly written, i.e 'sr-Cyrl', 'uz-Cyrl', 'az-Latn' and are not packaged with the model. |
Accuracy Score without removing language filters:
Other observations:
|
With PR #932 and fasttext function. Issue :
Are solved |
Language Detection: Speed For 10,000 calls:
Pl find codes below |
|
WIP: Language Detection: Accuracy Score
These scores are not accurate and are just for discussion. |
|
Thanks maintainers, this issue discussion is concluded. Thanks for your time. |
Improving language detection using optional language detection library
Related Issues #567, #575, #612.
Implementing plugging wrappers for optional language detection library, wrapper template, and docs on implementing a custom wrapper.
The text was updated successfully, but these errors were encountered: