-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add language autodetect plugin #1969
Conversation
Is this not getting merged because I have to change something, or do I simply need to wait a bit more? If you need any further changes, please tell me. |
Hi @ArtikusHG, thanks for your PR, a very interesting plugin 👍
Wait a bit more :-) .. TBH I haven't had time to review and I am sorry about. Without a deep review I can make only some bikeshedding and scratch on the surface .. three things come into my mind:
I want to see this PR gets merged, but before we should think about these topics. Update:
The language and locale (aka region) detection is done on different criteria: Lines 540 to 552 in a8359dd
Lines 221 to 231 in a8359dd
Lines 448 to 464 in a8359dd
|
I will answer to all the questions a bit later, however: langdetect is already a dependency of searxng (requirements.txt: langdetect==1.0.9), so I am not adding any new dependencies. |
Opps, sorry. |
Okay, back at my laptop, gonna try to answer all the questions.
As far as I know, langdetect falls back to English. I didn't think of that for some reason; I will implement a check so that if the language isn't detected (or detected with a confidence level that indicates that it's clearly false) it falls back to the user's default language.
I don't think it's accurate enough, and I don't think it will affect search results enough for us to care. As far as I understand you're not talking about location-based results (e.g. local shops, etc.), but even if you are, implementing this based on language is an even worse idea.
I believe the autodetected language (if detected) should be the one that is being chosen for searching. Sourcing the language and region from the browser works great as long as the user is only searching in one language - the one that is selected in the browser. This plugin is targeted at users who search in different languages. As I mentioned, I often switch between searching in Russian and English (and sometimes Romanian). The whole point of automatically detecting the language is that the language in which the search query is done is much more important than the one that is statically set in the browser. It doesn't matter that the user has their browser in English as long as they're searching in French, you know? Sorry for getting a bit too elaborate, I just wanted to make sure I get my points across properly. I am open to further discussion, but I think the only valid concern is that we should implement a fallback to the original language from the settings if no language is accurately detected. |
Okay, I did a bit more testing, and I've found that the langdetect library performs extremely poorly with short search queries ("apples" gets identified as French). Will adding a second language detection library into the project be a huge problem? Probably the best contender so far is Google's CLD3 - it supports 107 languages (langdetect supports only 57), does not require any additional downloads (e.g. language identification models for FastText) and is the only decent maintained modern library for language detection in python at the moment (langdetect, langid and others are dead projects and other libraries all require additional language models). Also, CLD3 provides an EDIT: CLD3 seems to have similar problems regarding short queries, and as far as I can tell, most other libraries (even outdated ones) have the same problem. As far as going for the model-based ones, most instance maintainers are probably not gonna bother setting up language detection models for their instance, so this feature will not be available to most users. Including the models with SearXNG will make the download massive (even though there are some decently small ones). What do we do? To be clear, for long-ish queries langdetect performs mostly fine, and this is still better in most cases than switching the language manually, but we should definitely consider using something better. |
Okay, last comment: fasttext has a model of ~917 kb that does much better than langdetect. It is not perfect, but it is a lot better than langdetect (and other standalone libraries too for that matter). Paired with fallback detection (e.g. not applying results that have a <0.3 confidence level) this results in, on average, pretty good results. I think this is a great solution - include the 917 kb model in the repo and use it as the default one, and add an option (either in settings.yml or in the plugin directly) to use a different model if the maintainer wishes. FYI, the model is distributed under Creative Commons Attribution-Share-Alike License 3.0, so including it in the repo should be no legal problem. |
@ArtikusHG If you think you can implement it, then go for it. I like the functionality of this plugin and it would be great if it would also work reliably for short search phrases. |
Implementing language detection with fasttext instead of langdetect is not a hard task. If you guys approve of the fasttext dependency addition, I will implement it tomorrow, alongside with fallback detection. |
I did the required changes to use fasttext. The plugin is now much better at detecting languages. It still gets confused here and there at times, but this is not a huge issue. A couple notes:
|
FYI before commit you should run To get the reported issue fixed you can run Thanks a lot for your efforts on this PR 👍 I hope I have time the next days to review / if not please patient with the maintainers. One question I have ..
I assume its a training file .. right? .. from where did you got it (may be we can regularly update this file from CI, like we d with other data files). |
Whoops, sorry for messy formatting, I will fix it tomorrow. The source of lid.176.ftz is: https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz - however, it's not a model that is being updated regularly. As far as I'm aware, no changes have been made to it since it's been published, so it won't make any sense to update it from CI. |
@ArtikusHG @tiekoetter is it OK when I squash the commits and rebase the branch on master? |
I'm fine with that |
Thanks 👍 .. I also placed a commit on top to improve 'Autodetect search language' plugin:
IMO this PR is a good starting point to get language autodetection into SearXNG but there are also some space left for improvements. If you look through the docs I add then you will find the following passage:
In this screenshot the user selects the We can't solve this conflict in the PR, to solve the conflict it needs some SearXNG core changes and I don't want to do it here in this PR. Since it is only a plugin that the user can disable I think we can merge this PR. Only one question I have: @ArtikusHG you set the threshold by 0.3 .. did you had any reason to choose this value .. should we think about the value again and possibly increase it? |
The reason for the 0.3 threshold was that from my testing (loading the model into Python and just trying it on some common words from languages I know) it seemed like 0.3 was the best option. Many queries I typed in were identified correctly with an accuracy of around 0.35-0.4. However, many non-standard queries (e.g. just smashing the keyboard, aka "asojfdowqireuowqeuroqidalkjfas") were assigned seemingly random languages with accuracy just below 0.3 (I remember most were like 0.25 or 0.28). The same things also happen with brand names - typing "LineageOS", "GitHub", etc. can sometimes identify the language as non-english because the word sounds like it, but actually it's just a brand name. If you feel like increasing the threshold, go for it - but I personally would never put it above 0.5, perhaps even 0.4. I feel like 0.3 is a comfortable balance for most regular search queries. If you believe this is too low, I'm fine with you setting it to 0.4, but I think 0.3 is alright. Also, part of this is related to me using the smaller, compressed version of the lid model. The ~100 MB model in most cases identifies the language as the same one, but gives slightly higher confidence numbers. The model can easily be replaced by the instance maintainer, but I don't think many (if any) of them will bother. We could think about adding a section to the docs about improving language detection accuracy and swapping the compressed model for the full one. |
with your experience in the background I do not want to change the value, it was just good to hear your experiences / thanks!
Not at this early stage .. lets get some experience in practice first. We're just getting some thoughts on the memory footprint and package size ... numpy alone takes up 69MB (all packages from searxng take just 109MB) .. we have to see how the impact on the deployment is ... to speed up and simplify the deployment we are thinking about using Memory footprint: import fasttext # --> +10 MB
fasttext.load_model(str(data_dir / 'lid.176.ftz')) # --> +4MB A common default SearXNG installation has 4 workers --> +56MB On my instance I deployed this PR with a |
First of all, just wanted to say thank you for helping get this feature merged. You've probably done more work than I did at this point :p If fasttext-wheel improves the deploying experience, then why not? I see no reason not to use it in this case. Regarding the dependencies and memory footprint: this is clearly something to do outside of this PR, but we could move all language identification code in SearXNG into fasttext to get rid of unnecessary dependencies. Langdetect is hugely out of date, not very accurate compared to fasttext, and seems to have the biggest memory footprint of all python language identification models: https://modelpredict.com/language-identification-survey. If you think this is a good idea, I can get a draft PR going in a few days (I believe there's not a lot of code that has to be changed to achieve this) and we can work on this change further (I understand that this is a more fundamental change to how SearXNG works and it will take even more time than this PR, but I am willing to work on this and I think this will benefit both the user and maintainer, since fasttext is both more accurate and less memory-hungry if we're talking RAM usage). |
Using
|
- Add documentation to the plugin - Harmonize FastText language model with SearXNG's language model Reosurces:: import fasttext # --> +10 MB fasttext.load_model(str(data_dir / 'lid.176.ftz')) # --> +4MB Suggested-by: @dalf - To speed up and simplify the deployment use fasttext-wheel instead of fasttext - Building numpy on the Alpine Linux of docker-images takes ages --> install py3-numpy from Alpines package manager (apk) - Alpine Linux on docker-images (musl libc) do not support fasttext-wheel (gnu libc) --> patch Dockerfile and build from fastetxt: sed -i s/fasttext-wheel/fasttext/ requirements.txt Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Its not only me, many suggestions are coming from @dalf :-) We had some (hopefully) last changes made to my commit on this branch / all changes are documented in the commit messag:.
Yes, this is a topic @dalf and me also considered .. if you have time, we are happy to see your PR :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding lang-detection to SearXNG 👍
Thanks a lot! I will most likely make a PR for replacing langdetect today. |
nice to hear / thanks! .. one recomend I have .. if implement a PR create a branch for the PR. This PR was on your master branch .. and the master branch is the the holy grail we normally don't want to touch in development phase :-) |
Replace langdetect with fasttext (followup of #1969)
[1] searxng#2027 (review) [2] searxng#1969 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
What does this PR do?
Adds a plugin which allows the search engine to autodetect the language of the search query and search in that language.
Why is this change important?
Because it improves results drastically for people who use the search engine in different languages.
How to test this PR locally?
Just enable the plugin, and do some search queries in different languages.