Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language autodetect plugin #1969

Merged
merged 2 commits into from
Dec 12, 2022
Merged

Add language autodetect plugin #1969

merged 2 commits into from
Dec 12, 2022

Conversation

ArtikusHG
Copy link
Contributor

What does this PR do?

Adds a plugin which allows the search engine to autodetect the language of the search query and search in that language.

Why is this change important?

Because it improves results drastically for people who use the search engine in different languages.

How to test this PR locally?

Just enable the plugin, and do some search queries in different languages.

@ArtikusHG
Copy link
Contributor Author

Is this not getting merged because I have to change something, or do I simply need to wait a bit more? If you need any further changes, please tell me.

@return42
Copy link
Member

return42 commented Nov 29, 2022

Hi @ArtikusHG, thanks for your PR, a very interesting plugin 👍

Is this not getting merged because I have to change something, or do I simply need to wait a bit more?

Wait a bit more :-) .. TBH I haven't had time to review and I am sorry about. Without a deep review I can make only some bikeshedding and scratch on the surface .. three things come into my mind:

  • langdetect should be added to the requirements.txt
  • what happens if the text is too ambiguous or too short? Will search.search_query.lang = detect(search.search_query.query) set to Null or an empty string?
  • some engines (most common search engines?) return best results when a region is selected (de-DE, fr-BE, en-US ..) / should we take care about this?

I want to see this PR gets merged, but before we should think about these topics.


Update:

  • about region (and script): langdetect seems to support zh-cn, zh-tw (I assume its zh-Hans and zh-Hant) .. beside the lower/upper case of the region part (in SearXNG: zh-CN, zh-TW): what about zh-HK (zh-Hant).
  • which language setting should be win? .. is it OK that the plugin (if activated) wins over:

The language and locale (aka region) detection is done on different criteria:

searxng/searx/webapp.py

Lines 540 to 552 in a8359dd

# language is defined neither in settings nor in preferences
# use browser headers
if not preferences.get_value("language"):
language = _get_browser_language(request, settings['search']['languages'])
preferences.parse_dict({"language": language})
logger.debug('set language %s (from browser)', preferences.get_value("language"))
# locale is defined neither in settings nor in preferences
# use browser headers
if not preferences.get_value("locale"):
locale = _get_browser_language(request, LOCALE_NAMES.keys())
preferences.parse_dict({"locale": locale})
logger.debug('set locale %s (from browser)', preferences.get_value("locale"))

searxng/searx/webapp.py

Lines 221 to 231 in a8359dd

def _get_browser_language(req, lang_list):
for lang in req.headers.get("Accept-Language", "en").split(","):
if ';' in lang:
lang = lang.split(';')[0]
if '-' in lang:
lang_parts = lang.split('-')
lang = "{}-{}".format(lang_parts[0], lang_parts[-1].upper())
locale = match_language(lang, lang_list, fallback=None)
if locale is not None:
return locale
return 'en'

def parse_dict(self, input_data: Dict[str, str]):
"""parse preferences from request (``flask.request.form``)"""
for user_setting_name, user_setting in input_data.items():
if user_setting_name in self.key_value_settings:
if self.key_value_settings[user_setting_name].locked:
continue
self.key_value_settings[user_setting_name].parse(user_setting)
elif user_setting_name == 'disabled_engines':
self.engines.parse_cookie(input_data.get('disabled_engines', ''), input_data.get('enabled_engines', ''))
elif user_setting_name == 'disabled_plugins':
self.plugins.parse_cookie(input_data.get('disabled_plugins', ''), input_data.get('enabled_plugins', ''))
elif user_setting_name == 'tokens':
self.tokens.parse(user_setting)
elif not any(
user_setting_name.startswith(x) for x in ['enabled_', 'disabled_', 'engine_', 'category_', 'plugin_']
):
self.unknown_params[user_setting_name] = user_setting

@ArtikusHG
Copy link
Contributor Author

I will answer to all the questions a bit later, however: langdetect is already a dependency of searxng (requirements.txt: langdetect==1.0.9), so I am not adding any new dependencies.

@return42
Copy link
Member

(requirements.txt: langdetect==1.0.9), so I am not adding any new dependencies.

Opps, sorry.

@ArtikusHG
Copy link
Contributor Author

ArtikusHG commented Nov 29, 2022

Okay, back at my laptop, gonna try to answer all the questions.

what happens if the text is too ambiguous or too short?

As far as I know, langdetect falls back to English. I didn't think of that for some reason; I will implement a check so that if the language isn't detected (or detected with a confidence level that indicates that it's clearly false) it falls back to the user's default language.

some engines (most common search engines?) return best results when a region is selected (de-DE, fr-BE, en-US ..) / should we take care about this?

I don't think it's accurate enough, and I don't think it will affect search results enough for us to care. As far as I understand you're not talking about location-based results (e.g. local shops, etc.), but even if you are, implementing this based on language is an even worse idea.

which language setting should be win? .. is it OK that the plugin (if activated) wins over: [...]

I believe the autodetected language (if detected) should be the one that is being chosen for searching. Sourcing the language and region from the browser works great as long as the user is only searching in one language - the one that is selected in the browser. This plugin is targeted at users who search in different languages. As I mentioned, I often switch between searching in Russian and English (and sometimes Romanian). The whole point of automatically detecting the language is that the language in which the search query is done is much more important than the one that is statically set in the browser. It doesn't matter that the user has their browser in English as long as they're searching in French, you know?

Sorry for getting a bit too elaborate, I just wanted to make sure I get my points across properly. I am open to further discussion, but I think the only valid concern is that we should implement a fallback to the original language from the settings if no language is accurately detected.

@ArtikusHG
Copy link
Contributor Author

ArtikusHG commented Nov 29, 2022

Okay, I did a bit more testing, and I've found that the langdetect library performs extremely poorly with short search queries ("apples" gets identified as French). Will adding a second language detection library into the project be a huge problem? Probably the best contender so far is Google's CLD3 - it supports 107 languages (langdetect supports only 57), does not require any additional downloads (e.g. language identification models for FastText) and is the only decent maintained modern library for language detection in python at the moment (langdetect, langid and others are dead projects and other libraries all require additional language models). Also, CLD3 provides an is_reliable boolean which returns false when the library cannot accurately identify the language, which allows us to easily implement fallback.

EDIT: CLD3 seems to have similar problems regarding short queries, and as far as I can tell, most other libraries (even outdated ones) have the same problem. As far as going for the model-based ones, most instance maintainers are probably not gonna bother setting up language detection models for their instance, so this feature will not be available to most users. Including the models with SearXNG will make the download massive (even though there are some decently small ones). What do we do?

To be clear, for long-ish queries langdetect performs mostly fine, and this is still better in most cases than switching the language manually, but we should definitely consider using something better.

@ArtikusHG
Copy link
Contributor Author

ArtikusHG commented Nov 29, 2022

Okay, last comment: fasttext has a model of ~917 kb that does much better than langdetect. It is not perfect, but it is a lot better than langdetect (and other standalone libraries too for that matter). Paired with fallback detection (e.g. not applying results that have a <0.3 confidence level) this results in, on average, pretty good results. I think this is a great solution - include the 917 kb model in the repo and use it as the default one, and add an option (either in settings.yml or in the plugin directly) to use a different model if the maintainer wishes.

FYI, the model is distributed under Creative Commons Attribution-Share-Alike License 3.0, so including it in the repo should be no legal problem.

@tiekoetter
Copy link
Member

@ArtikusHG If you think you can implement it, then go for it.

I like the functionality of this plugin and it would be great if it would also work reliably for short search phrases.

@ArtikusHG
Copy link
Contributor Author

ArtikusHG commented Nov 29, 2022

Implementing language detection with fasttext instead of langdetect is not a hard task. If you guys approve of the fasttext dependency addition, I will implement it tomorrow, alongside with fallback detection.

@ArtikusHG
Copy link
Contributor Author

I did the required changes to use fasttext. The plugin is now much better at detecting languages. It still gets confused here and there at times, but this is not a huge issue. A couple notes:

  • I wasn't sure where to put the language identification model, so I put it in the data folder. If it's supposed to go else where, please move it (or tell me where to move it)
  • fasttext.FastText.eprint = lambda x: None is needed to prevent fasttext from showing a (useless) warning when loading a model - it's just an informational warning about some changes in the API which does not affect functionality, however it stops the plugin from being loaded
  • I used confidence > 0.3 as the threshold for changing the language - if you feel this it too low feel free to change it

@return42
Copy link
Member

FYI before commit you should run make test

To get the reported issue fixed you can run make format.python.

Thanks a lot for your efforts on this PR 👍

I hope I have time the next days to review / if not please patient with the maintainers.

One question I have ..

I wasn't sure where to put the language identification model, so I put it in the data folder.

I assume its a training file .. right? .. from where did you got it (may be we can regularly update this file from CI, like we d with other data files).

@ArtikusHG
Copy link
Contributor Author

Whoops, sorry for messy formatting, I will fix it tomorrow.

The source of lid.176.ftz is: https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz - however, it's not a model that is being updated regularly. As far as I'm aware, no changes have been made to it since it's been published, so it won't make any sense to update it from CI.

@return42
Copy link
Member

return42 commented Dec 5, 2022

@ArtikusHG @tiekoetter is it OK when I squash the commits and rebase the branch on master?

@ArtikusHG
Copy link
Contributor Author

I'm fine with that

@return42 return42 self-requested a review December 9, 2022 19:07
@return42
Copy link
Member

I'm fine with that

Thanks 👍 .. I also placed a commit on top to improve 'Autodetect search language' plugin:

  • add documentation
  • harmonize FastText language model with SearXNG's language model

IMO this PR is a good starting point to get language autodetection into SearXNG but there are also some space left for improvements.

If you look through the docs I add then you will find the following passage:

SearXNG's locale of a query comes from (highest wins):

  1. The Accept-Language header from user's HTTP client.
  2. The user select a locale in the preferences.
  3. The user select a locale from the menu in the query form (e.g. :zh-TW)
  4. This plugin is activated in the preferences and the locale (only the language
    code / none region code) comes from the fastText's language detection.

Conclusion: There is a conflict between the language selected by the user and
the language from language detection of this plugin. For example, the user
explicitly selects the German locale via the search syntax to search for a term
that is identified as an English term (try :de-DE thermomix, for example).

In this screenshot the user selects the de-DE locale but the autodetect plugin has switched the language to en because the name of the product the user looks for is a English word

grafik

We can't solve this conflict in the PR, to solve the conflict it needs some SearXNG core changes and I don't want to do it here in this PR.

Since it is only a plugin that the user can disable I think we can merge this PR.

Only one question I have: @ArtikusHG you set the threshold by 0.3 .. did you had any reason to choose this value .. should we think about the value again and possibly increase it?

@ArtikusHG
Copy link
Contributor Author

ArtikusHG commented Dec 10, 2022

The reason for the 0.3 threshold was that from my testing (loading the model into Python and just trying it on some common words from languages I know) it seemed like 0.3 was the best option. Many queries I typed in were identified correctly with an accuracy of around 0.35-0.4. However, many non-standard queries (e.g. just smashing the keyboard, aka "asojfdowqireuowqeuroqidalkjfas") were assigned seemingly random languages with accuracy just below 0.3 (I remember most were like 0.25 or 0.28). The same things also happen with brand names - typing "LineageOS", "GitHub", etc. can sometimes identify the language as non-english because the word sounds like it, but actually it's just a brand name.

If you feel like increasing the threshold, go for it - but I personally would never put it above 0.5, perhaps even 0.4. I feel like 0.3 is a comfortable balance for most regular search queries. If you believe this is too low, I'm fine with you setting it to 0.4, but I think 0.3 is alright.

Also, part of this is related to me using the smaller, compressed version of the lid model. The ~100 MB model in most cases identifies the language as the same one, but gives slightly higher confidence numbers. The model can easily be replaced by the instance maintainer, but I don't think many (if any) of them will bother. We could think about adding a section to the docs about improving language detection accuracy and swapping the compressed model for the full one.

@return42
Copy link
Member

return42 commented Dec 10, 2022

If you feel like increasing the threshold, go for it

with your experience in the background I do not want to change the value, it was just good to hear your experiences / thanks!

We could think about adding a section to the docs about improving language detection accuracy and swapping the compressed model for the full one.

Not at this early stage .. lets get some experience in practice first.

We're just getting some thoughts on the memory footprint and package size ... numpy alone takes up 69MB (all packages from searxng take just 109MB) .. we have to see how the impact on the deployment is ... to speed up and simplify the deployment we are thinking about using fasttext-wheel instead of fasttext ..


Memory footprint:

import fasttext                                    # --> +10 MB
fasttext.load_model(str(data_dir / 'lid.176.ftz')) # --> +4MB

A common default SearXNG installation has 4 workers --> +56MB


On my instance I deployed this PR with a fasttext-wheel --> https://darmarit.org/searx/preferences

@ArtikusHG
Copy link
Contributor Author

First of all, just wanted to say thank you for helping get this feature merged. You've probably done more work than I did at this point :p

If fasttext-wheel improves the deploying experience, then why not? I see no reason not to use it in this case.

Regarding the dependencies and memory footprint: this is clearly something to do outside of this PR, but we could move all language identification code in SearXNG into fasttext to get rid of unnecessary dependencies. Langdetect is hugely out of date, not very accurate compared to fasttext, and seems to have the biggest memory footprint of all python language identification models: https://modelpredict.com/language-identification-survey. If you think this is a good idea, I can get a draft PR going in a few days (I believe there's not a lot of code that has to be changed to achieve this) and we can work on this change further (I understand that this is a more fundamental change to how SearXNG works and it will take even more time than this PR, but I am willing to work on this and I think this will benefit both the user and maintainer, since fasttext is both more accurate and less memory-hungry if we're talking RAM usage).

@ArtikusHG
Copy link
Contributor Author

Using grep -R -i langdetect ./* in the SearXNG repo directory, I found that langdetect is only used in two files:

  • searx/search/checker/impl.py
  • searxng_extra/update/update_engine_descriptions.py
    In both of these files, replacing langdetect with fasttext is trivial - simply replacing the imports and the few calls to langdetect with calls to fasttext will be enough. If you think this is a good idea, I can start a PR for this today (on a side note: should we get this merged first? Or should I move the lid training file to the fasttext replacement PR, we merge that first, and then the autodetect plugin?)

- Add documentation to the plugin
- Harmonize FastText language model with SearXNG's language model

Reosurces::

    import fasttext                                    # --> +10 MB
    fasttext.load_model(str(data_dir / 'lid.176.ftz')) # --> +4MB

Suggested-by: @dalf

- To speed up and simplify the deployment use fasttext-wheel instead of fasttext
- Building numpy on the Alpine Linux of docker-images takes ages --> install
  py3-numpy from Alpines package manager (apk)
- Alpine Linux on docker-images (musl libc) do not support fasttext-wheel (gnu
  libc) --> patch Dockerfile and build from fastetxt:

     sed -i s/fasttext-wheel/fasttext/ requirements.txt

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
@return42
Copy link
Member

First of all, just wanted to say thank you for helping get this feature merged. You've probably done more work than I did at this point :p

Its not only me, many suggestions are coming from @dalf :-)

We had some (hopefully) last changes made to my commit on this branch / all changes are documented in the commit messag:.

[mod] improve 'Autodetect search language' plugin

- Add documentation to the plugin
- Harmonize FastText language model with SearXNG's language model

Reosurces::

    import fasttext                                    # --> +10 MB
    fasttext.load_model(str(data_dir / 'lid.176.ftz')) # --> +4MB

Suggested-by: @dalf

- To speed up and simplify the deployment use fasttext-wheel instead of fasttext
- Building numpy on the Alpine Linux of docker-images takes ages --> install
  py3-numpy from Alpines package manager (apk)
- Alpine Linux on docker-images (musl libc) do not support fasttext-wheel (gnu
  libc) --> patch Dockerfile and build from fastetxt:

     sed -i s/fasttext-wheel/fasttext/ requirements.txt

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>

Langdetect is hugely out of date ... If you think this is a good idea, I can get a draft PR going in a few days

Yes, this is a topic @dalf and me also considered .. if you have time, we are happy to see your PR :-)

Copy link
Member

@return42 return42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding lang-detection to SearXNG 👍

@ArtikusHG
Copy link
Contributor Author

Thanks a lot! I will most likely make a PR for replacing langdetect today.

@return42
Copy link
Member

return42 commented Dec 11, 2022

Thanks a lot! I will most likely make a PR for replacing langdetect today.

nice to hear / thanks! .. one recomend I have .. if implement a PR create a branch for the PR.

This PR was on your master branch .. and the master branch is the the holy grail we normally don't want to touch in development phase :-)

dalf added a commit that referenced this pull request Dec 16, 2022
Replace langdetect with fasttext (followup of #1969)
return42 added a commit to return42/searxng that referenced this pull request Mar 5, 2023
[1] searxng#2027 (review)
[2] searxng#1969 (comment)

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants