Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

reduce the number of external bangs #2045

Closed
return42 opened this issue Jul 5, 2020 · 10 comments
Closed

reduce the number of external bangs #2045

return42 opened this issue Jul 5, 2020 · 10 comments

Comments

@return42
Copy link
Contributor

return42 commented Jul 5, 2020

We have 7438 external bangs plus some localized URLs / I guess we have round about 8k search URLs to maintain.

This is to much and ATM we do not know which of them are already broken or dead. I also have some privacy doubt when we redirect our users to URLs we have never visited.

We should reduce external-bangs significantly, I could imagine to start with round about 10 or 20 major bangs.

@return42
Copy link
Contributor Author

return42 commented Jul 5, 2020

Some thoughts about localisation (l10n) .. by example; wikipedia and DE ..

What we have is more or less a mess ..

    {
      "favicon": "http://de.wikipedia.org/static/favicon/wikipedia.ico",
      "name": "de.wikipedia.org",
      "triggers": [
        "dewiki"
      ],
      "regions": {
        "default": "http://de.wikipedia.org/wiki/Special:Search?search={{{term}}}&go=Go"
      }
    },
    {
      "favicon": "https://de.wikipedia.org/static/favicon/wikipedia.ico",
      "name": "de.wikipedia.org",
      "triggers": [
        "dew"
      ],
      "regions": {
        "default": "https://de.wikipedia.org/wiki/{{{term}}}"
      }
    },
...
    {
      "favicon": "http://de.wikipedia.org/static/favicon/wikipedia.ico",
      "name": "Wikipedia",
      "triggers": [
        "wikide"
      ],
      "regions": {
        "default": "http://de.wikipedia.org/w/index.php?search={{{term}}} "
      }
    },
...
    {
      "favicon": "https://de.wikipedia.org/static/favicon/wikipedia.ico",
      "name": "Wikipedia (DE)",
      "triggers": [
        "wde",
        "w.de",
        "wiki.de",
        "wge"
      ],
      "regions": {
        "default": "https://de.wikipedia.org/w/index.php?search={{{term}}}"
      }
    {
      "favicon": "http://de.wikipedia.org/static/favicon/wikipedia.ico",
      "name": "Wikipedia Deutschland",
      "triggers": [
        "wikipediade"
      ],
      "regions": {
        "default": "http://de.wikipedia.org/w/index.php?search={{{term}}}"
      }
    },

and finally we have some kind of redundancy...

    {
      "favicon": "https://en.wikipedia.org/favicon.ico",
      "functions": [
        "wikipediaCanonical"
      ],
      "name": "Wikipedia",
      "triggers": [
        "w",
        "wikipedia",
        "wiki",
        "encyclopedia",
        "wen"
      ],
      "regions": {
        "de": "https://de.wikipedia.org/wiki/{{{term}}}",
        "default": "https://en.wikipedia.org/wiki/{{{term}}}",
        "es": "https://es.wikipedia.org/wiki/{{{term}}}",
        "fr": "https://fr.wikipedia.org/wiki/{{{term}}}"
      }
    },

from which we can drop functions:

      "functions": [
        "wikipediaCanonical"
      ],

and also drop the bangs (trigger):

        "wikipedia",
        "wiki",
        "encyclopedia",
        "wen"

My suggestion is to remove all entries shown in the topmost code block and use only the final block, so we only have:

    {
      "favicon": "https://en.wikipedia.org/favicon.ico",
      "name": "Wikipedia",
      "triggers": [
        "w",
      ],
      "regions": {
        "de": "https://de.wikipedia.org/wiki/{{{term}}}",
        "default": "https://en.wikipedia.org/wiki/{{{term}}}",
        "es": "https://es.wikipedia.org/wiki/{{{term}}}",
        "fr": "https://fr.wikipedia.org/wiki/{{{term}}}"
      }
    },

And to have more flexibility in l10n we implement some bang syntax which allows the user to localize explicit.

!!<trigger>.<l10n>
!!w.de
!!w.fr

So even if users browser is localized to DE, the user can search in the wikipedia from FR by explicit using !!w.fr. If a localization does not exists (e.g. !!w.cn) the user will be redirected to the default wikipedia.

@lukasvdberk
Copy link
Contributor

This is to much and ATM we do not know which of them are already broken or dead.

Now that I have been thinking about it I agree since this project is privacy focused. More bangs is offcourse more convinient but a pain to maintain like you said. If we have less we can also create unit tests for every bang (something like a json file with the external bang the query and a text that should be included on the page).

Maybe I can help with creating these tests and making and simplyfing the external bang json file? @return42

Some bangs I currently use a lot and I think should be included.

  • !!yt (youtube)
  • !!gh (github)
  • !!w (wikipedia)
  • !!ae (alliexpress)
  • !!spt (spotify)
  • !!g (google)
  • !!ddg (duckduckgo)
  • !!gm (google maps)
    (let me know what you guys think should be removed/added)

And to have more flexibility in l10n we implement some bang syntax which allows the user to localize explicit.

I think that is really a great idea!

@lukasvdberk
Copy link
Contributor

If we drop the amount of bangs to 20, maybe we can create a ExternalBang class in python instead of a json file. With fields like domain regions and trigger or something like that.

@return42
Copy link
Contributor Author

return42 commented Jul 6, 2020

maybe we can create a ExternalBang class in python

I vote for a YAML config file placed next to the settings.yml file.

Maybe I can help ..

Before we start to implement, lets hear what other say .. but yes, your contributions are welcome :)

with creating these tests and making and simplyfing the external bang json file? @return42

Unit test is nothing a admin can run, I vote for a command line tool to check the configured external bangs from the YAML file. I haven't had time to look deeper, but we have a searx-checker, may be its best to implement it there. hint: @dalf suggest to embed this tool into searx.

Many ideas and a lot of work :)

In a first step we should simply reduce and clean up the JSON file as is .. like shown in my last example above .. I think.

@dalf
Copy link
Contributor

dalf commented Jul 6, 2020

Some user may expect to have the same bangs between duckduckgo and searx ; but at the same time, it is clearly a mess. It seems it is based on the duckduckgo bangs where the autocompletion UI is really helpful.

I like the suggestion, way more clear. Why not match the searx bangs: wp and wikipedia ?


I think the file should remains in searx/data because it has a different life cycle than settings.yml


Source: DuckDuckGo

If you go to https://duckduckgo.com/newbang you will see that /bang.v255.js is downloaded: the result needs some cleanup (the actual list of duckduckgo bangs). Example (notice the different values for the sc field, subcategory) :

  {
    "s": "Debian Packages",
    "c": "Tech",
    "t": "debpackages",
    "u": "https://packages.debian.org/search?keywords={{{s}}}",
    "d": "packages.debian.org",
    "sc": "Downloads (software)",
    "r": 10
  },
...
  {
    "r": 15,
    "t": "dpackages",
    "u": "http://packages.debian.org/search?keywords={{{s}}}",
    "sc": "Sysadmin (debian)",
    "d": "packages.debian.org",
    "c": "Tech",
    "s": "Debian Packages"
  },

There are the same duplicate entries in searx external bangs.


Source: Wikidata

We can use this query (press Ctrl-Enter to run the query).

This one may be out of topic: This query gets the URL linked to an ID (whatever ID is).


And this where I start to think that a tool to transiflex could be useful to decrease the maintenance cost, not globally but on each person like wikipedia. This tool would manage the data which are now in searx/data (the bangs, the currencies). It could be similar to a wiki: anyone can contribute, but here we need some moderators (so we can't use wikidata directly). It is not anymore code, it is about managing data. To make it more clear, we would have an entry for each bang: external bang, favicon, search url for each language. The thing is, any of these entry can become later an searx engine. So this tool could also manage the "requested engine" issues. And with some UI/UX, it can be a sandbox to define the right parameters for the xpath engine. I know this is not for tomorrow.

@return42
Copy link
Contributor Author

return42 commented Jul 9, 2020

@dalf thanks for your additional hints ..

Some user may expect to have the same bangs between duckduckgo and searx
I think the file should remains in searx/data because it has a different life cycle

I agree with you, my first suggestion having a config for bangs was not a good idea. The bangs should be the same in all instances and therefore must not be configured.

And this where I start to think that a tool to transiflex could be useful to decrease the maintenance cost,

I'm not so happy with solutions needing several accounts to maintain searx development. nevertheless, your ideas are very interesting!

I know this is not for tomorrow.

:) .. yes, let's start with the most obvious first .. first we need to tidy up the mess, this could be done very simple by building up a python dictionary in the data folder. I fear that some users will get used to the wrong bangs otherwise.

@return42
Copy link
Contributor Author

return42 commented Jul 9, 2020

I know this is not for tomorrow.

OT: You have a lot of ideas and your considerations are often strategic. Most of your considerations are spread around in gh-issues. Does it make sense to use the gh-wiki to collect such remarks and order them by subject?

I mean, should we start using gh-wiki for strategic thoughts?

@bat999
Copy link

bat999 commented Jul 9, 2020

Before we start to implement, lets hear what other say ..

Hi
If Searx tries to create and maintain its own set of bangs then I think that it will be a pita.
Why not "scrape" the DDG bangs webpage?

Then we could make the statement...

Searx does not provide bangs, but it will honour any of those that are shown on the excellent DuckDuckGo website.
https://duckduckgo.com/bang
But remember folks, with Searx you must use double exclamation points like this !!w foo.

@return42
Copy link
Contributor Author

return42 commented Jul 9, 2020

Why not "scrape" the DDG bangs webpage?

This is more or less what we want .. we should use known bang names from ddg ... BUT: searx is about privacy .. DDG has 16.000 bangs and don't care where you are redirected. We shouldn't do that, as this could cause a loss of trust in searx.

That is also one reason more for me to vote against any solution ...

It could be similar to a wiki: anyone can contribute,

... where bags are coming from outside without any quality gate ..

But remember folks, with Searx you must use double exclamation points like this !!w foo.

We have a user base which have learned over years to use single exclamation point to select engines. We will never change this!

@return42
Copy link
Contributor Author

There is another problem. current bangs containing images (#2076).

A first step could be to clean up the current json file / over that @dalf made some good suggestions in #2045 (comment) and #2052 (comment)

I still don't know if I have the time to implement a PR. Unfortunately probably not. So if someone should have time ... your PR is welcome :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants