What is the data lifecycle ? #2052

dalf · 2020-07-08T14:32:05Z

Maybe I'm overthinking.

Which data ?

DOI
- https://github.com/asciimoo/searx/blob/52eba0c7210c53230fb574176d4bf1af771bf0b4/searx/settings.yml#L905-L911
- Update script: No
- Data source: various, https://www.wikidata.org/wiki/Q21980377 (the "official website" property) may help.
- When it should be updated ? check every day (?)
- When it is updated: never.
- Is it a problem to not update ? ❌ outdated URL, disappointing user experience.
- Is it a problem to update ? No.
searx/data/bangs.json:
- Update script: none (can be updated automatically and then manually with some clean up)
- Data source: jivesearch (useful sources are DuckDuckGo bangs and Wikidata)
- When it should be updated ? require manual checking / perhaps automatic checking (automatically extract the opensearch.xml, check the URL, etc...). Related to my out of topic comment reduce the number of external bangs #2045 (comment)
- When is it updated ? N/A
- Is it a problem to not update ? ❌ outdated URL, disappointing user experience.
- Is it a problem to update ? No.
searx/data/currencies.json :
- Update script: https://github.com/asciimoo/searx/blob/master/utils/fetch_currencies.py
- Data source: https://github.com/asciimoo/searx/pull/993/files
- When it should be updated: ?? check every month / year.
- When it is updated: never
- Is it a problem to not update ? ❔ should not be a problem.
- Is it a problem to update ? No.
searx/data/useragents.json:
- Update script: https://github.com/asciimoo/searx/blob/master/utils/fetch_firefox_version.py
- Data source: is https://ftp.mozilla.org/pub/firefox/releases/
- When it should be updated: as soon there is a new Firefox version, but engines compatibility must be check before.
- When it is updated: sometimes.
- Is it a problem to not update ? ❔ An old Firefox version may be a problem with some engines.
- Is it a problem to update ? Some engine may stop working.
searx/data/engines_languages.json:
- Update script: https://github.com/asciimoo/searx/blob/master/utils/fetch_languages.py
- Data source: the source is the results of the fetch_supported_languages / _fetch_supported_languages functions.
- When it should be updated: ?? check every week / month / year ??
- When it is updated: sometimes when an engine is updated.
- Is it a problem to not update ? ❔ I don't know. Most probably it doesn't change too much.
- Is it a problem to update ? If the fetch_supported_languages function doesn't match the actual website, the update result may be worse.
searx/engines
- It is code but it is related to searx/data/engines_languages.json and the life cycle is different from the core.
- related to Embedded searx-checker #1559
certifi package
- https://github.com/asciimoo/searx/blob/52eba0c7210c53230fb574176d4bf1af771bf0b4/requirements.txt#L1
- No update script
- When it should be updated: as soon there is a new version (is there a reason not to updated?)
- When it is updated: rarely.
- Is it a problem to not update ? ❔ for a security point of view, it would be better to update.
- Is it a problem to update ? No.
HSTS preload package (if httpx replaces requests)
- When it should be updated? as soon there a new version.
- Is it a problem to not update ? ✔️ No, since engines use the https protocols (it can be safety net).
- Is it a problem to update ? No.

Data and searx installation

After the data are updated in the git repository, once the searx get clone / install, the data remain the same as long searx is not updated.

How to update the data more often ?

do nothing, keep the same process.
keep the data in the searx git repository, add make data.update to update everything.
- When to call it ?
  - manually : same problem as now.
  - cron in travis / github action : the script can create a PR.
create a different package searx-data, automatically updated. It requires trust in this process.

The text was updated successfully, but these errors were encountered:

asciimoo · 2020-07-08T15:57:42Z

Maybe I'm overthinking.

I don't think that you're overthinking, this is a real issue what we need to address.

I'd pick the 2. option from your suggested solutions with automatic updates periodically.

dalf · 2020-07-16T15:57:11Z

Some brainstorming: https://github.com/asciimoo/searx/wiki/Brainstorming:-IDE-&-database

return42 · 2020-07-19T14:03:12Z

Brainstorming: IDE & database

@dalf / thanks for your article .. to give my 5cent

IDE: for me, developing or bug fixing a searx engine is a very individual task, where I want to have the maximum degree of freedom to use the Swiss army knife which fits at its best to the context.

A IDE helps flatten the learning curve, the flip side of the coin is, that the quality of the contributions regress and the maintainers have to discuss again and again the same subjects .. I remember all the contributions with the "Update <filename>" commit messages (e.g. #1941). I mean; IDEs are really good to flatten the learn-curve but they don't help if the know-how is missing.

Database: mostly the same what I said about IDEs, beside that it could be a solution for regular updates to decouple engine development from searx kernel .. for this a git repository seems to be a more suitable solution. But #2052 (comment) says he wants to keep the data in the searx git repository.

dalf · 2020-07-20T08:04:56Z

I want to have the maximum degree of freedom to use the Swiss army knife which fits at its best to the context.

The purpose is not to enforce a way to develop, it is to suggest a quick and easy way to develop engines/update data (and nothing about the core).

My ideas are not crystal clear, but if I sum up an example:

a user sees a requested engine (the "DB" can be a git repository, or at the point a [No]SQL database).
the user clicks "Edit".
a web version of wuzz allows setting URL template, HTTP headers from a searx request.
the data extraction is done with something similar to https://jqplay.org/ (based on the results from wuzz). The user can check the result output (right template, etc...) from different simulated user request. (if the json engine use https://pypi.org/project/pyjq/ )
the user validates and can the generated change request (code, the configuration in settings.yml...).
save the result to create a PR.

A reviewer can use the same tool to check the PR, but once again the usual tools work too (git, make, etc...).

The purpose here to allow contributions with just a browser (*). And for sure:

it will require more reviews (my wish is to allow more people to check them).
it is a huge task to develop something like this.

so, this idea is not for tomorrow.

(*) I say a browser because it is an easy way to have a rich UI, but a console UI is also a solution.

dalf added the data label Jul 8, 2020

return42 mentioned this issue Jul 15, 2020

reduce the number of external bangs #2045

Closed

dalf mentioned this issue Jul 28, 2020

feature request: private user blocklists/blacklists #2001

Open

This was referenced Oct 26, 2020

[mod] ahmia_filter.py: minor changes #2275

Merged

sci-hub.tw DOI resolver nxdomain #2260

Open

dalf mentioned this issue Jan 24, 2021

[enh] every Sunday, call utils/fetch_*.py scripts and create a PR automatically #2500

Merged

This was referenced Feb 19, 2021

[mod] update currencies.json and fetch_currencies.py #2585

Merged

Add some documentation about Github-Actions and dependabot #2599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the data lifecycle ? #2052

What is the data lifecycle ? #2052

dalf commented Jul 8, 2020 •

edited

asciimoo commented Jul 8, 2020

dalf commented Jul 16, 2020

return42 commented Jul 19, 2020 •

edited

dalf commented Jul 20, 2020 •

edited

What is the data lifecycle ? #2052

What is the data lifecycle ? #2052

Comments

dalf commented Jul 8, 2020 • edited

Which data ?

Data and searx installation

How to update the data more often ?

asciimoo commented Jul 8, 2020

dalf commented Jul 16, 2020

return42 commented Jul 19, 2020 • edited

dalf commented Jul 20, 2020 • edited

dalf commented Jul 8, 2020 •

edited

return42 commented Jul 19, 2020 •

edited

dalf commented Jul 20, 2020 •

edited