Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

[enh] add checker #2419

Merged
merged 12 commits into from
Jan 13, 2021
Merged

[enh] add checker #2419

merged 12 commits into from
Jan 13, 2021

Conversation

dalf
Copy link
Contributor

@dalf dalf commented Dec 24, 2020

What does this PR do?

This PR:


An engine can define a new global variable tests.
This variable can be defined in the code ( searx/engines/*.py or ) or in settings.yml.

Content example written in YAML :

- matrix:
    # 3 requests search for "test" in english, pages 1, 2 and 3.
    query: test
    pageno: [1, 2, 3]
    lang: en
- result_container:
    # check each page is not empty
    - not_empty
    # check the results are in english
    - [ "has_lang", "en" ]
- test:
    # check each result is unique (page 1 is not page 2)
    - unique_results
  • each entry in result_container call is a method name of searx.search.checker.ResultContainerTests.
  • each entry in test call is a method name of searx.search.checker.CheckerTests.

There are default tests:

  • see searx.search.processors.abstract.EngineProcessor
    • the method get_tests return the value define in the engine (or in settings.yml)
    • if there is no value, then some default tests are returned by the get_default_tests method.
  • searx.search.processors.online.OnlineProcessor.get_default_tests contains some generic tests.
    • The tests are variable according to what the engine support. For example there is a test about paging when the engine support this feature)
    • the 'rosebud' query idea comes from https://searx.neocities.org/changelog.html
  • searx.search.processors.online_currency.OnlineCurrencyProcessor.get_default_tests contains on specific test.
  • searx.search.processors.online_dictionary.OnlineDictionaryProcessor.get_default_tests contains on specific test.

For the now, there is a dedicated main program: python -u -m searx.search.checker

Output: https://gist.github.com/dalf/3457519d264384bfc7f234b7d8ec91b3 (slightly modified using sort)

It is possible to specify some engine name:
`python -u -m searx.search.checker "bing images" "google images"'


What is missing:

  • fix CI
  • check install with arch
  • check install with fedora
  • check install with centos
  • scheduling & API (scheduling might be clunky as long searx relies on uwsgi)
  • proper tool to help development
  • infobox checks
  • make sure _is_url_image works as intended.
  • adjust the tests for different engines.

Why is this change important?

For reference #1559

Currently if is difficult to know which engine is broken.

#2332 helps to know runtime error, but for example it doesn't detect if one these feature is broken: paging, time_range, safesearch.

This PR brings a easy way (?) to add tests, include for engines only define in settings.yml

How to test this PR locally?

  • For now python -u -m searx.search.checker

Author's checklist

Related issues

Close #1559

@dalf dalf force-pushed the checker branch 3 times, most recently from d0b9705 to c937140 Compare December 29, 2020 10:10
Copy link
Contributor

@return42 return42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first test on fedora ..

sudo -H ./utils/lxc.sh build searx-fedora31
sudo -H ./utils/lxc.sh cmd searx-fedora31 ./utils/searx.sh install packages
sudo -H ./utils/lxc.sh cmd searx-fedora31 make clean pyenvinstall
sudo -H ./utils/lxc.sh cmd searx-fedora31 make searx.checker

looks good. I also tested searx-archlinux and searx-centos7

The searx.checker itself seems to work on all platforms without any issue in the installation procedure (package additions in utils/searx.sh are OK).

On some platforms searx.checker reports more errors and raises more exceptions compared to my very first test on the desktop of my ubu2004.

There are issues like SearxEngineTooManyRequestsException and CAPTCHA which might related to the mass testing. I will have a deeper look for what the reasons are .. coming soon.

Here is the branch with additional commits, I tested on: https://github.com/return42/searx/commits/checker

searx.search.initialize()
broken_urls = []
for name, processor in iter_processor():
if sys.stdout.isatty():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the long term we should add a -v option to the command line. But first lets see how the usage of main() will evolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the verbose mode would do ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if sys.stdout.isatty(): prints "verbose" messages

 print(BOLD_SEQ, 'Engine ', '%-30s' % name, RESET_SEQ, WHITE, ' Checking', RESET_SEQ)

only when the command is started with a tty output, not when piping the output to a file.

My suggestion was to add "-v" flag instead this tty-condition. But this is nothing mandatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version:

  • always write to stdout "Engine ... checking"
  • write to stderr "Engine ... checking" too when sys.stdout.isatty()

See below #2419 (comment)



if __name__ == '__main__':
main()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a makefile target which uses python from ./local
If you think it has its value, cherry pick from 190fa23

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are different options / parameters, Makefile is not very convenient ?

python -m searx.search.checker duckduckgo "google images"

What about an entrypoint ?

searx/setup.py

Lines 52 to 56 in 14a395a

entry_points={
'console_scripts': [
'searx-run = searx.webapp:run'
]
},

searx/search/checker/__main__.py Outdated Show resolved Hide resolved
@HLFH
Copy link
Collaborator

HLFH commented Jan 5, 2021

I have tested on ArchLinux as I maintain searx-git on the AUR.

In my searx-git PKGBUILD, I changed:

source=(git+https://github.com/asciimoo/searx

to:

source=(git+https://github.com/dalf/searx#branch=checker

And then, I tested the PR.

python -u -m searx.search.checker
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 188, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.9/runpy.py", line 147, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.9/runpy.py", line 111, in _get_module_details
import(pkg_name)
File "/usr/lib/python3.9/site-packages/searx/search/checker/init.py", line 1, in
from .impl import Checker
File "/usr/lib/python3.9/site-packages/searx/search/checker/impl.py", line 9, in
import cld3
ModuleNotFoundError: No module named 'cld3'

I installed the dependency python-cld3-git.

Output: https://gist.github.com/HLFH/7779f70d2b091b0066384c0d1e2e9f0e

@dalf dalf force-pushed the checker branch 4 times, most recently from 94df0d1 to 6630893 Compare January 8, 2021 11:12
@dalf
Copy link
Contributor Author

dalf commented Jan 8, 2021

To schedule the checker, there is a new module searx.shared :

  • searx.shared.storage allows to get / set int and str.
  • searx.shared.schedule(delay, func, *args) schedules a func(*args) every delay seconds.

If searx detects uwsgi, searx uses uwsgi cache2:

But it requires to add this line to uwsgi.ini :
cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1.

I think the uwsgi.ini is ossified (there are too many way to install searx), so if searx detect uwsgi but uwsgi.ini is not updated, the checker is disabled.

Without uwsgi, searx uses threading.Timer(delay, wrapper)

@dalf
Copy link
Contributor Author

dalf commented Jan 8, 2021

@HLFH thank you for the test !

I hope the report is clear.

@dalf
Copy link
Contributor Author

dalf commented Jan 8, 2021

Usage cases of searx-checker command line:

$ make pyenvinstall
$ . ./local/py3/bin/activate

$ searx-checker google bing
Engine google                        Checking
Engine google                        OK
Engine bing                          Checking
Engine bing                          ErrorError ['paging: No result']

$ searx-checker -v google bing
Engine google                        Checking
Engine google                        OK
    found languages: en es et fr lb
Engine bing                          Checking
Engine bing                          Error
    found languages: en fr pt
    simple         : No result (query='computer' lang='all' pageno=1 safesearch=0 time_range=None)
    paging         : No result (query='news' lang='all' pageno=2 safesearch=0 time_range=None)
    paging         : No result (query='news' lang='all' pageno=3 safesearch=0 time_range=None)
    paging         : results are identitical for pageno=2 and pageno=3 (query='news' lang='all' safesearch=0 time_range=None)


$ searx-checker -v bing > result.txt
Engine bing                          Checking

$ cat result.txt
Engine bing                          Checking
Engine bing                          Error
    found languages: en
    simple         : No result (query='life' lang='all' pageno=1 safesearch=0 time_range=None)
    paging         : No result (query='news' lang='all' pageno=1 safesearch=0 time_range=None)

It is possible to list different engine either using their names or their shortcuts.
An empty list means all the engines.

@return42 I don't know how to make convenient using make.

@dalf dalf marked this pull request as ready for review January 8, 2021 18:31
@dalf
Copy link
Contributor Author

dalf commented Jan 8, 2021

Perhaps some engine require some adjustments, but I think the foundation is here to:

  • run the checker automatically (result at /stats/checker.
    • Later searx-stats2 can fetch this URL (and /stats/errors).
    • Later it is possible to display the health status of the engines in the preferences.
  • check / add the tests during the development phase of an engine.
  • quickly see which engines don't work.

@HLFH
Copy link
Collaborator

HLFH commented Jan 8, 2021

I now use a searx private instance.

I tested in command line.

➜ sites-enabled searx-checker google bing
Engine google Checking
Engine google OK
Engine bing Checking
Engine bing ErrorError ['simple: No result', 'paging: No result', "paging: results are identitical for pageno=2 and pageno=3 (query='news' lang='all' safesearch=0 time_range=None)"]

And this new UI feature:

Display the health status of the engines in the preferences

... this would be awesome! With this clearer engines report thanks to a neat UI, it will make things easier to submit and check on engines issues.

Shall I add the line cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1 to my /etc/uwsgi/vassals/searx.ini? And do systemctl restart uwsgi@searx.service?
Or after like 10 mins without editing the file, my results on "/searx/stats/checker" will be generated instead of having "null"?
Do the results on "/searx/stats/checker" show the entry "checked_at" as well?

I think the uwsgi.ini is ossified (there are too many way to install searx), so if searx detect uwsgi but uwsgi.ini is not updated, the checker is disabled.

Does the command line searx-checker status exist? So I can check if it is enabled or disabled?

@dalf
Copy link
Contributor Author

dalf commented Jan 8, 2021

Shall I add the line cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1 to my /etc/uwsgi/vassals/searx.ini? And do systemctl restart uwsgi@searx.service?

Yes ! Without this line, the checker is disabled if you are using uwsgi.

Also adjust the checker section in settings.yml to your needs:

searx/searx/settings.yml

Lines 100 to 102 in d347d24

scheduling:
start_after: [300, 1800] # delay to start the first run of the checker
every: [86400, 90000] # how often the checker runs

The checker starts after a random delay after searx has started. This delay is by default between 5 and 30 minutes (1800 seconds).

After that, by default the checker runs every 86400 seconds (one day).
More precisely, it run every 864000 seconds, and for each run, it waits an additional random delay between 0 and 90000 - 86400 seconds (for the default configuration).

Notes:

  • if there is no checker.scheduling section in settings.yml, then the checker is disabled.
  • you can start the checker manually by sending a SIGUSR1 signal to any of the uwsgi process.
  • if you are running searx in debug mode, the checker is disabled by default :
    off_when_debug: True

@@ -0,0 +1,15 @@
# SPDX-License-Identifier: AGPL-3.0-or-later

class SharedDict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use Abstract Base Class here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in this commit eca2004

checker = Checker(processor)
checker.run()
if checker.test_results.succesfull:
result[name] = {'status': True}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe instead of status we could name it success, so the boolean value makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in this commit a7e9802

Also answer @HLFH question :

Does the command line searx-checker status exist? So I can check if it is enabled or disabled?

If the checker is disabled, /stats/checker returns:

{
  "status": "disabled",
  "timestamp": 1610388000
}
  • timestamp is round to hour. Basically here it shows when the instance was started. Is it an issue ? It can be rounded to day if necessary.
  • kill -SIGUSR1 ... can still trigger the checker.

The status is:

  • ok after an update.
  • error if an exception stop the checker

from .shared_uwsgi import UwsgiCacheSharedDict as SharedDict, schedule
logger.info('Use shared_uwsgi implementation')

storage = SharedDict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a dict? What else are you planning to store in storage that you need to have in the same key-value storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ShareDict encoding / decoding is simple : uwsgi deals with bytes not a Python object.
  • the caller declares what kind of data is expected (not Python spirit I know, but here it is about cooperation between multiple process, I prefer something more declarative)

The alternative is to serialize the Python object using pickle or something equivalent, so SharedDict, so we can have storage[key] = value and storage[key].

What else are you planning to store in storage that you need to have in the same key-value storage?

Response times: #477 : here pickling each number is not very efficient.

Important note: the uwsgi implementation won't work with #2313

An alternative implementation can use anonymous memory map which also provides bytes (without a hash table).



def get_result():
serialized_result = storage.get_str('CHECKER_RESULT')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use the constant CHECKER_RESULT you have defined above in line 17.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the commit a7e9802

See settings.yml for the options
SIGUSR1 signal starts the checker.
The result is available at /stats/checker
* output is unbuffered
* verbose mode describe more precisly the errrors
the query "time" is convinient because most of the search engine will return some results,
but some engines in the general category will return documentation about the HTML tags <time> or <input type="time">
for each engine: replace status by success
searx.shared.shared_abstract.SharedDict inherit from abc.ABC
searx.shared.shared_uwsgi.schedule can schedule multiple functions without issue
HLFH
HLFH previously approved these changes Jan 12, 2021
Copy link
Collaborator

@HLFH HLFH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the nice suggestions from the reviewers have been taken into account, this PR is good to be merged in my view.

Once merged, for the ArchLinux AUR searx-git package, I will:

  • add the python-cld3-git dependency ;
  • remove the python-pyopenssl dependency.

@dalf
Copy link
Contributor Author

dalf commented Jan 12, 2021

The last thing I would like to check is why did you get an error instead of unknown.

Most probably this code doesn't work as expected:

# uwsgi.ini configuration problem: disable all scheduling
logger.error('uwsgi.ini configuration error, add this line to your uwsgi.ini\n'
'cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1')
from .shared_simple import SimpleSharedDict as SharedDict
def schedule(delay, func, *args):
pass

@HLFH
Copy link
Collaborator

HLFH commented Jan 13, 2021

The last thing I would like to check is why did you get an error instead of unknown.

Most probably this code doesn't work as expected:

# uwsgi.ini configuration problem: disable all scheduling
logger.error('uwsgi.ini configuration error, add this line to your uwsgi.ini\n'
'cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1')
from .shared_simple import SimpleSharedDict as SharedDict
def schedule(delay, func, *args):
pass

If I don't have this line in /etc/uwsgi/vassals/searx.ini, when I am quickly refreshing the stats/checker web page, I get:

  • sometimes error ;
  • sometimes unknown ;
  • sometimes the result itself ;
  • always (I think - or sometimes) the error log you mentioned.

So it seems it is mandatory to get this line to get things going.
If this requirement is documented properly in the searx doc, I'd say there is no problem.

@dalf
Copy link
Contributor Author

dalf commented Jan 13, 2021

[EDIT]

When I remove 'cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1 from uwsgi.ini, I have this log:

*** Operational MODE: preforking ***
added /usr/local/searx/ to pythonpath.
spawned uWSGI master process (pid: 7)
spawned uWSGI worker 1 (pid: 15, cores: 1)
spawned uWSGI worker 2 (pid: 16, cores: 1)
spawned uWSGI worker 3 (pid: 19, cores: 1)
spawned uWSGI worker 4 (pid: 24, cores: 1)
spawned 4 offload threads for uWSGI worker 1
spawned 4 offload threads for uWSGI worker 2
spawned 4 offload threads for uWSGI worker 4
spawned 4 offload threads for uWSGI worker 3
ERROR:searx.shared:uwsgi.ini configuration error, add this line to your uwsgi.ini
cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1
ERROR:searx.shared:uwsgi.ini configuration error, add this line to your uwsgi.ini
cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1
ERROR:searx.shared:uwsgi.ini configuration error, add this line to your uwsgi.ini
cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1
ERROR:searx.shared:uwsgi.ini configuration error, add this line to your uwsgi.ini
cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1
WSGI app 0 (mountpoint='') ready in 6 seconds on interpreter 0x55756b2761c0 pid: 15 (default app)
WSGI app 0 (mountpoint='') ready in 6 seconds on interpreter 0x55756b2761c0 pid: 16 (default app)
WSGI app 0 (mountpoint='') ready in 6 seconds on interpreter 0x55756b2761c0 pid: 24 (default app)
WSGI app 0 (mountpoint='') ready in 6 seconds on interpreter 0x55756b2761c0 pid: 19 (default app)

I CAN partially reproduce what you are describing : the checker run despite the log.

@HLFH
Copy link
Collaborator

HLFH commented Jan 13, 2021

@dalf Yes, the checker runs despite the log and without the recommended line, but with some various stuff shown in stats/checker: error/unknown/or the result itself. Shall this weird behaviour (without the line) be fixed somehow?

@dalf
Copy link
Contributor Author

dalf commented Jan 13, 2021

I found the issue:

def _start_scheduling():
every = _get_every()
schedule(every[0], _run_with_delay)
run()

The first call to run() doesn't use schedule.

…gured

Before this commit, even with the scheduler disabled, the checker was running
at least once for each uwsgi worker.
@dalf
Copy link
Contributor Author

dalf commented Jan 13, 2021

It should be fixed now: the checker is really disablef if uwsgi is not properly configured

This time, I think it is ready.

Copy link
Collaborator

@HLFH HLFH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is fixed.

joe-biden-merge

@dalf dalf merged commit 484dc99 into searx:master Jan 13, 2021
@HLFH
Copy link
Collaborator

HLFH commented Jan 13, 2021

New searx-git package released on ArchLinux AUR with updated deps: https://aur.archlinux.org/packages/searx-git/.

@unixfox
Copy link
Member

unixfox commented Jan 13, 2021

Sorry I haven't had the time to test this PR, but I just would like to know if as an instance maintainer I've to do something so that searx.space is able to use searx-checker on my instance.

@dalf
Copy link
Contributor Author

dalf commented Jan 13, 2021

@unixfox for now, searx-stats2 still query the API for the searx-checker project, but with few changes it will be okay.

So as an instance maintainer, make sure to have the checker section in your settings.yml and as soon searx-stats2 is updated, searx.space will show the results for your instance.

@unixfox Also, be to add
cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1
to your uwgi..ini

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embedded searx-checker
5 participants