-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Conversation
d0b9705
to
c937140
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first test on fedora ..
sudo -H ./utils/lxc.sh build searx-fedora31
sudo -H ./utils/lxc.sh cmd searx-fedora31 ./utils/searx.sh install packages
sudo -H ./utils/lxc.sh cmd searx-fedora31 make clean pyenvinstall
sudo -H ./utils/lxc.sh cmd searx-fedora31 make searx.checker
looks good. I also tested searx-archlinux
and searx-centos7
The searx.checker
itself seems to work on all platforms without any issue in the installation procedure (package additions in utils/searx.sh are OK).
On some platforms searx.checker reports more errors and raises more exceptions compared to my very first test on the desktop of my ubu2004.
There are issues like SearxEngineTooManyRequestsException
and CAPTCHA
which might related to the mass testing. I will have a deeper look for what the reasons are .. coming soon.
Here is the branch with additional commits, I tested on: https://github.com/return42/searx/commits/checker
searx/search/checker/__main__.py
Outdated
searx.search.initialize() | ||
broken_urls = [] | ||
for name, processor in iter_processor(): | ||
if sys.stdout.isatty(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the long term we should add a -v
option to the command line. But first lets see how the usage of main()
will evolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the verbose mode would do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if sys.stdout.isatty():
prints "verbose" messages
print(BOLD_SEQ, 'Engine ', '%-30s' % name, RESET_SEQ, WHITE, ' Checking', RESET_SEQ)
only when the command is started with a tty output, not when piping the output to a file.
My suggestion was to add "-v" flag instead this tty-condition. But this is nothing mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new version:
- always write to stdout "Engine ... checking"
- write to stderr "Engine ... checking" too when
sys.stdout.isatty()
See below #2419 (comment)
|
||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a makefile target which uses python from ./local
If you think it has its value, cherry pick from 190fa23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are different options / parameters, Makefile is not very convenient ?
python -m searx.search.checker duckduckgo "google images"
What about an entrypoint ?
Lines 52 to 56 in 14a395a
entry_points={ | |
'console_scripts': [ | |
'searx-run = searx.webapp:run' | |
] | |
}, |
I have tested on ArchLinux as I maintain searx-git on the AUR. In my searx-git PKGBUILD, I changed:
to:
And then, I tested the PR.
I installed the dependency python-cld3-git. Output: https://gist.github.com/HLFH/7779f70d2b091b0066384c0d1e2e9f0e |
94df0d1
to
6630893
Compare
To schedule the checker, there is a new module
If searx detects uwsgi, searx uses uwsgi cache2:
But it requires to add this line to uwsgi.ini : I think the uwsgi.ini is ossified (there are too many way to install searx), so if searx detect uwsgi but uwsgi.ini is not updated, the checker is disabled. Without uwsgi, searx uses |
@HLFH thank you for the test ! I hope the report is clear. |
Usage cases of searx-checker command line: $ make pyenvinstall
$ . ./local/py3/bin/activate
$ searx-checker google bing
Engine google Checking
Engine google OK
Engine bing Checking
Engine bing ErrorError ['paging: No result']
$ searx-checker -v google bing
Engine google Checking
Engine google OK
found languages: en es et fr lb
Engine bing Checking
Engine bing Error
found languages: en fr pt
simple : No result (query='computer' lang='all' pageno=1 safesearch=0 time_range=None)
paging : No result (query='news' lang='all' pageno=2 safesearch=0 time_range=None)
paging : No result (query='news' lang='all' pageno=3 safesearch=0 time_range=None)
paging : results are identitical for pageno=2 and pageno=3 (query='news' lang='all' safesearch=0 time_range=None)
$ searx-checker -v bing > result.txt
Engine bing Checking
$ cat result.txt
Engine bing Checking
Engine bing Error
found languages: en
simple : No result (query='life' lang='all' pageno=1 safesearch=0 time_range=None)
paging : No result (query='news' lang='all' pageno=1 safesearch=0 time_range=None) It is possible to list different engine either using their names or their shortcuts. @return42 I don't know how to make convenient using make. |
Perhaps some engine require some adjustments, but I think the foundation is here to:
|
I now use a searx private instance. I tested in command line.
And this new UI feature:
... this would be awesome! With this clearer engines report thanks to a neat UI, it will make things easier to submit and check on engines issues. Shall I add the line
Does the command line |
Yes ! Without this line, the checker is disabled if you are using uwsgi. Also adjust the Lines 100 to 102 in d347d24
The checker starts after a random delay after searx has started. This delay is by default between 5 and 30 minutes ( After that, by default the checker runs every Notes:
|
searx/shared/shared_abstract.py
Outdated
@@ -0,0 +1,15 @@ | |||
# SPDX-License-Identifier: AGPL-3.0-or-later | |||
|
|||
class SharedDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use Abstract Base Class here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in this commit eca2004
searx/search/checker/background.py
Outdated
checker = Checker(processor) | ||
checker.run() | ||
if checker.test_results.succesfull: | ||
result[name] = {'status': True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe instead of status
we could name it success
, so the boolean value makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in this commit a7e9802
Also answer @HLFH question :
Does the command line searx-checker status exist? So I can check if it is enabled or disabled?
If the checker is disabled, /stats/checker
returns:
{
"status": "disabled",
"timestamp": 1610388000
}
timestamp
is round to hour. Basically here it shows when the instance was started. Is it an issue ? It can be rounded to day if necessary.kill -SIGUSR1 ...
can still trigger the checker.
The status is:
ok
after an update.error
if an exception stop the checker
from .shared_uwsgi import UwsgiCacheSharedDict as SharedDict, schedule | ||
logger.info('Use shared_uwsgi implementation') | ||
|
||
storage = SharedDict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it a dict? What else are you planning to store in storage
that you need to have in the same key-value storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShareDict
encoding / decoding is simple : uwsgi deals withbytes
not a Python object.- the caller declares what kind of data is expected (not Python spirit I know, but here it is about cooperation between multiple process, I prefer something more declarative)
The alternative is to serialize the Python object using pickle
or something equivalent, so SharedDict
, so we can have storage[key] = value
and storage[key]
.
What else are you planning to store in storage that you need to have in the same key-value storage?
Response times: #477 : here pickling each number is not very efficient.
Important note: the uwsgi implementation won't work with #2313
An alternative implementation can use anonymous memory map which also provides bytes
(without a hash table).
searx/search/checker/background.py
Outdated
|
||
|
||
def get_result(): | ||
serialized_result = storage.get_str('CHECKER_RESULT') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use the constant CHECKER_RESULT
you have defined above in line 17.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the commit a7e9802
See settings.yml for the options SIGUSR1 signal starts the checker. The result is available at /stats/checker
* output is unbuffered * verbose mode describe more precisly the errrors
the query "time" is convinient because most of the search engine will return some results, but some engines in the general category will return documentation about the HTML tags <time> or <input type="time">
for each engine: replace status by success
searx.shared.shared_abstract.SharedDict inherit from abc.ABC searx.shared.shared_uwsgi.schedule can schedule multiple functions without issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the nice suggestions from the reviewers have been taken into account, this PR is good to be merged in my view.
Once merged, for the ArchLinux AUR searx-git
package, I will:
- add the
python-cld3-git
dependency ; - remove the
python-pyopenssl
dependency.
The last thing I would like to check is why did you get an Most probably this code doesn't work as expected: searx/searx/shared/__init__.py Lines 19 to 25 in 7f0c508
|
If I don't have this line in
So it seems it is mandatory to get this line to get things going. |
[EDIT] When I remove
I CAN partially reproduce what you are describing : the checker run despite the log. |
@dalf Yes, the checker runs despite the log and without the recommended line, but with some various stuff shown in |
I found the issue: searx/searx/search/checker/background.py Lines 83 to 86 in 7f0c508
The first call to |
…gured Before this commit, even with the scheduler disabled, the checker was running at least once for each uwsgi worker.
It should be fixed now: the checker is really disablef if uwsgi is not properly configured This time, I think it is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New |
Sorry I haven't had the time to test this PR, but I just would like to know if as an instance maintainer I've to do something so that searx.space is able to use searx-checker on my instance. |
@unixfox for now, searx-stats2 still query the API for the searx-checker project, but with few changes it will be okay. So as an instance maintainer, make sure to have the checker section in your settings.yml and as soon searx-stats2 is updated, searx.space will show the results for your instance. @unixfox Also, be to add |
What does this PR do?
This PR:
has_lang
)An engine can define a new global variable
tests
.This variable can be defined in the code (
searx/engines/*.py
or ) or insettings.yml
.Content example written in YAML :
result_container
call is a method name ofsearx.search.checker.ResultContainerTests
.test
call is a method name ofsearx.search.checker.CheckerTests
.There are default tests:
searx.search.processors.abstract.EngineProcessor
get_tests
return the value define in the engine (or in settings.yml)get_default_tests
method.searx.search.processors.online.OnlineProcessor.get_default_tests
contains some generic tests.searx.search.processors.online_currency.OnlineCurrencyProcessor.get_default_tests
contains on specific test.searx.search.processors.online_dictionary.OnlineDictionaryProcessor.get_default_tests
contains on specific test.For the now, there is a dedicated main program:
python -u -m searx.search.checker
Output: https://gist.github.com/dalf/3457519d264384bfc7f234b7d8ec91b3 (slightly modified using
sort
)It is possible to specify some engine name:
`python -u -m searx.search.checker "bing images" "google images"'
What is missing:
_is_url_image
works as intended.Why is this change important?
For reference #1559
Currently if is difficult to know which engine is broken.
#2332 helps to know runtime error, but for example it doesn't detect if one these feature is broken: paging, time_range, safesearch.
This PR brings a easy way (?) to add tests, include for engines only define in settings.yml
How to test this PR locally?
python -u -m searx.search.checker
Author's checklist
Related issues
Close #1559