searx.data : use sqlite to reduce the memory footprint #2633

dalf · 2023-08-11T09:57:22Z

dalf
Aug 11, 2023
Maintainer

For reference :

size	name
225	useragents.json
4.2K	external_urls.json
30K	wikidata_units.json
112K	engine_traits.json
548K	currencies.json
1.1M	engine_descriptions.json
1.3M	external_bangs.json
1.3M	ahmia_blacklist.txt
1.9M	osm_keys_tags.json

sqlite is available in all Python standard version, multiple workers can access the same read-only database. It would decrease the memory footprint by ~ 10 MB per worker according to tracemalloc.

The drawback is the access time : about 300 nanosec instead of few nanosec. A small cache can help., and it would not be a problem for engine_descriptions.json for example.

external_bangs.json might be different since there are a lot of access and they need to be fast.

dalf · 2023-08-13T08:35:38Z

dalf
Aug 13, 2023
Maintainer Author

Update https://gist.github.com/dalf/b3728182ef69f855c9103db6e31f38ed

With indexes, SQLite is actually only 10 to 20 time slower than a memory access, and with a small cache is can be similar.
Also, a test on a RaspberryPI or similar hardware could be useful.

Of course, it does not solve the statistics issue (shared the stats between the workers).

ping @return42

[EDIT] Anyway, if the memory is issue, the translation loads about 60MB in RAM (checked tracemalloc with on locales_initialize())

1 reply

return42 Aug 13, 2023
Maintainer

@dalf I'm sorry for the delay, I did not had much time this weekend and just tried to finish some small task .. for this topic I need more time .. and next week I have a lot of work to do .. give me some time and I will come back to this discussion / many thanks for your thoughts and analysis 👍

return42 · 2023-08-14T06:57:39Z

return42
Aug 14, 2023
Maintainer

With indexes, SQLite is actually only 10 to 20 time slower than a memory access,

I think access to individual records is not quite as critical since we need access to those records relatively infrequently (we will rarely do a SQL-SELECT on these tables).

I would be more concerned with a SQL solution that initializing the SQL databases (connect and first query) takes more time than loading a python module with a big dictionary in.

Do you have any experience with this?

5 replies

dalf Aug 14, 2023
Maintainer Author

I would be more concerned with a SQL solution that initializing the SQL databases (connect and first query) takes more time than loading a python module with a big dictionary in.

Are you refereeing to sqlite3.connect('file:data.db?mode=ro', uri=True) or to the SQLite file building?

sqlite3.connect('file:data.db?mode=ro', uri=True) seems really fast, but I can benchmark that.
In my view point, the scripts in searxng_extra build the database, the database is then commit into the git repository (similar to searx/data/lid.176.ftz)

return42 Aug 14, 2023
Maintainer

Are you refereeing to sqlite3.connect('file:data.db?mode=ro', uri=True) ..

Yes, plus a first query .. I assume connection is (due to lazy evaluation) ultra fast but when the first SQL-SELECT is done on a table, the table needs to be loaded .. and this might take more time compared to read the python module by the python interpreter.

In my view point, the scripts ..

Yes, build time does not care, we store the SQLite DB in the repo (will replace the python modules from /data).

dalf Aug 14, 2023
Maintainer Author

In the gist, the max time in the benchmark is 0,2 ms.

When the server starts, it can send a dummy request to avoid the initialization overhead of the first request.

return42 Aug 14, 2023
Maintainer

UPDATE: I must think about it again, may I still have an error in my thoughts about the process model and process flow ... give me some more time, then I'll get back to you ... as soon as I have sorted my thoughts :)

When the server starts, it can send a dummy request to avoid the initialization overhead of the first request.

I'm unsure about the process flow, especial when wsgi (walkers) came into play .. You mean the uwsgi prepare the workers and those workers waiting for a request .. I'm right so far? .. and when worker is build by the wsgi server, we can do a dummy SQL-SELECT on the tables?

In the process flow I have in mind there is no wsgi server .. and what I asking myself is: how long would it take to read the python modules compared to a connect and single SQL-SELECT .. I don't know if I could express myself clearly, I want to achieve a performance gain, which is also possible without an additional component like a WSGI server.

dalf Aug 14, 2023
Maintainer Author

What I see:

one global read-only SQLite connection per worker : it is just a global Python variable, similar to the redis connection (no need for a uwsgi dependency).
this connection is initialized in the module searx.data :

DATA_CONNECTION = None

def data_initialize():
    global DATA_CONNECTION

    # Read-only connection : 
    # * SQLite does not support write from multi-processes (multiple processes can read the database).
    # * and most importantly : the webapp must not modify data.db. It is updated once per month by the `searxng_extra` 

    DATA_CONNECTION = sqlite3.connect('file:data.db?mode=ro', uri=True)
    # https://phiresky.github.io/blog/2020/sqlite-performance-tuning/
    DATA_CONNECTION.executescript("""
        pragma temp_store = memory;
        pragma mmap_size = 30000000000;
    """)
    # TODO : send a dummy request to initialize the database, if this decrease the overhead of the following requests.

searx.webapp calls this data_initialize function.
searx.data defines some additional functions to abstract the SQL. See the gist, here an example :

def fetch_engine_description(engine, language):
    res = DATA_CONNECTION.execute("SELECT description, source FROM engine_descriptions WHERE engine=? AND language=?", (engine, language))
    result = res.fetchone()
    if result is None:
        return None, None
    return result[0], result[1]

==> outside the searx.data module, there is no reference to SQL queries or to the global SQL connection : there are just some functions to get the data.

dalf · 2024-02-16T10:03:14Z

dalf
Feb 16, 2024
Maintainer Author

Actually, locales_initialize allocates about 75MB of memory. gc.collect() does not free that memory.

The line that allocates memory:

searxng/searx/locales.py

Line 157 in 11c0651

locale = babel.Locale.parse(dirname)

The purpose of this code is to do that:

LOCALE_NAMES = {'dv': 'ދިވެހި (Dhivehi)', 'oc': 'Occitan', 'szl': 'Ślōnski (Silesian)', 'pap': 'Papiamento', 'nl-BE': 'Nederlands, België (Dutch, Belgium)', 'zh-HK': '中文, 中國香港特別行政區 (Chinese, Hong Kong SAR China)', 'af': 'Afrikaans', 'ar': 'العربية (Arabic)', 'bg': 'Български (Bulgarian)', 'bn': 'বাংলা (Bangla)', 'bo': 'བོད་སྐད་ (Tibetan)', 'ca': 'Català (Catalan)', 'cs': 'Čeština (Czech)', 'cy': 'Cymraeg (Welsh)', 'da': 'Dansk (Danish)', 'de': 'Deutsch (German)', 'el-GR': 'Ελληνικά, Ελλάδα (Greek, Greece)', 'en': 'English', 'eo': 'Esperanto', 'es': 'Español (Spanish)', 'et': 'Eesti (Estonian)', 'eu': 'Euskara (Basque)', 'fa-IR': 'فارسی, ایران (Persian, Iran)', 'fi': 'Suomi (Finnish)', 'fil': 'Filipino', 'fr': 'Français (French)', 'gl': 'Galego (Galician)', 'he': 'עברית (Hebrew)', 'hr': 'Hrvatski (Croatian)', 'hu': 'Magyar (Hungarian)', 'ia': 'Interlingua', 'id': 'Indonesia (Indonesian)', 'it': 'Italiano (Italian)', 'ja': '日本語 (Japanese)', 'ko': '한국어 (Korean)', 'lt': 'Lietuvių (Lithuanian)', 'lv': 'Latviešu (Latvian)', 'ml': 'മലയാളം (Malayalam)', 'ms': 'Melayu (Malay)', 'nb-NO': 'Norsk bokmål, Norge (Norwegian bokmål, Norway)', 'nl': 'Nederlands (Dutch)', 'pa': 'ਪੰਜਾਬੀ (Punjabi)', 'pl': 'Polski (Polish)', 'pt': 'Português (Portuguese)', 'pt-BR': 'Português, Brasil (Portuguese, Brazil)', 'ro': 'Română (Romanian)', 'ru': 'Русский (Russian)', 'si': 'සිංහල (Sinhala)', 'sk': 'Slovenčina (Slovak)', 'sl': 'Slovenščina (Slovenian)', 'sr': 'Српски (Serbian)', 'sv': 'Svenska (Swedish)', 'ta': 'தமிழ் (Tamil)', 'te': 'తెలుగు (Telugu)', 'th': 'ไทย (Thai)', 'tr': 'Türkçe (Turkish)', 'uk': 'Українська (Ukrainian)', 'vi': 'Tiếng việt (Vietnamese)', 'zh-Hans-CN': '中文, 中国 (Chinese, China)', 'zh-Hant-TW': '中文, 台灣 (Chinese, Taiwan)'}
RTL_LOCALES = {'ar', 'fa-IR', 'he'}

So, one idea is to store these values in searx.data when the translations are updated --> 75x4 worker (usually) = 300MB of RAM saved.

0 replies

return42 · 2024-02-16T10:40:05Z

return42
Feb 16, 2024
Maintainer

Actually, locales_initialize allocates about 75MB of memory.

wow thats much 👍 .. may we have a chance to remove that monkey patching of flask_babel

searxng/searx/locales.py

Line 134 in 11c0651

flask_babel.get_translations = get_translations

Form the ..

searxng/searx/locales.py

Lines 37 to 41 in 11c0651

    
           ADDITIONAL_TRANSLATIONS = { 
        
               "dv": "ދިވެހި (Dhivehi)", 
        
               "oc": "Occitan", 
        
               "szl": "Ślōnski (Silesian)", 
        
               "pap": "Papiamento",

we can at least remove

Drop language Dhivehi from SearXNG translations #2476
Papiamento / we do not have a translation for

0 replies

dalf · 2024-02-16T13:34:26Z

dalf
Feb 16, 2024
Maintainer Author

may we have a chance to remove that monkey patching of flask_babel

This is unrelated to my comment. I can't say if this is possible. I've just focused on searx.locales.locales_initialize.

0 replies

dalf · 2024-05-04T06:51:10Z

dalf
May 4, 2024
Maintainer Author

The "issue" with sqlite is readability: it requires an external program to read the content compare to json. We can keep an SQL dump in the repository: it won't be use, but it makes the PRs easy to compare.

SQLite supports multiple tables per database, in the case of SearXNG it can make sense to create one database file per data type, so we keep the update workflow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

searx.data : use sqlite to reduce the memory footprint #2633

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

searx.data : use sqlite to reduce the memory footprint #2633

dalf Aug 11, 2023 Maintainer

Replies: 6 comments · 6 replies

dalf Aug 13, 2023 Maintainer Author

return42 Aug 13, 2023 Maintainer

return42 Aug 14, 2023 Maintainer

dalf Aug 14, 2023 Maintainer Author

return42 Aug 14, 2023 Maintainer

dalf Aug 14, 2023 Maintainer Author

return42 Aug 14, 2023 Maintainer

dalf Aug 14, 2023 Maintainer Author

dalf Feb 16, 2024 Maintainer Author

return42 Feb 16, 2024 Maintainer

dalf Feb 16, 2024 Maintainer Author

dalf May 4, 2024 Maintainer Author

dalf
Aug 11, 2023
Maintainer

Replies: 6 comments 6 replies

dalf
Aug 13, 2023
Maintainer Author

return42 Aug 13, 2023
Maintainer

return42
Aug 14, 2023
Maintainer

dalf Aug 14, 2023
Maintainer Author

return42 Aug 14, 2023
Maintainer

dalf Aug 14, 2023
Maintainer Author

return42 Aug 14, 2023
Maintainer

dalf Aug 14, 2023
Maintainer Author

dalf
Feb 16, 2024
Maintainer Author

return42
Feb 16, 2024
Maintainer

dalf
Feb 16, 2024
Maintainer Author

dalf
May 4, 2024
Maintainer Author