Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database is written as whole to disk on each run --> SSD wearout, especially SD-Cards on Raspberry PI #690

Closed
JsBergbau opened this issue Jan 15, 2022 · 11 comments

Comments

@JsBergbau
Copy link

JsBergbau commented Jan 15, 2022

Command
./urlwatch --urls sonstiges.url --cache sonstigesWatch.db

Filesize: sonstigesWatch.db about 12 MB

On Prometheuts stats you see disk write marked as 1

grafik

Now I use another db which has about 35 MB. Then you see in the graph the disk write marked with number 2.

Job with 12 MB runs every 5 minutes. Thats 8760 hours per year * 60 minutes / 5 minutes * 12 MB = 1,261,440 MB or 1,2 TB written per year.
Job with now 35 MB DB runs every 30 minutes, thats about 600 GB written per year.

Thats a lot of workload per year, just because of urlwatch and with an SD-Card of raspberry PI this is a really huge workload.

DB is even rewritten when there is no change detected, because timestamp of last run is updated.

Since this is basically SQLite, I can't imagine that there is no other way than rewriting the whole file.

Update: This also affects urlwatch performance with large databases. Since each time database is rewritten it slows down urlwatch process.

@JsBergbau
Copy link
Author

JsBergbau commented Jan 15, 2022

I did some research.
As far as I understand https://www.sqlite.org/tempfiles.html#write_ahead_log_wal_files correct SQLite-Database should be opened in WAL mode then much less data should be written.

According to https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ for this it is sufficient to do:

# Set journal mode to WAL.
conn.execute('pragma journal_mode=wal')

Now database should be in wal mode.
Hope this can be implemented / checked soon.

Update:
Checked this by editing minidb.py

    def __init__(self, filename=':memory:', debug=False, smartupdate=False):
        self.db = sqlite3.connect(filename, check_same_thread=False)
        self.debug = debug
        self.smartupdate = smartupdate
        self.registered = {}
        self.lock = threading.RLock()
        self.db.execute('pragma journal_mode=wal')
        self.db.execute("pragma temp_store = memory;")

Wal mode is working as it can be seen by the temporary filenames. But still same amount of data is written.
I've also tried according to https://phiresky.github.io/blog/2020/sqlite-performance-tuning/

pragma temp_store = memory;
pragma mmap_size = 30000000000;

But didn't help either.

So is this a general SQLite problem and can't be done otherwise?

@JsBergbau
Copy link
Author

JsBergbau commented Jan 16, 2022

Further investigating this lead to this testcode:

#!/usr/bin/python3

import sqlite3
import random
import os
import time


filename="example.db"
targetSizeMB=20
numBytesPerEntry = 1024


con = sqlite3.connect(filename)
con.execute('pragma journal_mode=wal')
cur = con.cursor()
targetSizeBytes = targetSizeMB * 1024 * 1024

def createDB():
	print("Creating DB, this can take some time")
	cur.execute("CREATE TABLE test (id integer, data text)")
	iterCount = int(targetSizeMB * 1024 * 1024 / numBytesPerEntry / 1.33)
	for i in range(0,iterCount):
		randomValue = ''.join(random.choice([chr(i) for i in range(ord('a'),ord('z'))]) for _ in range(numBytesPerEntry))
		stmt = f'INSERT INTO TEST VALUES({i},"{randomValue}")'
		#print(stmt)
		cur.execute(stmt)
		if i % 1000 == 0:
			pct = round(i / iterCount * 100,2)
			print(f"Progress {pct} %")


dbSize = os.path.getsize(filename)
if dbSize < targetSizeBytes:
	createDB()

randID = random.randint(10000000000,100000000000)
randomValue = ''.join(random.choice([chr(i) for i in range(ord('a'),ord('z'))]) for _ in range(numBytesPerEntry))
stmt = f'INSERT INTO TEST VALUES({randID},"{randomValue}")'
cur.execute(stmt)

stmt=f'INSERT OR REPLACE INTO test values (900000000000,"{randomValue}")'
#print(stmt)
cur.execute(stmt)
con.commit()
wait=10

print("Wating", 10, "seconds to examine files")
time.sleep(wait)


con.close()

First a sqlite Database is created and filled with random data.
Then from the second run on on every execution a new entry is added and in addition an already existing entry is updated. So it is basically the same like when URL checks a Website for updates.
With this testcode whole database is not rewritten. Having line con.execute('pragma journal_mode=wal') does not make any difference.

So this is either a urlwatch problem or a minidb problem.

@JsBergbau
Copy link
Author

Furhter investigation

cp sonstigesWatch.db sonstigesWatch.db.prev
./urlwatch --urls sonstiges.url --cache sonstigesWatch.db
cmp -l sonstigesWatch.db sonstigesWatch.db.prev | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}'

Prints

0000001C 9B 9A
0000002C 30 2F
00000060 9B 9A

So really only minor changes in the DB but still whole file is rewritten.

As you can see the WAL-file has about the same size as the DB
grafik

Whereas with the example code wal-size is only 16 KB.
grafik

So probably not only DB is rewritten on disk, but also internally database is refreshed completely. With fixing this bug, urlwatch performance will significantly improve with larger database sizes.

@mborsetti
Copy link
Contributor

Hello @JsBergbau ,

I maintain a backward-compatible fork called webchanges that's optimized for web browsing here. That fork uses SQLite natively (instead of through the minidb package), and this may help you in your quest since you have direct access to the SQLite code.

The class you're looking for is here. I note that every time the database is closed there is a call to VACUUM; that may have something to do with the rewriting of the entire database, but am no expert in MySQL.

I think the broader goal is to find a solution that strikes a balance between not writing the database as a whole to disk on each run while at the same time preventing the database to grow to infinity. Suggestions welcomed.

Feel free to open an issue on that project.

@JsBergbau
Copy link
Author

JsBergbau commented Jan 27, 2022

Hello @mborsetti

thank you very much for your answer.

btw: What do you mean with "optimized for web browsing"?

The class you're looking for is here. I note that every time the database is closed there is a call to VACUUM; that may have something to do with the rewriting of the entire database, but am no expert in MySQL.

I didn't expect it to be so easy and didn't expect that the database will be vaccumed on every write, so I didn't search for vacum. Yes of course VACCUM rewrites the database.
https://www.sqlite.org/lang_vacuum.html

The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction. This means that when VACUUMing a database, as much as twice the size of the original database file is required in free disk space.

I've commented the line https://github.com/thp/minidb/blob/0dbecfa68f34199ccae9cb9f4201dc35e2a3ec67/minidb.py#L171 and now the rewriting of the database is gone. Thank you so much. That also speeds up urlwatch a lot for me (because SD card on raspberry PI is quite slow).

I think the broader goal is to find a solution that strikes a balance between not writing the database as a whole to disk on each run while at the same time preventing the database to grow to infinity. Suggestions welcomed.

VACCUM is then needeed if you delete entries to gain the space back. Since URL watch only appends data (except for --gc-cache option) there won't be any database growing at all, if there is no database change. I've just tested this. Even the modified date of the database stays the same. Just try it by yourself.

So only for --gc-cache option VACCUM makes sense. So code changes should be quite easy. Add an option for vaccum in minidb which is only called when --gc-cache is used.

@mborsetti
Copy link
Contributor

Hello @JsBergbau,

btw: What do you mean with "optimized for web browsing"?

Check out https://webchanges.readthedocs.io/en/stable/migration.html

The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction. This means that when VACUUMing a database, as much as twice the size of the original database file is required in free disk space.

Thanks for the quote and explanation.

VACCUM is then needeed if you delete entries to gain the space back. Since URL watch only appends data (except for --gc-cache option) there won't be any database growing at all, if there is no database change. I've just tested this. Even the modified date of the database stays the same. Just try it by yourself.

The database in urwatch by default grows to infinity because new snapshots are added and old ones are never deleted.

The one in webchanges does not grow to infinity as it only keeps the last 4 snapshots by default (adjustable, see here). What I am struggling with is that after deleting unneeded snapshot it feels like VACCUM needs to be called or the file continues to grow, because the deletion is only a flag and not actual freed space, or does it not?

@JsBergbau
Copy link
Author

The database in urwatch by default grows to infinity because new snapshots are added and old ones are never deleted.

Exactly and thats why a vaccum command does not gain any additional space, only when run with --gc-cache a vaccum is needed.

What I am struggling with is that after deleting unneeded snapshot it feels like VACCUM needs to be called or the file continues to grow, because the deletion is only a flag and not actual freed space, or does it not?

In your case, when deleting old data, then yes, a VACCUM is needed to make the DB smaller. Yes delting in Sqlite is like a flag, just a logical delete and with VACCUM data is really removed. According to https://stackoverflow.com/questions/26812249/sqlite-vacuuming-fragmentation-and-performance-degradation the freespace is used again, so even with just keeping the 4 last entries your database won't grow forever.

@mborsetti
Copy link
Contributor

In your case, when deleting old data, then yes, a VACCUM is needed to make the DB smaller. Yes delting in Sqlite is like a flag, just a logical delete and with VACCUM data is really removed. According to https://stackoverflow.com/questions/26812249/sqlite-vacuuming-fragmentation-and-performance-degradation the freespace is used again, so even with just keeping the 4 last entries your database won't grow forever.

Thank you so much for your expertise and research. Will remove (in webchanges) the VACUUM being run every single time (copied from minidb) and only use it with --gc-cache and --clean-cache.

Appreciate the help very much!

@JsBergbau
Copy link
Author

Thanks you for pointing me to the right place.
How lets hope this is also fixed in URL watch soon.

@thp thp closed this as completed in e945a6f Feb 17, 2022
@thp
Copy link
Owner

thp commented Feb 17, 2022

This should be fixed in the master branch now (the mentioned commit + cfee541). VACUUM will still be executed with --gc-cache (where one wants to shrink the DB), but not by default in normal runs.

@JsBergbau
Copy link
Author

Thanks for fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants