New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
double-linked SQLite libs cause locking/corruption issues #1467
Comments
At this point, we have a few paths forward available: A) Consolidate onto a single copy of SQLite. This has three sub-options: A1) Use TF's copy of SQLite. This is a non-starter; we are aiming to eliminate the TF dependency anyway in the long term, and using the TF copy would require adding to TF either our own python sqlite wrapper or an entire existing one like pysqlite. A2) Use python's copy of SQLite. The primary concern here is performance - it would require routing the DB import logic through python, and deserializing all the event protos into their constituent summary elements in python. We do this already today, but it's slow. Using the C++ import_event() op lets us avoid going into python for this. The secondary concern is duplication, since we have eventfile -> sqlite conversion logic in summary_db_writer.cc already, and this would all need to be reimplemented in python, and we would need to keep them in sync if we want to continue supporting the DB writer summary ops in TF (without resorting to something like tf.py_func to call back into python). A3) Use a new, dedicated, TensorBoard copy of SQLite. This requires figuring out a way to build SQLite for all the platforms we target, increasing the complexity of TensorBoard builds/releases, but we may need to do this anyway if we want to build our own C++ code in general. This also shares the duplication concern with A2, unless we just define all the summary ops ourselves as custom ops and have them replace the current built-in ones. B) Attempt to make two copies of SQLite in the same process work. This would require fixing at least one of the two issues above - preferably issue 2, since WAL mode is better for our use case because it allows interleaving a single writer + multiple readers without contention. That means understanding the WAL-index mmap failure and working around it somehow. We've tried a few things to fix it; there are more we could try like:
C) Put the two SQLite copies in separate processes. This means spawning a child process to do the reload/writer side, while the main process serves the HTTP requests that query the DB. |
I am facing a similar issue. 1 Process like TF (doing writes) and another one(doing reads) in WAL mode. |
We ended up doing A2 (we just use python SQLite for everything), so we didn't solve this problem - sorry we can't be more help. |
Thank you! |
Could you please move this to closed status since PR is merged. Thanks |
This issue is obsolete since we no longer support the SQLite mode. |
This issue documents problems arising from TensorBoard attempting to use two different copies of the SQLite library within a single process. One copy of SQLite is the one provided in the python standard library via
import sqlite3
. The other is compiled into TensorFlow, and used by the DB summary writer to write directly to SQLite from C++ code.SQLite expects that all its operations within a single process use the same copy of the library with the same shared global state, and apparently you can get undefined behavior if that's not the case. The primary documentation for this is the warning section "Multiple copies of SQLite linked into the same application" from the article here:
https://www.sqlite.org/howtocorrupt.html#_posix_advisory_locks_canceled_by_a_separate_thread_doing_close_
Issues we've seen so far:
Rollback journal mode results in spurious errors about the database file having been deleted. It's unknown why this specific error happens but it might be related to the problem report here: "For the record, this problem was due to a locking problem between two processes trying to access the same SQLite database, in which a locked database was being mis-reported by SQLite as a disk I/O error." http://sqlite.1065341.n5.nabble.com/random-infrequent-disk-I-O-errors-td66986.html
Using WAL mode instead appeared to work on Linux (debian), but fails on macOS with a "Bus Error: 10" crashing the process as soon as the python sqlite accesses the DB while it's being updated in the background by the TF sqlite. (It's not known exactly what stage of access causes the crash.)
Sample macOS crash log under python 3.6:
Another couple samples from python 2.7:
My best guess is that the WAL mode crash has to do with the two libraries interfere when accessing the mmap'd WAL-index file, and the TF library ends up trying to read into invalid memory. I don't know whether this issue happens when the two libraries are present at all, or if it's a race condition that only happens when multiple threads are involved. Cursory searching does show some possible results that additionally associate Bus Error 10 with mmap issues on macOS: https://stackoverflow.com/questions/49860015/anomalous-bus-error-for-accessing-protected-memory-on-macos
I've tried using
PRAGMA query_only=1
andPRAGMA mmap_size=0
on the python SQLite side to see if they can prevent the WAL-index mmap errors, but neither one works. Usingenv TF_SQLITE_MMAP_SIZE=0
on the TF SQLite side is a little more effective in that I getUnavailableError (see above for traceback): Step() failed: [261] database is locked
errors from ImportEvent sometimes instead, but other times I still get the sameBus Error 10
(and even getting a locking error, while better, still would leave plenty of issues).The text was updated successfully, but these errors were encountered: