double-linked SQLite libs cause locking/corruption issues #1467

nfelt · 2018-09-27T23:40:37Z

This issue documents problems arising from TensorBoard attempting to use two different copies of the SQLite library within a single process. One copy of SQLite is the one provided in the python standard library via import sqlite3. The other is compiled into TensorFlow, and used by the DB summary writer to write directly to SQLite from C++ code.

SQLite expects that all its operations within a single process use the same copy of the library with the same shared global state, and apparently you can get undefined behavior if that's not the case. The primary documentation for this is the warning section "Multiple copies of SQLite linked into the same application" from the article here:
https://www.sqlite.org/howtocorrupt.html#_posix_advisory_locks_canceled_by_a_separate_thread_doing_close_

Issues we've seen so far:

Rollback journal mode results in spurious errors about the database file having been deleted. It's unknown why this specific error happens but it might be related to the problem report here: "For the record, this problem was due to a locking problem between two processes trying to access the same SQLite database, in which a locked database was being mis-reported by SQLite as a disk I/O error." http://sqlite.1065341.n5.nabble.com/random-infrequent-disk-I-O-errors-td66986.html

Using WAL mode instead appeared to work on Linux (debian), but fails on macOS with a "Bus Error: 10" crashing the process as soon as the python sqlite accesses the DB while it's being updated in the background by the TF sqlite. (It's not known exactly what stage of access causes the crash.)

Sample macOS crash log under python 3.6:

Crashed Thread:        7

Exception Type:        EXC_BAD_ACCESS (SIGBUS)
Exception Codes:       KERN_MEMORY_ERROR at 0x0000000108dbe1ee
Exception Note:        EXC_CORPSE_NOTIFY

...

Thread 7 Crashed:
0   libsystem_kernel.dylib        	0x00007fff53645b66 __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff53810080 pthread_kill + 333
2   libsystem_c.dylib             	0x00007fff535536fe raise + 26
3   libsystem_platform.dylib      	0x00007fff53803f5a _sigtramp + 26
4   ???                           	0x0000000100000005 0 + 4294967301
5   _pywrap_tensorflow_internal.so	0x000000010aae9a0f pagerWalFrames + 2052
6   _pywrap_tensorflow_internal.so	0x000000010aad6b27 sqlite3PagerCommitPhaseOne + 208
7   _pywrap_tensorflow_internal.so	0x000000010aad70ab sqlite3BtreeCommitPhaseOne + 119
8   _pywrap_tensorflow_internal.so	0x000000010aaf423e sqlite3VdbeHalt + 1710
9   _pywrap_tensorflow_internal.so	0x000000010ab00436 sqlite3VdbeExec + 39819
10  _pywrap_tensorflow_internal.so	0x000000010aad8853 sqlite3_step + 540
11  _pywrap_tensorflow_internal.so	0x000000010aacc957 tensorflow::SqliteStatement::Step(bool*) + 71
12  _pywrap_tensorflow_internal.so	0x000000010aacd162 tensorflow::SqliteStatement::StepAndReset() + 50
13  _pywrap_tensorflow_internal.so	0x000000010aa99f50 tensorflow::(anonymous     namespace)::SeriesWriter::Update(tensorflow::Sqlite*, long long, double, tensorflow::Tensor const&, tensorflow::StringPiece const&, long long) + 656
14  _pywrap_tensorflow_internal.so	0x000000010aa97c78 tensorflow::(anonymous     namespace)::RunWriter::Append(tensorflow::Sqlite*, long long, long long, unsigned long long, double, tensorflow::Tensor const&) +     2088
15  _pywrap_tensorflow_internal.so	0x000000010aa9abbe tensorflow::(anonymous namespace)::SummaryDbWriter::MigrateEvent(std::__1::unique_ptr<tensorflow::Event, std::__1::default_delete<tensorflow::Event> >) + 1822
16  _pywrap_tensorflow_internal.so	0x000000010aa96b25 tensorflow::(anonymous     namespace)::SummaryDbWriter::WriteEvent(std::__1::unique_ptr<tensorflow::Event, std::__1::default_delete<tensorflow::Event> >) +     37
17  _pywrap_tensorflow_internal.so	0x000000010aa92db7 tensorflow::ImportEventOp::Compute(tensorflow::OpKernelContext*) + 263

Another couple samples from python 2.7:

0   libsystem_platform.dylib      	0x00007fff53802bc1 _platform_memcmp + 33
1   _pywrap_tensorflow_internal.so	0x000000011140391c pagerWalFrames + 209
2   _pywrap_tensorflow_internal.so	0x00000001113f1167 sqlite3PagerCommitPhaseOne + 208
3   _pywrap_tensorflow_internal.so	0x00000001113f16eb sqlite3BtreeCommitPhaseOne + 119
4   _pywrap_tensorflow_internal.so	0x000000011140e87e sqlite3VdbeHalt + 1710
5   _pywrap_tensorflow_internal.so	0x000000011141aa76 sqlite3VdbeExec + 39819
6   _pywrap_tensorflow_internal.so	0x00000001113f2e93 sqlite3_step + 540
7   _pywrap_tensorflow_internal.so	0x00000001113e7097 tensorflow::SqliteStatement::Step(bool*) + 71
8   _pywrap_tensorflow_internal.so	0x00000001113e782e tensorflow::SqliteStatement::StepAndReset() + 30
9   _pywrap_tensorflow_internal.so	0x00000001113b4640 tensorflow::(anonymous namespace)::SeriesWriter::Update(tensorflow::Sqlite*, long long, double, tensorflow::Tensor const&, absl::string_view const&, long long) + 656
10  _pywrap_tensorflow_internal.so	0x00000001113b2378 tensorflow::(anonymous namespace)::RunWriter::Append(tensorflow::Sqlite*, long long, long long, unsigned long long, double, tensorflow::Tensor const&) + 2088
11  _pywrap_tensorflow_internal.so	0x00000001113b516e tensorflow::(anonymous namespace)::SummaryDbWriter::MigrateEvent(std::__1::unique_ptr<tensorflow::Event, std::__1::default_delete<tensorflow::Event> >) + 1822
12  _pywrap_tensorflow_internal.so	0x00000001113b1215 tensorflow::(anonymous namespace)::SummaryDbWriter::WriteEvent(std::__1::unique_ptr<tensorflow::Event, std::__1::default_delete<tensorflow::Event> >) + 37
13  _pywrap_tensorflow_internal.so	0x00000001113ad2e7 tensorflow::ImportEventOp::Compute(tensorflow::OpKernelContext*) + 263

0   _pywrap_tensorflow_internal.so	0x00000001196ab184 walIndexAppend + 185
1   _pywrap_tensorflow_internal.so	0x00000001196aa04f pagerWalFrames + 2052
2   _pywrap_tensorflow_internal.so	0x0000000119697167 sqlite3PagerCommitPhaseOne + 208
3   _pywrap_tensorflow_internal.so	0x00000001196976eb sqlite3BtreeCommitPhaseOne + 119
4   _pywrap_tensorflow_internal.so	0x00000001196b487e sqlite3VdbeHalt + 1710
5   _pywrap_tensorflow_internal.so	0x00000001196c0a76 sqlite3VdbeExec + 39819
6   _pywrap_tensorflow_internal.so	0x0000000119698e93 sqlite3_step + 540
7   _pywrap_tensorflow_internal.so	0x000000011968d097 tensorflow::SqliteStatement::Step(bool*) + 71
8   _pywrap_tensorflow_internal.so	0x000000011968d82e tensorflow::SqliteStatement::StepAndReset() + 30
9   _pywrap_tensorflow_internal.so	0x000000011965a640 tensorflow::(anonymous namespace)::SeriesWriter::Update(tensorflow::Sqlite*, long long, double, tensorflow::Tensor const&, absl::string_view const&, long long) + 656
10  _pywrap_tensorflow_internal.so	0x0000000119658378 tensorflow::(anonymous namespace)::RunWriter::Append(tensorflow::Sqlite*, long long, long long, unsigned long long, double, tensorflow::Tensor const&) + 2088
11  _pywrap_tensorflow_internal.so	0x000000011965b16e tensorflow::(anonymous namespace)::SummaryDbWriter::MigrateEvent(std::__1::unique_ptr<tensorflow::Event, std::__1::default_delete<tensorflow::Event> >) + 1822
12  _pywrap_tensorflow_internal.so	0x0000000119657215 tensorflow::(anonymous namespace)::SummaryDbWriter::WriteEvent(std::__1::unique_ptr<tensorflow::Event, std::__1::default_delete<tensorflow::Event> >) + 37
13  _pywrap_tensorflow_internal.so	0x00000001196532e7 tensorflow::ImportEventOp::Compute(tensorflow::OpKernelContext*) + 263

My best guess is that the WAL mode crash has to do with the two libraries interfere when accessing the mmap'd WAL-index file, and the TF library ends up trying to read into invalid memory. I don't know whether this issue happens when the two libraries are present at all, or if it's a race condition that only happens when multiple threads are involved. Cursory searching does show some possible results that additionally associate Bus Error 10 with mmap issues on macOS: https://stackoverflow.com/questions/49860015/anomalous-bus-error-for-accessing-protected-memory-on-macos

I've tried using PRAGMA query_only=1 and PRAGMA mmap_size=0 on the python SQLite side to see if they can prevent the WAL-index mmap errors, but neither one works. Using env TF_SQLITE_MMAP_SIZE=0 on the TF SQLite side is a little more effective in that I get UnavailableError (see above for traceback): Step() failed: [261] database is locked errors from ImportEvent sometimes instead, but other times I still get the same Bus Error 10 (and even getting a locking error, while better, still would leave plenty of issues).

The text was updated successfully, but these errors were encountered:

nfelt · 2018-10-01T19:29:43Z

At this point, we have a few paths forward available:

A) Consolidate onto a single copy of SQLite. This has three sub-options:

A1) Use TF's copy of SQLite. This is a non-starter; we are aiming to eliminate the TF dependency anyway in the long term, and using the TF copy would require adding to TF either our own python sqlite wrapper or an entire existing one like pysqlite.

A2) Use python's copy of SQLite. The primary concern here is performance - it would require routing the DB import logic through python, and deserializing all the event protos into their constituent summary elements in python. We do this already today, but it's slow. Using the C++ import_event() op lets us avoid going into python for this. The secondary concern is duplication, since we have eventfile -> sqlite conversion logic in summary_db_writer.cc already, and this would all need to be reimplemented in python, and we would need to keep them in sync if we want to continue supporting the DB writer summary ops in TF (without resorting to something like tf.py_func to call back into python).

A3) Use a new, dedicated, TensorBoard copy of SQLite. This requires figuring out a way to build SQLite for all the platforms we target, increasing the complexity of TensorBoard builds/releases, but we may need to do this anyway if we want to build our own C++ code in general. This also shares the duplication concern with A2, unless we just define all the summary ops ourselves as custom ops and have them replace the current built-in ones.

B) Attempt to make two copies of SQLite in the same process work. This would require fixing at least one of the two issues above - preferably issue 2, since WAL mode is better for our use case because it allows interleaving a single writer + multiple readers without contention. That means understanding the WAL-index mmap failure and working around it somehow.

We've tried a few things to fix it; there are more we could try like:

Using locking to avoid multiple DB connections working at once (though this negates the main WAL mode advantage above and adds complexity), in case the issue is only due to multithreading
Attempting to avoid ever closing DB connections (and hence losing the POSIX advisory locking) in case that's the issue - e.g. we would use a shared pool of DB connections, and would have to ensure we close them all together at shutdown, only after none of them are being used - also somewhat complicated

C) Put the two SQLite copies in separate processes. This means spawning a child process to do the reload/writer side, while the main process serves the HTTP requests that query the DB.

continuum-avinash-jha · 2019-12-02T09:43:33Z

I am facing a similar issue. 1 Process like TF (doing writes) and another one(doing reads) in WAL mode.
Stacktrace appears similar. Tried to avoid closing DB connections, but still crash is seen. What was the fix done to resolve this?

nfelt · 2019-12-03T22:41:02Z

We ended up doing A2 (we just use python SQLite for everything), so we didn't solve this problem - sorry we can't be more help.

continuum-avinash-jha · 2019-12-05T06:39:41Z

Thank you!

pindinagesh · 2021-11-07T14:00:42Z

@nfelt

Could you please move this to closed status since PR is merged. Thanks

nfelt · 2021-11-09T01:06:51Z

This issue is obsolete since we no longer support the SQLite mode.

nfelt added the core:backend label Sep 27, 2018

nfelt mentioned this issue Sep 28, 2018

Add ability to run reload logic in a child process #1469

Merged

pindinagesh self-assigned this Nov 7, 2021

pindinagesh added the stat:awaiting response label Nov 7, 2021

nfelt closed this as completed Nov 9, 2021

nfelt removed the stat:awaiting response label Nov 9, 2021

nfelt unassigned pindinagesh Nov 9, 2021

Mytherin mentioned this issue Mar 23, 2024

SQLite/DuckDB Concurrency Errors Within a Single Process duckdb/sqlite_scanner#82

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

double-linked SQLite libs cause locking/corruption issues #1467

double-linked SQLite libs cause locking/corruption issues #1467

nfelt commented Sep 27, 2018 •

edited

nfelt commented Oct 1, 2018

continuum-avinash-jha commented Dec 2, 2019

nfelt commented Dec 3, 2019

continuum-avinash-jha commented Dec 5, 2019

pindinagesh commented Nov 7, 2021

nfelt commented Nov 9, 2021

double-linked SQLite libs cause locking/corruption issues #1467

double-linked SQLite libs cause locking/corruption issues #1467

Comments

nfelt commented Sep 27, 2018 • edited

nfelt commented Oct 1, 2018

continuum-avinash-jha commented Dec 2, 2019

nfelt commented Dec 3, 2019

continuum-avinash-jha commented Dec 5, 2019

pindinagesh commented Nov 7, 2021

nfelt commented Nov 9, 2021

nfelt commented Sep 27, 2018 •

edited