Server hang on parallel execution of queries to named in-memory databases #2189

simonw · 2023-09-18T17:23:18Z

I've started to encounter a bug where queries to tables inside named in-memory databases sometimes trigger server hangs.

I'm still trying to figure out what's going on here - on one occasion I managed to Ctrl+C the server and saw an exception that mentioned a thread lock, but usually hitting Ctrl+C does nothing and I have to kill -9 the PID instead.

This is all running on my M2 Mac.

I've seen the bug in the Datasette 1.0 alphas and in Datasette 0.64.3 - but reverting to 0.61 appeared to fix it.

The text was updated successfully, but these errors were encountered:

simonw · 2023-09-18T17:24:07Z

I need reliable steps to reproduce, then I can bisect and figure out which exact version of Datasette introduced the problem.

I have a hunch that it relates to changes made to the datasette/database.py module, maybe one of these changes here: 0.61...0.63.1#diff-4e20309c969326a0008dc9237f6807f48d55783315fbfc1e7dfa480b550e16f9

simonw · 2023-09-18T17:25:55Z

The good news is that this bug is currently unlikely to affect most users since named in-memory databases (created using datasette.add_memory_database("airtable_refs") (docs) are a pretty obscure feature, only available to plugins.

simonw · 2023-09-18T17:26:44Z

Just managed to get this exception trace:

    return await self.route_path(scope, receive, send, path)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/app.py", line 1354, in route_path
    response = await view(request, send)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/base.py", line 134, in view
    return await self.dispatch_request(request)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/base.py", line 91, in dispatch_request
    return await handler(request)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/base.py", line 361, in get
    response_or_template_contexts = await self.data(request, **data_kwargs)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/table.py", line 158, in data
    return await self._data_traced(request, default_labels, _next, _size)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/table.py", line 568, in _data_traced
    await gather(execute_facets(), execute_suggested_facets())
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/table.py", line 177, in _gather_parallel
    return await asyncio.gather(*args)
asyncio.exceptions.CancelledError
INFO:     127.0.0.1:64109 - "GET /airtable_refs/airtable_refs?_facet=table_name&table_name=Sessions HTTP/1.1" 500 Internal Server Error
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/simon/.pyenv/versions/3.8.17/lib/python3.8/concurrent/futures/thread.py", line 40, in _python_exit
    t.join()
  File "/Users/simon/.pyenv/versions/3.8.17/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/Users/simon/.pyenv/versions/3.8.17/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

simonw · 2023-09-18T17:28:20Z

The bug exhibits when I try to add a facet. I think it's caused by the parallel query execution I added to facets at some point.

http://127.0.0.1:8045/airtable_refs/airtable_refs - no error
http://127.0.0.1:8045/airtable_refs/airtable_refs?_facet=table_name#facet-table_name - hangs the server

Crucial line in the traceback:

    await gather(execute_facets(), execute_suggested_facets())

From here:

datasette/datasette/views/table.py

Line 568 in 917272c

await gather(execute_facets(), execute_suggested_facets())

simonw · 2023-09-18T17:31:10Z

That line was added in 942411e which first shipped in 0.62a0.

simonw · 2023-09-18T17:36:00Z

I wrote this test, but it passes:

@pytest.mark.asyncio
async def test_facet_against_in_memory_database():
    ds = Datasette()
    db = ds.add_memory_database("mem")
    await db.execute_write("create table t (id integer primary key, name text)")
    await db.execute_write_many(
        "insert into t (name) values (?)", [["one"], ["one"], ["two"]]
    )
    response1 = await ds.client.get("/mem/t.json")
    assert response1.status_code == 200
    response2 = await ds.client.get("/mem/t.json?_facet=name")
    assert response2.status_code == 200
    assert response2.json() == {
        "ok": True,
        "next": None,
        "facet_results": {
            "results": {
                "name": {
                    "name": "name",
                    "type": "column",
                    "hideable": True,
                    "toggle_url": "/mem/t.json",
                    "results": [
                        {
                            "value": "one",
                            "label": "one",
                            "count": 2,
                            "toggle_url": "http://localhost/mem/t.json?_facet=name&name=one",
                            "selected": False,
                        },
                        {
                            "value": "two",
                            "label": "two",
                            "count": 1,
                            "toggle_url": "http://localhost/mem/t.json?_facet=name&name=two",
                            "selected": False,
                        },
                    ],
                    "truncated": False,
                }
            },
            "timed_out": [],
        },
        "rows": [
            {"id": 1, "name": "one"},
            {"id": 2, "name": "one"},
            {"id": 3, "name": "two"},
        ],
        "truncated": False,
    }

simonw · 2023-09-18T17:39:06Z

Landing a version of that test anyway.

This is meant to illustrate a crashing bug but it does not trigger it.

This now executes two facets, in the hope that parallel facet execution would illustrate the bug - but it did not illustrate the bug.

simonw · 2023-09-18T17:45:27Z

Maybe it's not related to faceting - I just got it on a hit to http://127.0.0.1:8045/airtable_refs/airtable_refs instead.

simonw · 2023-09-18T17:46:21Z

Sometimes it takes a few clicks for the bug to occur, but it does seem to always be within the in-memory database.

simonw · 2023-09-18T17:47:01Z

I managed to trigger it by loading http://127.0.0.1:8045/airtable_refs/airtable_refs - which worked - and then hitting refresh on that page a bunch of times until it hung.

simonw · 2023-09-18T17:49:24Z

I switched that particular implementation to using an on-disk database instead of an in-memory database and could no longer recreate the bug.

simonw · 2023-09-18T18:30:30Z

OK, I can trigger the bug like this:

datasette pottery2.db -p 8045 --get /airtable_refs/airtable_refs

Can I write a bash script that fails (and terminates the process) if it takes longer than X seconds?

simonw · 2023-09-18T18:32:29Z

This worked, including on macOS even though GPT-4 thought timeout would not work there: https://chat.openai.com/share/cc4628e9-5240-4f35-b640-16a9c178b315

#!/bin/bash

# Run the command with a timeout of 5 seconds
timeout 5s datasette pottery2.db -p 8045 --get /airtable_refs/airtable_refs

# Check the exit code from timeout
if [ $? -eq 124 ]; then
    echo "Error: Command timed out after 5 seconds."
    exit 1
fi

simonw · 2023-09-18T19:39:27Z

I'm now trying this test script:

#!/bin/bash

port=8064
# Start datasette server in the background and get its PID
datasette pottery2.db -p $port &
server_pid=$!

# Wait for a moment to ensure the server has time to start up
sleep 2

# Initialize counters and parameters
retry_count=0
max_retries=3
success_count=0
path="/airtable_refs/airtable_refs"

# Function to run curl with a timeout
function test_curl {
    # Run the curl command with a timeout of 3 seconds
    timeout 3s curl -s "http://localhost:${port}${path}" > /dev/null
    if [ $? -eq 0 ]; then
        # Curl was successful
        ((success_count++))
    fi
}

# Try three parallel curl requests
while [[ $retry_count -lt $max_retries ]]; do
    # Reset the success counter
    success_count=0

    # Run the curls in parallel
    echo "  Running curls"
    test_curl
    test_curl
    test_curl #  & test_curl & test_curl &

    # Wait for all curls to finish
    #wait

    # Check the success count
    if [[ $success_count -eq 3 ]]; then
        # All curls succeeded, break out of the loop
        echo "  All curl succeeded"
        break
    fi

    ((retry_count++))
done

# Kill the datasette server
echo "Killing datasette server with PID $server_pid"
kill -9 $server_pid
sleep 2

# Print result
if [[ $success_count -eq 3 ]]; then
    echo "All three curls succeeded."
    exit 0
else
    echo "Error: Not all curls succeeded after $retry_count attempts."
    exit 1
fi

I run it like this:

git bisect reset
git bisect start
git bisect good 0.59.4
git bisect bad 1.0a6
git bisect run ../airtable-export/testit.sh

But... it's not having the desired result, I think because the bug is intermittent so each time I run it the bisect spits out a different commit as the one that is to blame.

simonw · 2023-09-18T19:40:13Z

Output while it is running looks like this:

running  '../airtable-export/testit.sh'
INFO:     Started server process [75649]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls
  Running curls
  Running curls
Killing datasette server with PID 75649
../airtable-export/testit.sh: line 54: 75649 Killed: 9               datasette pottery2.db -p $port
Error: Not all curls succeeded after 3 attempts.
Bisecting: 155 revisions left to test after this (roughly 7 steps)
[247e460e08bf823142f7b84058fe44e43626787f] Update beautifulsoup4 requirement (#1703)
running  '../airtable-export/testit.sh'
INFO:     Started server process [75722]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls
  Running curls
  Running curls
Killing datasette server with PID 75722
../airtable-export/testit.sh: line 54: 75722 Killed: 9               datasette pottery2.db -p $port
Error: Not all curls succeeded after 3 attempts.
Bisecting: 77 revisions left to test after this (roughly 6 steps)
[3ef47a0896c7e63404a34e465b7160c80eaa571d] Link rel=alternate header for tables and rows
running  '../airtable-export/testit.sh'
INFO:     Started server process [75818]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls

simonw · 2023-09-18T19:40:56Z

I tried it with a path of / and everything passed - so it's definitely the path of /airtable_refs/airtable_refs (an in-memory database created by an experimental branch of https://github.com/simonw/airtable-export) that triggers the problem.

simonw · 2023-09-18T19:44:03Z

I knocked it down to 1 retry just to see what happened.

simonw · 2023-09-18T19:54:23Z

Turned out I wasn't running the datasette from the current directory, so it was not testing what I intended.

FIxed that with pip install -e . in the datasette/ directory.

Now I'm seeing some passes, which look like this:

running  '../airtable-export/testit.sh'
INFO:     Started server process [77810]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls
INFO:     127.0.0.1:59439 - "GET /airtable_refs/airtable_refs HTTP/1.1" 200 OK
INFO:     127.0.0.1:59440 - "GET /airtable_refs/airtable_refs HTTP/1.1" 200 OK
INFO:     127.0.0.1:59441 - "GET /airtable_refs/airtable_refs HTTP/1.1" 200 OK
  All curl succeeded
Killing datasette server with PID 77810
../airtable-export/testit.sh: line 54: 77810 Killed: 9               datasette pottery2.db -p $port
All three curls succeeded.
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[7463b051cf8d7f856df5eba9f7aa944183ebabe5] Cosmetic tweaks after blacken-docs, refs #1718
running  '../airtable-export/testit.sh'
INFO:     Started server process [77826]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls

simonw · 2023-09-18T19:55:32Z

OK it looks like it found it!

942411ef946e9a34a2094944d3423cddad27efd3 is the first bad commit
commit 

Author: Simon Willison <swillison@gmail.com>
Date:   Tue Apr 26 15:48:56 2022 -0700

    Execute some TableView queries in parallel
    
    Use ?_noparallel=1 to opt out (undocumented, useful for benchmark comparisons)
    
    Refs #1723, #1715

 datasette/views/table.py | 93 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 67 insertions(+), 26 deletions(-)
bisect found first bad commit

942411e does look like the cause of this problem.

simonw · 2023-09-18T19:58:06Z

I also confirmed that http://127.0.0.1:8064/airtable_refs/airtable_refs?_noparallel=1 does not trigger the bug but http://127.0.0.1:8064/airtable_refs/airtable_refs does.

simonw · 2023-09-18T20:11:26Z

Now that I've confirmed that parallel query execution of the kind introduced in 942411e can cause hangs (presumably some kind of locking issue) against in-memory databases, some options:

Disable parallel execution entirely and rip out related code.
Disable parallel execution entirely by leaving that code but having it always behave as if _noparallel=1
Continue digging and try and find some way to avoid this problem

The parallel execution work is something I was playing with last year in the hope of speeding up Datasette pages like the table page which need to execute a bunch of queries - one for each facet, plus one for each column to see if it should be suggested as a facet.

I wrote about this at the time here: https://simonwillison.net/2022/May/6/weeknotes/

My hope was that despite Python's GIL this optimization would still help, because the SQLite C module releases the GIL once it gets to SQLite.

But... that didn't hold up. It looked like enough work was happening in Python land with the GIL that the optimization didn't improve things.

Running the nogil fork of Python DID improve things though! I left the code in partly on the hope that the nogil fork would be accepted into Python core.

... which it now has! But it will still be a year or two before it fully lands: https://discuss.python.org/t/a-steering-council-notice-about-pep-703-making-the-global-interpreter-lock-optional-in-cpython/30474

So I'm not particularly concerned about dropping the parallel execution. If I do drop it though do I leave the potentially complex code in that relates to it?

simonw · 2023-09-18T20:16:22Z

Looking again at this code:

datasette/datasette/database.py

Lines 87 to 117 in 6ed7908

    
           def connect(self, write=False): 
        
               if self.memory_name: 
        
                   uri = "file:{}?mode=memory&cache=shared".format(self.memory_name) 
        
                   conn = sqlite3.connect( 
        
                       uri, 
        
                       uri=True, 
        
                       check_same_thread=False, 
        
                   ) 
        
                   if not write: 
        
                       conn.execute("PRAGMA query_only=1") 
        
                   return conn 
        
               if self.is_memory: 
        
                   return sqlite3.connect(":memory:", uri=True) 
        
               # mode=ro or immutable=1? 
        
               if self.is_mutable: 
        
                   qs = "?mode=ro" 
        
                   if self.ds.nolock: 
        
                       qs += "&nolock=1" 
        
               else: 
        
                   qs = "?immutable=1" 
        
               assert not (write and not self.is_mutable) 
        
               if write: 
        
                   qs = "" 
        
               if self.mode is not None: 
        
                   qs = f"?mode={self.mode}" 
        
               conn = sqlite3.connect( 
        
                   f"file:{self.path}{qs}", uri=True, check_same_thread=False 
        
               ) 
        
               self._all_file_connections.append(conn) 
        
               return conn

check_same_thread=False really stands out here.

Python docs at https://docs.python.org/3/library/sqlite3.html

check_same_thread (bool) -- If True (default), ProgrammingError will be raised if the database connection is used by a thread other than the one that created it. If False, the connection may be accessed in multiple threads; write operations may need to be serialized by the user to avoid data corruption. See threadsafety for more information.

I think I'm playing with fire by allowing multiple threads to access the same connection without doing my own serialization of those requests.

I do do that using the write connection - and in this particular case the bug isn't coming from write queries, it's coming from read queries - but perhaps SQLite has issues with threading for reads, too.

simonw · 2023-09-18T20:24:30Z

Using SQLite In Multi-Threaded Applications

That indicates that there's a SQLite option for "Serialized" mode where it's safe to access anything SQLite provides from multiple threads, but as far as I can tell Python doesn't give you an option to turn that mode on or off for a connection - you can read sqlite3.threadsafety to see if that mode was compiled in or not, but not actually change it.

On my Mac sqlite3.threadsafety returns 1 which means https://docs.python.org/3/library/sqlite3.html#sqlite3.threadsafety "Multi-thread: In this mode, SQLite can be safely used by multiple threads provided that no single database connection is used simultaneously in two or more threads." - it would need to return 3 for that serialized mode.

simonw · 2023-09-18T20:25:44Z

My current hunch is that SQLite gets unhappy if multiple threads access the same underlying C object - which sometimes happens with in-memory connections and Datasette presumably because they are faster than file-backed databases.

I'm going to remove the asyncio.gather() code from the table view. I'll ship a 0.x release with that fix too.

simonw · 2023-09-18T20:29:41Z

The one other thing affected by this change is this documentation, which suggests a not-actually-safe pattern:

datasette/docs/internals.rst

Lines 1292 to 1321 in 6ed7908

    
           This example uses the :ref:`register_routes() <plugin_register_routes>` plugin hook to add a page at ``/parallel-queries`` which executes two SQL queries in parallel using ``asyncio.gather()`` and returns their results. 
        
           .. code-block:: python 
        
               from datasette import hookimpl 
        
               from datasette import tracer 
        
               @hookimpl 
        
               def register_routes(): 
        
                   async def parallel_queries(datasette): 
        
                       db = datasette.get_database() 
        
                       with tracer.trace_child_tasks(): 
        
                           one, two = await asyncio.gather( 
        
                               db.execute("select 1"), 
        
                               db.execute("select 2"), 
        
                           ) 
        
                       return Response.json( 
        
                           { 
        
                               "one": one.single_value(), 
        
                               "two": two.single_value(), 
        
                           } 
        
                       ) 
        
                   return [ 
        
                       (r"/parallel-queries$", parallel_queries), 
        
                   ] 
        
           Adding ``?_trace=1`` will show that the trace covers both of those child tasks.

simonw · 2023-09-20T01:28:16Z

Added a note to that example in the documentation:

datasette/docs/internals.rst

Line 1320 in 4e6a341

    
           Note that running parallel SQL queries in this way has `been known to cause problems in the past <https://github.com/simonw/datasette/issues/2189>`__, so treat this example with caution.

yozlet · 2023-09-20T17:53:31Z

/me munches popcorn at a furious rate, utterly entralled

Refs: - #2189

simonw · 2023-09-21T19:19:47Z

I'm going to release this in 1.0a7, and I'll backport it to a 0.64.4 release too.

Refs #2189

simonw · 2023-09-21T20:10:28Z

Release 0.64.4: https://docs.datasette.io/en/stable/changelog.html#v0-64-4

simonw · 2023-09-21T20:11:16Z

We're planning a breaking change in 1.0a7:

Move permissions, allow blocks, canned queries and more out of metadata.yaml and into datasette.yaml #2191

Since that's a breaking change I'm going to ship 1.0a7 right now with this fix, then ship that breaking change as 1.0a8 instead.

Refs: - #2189

Refs #2189

simonw · 2023-09-21T22:26:19Z

1.0a7 is out with this fix as well now: https://docs.datasette.io/en/1.0a7/changelog.html#a7-2023-09-21

Refs #2189

simonw added the bug label Sep 18, 2023

simonw added a commit that referenced this issue Sep 18, 2023

test_facet_against_in_memory_database, refs #2189

f56e043

This is meant to illustrate a crashing bug but it does not trigger it.

simonw added a commit that referenced this issue Sep 18, 2023

Simplified test for #2189

6ed7908

This now executes two facets, in the hope that parallel facet execution would illustrate the bug - but it did not illustrate the bug.

simonw changed the title ~~Server hang on some queries to named in-memory databases~~ Server hang on parallel execution of queries to named in-memory databases Sep 18, 2023

simonw added a commit that referenced this issue Sep 20, 2023

Stop using parallel SQL queries for tables, refs #2189

4e6a341

simonw mentioned this issue Sep 20, 2023

Stop using parallel SQL queries for tables #2192

Merged

simonw added a commit that referenced this issue Sep 20, 2023

Stop using parallel SQL queries for tables

b0e5d8a

Refs: - #2189

simonw added a commit that referenced this issue Sep 21, 2023

Stop running queries in parallel, refs #2189

c0d58a7

simonw added a commit that referenced this issue Sep 21, 2023

Release 0.64.4

d61687c

Refs #2189

simonw referenced this issue Sep 21, 2023

Fixed typo in docs

2b44d6a

simonw added a commit that referenced this issue Sep 21, 2023

Stop using parallel SQL queries for tables

12395ba

Refs: - #2189

simonw added a commit that referenced this issue Sep 21, 2023

Release 1.0a7

7ecf5bf

Refs #2189

simonw closed this as completed Sep 21, 2023

simonw added a commit that referenced this issue Sep 21, 2023

Release notes for 1.0a7

836b158

Refs #2189

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server hang on parallel execution of queries to named in-memory databases #2189

Server hang on parallel execution of queries to named in-memory databases #2189

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023 •

edited

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023 •

edited

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023 •

edited

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 20, 2023

yozlet commented Sep 20, 2023

simonw commented Sep 21, 2023

simonw commented Sep 21, 2023

simonw commented Sep 21, 2023

simonw commented Sep 21, 2023

Server hang on parallel execution of queries to named in-memory databases #2189

Server hang on parallel execution of queries to named in-memory databases #2189

Comments

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023 • edited

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023 • edited

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023 • edited

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 18, 2023

simonw commented Sep 20, 2023

yozlet commented Sep 20, 2023

simonw commented Sep 21, 2023

simonw commented Sep 21, 2023

simonw commented Sep 21, 2023

simonw commented Sep 21, 2023

simonw commented Sep 18, 2023 •

edited

simonw commented Sep 18, 2023 •

edited

simonw commented Sep 18, 2023 •

edited