Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server hang on parallel execution of queries to named in-memory databases #2189

Closed
simonw opened this issue Sep 18, 2023 · 31 comments
Closed
Labels

Comments

@simonw
Copy link
Owner

simonw commented Sep 18, 2023

I've started to encounter a bug where queries to tables inside named in-memory databases sometimes trigger server hangs.

I'm still trying to figure out what's going on here - on one occasion I managed to Ctrl+C the server and saw an exception that mentioned a thread lock, but usually hitting Ctrl+C does nothing and I have to kill -9 the PID instead.

This is all running on my M2 Mac.

I've seen the bug in the Datasette 1.0 alphas and in Datasette 0.64.3 - but reverting to 0.61 appeared to fix it.

@simonw simonw added the bug label Sep 18, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I need reliable steps to reproduce, then I can bisect and figure out which exact version of Datasette introduced the problem.

I have a hunch that it relates to changes made to the datasette/database.py module, maybe one of these changes here: 0.61...0.63.1#diff-4e20309c969326a0008dc9237f6807f48d55783315fbfc1e7dfa480b550e16f9

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

The good news is that this bug is currently unlikely to affect most users since named in-memory databases (created using datasette.add_memory_database("airtable_refs") (docs) are a pretty obscure feature, only available to plugins.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Just managed to get this exception trace:

    return await self.route_path(scope, receive, send, path)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/app.py", line 1354, in route_path
    response = await view(request, send)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/base.py", line 134, in view
    return await self.dispatch_request(request)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/base.py", line 91, in dispatch_request
    return await handler(request)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/base.py", line 361, in get
    response_or_template_contexts = await self.data(request, **data_kwargs)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/table.py", line 158, in data
    return await self._data_traced(request, default_labels, _next, _size)
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/table.py", line 568, in _data_traced
    await gather(execute_facets(), execute_suggested_facets())
  File "/Users/simon/.local/share/virtualenvs/airtable-export-Ca4U-3qk/lib/python3.8/site-packages/datasette/views/table.py", line 177, in _gather_parallel
    return await asyncio.gather(*args)
asyncio.exceptions.CancelledError
INFO:     127.0.0.1:64109 - "GET /airtable_refs/airtable_refs?_facet=table_name&table_name=Sessions HTTP/1.1" 500 Internal Server Error
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/simon/.pyenv/versions/3.8.17/lib/python3.8/concurrent/futures/thread.py", line 40, in _python_exit
    t.join()
  File "/Users/simon/.pyenv/versions/3.8.17/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/Users/simon/.pyenv/versions/3.8.17/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

The bug exhibits when I try to add a facet. I think it's caused by the parallel query execution I added to facets at some point.

http://127.0.0.1:8045/airtable_refs/airtable_refs - no error
http://127.0.0.1:8045/airtable_refs/airtable_refs?_facet=table_name#facet-table_name - hangs the server

Crucial line in the traceback:

    await gather(execute_facets(), execute_suggested_facets())

From here:

await gather(execute_facets(), execute_suggested_facets())

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

That line was added in 942411e which first shipped in 0.62a0.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I wrote this test, but it passes:

@pytest.mark.asyncio
async def test_facet_against_in_memory_database():
    ds = Datasette()
    db = ds.add_memory_database("mem")
    await db.execute_write("create table t (id integer primary key, name text)")
    await db.execute_write_many(
        "insert into t (name) values (?)", [["one"], ["one"], ["two"]]
    )
    response1 = await ds.client.get("/mem/t.json")
    assert response1.status_code == 200
    response2 = await ds.client.get("/mem/t.json?_facet=name")
    assert response2.status_code == 200
    assert response2.json() == {
        "ok": True,
        "next": None,
        "facet_results": {
            "results": {
                "name": {
                    "name": "name",
                    "type": "column",
                    "hideable": True,
                    "toggle_url": "/mem/t.json",
                    "results": [
                        {
                            "value": "one",
                            "label": "one",
                            "count": 2,
                            "toggle_url": "http://localhost/mem/t.json?_facet=name&name=one",
                            "selected": False,
                        },
                        {
                            "value": "two",
                            "label": "two",
                            "count": 1,
                            "toggle_url": "http://localhost/mem/t.json?_facet=name&name=two",
                            "selected": False,
                        },
                    ],
                    "truncated": False,
                }
            },
            "timed_out": [],
        },
        "rows": [
            {"id": 1, "name": "one"},
            {"id": 2, "name": "one"},
            {"id": 3, "name": "two"},
        ],
        "truncated": False,
    }

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Landing a version of that test anyway.

simonw added a commit that referenced this issue Sep 18, 2023
This is meant to illustrate a crashing bug but it does not trigger it.
simonw added a commit that referenced this issue Sep 18, 2023
This now executes two facets, in the hope that parallel facet execution
would illustrate the bug - but it did not illustrate the bug.
@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Maybe it's not related to faceting - I just got it on a hit to http://127.0.0.1:8045/airtable_refs/airtable_refs instead.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Sometimes it takes a few clicks for the bug to occur, but it does seem to always be within the in-memory database.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I managed to trigger it by loading http://127.0.0.1:8045/airtable_refs/airtable_refs - which worked - and then hitting refresh on that page a bunch of times until it hung.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I switched that particular implementation to using an on-disk database instead of an in-memory database and could no longer recreate the bug.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

OK, I can trigger the bug like this:

datasette pottery2.db -p 8045 --get /airtable_refs/airtable_refs

Can I write a bash script that fails (and terminates the process) if it takes longer than X seconds?

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

This worked, including on macOS even though GPT-4 thought timeout would not work there: https://chat.openai.com/share/cc4628e9-5240-4f35-b640-16a9c178b315

#!/bin/bash

# Run the command with a timeout of 5 seconds
timeout 5s datasette pottery2.db -p 8045 --get /airtable_refs/airtable_refs

# Check the exit code from timeout
if [ $? -eq 124 ]; then
    echo "Error: Command timed out after 5 seconds."
    exit 1
fi

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I'm now trying this test script:

#!/bin/bash

port=8064
# Start datasette server in the background and get its PID
datasette pottery2.db -p $port &
server_pid=$!

# Wait for a moment to ensure the server has time to start up
sleep 2

# Initialize counters and parameters
retry_count=0
max_retries=3
success_count=0
path="/airtable_refs/airtable_refs"

# Function to run curl with a timeout
function test_curl {
    # Run the curl command with a timeout of 3 seconds
    timeout 3s curl -s "http://localhost:${port}${path}" > /dev/null
    if [ $? -eq 0 ]; then
        # Curl was successful
        ((success_count++))
    fi
}

# Try three parallel curl requests
while [[ $retry_count -lt $max_retries ]]; do
    # Reset the success counter
    success_count=0

    # Run the curls in parallel
    echo "  Running curls"
    test_curl
    test_curl
    test_curl #  & test_curl & test_curl &

    # Wait for all curls to finish
    #wait

    # Check the success count
    if [[ $success_count -eq 3 ]]; then
        # All curls succeeded, break out of the loop
        echo "  All curl succeeded"
        break
    fi

    ((retry_count++))
done

# Kill the datasette server
echo "Killing datasette server with PID $server_pid"
kill -9 $server_pid
sleep 2

# Print result
if [[ $success_count -eq 3 ]]; then
    echo "All three curls succeeded."
    exit 0
else
    echo "Error: Not all curls succeeded after $retry_count attempts."
    exit 1
fi

I run it like this:

git bisect reset
git bisect start
git bisect good 0.59.4
git bisect bad 1.0a6
git bisect run ../airtable-export/testit.sh

But... it's not having the desired result, I think because the bug is intermittent so each time I run it the bisect spits out a different commit as the one that is to blame.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Output while it is running looks like this:

running  '../airtable-export/testit.sh'
INFO:     Started server process [75649]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls
  Running curls
  Running curls
Killing datasette server with PID 75649
../airtable-export/testit.sh: line 54: 75649 Killed: 9               datasette pottery2.db -p $port
Error: Not all curls succeeded after 3 attempts.
Bisecting: 155 revisions left to test after this (roughly 7 steps)
[247e460e08bf823142f7b84058fe44e43626787f] Update beautifulsoup4 requirement (#1703)
running  '../airtable-export/testit.sh'
INFO:     Started server process [75722]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls
  Running curls
  Running curls
Killing datasette server with PID 75722
../airtable-export/testit.sh: line 54: 75722 Killed: 9               datasette pottery2.db -p $port
Error: Not all curls succeeded after 3 attempts.
Bisecting: 77 revisions left to test after this (roughly 6 steps)
[3ef47a0896c7e63404a34e465b7160c80eaa571d] Link rel=alternate header for tables and rows
running  '../airtable-export/testit.sh'
INFO:     Started server process [75818]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I tried it with a path of / and everything passed - so it's definitely the path of /airtable_refs/airtable_refs (an in-memory database created by an experimental branch of https://github.com/simonw/airtable-export) that triggers the problem.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I knocked it down to 1 retry just to see what happened.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Turned out I wasn't running the datasette from the current directory, so it was not testing what I intended.

FIxed that with pip install -e . in the datasette/ directory.

Now I'm seeing some passes, which look like this:

running  '../airtable-export/testit.sh'
INFO:     Started server process [77810]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls
INFO:     127.0.0.1:59439 - "GET /airtable_refs/airtable_refs HTTP/1.1" 200 OK
INFO:     127.0.0.1:59440 - "GET /airtable_refs/airtable_refs HTTP/1.1" 200 OK
INFO:     127.0.0.1:59441 - "GET /airtable_refs/airtable_refs HTTP/1.1" 200 OK
  All curl succeeded
Killing datasette server with PID 77810
../airtable-export/testit.sh: line 54: 77810 Killed: 9               datasette pottery2.db -p $port
All three curls succeeded.
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[7463b051cf8d7f856df5eba9f7aa944183ebabe5] Cosmetic tweaks after blacken-docs, refs #1718
running  '../airtable-export/testit.sh'
INFO:     Started server process [77826]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8064 (Press CTRL+C to quit)
  Running curls

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

OK it looks like it found it!

942411ef946e9a34a2094944d3423cddad27efd3 is the first bad commit
commit 

Author: Simon Willison <swillison@gmail.com>
Date:   Tue Apr 26 15:48:56 2022 -0700

    Execute some TableView queries in parallel
    
    Use ?_noparallel=1 to opt out (undocumented, useful for benchmark comparisons)
    
    Refs #1723, #1715

 datasette/views/table.py | 93 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 67 insertions(+), 26 deletions(-)
bisect found first bad commit

942411e does look like the cause of this problem.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

I also confirmed that http://127.0.0.1:8064/airtable_refs/airtable_refs?_noparallel=1 does not trigger the bug but http://127.0.0.1:8064/airtable_refs/airtable_refs does.

@simonw simonw changed the title Server hang on some queries to named in-memory databases Server hang on parallel execution of queries to named in-memory databases Sep 18, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Now that I've confirmed that parallel query execution of the kind introduced in 942411e can cause hangs (presumably some kind of locking issue) against in-memory databases, some options:

  1. Disable parallel execution entirely and rip out related code.
  2. Disable parallel execution entirely by leaving that code but having it always behave as if _noparallel=1
  3. Continue digging and try and find some way to avoid this problem

The parallel execution work is something I was playing with last year in the hope of speeding up Datasette pages like the table page which need to execute a bunch of queries - one for each facet, plus one for each column to see if it should be suggested as a facet.

I wrote about this at the time here: https://simonwillison.net/2022/May/6/weeknotes/

My hope was that despite Python's GIL this optimization would still help, because the SQLite C module releases the GIL once it gets to SQLite.

But... that didn't hold up. It looked like enough work was happening in Python land with the GIL that the optimization didn't improve things.

Running the nogil fork of Python DID improve things though! I left the code in partly on the hope that the nogil fork would be accepted into Python core.

... which it now has! But it will still be a year or two before it fully lands: https://discuss.python.org/t/a-steering-council-notice-about-pep-703-making-the-global-interpreter-lock-optional-in-cpython/30474

So I'm not particularly concerned about dropping the parallel execution. If I do drop it though do I leave the potentially complex code in that relates to it?

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Looking again at this code:

def connect(self, write=False):
if self.memory_name:
uri = "file:{}?mode=memory&cache=shared".format(self.memory_name)
conn = sqlite3.connect(
uri,
uri=True,
check_same_thread=False,
)
if not write:
conn.execute("PRAGMA query_only=1")
return conn
if self.is_memory:
return sqlite3.connect(":memory:", uri=True)
# mode=ro or immutable=1?
if self.is_mutable:
qs = "?mode=ro"
if self.ds.nolock:
qs += "&nolock=1"
else:
qs = "?immutable=1"
assert not (write and not self.is_mutable)
if write:
qs = ""
if self.mode is not None:
qs = f"?mode={self.mode}"
conn = sqlite3.connect(
f"file:{self.path}{qs}", uri=True, check_same_thread=False
)
self._all_file_connections.append(conn)
return conn

check_same_thread=False really stands out here.

Python docs at https://docs.python.org/3/library/sqlite3.html

check_same_thread (bool) -- If True (default), ProgrammingError will be raised if the database connection is used by a thread other than the one that created it. If False, the connection may be accessed in multiple threads; write operations may need to be serialized by the user to avoid data corruption. See threadsafety for more information.

I think I'm playing with fire by allowing multiple threads to access the same connection without doing my own serialization of those requests.

I do do that using the write connection - and in this particular case the bug isn't coming from write queries, it's coming from read queries - but perhaps SQLite has issues with threading for reads, too.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

Using SQLite In Multi-Threaded Applications

That indicates that there's a SQLite option for "Serialized" mode where it's safe to access anything SQLite provides from multiple threads, but as far as I can tell Python doesn't give you an option to turn that mode on or off for a connection - you can read sqlite3.threadsafety to see if that mode was compiled in or not, but not actually change it.

On my Mac sqlite3.threadsafety returns 1 which means https://docs.python.org/3/library/sqlite3.html#sqlite3.threadsafety "Multi-thread: In this mode, SQLite can be safely used by multiple threads provided that no single database connection is used simultaneously in two or more threads." - it would need to return 3 for that serialized mode.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

My current hunch is that SQLite gets unhappy if multiple threads access the same underlying C object - which sometimes happens with in-memory connections and Datasette presumably because they are faster than file-backed databases.

I'm going to remove the asyncio.gather() code from the table view. I'll ship a 0.x release with that fix too.

@simonw
Copy link
Owner Author

simonw commented Sep 18, 2023

The one other thing affected by this change is this documentation, which suggests a not-actually-safe pattern:

datasette/docs/internals.rst

Lines 1292 to 1321 in 6ed7908

This example uses the :ref:`register_routes() <plugin_register_routes>` plugin hook to add a page at ``/parallel-queries`` which executes two SQL queries in parallel using ``asyncio.gather()`` and returns their results.
.. code-block:: python
from datasette import hookimpl
from datasette import tracer
@hookimpl
def register_routes():
async def parallel_queries(datasette):
db = datasette.get_database()
with tracer.trace_child_tasks():
one, two = await asyncio.gather(
db.execute("select 1"),
db.execute("select 2"),
)
return Response.json(
{
"one": one.single_value(),
"two": two.single_value(),
}
)
return [
(r"/parallel-queries$", parallel_queries),
]
Adding ``?_trace=1`` will show that the trace covers both of those child tasks.

@simonw
Copy link
Owner Author

simonw commented Sep 20, 2023

Added a note to that example in the documentation:

Note that running parallel SQL queries in this way has `been known to cause problems in the past <https://github.com/simonw/datasette/issues/2189>`__, so treat this example with caution.

@yozlet
Copy link

yozlet commented Sep 20, 2023

/me munches popcorn at a furious rate, utterly entralled

simonw added a commit that referenced this issue Sep 20, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 21, 2023

I'm going to release this in 1.0a7, and I'll backport it to a 0.64.4 release too.

simonw added a commit that referenced this issue Sep 21, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 21, 2023

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2023

We're planning a breaking change in 1.0a7:

Since that's a breaking change I'm going to ship 1.0a7 right now with this fix, then ship that breaking change as 1.0a8 instead.

simonw added a commit that referenced this issue Sep 21, 2023
simonw added a commit that referenced this issue Sep 21, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 21, 2023

1.0a7 is out with this fix as well now: https://docs.datasette.io/en/1.0a7/changelog.html#a7-2023-09-21

@simonw simonw closed this as completed Sep 21, 2023
simonw added a commit that referenced this issue Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants