Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize all those calls to index_list and foreign_key_list #1555

Closed
simonw opened this issue Dec 13, 2021 · 27 comments
Closed

Optimize all those calls to index_list and foreign_key_list #1555

simonw opened this issue Dec 13, 2021 · 27 comments

Comments

@simonw
Copy link
Owner

simonw commented Dec 13, 2021

On the first hit to a restarted index I'm seeing this in the SQL traces: https://latest-with-plugins.datasette.io/github/commits?_trace=1

image

I imagine this could be sped up a lot using tricks like this one from the SQLite documentation: https://sqlite.org/pragma.html#pragfunc

SELECT DISTINCT m.name || '.' || ii.name AS 'indexed-columns'
  FROM sqlite_schema AS m,
       pragma_index_list(m.name) AS il,
       pragma_index_info(il.name) AS ii
 WHERE m.type='table'
 ORDER BY 1;

https://latest-with-plugins.datasette.io/fixtures?sql=SELECT+DISTINCT+m.name+%7C%7C+%27.%27+%7C%7C+ii.name+AS+%27indexed-columns%27%0D%0A++FROM+sqlite_schema+AS+m%2C%0D%0A+++++++pragma_index_list%28m.name%29+AS+il%2C%0D%0A+++++++pragma_index_info%28il.name%29+AS+ii%0D%0A+WHERE+m.type%3D%27table%27%0D%0A+ORDER+BY+1%3B

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

I think all of these queries happen in one place - in the populate_schema_tables() function - so optimizing them might be localized to just that area of the code, which would be nice:

async def populate_schema_tables(internal_db, db):
database_name = db.name
await internal_db.execute_write(
"DELETE FROM tables WHERE database_name = ?", [database_name], block=True
)
tables = (await db.execute("select * from sqlite_master WHERE type = 'table'")).rows
for table in tables:
table_name = table["name"]
await internal_db.execute_write(
"""
INSERT INTO tables (database_name, table_name, rootpage, sql)
values (?, ?, ?, ?)
""",
[database_name, table_name, table["rootpage"], table["sql"]],
block=True,
)
# And the columns
await internal_db.execute_write(
"DELETE FROM columns WHERE database_name = ? and table_name = ?",
[database_name, table_name],
block=True,
)
columns = await db.table_column_details(table_name)
for column in columns:
params = {
**{"database_name": database_name, "table_name": table_name},
**column._asdict(),
}
await internal_db.execute_write(
"""
INSERT INTO columns (
database_name, table_name, cid, name, type, "notnull", default_value, is_pk, hidden
) VALUES (
:database_name, :table_name, :cid, :name, :type, :notnull, :default_value, :is_pk, :hidden
)
""",
params,
block=True,
)
# And the foreign_keys
await internal_db.execute_write(
"DELETE FROM foreign_keys WHERE database_name = ? and table_name = ?",
[database_name, table_name],
block=True,
)
foreign_keys = (
await db.execute(f"PRAGMA foreign_key_list([{table_name}])")
).rows
for foreign_key in foreign_keys:
params = {
**{"database_name": database_name, "table_name": table_name},
**dict(foreign_key),
}
await internal_db.execute_write(
"""
INSERT INTO foreign_keys (
database_name, table_name, "id", seq, "table", "from", "to", on_update, on_delete, match
) VALUES (
:database_name, :table_name, :id, :seq, :table, :from, :to, :on_update, :on_delete, :match
)
""",
params,
block=True,
)
# And the indexes
await internal_db.execute_write(
"DELETE FROM indexes WHERE database_name = ? and table_name = ?",
[database_name, table_name],
block=True,
)
indexes = (await db.execute(f"PRAGMA index_list([{table_name}])")).rows
for index in indexes:
params = {
**{"database_name": database_name, "table_name": table_name},
**dict(index),
}
await internal_db.execute_write(
"""
INSERT INTO indexes (
database_name, table_name, seq, name, "unique", origin, partial
) VALUES (
:database_name, :table_name, :seq, :name, :unique, :origin, :partial
)
""",
params,
block=True,
)

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

I was thinking it might even be possible to convert this into a insert into tables select from ... query:

tables = (await db.execute("select * from sqlite_master WHERE type = 'table'")).rows
for table in tables:
table_name = table["name"]
await internal_db.execute_write(
"""
INSERT INTO tables (database_name, table_name, rootpage, sql)
values (?, ?, ?, ?)
""",
[database_name, table_name, table["rootpage"], table["sql"]],
block=True,
)

But the SELECT runs against a separate database from the INSERT INTO, so I would have to setup a cross-database connection for this which feels a little too complicated.

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

I wonder why the INSERT INTO queries don't show up in that ?trace=1 view?

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

Here's why - trace only applies to read, not write SQL operations:

with trace("sql", database=self.name, sql=sql.strip(), params=params):
results = await self.execute_fn(sql_operation_in_thread)
return results

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

Maybe it would be worth exploring attaching each DB in turn to the _internal connection in order to perform these queries faster.

I'm a bit worried about leaks though: the internal database isn't meant to be visible, even temporarily attaching another DB to it could cause SQL queries against that DB to be able to access the internal data.

So maybe instead the _internal connection gets to connect to the other DBs? There's a maximum of ten there I think, which is good for most but not all cases. But the cases with the most connected databases will see the worst performance!

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

Now that trace sees write queries (#1568) it's clear that there is a whole lot more DB activity then I had realized:

image

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

A simpler optimization would be just to turn all of those column and index reads into a single efficient UNION query against each database, then figure out the most efficient pattern to send them all as writes in one go as opposed to calling .execute_write() in a loop.

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

Also: running all of those CREATE TABLE IF NOT EXISTS in a single call to conn.executescript() rather than as separate queries may speed things up too.

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

One downside of conn.executescript() is that it won't be picked up by the tracing mechanism - in fact nothing that uses await db.execute_write_fn(fn, block=True) or await db.execute_fn(fn, block=True) gets picked up by tracing.

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

Idea: teach execute_write to accept an optional executescript=True parameter, like this:

diff --git a/datasette/database.py b/datasette/database.py
index 468e936..1a424f5 100644
--- a/datasette/database.py
+++ b/datasette/database.py
@@ -94,10 +94,14 @@ class Database:
             f"file:{self.path}{qs}", uri=True, check_same_thread=False
         )
 
-    async def execute_write(self, sql, params=None, block=False):
+    async def execute_write(self, sql, params=None, executescript=False, block=False):
+        assert not executescript and params, "Cannot use params with executescript=True"
         def _inner(conn):
             with conn:
-                return conn.execute(sql, params or [])
+                if executescript:
+                    return conn.executescript(sql)
+                else:
+                    return conn.execute(sql, params or [])
 
         with trace("sql", database=self.name, sql=sql.strip(), params=params):
             results = await self.execute_write_fn(_inner, block=block)

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

image

Using executescript=True that call now takes 1.89ms to create all of those tables.

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

The implementation of cursor.executemany() looks very efficient - it turns into a call to this C function with multiple set to 1: https://github.com/python/cpython/blob/e002bbc6cce637171fb2b1391ffeca8643a13843/Modules/_sqlite/cursor.c#L468-L469

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

Rather than adding a executemany=True parameter, I'm now thinking a better design might be to have three methods:

  • db.execute_write(sql, params=None, block=False)
  • db.execute_writescript(sql, block=False)
  • db.execute_writemany(sql, params_seq, block=False)

@simonw
Copy link
Owner Author

simonw commented Dec 18, 2021

That's a good optimization. Still need to deal with the huge flurry of PRAGMA queries though before I can consider this done.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

To list all indexes: https://latest.datasette.io/fixtures?sql=SELECT%0D%0A++sqlite_master.name%2C%0D%0A++index_list.*%0D%0AFROM%0D%0A++sqlite_master%2C%0D%0A++pragma_index_list%28sqlite_master.name%29+AS+index_list%0D%0AWHERE%0D%0A++sqlite_master.type+%3D+%27table%27

SELECT
  sqlite_master.name,
  index_list.*
FROM
  sqlite_master,
  pragma_index_list(sqlite_master.name) AS index_list
WHERE
  sqlite_master.type = 'table'

Foreign keys: https://latest.datasette.io/fixtures?sql=SELECT%0D%0A++sqlite_master.name%2C%0D%0A++foreign_key_list.*%0D%0AFROM%0D%0A++sqlite_master%2C%0D%0A++pragma_foreign_key_list%28sqlite_master.name%29+AS+foreign_key_list%0D%0AWHERE%0D%0A++sqlite_master.type+%3D+%27table%27

SELECT
  sqlite_master.name,
  foreign_key_list.*
FROM
  sqlite_master,
  pragma_foreign_key_list(sqlite_master.name) AS foreign_key_list
WHERE
  sqlite_master.type = 'table'

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

Table columns is a bit harder, because table_xinfo is only in SQLite 3.26.0 or higher:

def table_column_details(conn, table):
if supports_table_xinfo():
# table_xinfo was added in 3.26.0
return [
Column(*r)
for r in conn.execute(
f"PRAGMA table_xinfo({escape_sqlite(table)});"
).fetchall()
]
else:
# Treat hidden as 0 for all columns
return [
Column(*(list(r) + [0]))
for r in conn.execute(
f"PRAGMA table_info({escape_sqlite(table)});"
).fetchall()
]

So if that function is available: https://latest.datasette.io/fixtures?sql=SELECT%0D%0A++sqlite_master.name%2C%0D%0A++table_xinfo.*%0D%0AFROM%0D%0A++sqlite_master%2C%0D%0A++pragma_table_xinfo%28sqlite_master.name%29+AS+table_xinfo%0D%0AWHERE%0D%0A++sqlite_master.type+%3D+%27table%27

SELECT
  sqlite_master.name,
  table_xinfo.*
FROM
  sqlite_master,
  pragma_table_xinfo(sqlite_master.name) AS table_xinfo
WHERE
  sqlite_master.type = 'table'

And otherwise, using table_info: https://latest.datasette.io/fixtures?sql=SELECT%0D%0A++sqlite_master.name%2C%0D%0A++table_info.*%2C%0D%0A++0+as+hidden%0D%0AFROM%0D%0A++sqlite_master%2C%0D%0A++pragma_table_info%28sqlite_master.name%29+AS+table_info%0D%0AWHERE%0D%0A++sqlite_master.type+%3D+%27table%27

SELECT
  sqlite_master.name,
  table_info.*,
  0 as hidden
FROM
  sqlite_master,
  pragma_table_info(sqlite_master.name) AS table_info
WHERE
  sqlite_master.type = 'table'

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

One thing to watch out for though, from https://sqlite.org/pragma.html#pragfunc

The table-valued functions for PRAGMA feature was added in SQLite version 3.16.0 (2017-01-02). Prior versions of SQLite cannot use this feature.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

This caught me out once before in:

Turns out Glitch was running SQLite 3.11.0 from 2016-02-15.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

If I want to continue supporting SQLite prior to 3.16.0 (2017-01-02) I'll need this optimization to only kick in with versions that support table-valued PRAGMA functions, while keeping the old PRAGMA foreign_key_list(table) stuff working for those older versions.

That's feasible, but it's a bit more work - and I need to make sure I have robust testing in place for SQLite 3.15.0.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

On that same documentation page I just spotted this:

This feature is experimental and is subject to change. Further documentation will become available if and when the table-valued functions for PRAGMAs feature becomes officially supported.

This makes me nervous to rely on pragma function optimizations in Datasette itself.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

I wonder how much overhead there is switching between the async event loop main code and the thread that runs the SQL queries.

Would there be a performance boost if I gathered all of the column/index information in a single function run on the thread using db.execute_fn() I wonder? It would eliminate a bunch of switching between threads.

Would be great to understand how much of an impact that would have.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

Using the prototype of this:

I'm seeing about 180ms spent running all of these queries on startup!

CleanShot 2021-12-18 at 19 38 37@2x

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

Here's a hacked together prototype of running all of that stuff inside a single function passed to .execute_fn():

diff --git a/datasette/utils/internal_db.py b/datasette/utils/internal_db.py
index 95055d8..58f9982 100644
--- a/datasette/utils/internal_db.py
+++ b/datasette/utils/internal_db.py
@@ -1,4 +1,5 @@
 import textwrap
+from datasette.utils import table_column_details
 
 
 async def init_internal_db(db):
@@ -70,49 +71,70 @@ async def populate_schema_tables(internal_db, db):
         "DELETE FROM tables WHERE database_name = ?", [database_name], block=True
     )
     tables = (await db.execute("select * from sqlite_master WHERE type = 'table'")).rows
-    tables_to_insert = []
-    columns_to_delete = []
-    columns_to_insert = []
-    foreign_keys_to_delete = []
-    foreign_keys_to_insert = []
-    indexes_to_delete = []
-    indexes_to_insert = []
 
-    for table in tables:
-        table_name = table["name"]
-        tables_to_insert.append(
-            (database_name, table_name, table["rootpage"], table["sql"])
-        )
-        columns_to_delete.append((database_name, table_name))
-        columns = await db.table_column_details(table_name)
-        columns_to_insert.extend(
-            {
-                **{"database_name": database_name, "table_name": table_name},
-                **column._asdict(),
-            }
-            for column in columns
-        )
-        foreign_keys_to_delete.append((database_name, table_name))
-        foreign_keys = (
-            await db.execute(f"PRAGMA foreign_key_list([{table_name}])")
-        ).rows
-        foreign_keys_to_insert.extend(
-            {
-                **{"database_name": database_name, "table_name": table_name},
-                **dict(foreign_key),
-            }
-            for foreign_key in foreign_keys
-        )
-        indexes_to_delete.append((database_name, table_name))
-        indexes = (await db.execute(f"PRAGMA index_list([{table_name}])")).rows
-        indexes_to_insert.extend(
-            {
-                **{"database_name": database_name, "table_name": table_name},
-                **dict(index),
-            }
-            for index in indexes
+    def collect_info(conn):
+        tables_to_insert = []
+        columns_to_delete = []
+        columns_to_insert = []
+        foreign_keys_to_delete = []
+        foreign_keys_to_insert = []
+        indexes_to_delete = []
+        indexes_to_insert = []
+
+        for table in tables:
+            table_name = table["name"]
+            tables_to_insert.append(
+                (database_name, table_name, table["rootpage"], table["sql"])
+            )
+            columns_to_delete.append((database_name, table_name))
+            columns = table_column_details(conn, table_name)
+            columns_to_insert.extend(
+                {
+                    **{"database_name": database_name, "table_name": table_name},
+                    **column._asdict(),
+                }
+                for column in columns
+            )
+            foreign_keys_to_delete.append((database_name, table_name))
+            foreign_keys = conn.execute(
+                f"PRAGMA foreign_key_list([{table_name}])"
+            ).fetchall()
+            foreign_keys_to_insert.extend(
+                {
+                    **{"database_name": database_name, "table_name": table_name},
+                    **dict(foreign_key),
+                }
+                for foreign_key in foreign_keys
+            )
+            indexes_to_delete.append((database_name, table_name))
+            indexes = conn.execute(f"PRAGMA index_list([{table_name}])").fetchall()
+            indexes_to_insert.extend(
+                {
+                    **{"database_name": database_name, "table_name": table_name},
+                    **dict(index),
+                }
+                for index in indexes
+            )
+        return (
+            tables_to_insert,
+            columns_to_delete,
+            columns_to_insert,
+            foreign_keys_to_delete,
+            foreign_keys_to_insert,
+            indexes_to_delete,
+            indexes_to_insert,
         )
 
+    (
+        tables_to_insert,
+        columns_to_delete,
+        columns_to_insert,
+        foreign_keys_to_delete,
+        foreign_keys_to_insert,
+        indexes_to_delete,
+        indexes_to_insert,
+    ) = await db.execute_fn(collect_info)
+
     await internal_db.execute_write_many(
         """
         INSERT INTO tables (database_name, table_name, rootpage, sql)

First impressions: it looks like this helps a lot - as far as I can tell this is now taking around 21ms to get to the point at which all of those internal databases have been populated, where previously it took more than 180ms.

CleanShot 2021-12-18 at 19 47 22@2x

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

It's a bit annoying that the queries no longer show up in the trace at all now, thanks to running in .execute_fn(). I wonder if there's something smart I can do about that - maybe have trace() record that function with a traceback even though it doesn't have the executed SQL string?

5fac26a has the same problem.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

Another option would be to provide an abstraction that makes it easier to run a group of SQL queries in the same thread at the same time, and have them traced correctly.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

Using #1571 showed me that the DELETE FROM columns/foreign_keys/indexes WHERE database_name = ? and table_name = ? queries were running way more times than I expected. I came up with a new optimization that just does DELETE FROM columns/foreign_keys/indexes WHERE database_name = ? instead.

@simonw
Copy link
Owner Author

simonw commented Dec 19, 2021

Closing this issue because I've optimized this a whole bunch, and it's definitely good enough for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant