Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix all the places that currently use .inspect() data #420

Closed
simonw opened this issue Mar 17, 2019 · 13 comments

Comments

Projects
None yet
1 participant
@simonw
Copy link
Owner

commented Mar 17, 2019

See #419: if Datasette is going to work against mutable SQLite files it can no longer assume that the .inspect() method will have cached the correct schema for all tables in all attached databases.

So everywhere in the code at the moment that relies on .inspect() data needs to be modified to use live introspection of the schema instead.

From a comment later on: here are the uses I need to fix as a checklist:

  • table_exists()
  • info["file"] in .execute()
  • resolve_db_name()
  • .database_url(database)
  • DatabaseDownload file path
  • sortable_columns_for_table() uses it to find the columns in a table
  • expandable_columns() uses it to find foreign keys
  • expand_foreign_keys() uses it to find foreign keys
  • display_columns_and_rows() uses it to find primary keys and foreign keys... but also has access to a cursor.description which it uses to list the columns
  • TableView.data uses it to lookup columns and primary keys and the table_rows_count (used if the thing isn't a view) and probably a few more things, this method is huge!
  • RowView.data uses it for primary keys
  • foreign_key_tables() uses it for foreign keys
  • DatabaseView list of tables
  • IndexView
@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 17, 2019

Some examples:

def sortable_columns_for_table(self, database, table, use_rowid):
table_metadata = self.table_metadata(database, table)
if "sortable_columns" in table_metadata:
sortable_columns = set(table_metadata["sortable_columns"])
else:
table_info = self.ds.inspect()[database]["tables"].get(table) or {}
sortable_columns = set(table_info.get("columns", []))

def expandable_columns(self, database, table):
# Returns list of (fk_dict, label_column-or-None) pairs for that table
tables = self.ds.inspect()[database].get("tables", {})
table_info = tables.get(table)

async def expand_foreign_keys(self, database, table, column, values):
"Returns dict mapping (column, value) -> label"
labeled_fks = {}
tables_info = self.ds.inspect()[database]["tables"]

async def display_columns_and_rows(
self,
database,
table,
description,
rows,
link_column=False,
truncate_cells=0,
):
"Returns columns, rows for specified table - including fancy foreign key treatment"
table_metadata = self.table_metadata(database, table)
info = self.ds.inspect()[database]

class IndexView(RenderMixin):
def __init__(self, datasette):
self.ds = datasette
async def get(self, request, as_format):
databases = []
for key, info in sorted(self.ds.inspect().items()):
tables = [t for t in info["tables"].values() if not t["hidden"]]

def resolve_db_name(self, db_name, **kwargs):
databases = self.ds.inspect()
hash = None
name = None
if "-" in db_name:

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 17, 2019

Needed for #419

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 18, 2019

Maybe this is a good opportunity to improve the introspection capabilities in sqlite-utils and add it as a dependency.

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 19, 2019

I systematically reviewed the codebase for things that .inspect() is used for:

In app.py:

  • table_exists() uses table in self.inspect().get(database, {}).get("tables")
  • .execute() looks up the database name to get the info["file"] (the correct filename with the .db extension)

In cli.py:

  • The datasette inspect command dumps it to JSON
  • datasette skeleton iterates over it
  • datasette serve calls it on startup (to populate static cache of inspect data)

In base.py:

  • .database_url(database) calls it to lookup the hash (if hash_urls config turned on)
  • .resolve_db_name() uses it to lookup the hash

In database.py:

  • DatabaseView uses it to find up the list of tables and views to display, plus the size of the DB file in bytes
  • DatabaseDownload uses it to get the filepath for download

In index.py:

  • IndexView uses it extensively - to loop through every database and every table. This would make a good starting point for the refactor.

In table.py:

  • sortable_columns_for_table() uses it to find the columns in a table
  • expandable_columns() uses it to find foreign keys
  • expand_foreign_keys() uses it to find foreign keys
  • display_columns_and_rows() uses it to find primary keys and foreign keys... but also has access to a cursor.description which it uses to list the columns
  • TableView.data uses it to lookup columns and primary keys and the table_rows_count (used if the thing isn't a view) and probably a few more things, this method is huge!
  • RowView.data uses it for primary keys
  • foreign_key_tables() uses it for foreign keys

In the tests it's used by test_api.test_inspect_json() and by a couple of tests in test_inspect.

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 19, 2019

Most of these can be replaced with relatively straight-forward direct introspection of the SQLite table.

The one exception is the incoming foreign keys: these can only be found by inspecting ALL of the other tables.

This requires running PRAGMA foreign_key_list([table_name]) against every other table in the database. How expensive is doing this on a database with hundreds of tables?

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 19, 2019

A microbenchmark against fivethirtyeight.db (415 tables):

In [1]: import sqlite3                                                                                              
In [2]: c = sqlite3.connect("fivethirtyeight.db")                                                                   
In [3]: %timeit c.execute("select name from sqlite_master where type = 'table'").fetchall()                         
283 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: tables = [r[0] for r in c.execute("select name from sqlite_master where type = 'table'").fetchall()]        
In [5]: len(tables)                                                                                                 
Out[5]: 415
In [6]: %timeit [c.execute("pragma foreign_keys([{}])".format(t)).fetchall() for t in tables]                       
1.81 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So running pragma foreign_keys() against 415 tables only takes 1.81ms. This is going to be fine.

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 28, 2019

I started looking at how I would implement table_exists() with a direct call that uses sqlite-utils to see if a table exists.

datasette/datasette/app.py

Lines 303 to 304 in 82fec60

def table_exists(self, database, table):
return table in self.inspect().get(database, {}).get("tables")

sqlite-utils needs access to the database connection - but the database connection itself is currently only available in code that runs in a thread inside the .execute() method:

datasette/datasette/app.py

Lines 413 to 426 in 82fec60

async def execute(
self,
db_name,
sql,
params=None,
truncate=False,
custom_time_limit=None,
page_size=None,
):
"""Executes sql against db_name in a thread"""
page_size = page_size or self.page_size
def sql_operation_in_thread():
conn = getattr(connections, db_name, None)

So I'm going to need to refactor this a bit. I think I need a way to say "here is a function which needs access to the connection object for database named X - run that function in a thread, give it access to that connection and then give me back the result".

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 28, 2019

Even more tricky: table_exists() is currently a synchronous function. If it's going to be executing a SQL query it needs to become an async function.

simonw added a commit that referenced this issue Mar 31, 2019

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 31, 2019

Next I need to fix this:

datasette/datasette/app.py

Lines 420 to 435 in 0209a0a

conn = getattr(connections, db_name, None)
if not conn:
info = self.inspect()[db_name]
if info["file"] == ":memory:":
conn = sqlite3.connect(":memory:")
else:
# mode=ro or immutable=1?
if info["file"] in self.immutables:
qs = "immutable=1"
else:
qs = "mode=ro"
conn = sqlite3.connect(
"file:{}?{}".format(info["file"], qs),
uri=True,
check_same_thread=False,
)

Given the name of the database (from the URL e.g. https://latest.datasette.io/fixtures) I need to figure out what name I used to cache the collection.

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Mar 31, 2019

This means the Datasette class needs a new property, keeping track of all of the connected databases.

ds.databases = {
    "name_used_in_urls": {
        "type": "file", # or "memory"
        "path": filepath # or None if memory
        "mutable": True # or False,
        "hash": "..." # or None if mutable
    }
}

Maybe these should be objects, not dictionaries.

simonw added a commit that referenced this issue Mar 31, 2019

simonw added a commit that referenced this issue Mar 31, 2019

simonw added a commit that referenced this issue Apr 1, 2019

simonw added a commit that referenced this issue Apr 7, 2019

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Apr 7, 2019

expand_foreign_keys() relies on the .inspect() command having automatically derived the label_column for a table, which it does using this code:

def detect_label_column(column_names):
""" Detect the label column - which we display as the label for a joined column.
If a table has two columns, one of which is ID, then label_column is the other one.
"""
if (column_names and len(column_names) == 2 and "id" in column_names):
return [c for c in column_names if c != "id"][0]
return None

This needs access to the column names for the table. I think we can drop this entirely in favour of a new utility function - and that function can incorporate the metadata check as well.

simonw added a commit that referenced this issue Apr 7, 2019

simonw added a commit that referenced this issue Apr 7, 2019

simonw added a commit that referenced this issue Apr 7, 2019

simonw added a commit that referenced this issue Apr 7, 2019

@simonw

This comment has been minimized.

Copy link
Owner Author

commented Apr 7, 2019

Still need to solve: TableView.data() - but this is the one with a row count in hence the need to solve #422

simonw added a commit that referenced this issue Apr 7, 2019

simonw added a commit that referenced this issue Apr 7, 2019

TableView.data() no longer uses .inspect, refs #420
BUT... it does a count(*) against the whole table which may take unbounded time.

Fixing this is part of #422
@simonw

This comment has been minimized.

Copy link
Owner Author

commented Apr 9, 2019

Efficient row counts are even more important for the DatabaseView and IndexView pages.

The row counts on those pages don't have to be precise, so one option is for me to calculate them and cache them occasionally. I could even have a dedicated thread which just does the counting?

In #422 I've figured out a mechanism for getting accurate or lower-bound counts within a time limit (accurate if possible, lower-bound otherwise).

simonw added a commit that referenced this issue May 2, 2019

Index page no longer uses inspect data - refs #420
Also introduced a mechanism whereby table counts are calculated against a time limit
but immutable databases have their table counts calculated on server startup.

@simonw simonw closed this in 033cf0b May 2, 2019

simonw added a commit that referenced this issue May 2, 2019

simonw added a commit that referenced this issue May 2, 2019

Entirely removed table_rows_count table property
We were not displaying this anywhere, and it is now expensive to calculate.

Refs #419, #420

@simonw simonw added this to the 0.28 milestone May 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.