Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a row has a primary key of null various things break #2145

Open
simonw opened this issue Aug 18, 2023 · 23 comments
Open

If a row has a primary key of null various things break #2145

simonw opened this issue Aug 18, 2023 · 23 comments
Labels

Comments

@simonw
Copy link
Owner

simonw commented Aug 18, 2023

Stumbled across this while experimenting with datasette-write-ui. The error I got was a 500 on the /db page:

'NoneType' object has no attribute 'encode'

Tracked it down to this code, which assembles the URL for a row page:

def path_from_row_pks(row, pks, use_rowid, quote=True):
"""Generate an optionally tilde-encoded unique identifier
for a row from its primary keys."""
if use_rowid:
bits = [row["rowid"]]
else:
bits = [
row[pk]["value"] if isinstance(row[pk], dict) else row[pk] for pk in pks
]
if quote:
bits = [tilde_encode(str(bit)) for bit in bits]
else:
bits = [str(bit) for bit in bits]
return ",".join(bits)

That's because tilde_encode can't handle None:

@documented
def tilde_encode(s: str) -> str:
"Returns tilde-encoded string - for example ``/foo/bar`` -> ``~2Ffoo~2Fbar``"
return "".join(_tilde_encoder(char) for char in s.encode("utf-8"))

@simonw simonw added the bug label Aug 18, 2023
@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

The big challenge here is what the URL to that row page should look like. How can I encode a None in a form that can be encoded and decoded without clashing with primary keys that are the string None or null?

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Oh wow, null primary keys are bad news... SQLite lets you insert multiple rows with the same null value!

>>> import sqlite_utils
>>> db = sqlite_utils.Database(memory=True)
>>> db["foo"].insert({"id": None, "name": "No ID"}, pk="id")
<Table foo (id, name)>
>>> db.schema
'CREATE TABLE [foo] (\n   [id] TEXT PRIMARY KEY,\n   [name] TEXT\n);'
>>> db["foo"].insert({"id": None, "name": "No ID"}, pk="id")
<Table foo (id, name)>
>>> db.schema
'CREATE TABLE [foo] (\n   [id] TEXT PRIMARY KEY,\n   [name] TEXT\n);'
>>> list(db["foo"].rows)
[{'id': None, 'name': 'No ID'}, {'id': None, 'name': 'No ID'}]
>>> list(db.query('select * from foo where id = null'))
[]
>>> list(db.query('select * from foo where id is null'))
[{'id': None, 'name': 'No ID'}, {'id': None, 'name': 'No ID'}]

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

https://www.sqlite.org/lang_createtable.html#the_primary_key says:

According to the SQL standard, PRIMARY KEY should always imply NOT NULL. Unfortunately, due to a bug in some early versions, this is not the case in SQLite. Unless the column is an INTEGER PRIMARY KEY or the table is a WITHOUT ROWID table or a STRICT table or the column is declared NOT NULL, SQLite allows NULL values in a PRIMARY KEY column. SQLite could be fixed to conform to the standard, but doing so might break legacy applications. Hence, it has been decided to merely document the fact that SQLite allows NULLs in most PRIMARY KEY columns.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

So it sounds like SQLite does ensure that a rowid before it allows a primary key to be null.

So one solution here would be to detect a null primary key and switch that table over to using rowid URLs instead. The key problem we're trying to solve here after all is how to link to a row:

https://latest.datasette.io/fixtures/infinity/1

But when would we run that check? And does every row in the table get a new /rowid/ URL just because someone messed up and inserted a null by mistake?

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Here's a potential solution: make it so ALL rowid tables in SQLite can be optionally addressed by their rowid instead of by their primary key.

Then teach the code that outputs the URL to a row page to spot if there are null primary keys and switch to that alternative addressing mechanism instead.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

The most interesting row URL in the fixtures database right now is this one:

https://latest.datasette.io/fixtures/compound_primary_key/a~2Fb,~2Ec-d

CleanShot 2023-08-18 at 15 34 52@2x

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Looking at the way these URLs work: because the components themselves in a~2Fb,~2Ec-d are tilde-encoded, any character that's "safe" in tilde-encoding could be used to indicate "this is actually a rowid".

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

I just found this and panicked, thinking maybe tilde encoding is a bad idea after all! https://jkorpela.fi/tilde.html

But... "Date of last update: 1999-08-27" - I think I'm OK.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

From reviewing https://simonwillison.net/2022/Mar/19/weeknotes/

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

That's how I chose the tilde character - but it also suggests that I could use - or . or _ for my new rowid encoding.

So maybe /database/table/_4 could indicate "the row with rowid of 4".

No, that doesn't work:

>>> from datasette.utils import tilde_encode
>>> tilde_encode("_")
'_'

I need a character which tilde-encoding does indeed encode.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

>>> tilde_encode("~")
'~7E'
>>> tilde_encode(".")
'~2E'
>>> tilde_encode("-")
'-'

I think . might be the way to do this:

 /database/table/.4

But... I worry about that colliding with my URL routing code that spots the difference between these:

 /database/table/.4
 /database/table/.4.json
 /database/table/.4.csv

etc.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

I could set a rule that extensions (including custom render extensions set by plugins) must not be valid integers, and teach Datasette that /\.\d+ is the indication of a rowid.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Here's the regex in question at the moment:

datasette/datasette/app.py

Lines 1387 to 1390 in 943df09

add_route(
RowView.as_view(self),
r"/(?P<database>[^\/\.]+)/(?P<table>[^/]+?)/(?P<pks>[^/]+?)(\.(?P<format>\w+))?$",
)

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

(?P<pks>[^/]+?) could instead be a regex that is restricted to the tilde-encoded set of characters, or \.\d+.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Also relevant:

_TILDE_ENCODING_SAFE = frozenset(
b"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
b"abcdefghijklmnopqrstuvwxyz"
b"0123456789_-"
# This is the same as Python percent-encoding but I removed
# '.' and '~'
)

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Here's a prototype of that:

diff --git a/datasette/app.py b/datasette/app.py
index b2644ace..acc55249 100644
--- a/datasette/app.py
+++ b/datasette/app.py
@@ -1386,7 +1386,7 @@ class Datasette:
         )
         add_route(
             RowView.as_view(self),
-            r"/(?P<database>[^\/\.]+)/(?P<table>[^/]+?)/(?P<pks>[^/]+?)(\.(?P<format>\w+))?$",
+            r"/(?P<database>[^\/\.]+)/(?P<table>[^/]+?)/(?P<pks>[A-Za-z0-9\_\-\~]+|\.\d+)(\.(?P<format>\w+))?$",
         )
         add_route(
             TableInsertView.as_view(self),
@@ -1440,7 +1440,15 @@ class Datasette:
     async def resolve_row(self, request):
         db, table_name, _ = await self.resolve_table(request)
         pk_values = urlsafe_components(request.url_vars["pks"])
-        sql, params, pks = await row_sql_params_pks(db, table_name, pk_values)
+
+        if len(pk_values) == 1 and pk_values[0].startswith("."):
+            # It's a special .rowid value
+            pk_values = (pk_values[0][1:],)
+            sql, params, pks = await row_sql_params_pks(
+                db, table_name, pk_values, rowid=True
+            )
+        else:
+            sql, params, pks = await row_sql_params_pks(db, table_name, pk_values)
         results = await db.execute(sql, params, truncate=True)
         row = results.first()
         if row is None:
diff --git a/datasette/utils/__init__.py b/datasette/utils/__init__.py
index c388673d..96669281 100644
--- a/datasette/utils/__init__.py
+++ b/datasette/utils/__init__.py
@@ -1206,9 +1206,12 @@ def truncate_url(url, length):
     return url[: length - 1] + "…"
 
 
-async def row_sql_params_pks(db, table, pk_values):
+async def row_sql_params_pks(db, table, pk_values, rowid=False):
     pks = await db.primary_keys(table)
-    use_rowid = not pks
+    if rowid:
+        use_rowid = True
+    else:
+        use_rowid = not pks
     select = "*"
     if use_rowid:
         select = "rowid, *"

It works:

CleanShot 2023-08-18 at 15 57 36@2x CleanShot 2023-08-18 at 15 57 53@2x

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Except it looks like the Links from other tables section is broken:

CleanShot 2023-08-18 at 15 58 54@2x

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Creating a quick test database:

sqlite-utils create-table nulls.db nasty id text --pk id
sqlite-utils nulls.db 'insert into nasty (id) values (null)'

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

This is hard. I tried this:

def path_from_row_pks(row, pks, use_rowid, quote=True):
    """Generate an optionally tilde-encoded unique identifier
    for a row from its primary keys."""
    if use_rowid or any(row[pk] is None for pk in pks):
        bits = [row["rowid"]]
    else:
        bits = [
            row[pk]["value"] if isinstance(row[pk], dict) else row[pk] for pk in pks
        ]
    if quote:
        bits = [tilde_encode(str(bit)) for bit in bits]
    else:
        bits = [str(bit) for bit in bits]

    return ",".join(bits)

The if use_rowid or any(row[pk] is None for pk in pks) bit is new.

But I got this error on http://127.0.0.1:8003/nulls/nasty :

  File "/Users/simon/Dropbox/Development/datasette/datasette/views/table.py", line 1364, in run_display_columns_and_rows
    display_columns, display_rows = await display_columns_and_rows(
  File "/Users/simon/Dropbox/Development/datasette/datasette/views/table.py", line 186, in display_columns_and_rows
    pk_path = path_from_row_pks(row, pks, not pks, False)
  File "/Users/simon/Dropbox/Development/datasette/datasette/utils/__init__.py", line 124, in path_from_row_pks
    bits = [row["rowid"]]
IndexError: No item with that key

Because the SQL query I ran to populate the page didn't know that it would need to select rowid as well.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

How expensive is it to detect if a SQLite table contains at least one null primary key?

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2023

Ran a quick benchmark on ChatGPT Code Interpreter: https://chat.openai.com/share/8357dc01-a97e-48ae-b35a-f06249935124

Conclusion from there is that this query returns fast no matter how much the table grows:

SELECT EXISTS(SELECT 1 FROM "nasty" WHERE "id" IS NULL)

So detecting if a table contains any null primary keys is definitely feasible without a performance hit.

@pkulchenko
Copy link

@simonw, since you're referencing "rowid" column by name, I just want to note that there may be an existing rowid column with completely different semantics (https://www.sqlite.org/lang_createtable.html#rowid), which is likely to break this logic. I don't see a good way to detect a proper "rowid" name short of checking if there is a field with that name and using the alternative (_rowid_ or oid), which is not ideal, but may work.

In terms of the original issue, maybe a way to deal with it is to use rowid by default and then use primary key for WITHOUT ROWID tables (as they are guaranteed to be not null), but I suspect it may require significant changes to the API (and doesn't fully address the issue of what value to pass to indicate NULL when editing records). Would it make sense to generate a random string to indicate NULL values when editing?

@simonw
Copy link
Owner Author

simonw commented Aug 21, 2023

Suggestion from @asg017 is that we say that if your row has a null primary key you don't get a link to a row page for that row.

Which has some precedent, because our SQL view display doesn't link to row pages at all (since they don't make sense for views): https://latest.datasette.io/fixtures/simple_view

@asg017
Copy link
Contributor

asg017 commented Aug 21, 2023

Another point: The new Datasette write API should refuse to insert a row with a NULL primary key. That will likely decrease the likelihood someone find themselves with NULLs in their primary keys, at least with Datasette users. Especially buggy code that uses the write API, like our datasette-write-ui bug that led to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants