Refactor out the keyset pagination code #2019

simonw · 2023-02-06T23:04:00Z

While working on:

?_extra= support (draft) #1999

I noticed that some of the most complex code in the existing table view is the code that implements keyset pagination:

Lines 417 to 493 in 0b4a286

    
           # Handle pagination driven by ?_next= 
        
           _next = _next or request.args.get("_next") 
        
           offset = "" 
        
           if _next: 
        
               sort_value = None 
        
               if is_view: 
        
                   # _next is an offset 
        
                   offset = f" offset {int(_next)}" 
        
               else: 
        
                   components = urlsafe_components(_next) 
        
                   # If a sort order is applied and there are multiple components, 
        
                   # the first of these is the sort value 
        
                   if (sort or sort_desc) and (len(components) > 1): 
        
                       sort_value = components[0] 
        
                       # Special case for if non-urlencoded first token was $null 
        
                       if _next.split(",")[0] == "$null": 
        
                           sort_value = None 
        
                       components = components[1:] 
        
                   # Figure out the SQL for next-based-on-primary-key first 
        
                   next_by_pk_clauses = [] 
        
                   if use_rowid: 
        
                       next_by_pk_clauses.append(f"rowid > :p{len(params)}") 
        
                       params[f"p{len(params)}"] = components[0] 
        
                   else: 
        
                       # Apply the tie-breaker based on primary keys 
        
                       if len(components) == len(pks): 
        
                           param_len = len(params) 
        
                           next_by_pk_clauses.append( 
        
                               compound_keys_after_sql(pks, param_len) 
        
                           ) 
        
                           for i, pk_value in enumerate(components): 
        
                               params[f"p{param_len + i}"] = pk_value 
        
                   # Now add the sort SQL, which may incorporate next_by_pk_clauses 
        
                   if sort or sort_desc: 
        
                       if sort_value is None: 
        
                           if sort_desc: 
        
                               # Just items where column is null ordered by pk 
        
                               where_clauses.append( 
        
                                   "({column} is null and {next_clauses})".format( 
        
                                       column=escape_sqlite(sort_desc), 
        
                                       next_clauses=" and ".join(next_by_pk_clauses), 
        
                                   ) 
        
                               ) 
        
                           else: 
        
                               where_clauses.append( 
        
                                   "({column} is not null or ({column} is null and {next_clauses}))".format( 
        
                                       column=escape_sqlite(sort), 
        
                                       next_clauses=" and ".join(next_by_pk_clauses), 
        
                                   ) 
        
                               ) 
        
                       else: 
        
                           where_clauses.append( 
        
                               "({column} {op} :p{p}{extra_desc_only} or ({column} = :p{p} and {next_clauses}))".format( 
        
                                   column=escape_sqlite(sort or sort_desc), 
        
                                   op=">" if sort else "<", 
        
                                   p=len(params), 
        
                                   extra_desc_only="" 
        
                                   if sort 
        
                                   else " or {column2} is null".format( 
        
                                       column2=escape_sqlite(sort or sort_desc) 
        
                                   ), 
        
                                   next_clauses=" and ".join(next_by_pk_clauses), 
        
                               ) 
        
                           ) 
        
                           params[f"p{len(params)}"] = sort_value 
        
                       order_by = f"{order_by}, {order_by_pks}" 
        
                   else: 
        
                       where_clauses.extend(next_by_pk_clauses) 
        
           where_clause = "" 
        
           if where_clauses: 
        
               where_clause = f"where {' and '.join(where_clauses)} " 
        
           if order_by: 
        
               order_by = f"order by {order_by}"

Extracting that into a utility function would simplify that code a lot.

simonw · 2023-02-06T23:09:51Z

The inputs and outputs for this are pretty complex.

Inputs:

?_next= from the query string
is_view - is this for a table or view? If it's a view it uses offset/limit pagination - which could actually work for arbitrary queries too. Also views could have keyset pagination if they are known to be sorted by a particular column.
sort and sort_desc reflecting the current sort order
use_rowid for if the table is a rowid table with no primary key of its own
pks - the primary keys for the table
params - the current set of parameters, I think used just to count their length so new params can be added as p5 etc without collisions. This could be handled with a s0, s1 etc naming convention instead.

Outputs:

where_clauses - a list of where clauses to add to the query
params - additional parameters to use with the query due to the new where clauses
order_by - the order by clause

simonw · 2023-02-06T23:10:51Z

I should turn sort and sort_desc into an object representing the sort order earlier in the code.

I should also create something that bundles together pks and use_rowid and maybe is_view as well.

simonw · 2023-02-06T23:14:15Z

Crucial utility function:

datasette/datasette/utils/__init__.py

Lines 137 to 160 in 0b4a286

    
           def compound_keys_after_sql(pks, start_index=0): 
        
               # Implementation of keyset pagination 
        
               # See https://github.com/simonw/datasette/issues/190 
        
               # For pk1/pk2/pk3 returns: 
        
               # 
        
               # ([pk1] > :p0) 
        
               #   or 
        
               # ([pk1] = :p0 and [pk2] > :p1) 
        
               #   or 
        
               # ([pk1] = :p0 and [pk2] = :p1 and [pk3] > :p2) 
        
               or_clauses = [] 
        
               pks_left = pks[:] 
        
               while pks_left: 
        
                   and_clauses = [] 
        
                   last = pks_left[-1] 
        
                   rest = pks_left[:-1] 
        
                   and_clauses = [ 
        
                       f"{escape_sqlite(pk)} = :p{i + start_index}" for i, pk in enumerate(rest) 
        
                   ] 
        
                   and_clauses.append(f"{escape_sqlite(last)} > :p{len(rest) + start_index}") 
        
                   or_clauses.append(f"({' and '.join(and_clauses)})") 
        
                   pks_left.pop() 
        
               or_clauses.reverse() 
        
               return "({})".format("\n  or\n".join(or_clauses))

simonw · 2023-02-06T23:21:50Z

Found more logic relating to this:

datasette/datasette/views/table.py

Lines 684 to 732 in 0b4a286

    
           # Pagination next link 
        
           next_value = None 
        
           next_url = None 
        
           if 0 < page_size < len(rows): 
        
               if is_view: 
        
                   next_value = int(_next or 0) + page_size 
        
               else: 
        
                   next_value = path_from_row_pks(rows[-2], pks, use_rowid) 
        
               # If there's a sort or sort_desc, add that value as a prefix 
        
               if (sort or sort_desc) and not is_view: 
        
                   try: 
        
                       prefix = rows[-2][sort or sort_desc] 
        
                   except IndexError: 
        
                       # sort/sort_desc column missing from SELECT - look up value by PK instead 
        
                       prefix_where_clause = " and ".join( 
        
                           "[{}] = :pk{}".format(pk, i) for i, pk in enumerate(pks) 
        
                       ) 
        
                       prefix_lookup_sql = "select [{}] from [{}] where {}".format( 
        
                           sort or sort_desc, table_name, prefix_where_clause 
        
                       ) 
        
                       prefix = ( 
        
                           await db.execute( 
        
                               prefix_lookup_sql, 
        
                               { 
        
                                   **{ 
        
                                       "pk{}".format(i): rows[-2][pk] 
        
                                       for i, pk in enumerate(pks) 
        
                                   } 
        
                               }, 
        
                           ) 
        
                       ).single_value() 
        
                   if isinstance(prefix, dict) and "value" in prefix: 
        
                       prefix = prefix["value"] 
        
                   if prefix is None: 
        
                       prefix = "$null" 
        
                   else: 
        
                       prefix = tilde_encode(str(prefix)) 
        
                   next_value = f"{prefix},{next_value}" 
        
                   added_args = {"_next": next_value} 
        
                   if sort: 
        
                       added_args["_sort"] = sort 
        
                   else: 
        
                       added_args["_sort_desc"] = sort_desc 
        
               else: 
        
                   added_args = {"_next": next_value} 
        
               next_url = self.ds.absolute_url( 
        
                   request, self.ds.urls.path(path_with_replaced_args(request, added_args)) 
        
               ) 
        
               rows = rows[:page_size]

simonw · 2023-02-06T23:42:56Z

Relevant issue:

500 error if sorted by a column not in the ?_col= list #1773

Explains this comment:

datasette/datasette/views/table.py

Line 697 in 0b4a286

# sort/sort_desc column missing from SELECT - look up value by PK instead

simonw · 2023-02-07T02:14:35Z

Maybe the correct level of abstraction here is that pagination is something that happens to a SQL query that is defined as SQL and params, without an order by or limit. That's then wrapped in a sub-select and those things are added to it, plus the necessary where clauses depending on the page.

Need to check that the query plan for pagination of a subquery isn't slower than the plan for pagination as it works today.

simonw · 2023-02-07T02:18:11Z

For the SQL underlying this page (the second page in that compound primary key paginated sequence): https://latest.datasette.io/fixtures/compound_three_primary_keys?_next=a%2Cd%2Cv

The explain for the default query: https://latest.datasette.io/fixtures?sql=explain+select%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%2C%0D%0A++content%0D%0Afrom%0D%0A++compound_three_primary_keys%0D%0Awhere%0D%0A++%28%0D%0A++++%28pk1+%3E+%3Ap0%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3E+%3Ap1%0D%0A++++%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3D+%3Ap1%0D%0A++++++and+pk3+%3E+%3Ap2%0D%0A++++%29%0D%0A++%29%0D%0Aorder+by%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%0D%0Alimit%0D%0A++101&p0=a&p1=d&p2=v

The explain for that query rewritten as this:

explain
select
  *
from
  (
    select
      pk1,
      pk2,
      pk3,
      content
    from
      compound_three_primary_keys
  )
where
  (
    (pk1 > :p0)
    or (
      pk1 = :p0
      and pk2 > :p1
    )
    or (
      pk1 = :p0
      and pk2 = :p1
      and pk3 > :p2
    )
  )
order by
  pk1,
  pk2,
  pk3
limit
  101

https://latest.datasette.io/fixtures?sql=explain+select+*+from+%28select+%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%2C%0D%0A++content%0D%0Afrom%0D%0A++compound_three_primary_keys%0D%0A%29%0D%0A++where%0D%0A++%28%0D%0A++++%28pk1+%3E+%3Ap0%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3E+%3Ap1%0D%0A++++%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3D+%3Ap1%0D%0A++++++and+pk3+%3E+%3Ap2%0D%0A++++%29%0D%0A++%29%0D%0Aorder+by%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%0D%0Alimit%0D%0A++101&p0=a&p1=d&p2=v

Both explains have 31 steps and look pretty much identical.

simonw · 2023-02-07T02:22:11Z

A more complex example: https://latest.datasette.io/fixtures/sortable?_next=0~2E2650566289400591%2Ca%2Cu&_sort=sortable_with_nulls_2

SQL:

select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
  or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101

https://latest.datasette.io/fixtures?sql=select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable+where+%28sortable_with_nulls_2+%3E+%3Ap2+or+%28sortable_with_nulls_2+%3D+%3Ap2+and+%28%28pk1+%3E+%3Ap0%29%0A++or%0A%28pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1%29%29%29%29+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p0=a&p1=u&p2=0.2650566289400591

Here's the explain: 49 steps long https://latest.datasette.io/fixtures?sql=explain+select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable+where+%28sortable_with_nulls_2+%3E+%3Ap2+or+%28sortable_with_nulls_2+%3D+%3Ap2+and+%28%28pk1+%3E+%3Ap0%29%0D%0A++or%0D%0A%28pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1%29%29%29%29+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p2=0.2650566289400591&p0=a&p1=u

Rewritten with a subselect:

select * from (
  select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable
)
where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
  or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101

https://latest.datasette.io/fixtures?sql=select+*+from+(%0D%0A++select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable%0D%0A)%0D%0Awhere+(sortable_with_nulls_2+%3E+%3Ap2+or+(sortable_with_nulls_2+%3D+%3Ap2+and+((pk1+%3E+%3Ap0)%0D%0A++or%0D%0A(pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1))))+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p2=0.2650566289400591&p0=a&p1=u

And here's the explain for that - also 49 steps: https://latest.datasette.io/fixtures?sql=explain+select+*+from+%28%0D%0A++select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable%0D%0A%29%0D%0Awhere+%28sortable_with_nulls_2+%3E+%3Ap2+or+%28sortable_with_nulls_2+%3D+%3Ap2+and+%28%28pk1+%3E+%3Ap0%29%0D%0A++or%0D%0A%28pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1%29%29%29%29+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p2=0.2650566289400591&p0=a&p1=u

simonw · 2023-02-07T02:24:46Z

Even more complicated: https://latest.datasette.io/fixtures/sortable?sortable_with_nulls__notnull=1&_next=0~2E692704598586882%2Ce%2Cr&_sort=sortable_with_nulls_2

The rewritten SQL for that is:

select * from (select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable where "sortable_with_nulls" is not null)
  where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
  or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101

And it still has the same number of explain steps as the current SQL witohut the subselect.

simonw · 2023-02-07T02:28:03Z

So I think I can write an abstraction that applies keyset pagination to ANY arbitrary SQL query provided it is given the query, the existing params (so it can pick names for the new params that won't overlap with them), the desired sort order, any existing _next token AND the columns that should be used to tie-break any duplicates.

Those tie breakers will be either the primary key(s) or rowid if none are provided.

What about the case of SQL views, where offset/limit should be used instead? I'm inclined to have that as a separate pagination abstraction entirely, with the calling code deciding which pagination helper to use based on if keyset pagination makes sense or not.

Might be easier to design a class structure for this starting with OffsetPaginator, then using that to inform the design of KeysetPaginator.

Might put these in datasette.utils.pagination to start off with, then maybe extract them out to sqlite-utils later once they've proven themselves.

simonw · 2023-02-07T02:32:36Z

Doing this as a class makes sense to me. There are a few steps:

Instantiate the class with the information it needs, which includes sort order, page size, tiebreaker columns and SQL query and parameters
Generate the new SQL query that will actually be executed - maybe this takes the optional _next parameter? This returns the SQL and params that should be executed, where the SQL now includes pagination logic plus order by and limit
The calling code then gets to execute the SQL query to fetch the rows
Last step: those rows are passed to a paginator method which returns (rows, next) - where rows is the rows truncated to the correct length (really just with the last one cut off if it's too long for the length) and next is either None or a token, depending on if there should be a next page.

simonw · 2023-02-07T18:42:42Z

I'm going to build completely separate tests for this in test_pagination.py.

simonw · 2023-02-07T23:12:40Z

Most complicated example of a paginated query: https://latest.datasette.io/fixtures?sql=select%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++content%2C%0D%0A++sortable%2C%0D%0A++sortable_with_nulls%2C%0D%0A++sortable_with_nulls_2%2C%0D%0A++text%0D%0Afrom%0D%0A++sortable%0D%0Awhere%0D%0A++(%0D%0A++++sortable_with_nulls+is+null%0D%0A++++and+(%0D%0A++++++(pk1+%3E+%3Ap0)%0D%0A++++++or+(%0D%0A++++++++pk1+%3D+%3Ap0%0D%0A++++++++and+pk2+%3E+%3Ap1%0D%0A++++++)%0D%0A++++)%0D%0A++)%0D%0Aorder+by%0D%0A++sortable_with_nulls+desc%2C%0D%0A++pk1%2C%0D%0A++pk2%0D%0Alimit%0D%0A++101&p0=h&p1=r

select
  pk1,
  pk2,
  content,
  sortable,
  sortable_with_nulls,
  sortable_with_nulls_2,
  text
from
  sortable
where
  (
    sortable_with_nulls is null
    and (
      (pk1 > :p0)
      or (
        pk1 = :p0
        and pk2 > :p1
      )
    )
  )
order by
  sortable_with_nulls desc,
  pk1,
  pk2
limit
  101

Generated by this page: https://latest.datasette.io/fixtures/sortable?_next=%24null%2Ch%2Cr&_sort_desc=sortable_with_nulls

The _next= parameter there decodes as $null,h,r - and those components are tilde-encoded, so this can be distinguished from an actual $null value which would be represented as ~24null.

simonw · 2023-02-08T01:28:25Z

Rather than duplicate this rather awful hack:

datasette/datasette/views/table.py

Lines 694 to 714 in 0b4a286

    
           try: 
        
               prefix = rows[-2][sort or sort_desc] 
        
           except IndexError: 
        
               # sort/sort_desc column missing from SELECT - look up value by PK instead 
        
               prefix_where_clause = " and ".join( 
        
                   "[{}] = :pk{}".format(pk, i) for i, pk in enumerate(pks) 
        
               ) 
        
               prefix_lookup_sql = "select [{}] from [{}] where {}".format( 
        
                   sort or sort_desc, table_name, prefix_where_clause 
        
               ) 
        
               prefix = ( 
        
                   await db.execute( 
        
                       prefix_lookup_sql, 
        
                       { 
        
                           **{ 
        
                               "pk{}".format(i): rows[-2][pk] 
        
                               for i, pk in enumerate(pks) 
        
                           } 
        
                       }, 
        
                   ) 
        
               ).single_value()

I'm tempted to say that the code that calls the new pagination helper needs to ensure that the sort or sort_desc columns are selected. If it wants to ditch them later (e.g. because they were not included in ?_col=) it can do that later once the results have come back.

simonw added the refactor label Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor out the keyset pagination code #2019

Refactor out the keyset pagination code #2019

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023 •

edited

Loading

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023 •

edited

Loading

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023 •

edited

Loading

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023 •

edited

Loading

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023 •

edited

Loading

simonw commented Feb 8, 2023 •

edited

Loading

Refactor out the keyset pagination code #2019

Refactor out the keyset pagination code #2019

Comments

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023 • edited Loading

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023

simonw commented Feb 6, 2023 • edited Loading

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023 • edited Loading

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023 • edited Loading

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023

simonw commented Feb 7, 2023 • edited Loading

simonw commented Feb 8, 2023 • edited Loading

simonw commented Feb 6, 2023 •

edited

Loading

simonw commented Feb 6, 2023 •

edited

Loading

simonw commented Feb 7, 2023 •

edited

Loading

simonw commented Feb 7, 2023 •

edited

Loading

simonw commented Feb 7, 2023 •

edited

Loading

simonw commented Feb 8, 2023 •

edited

Loading