Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to sort (and paginate) by column #189

Closed
7 tasks done
simonw opened this issue Mar 28, 2018 · 31 comments
Closed
7 tasks done

Ability to sort (and paginate) by column #189

simonw opened this issue Mar 28, 2018 · 31 comments
Assignees
Labels

Comments

@simonw
Copy link
Owner

simonw commented Mar 28, 2018

As requested in #185 (comment)

I've previously avoided this for performance reasons: sort-by-column on a column without an index is likely to perform badly for hundreds of thousands of rows.

That's not a good enough reason to avoid the feature entirely though. A few options:

  • Allow sort-by-column by default, give users the option to disable it for specific tables/columns
  • Disallow sort-by-column by default, give users option (probably in metadata.json) to enable it for specific tables/columns
  • Automatically detect if a column either has an index on it OR a table has less than X rows in it

We already have the mechanism in place to cut off SQL queries that take more than X seconds, so if someone DOES try to sort by a column that's too expensive it won't actually hurt anything - but it would be nice to not show people a "sort" option which is guaranteed to throw a timeout error.

The vast majority of datasette usage that I've seen so far is on smaller datasets where the performance penalties of sort-by-column are extremely unlikely to show up.


Still left to do:

  • UI that shows which sort order is currently being applied (in HTML and in JSON)
  • UI for applying a sort order (with rel=nofollow to avoid Google crawling it)
  • Sort column names should be escaped correctly in generated SQL
  • Validation that the selected sort order is a valid column
  • Throw error if user attempts to apply _sort AND _sort_desc at the same time
  • Ability to disable sorting (or sort only for specific columns) in metadata.json
  • Fix "201 rows where sorted by sortable_with_nulls " bug
@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

Given how unlikely it is that this will pose a real problem I think I like option 1: enable sort-by-column by default for all tables, then allow power users to instead switch to explicit enabling of the functionality in their metadata.json if they know their data is too big.

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

I think this can work with a ?_sort=xxx parameter - and ?_sort=-xxx to sort in the opposite direction.

I'd like to support "sort by X descending, then by Y ascending if there are dupes for X" as well. Two ways that could work:

?_sort=-xxx,yyy

Or...

?_sort=-xxx&_sort=yyy

The second option is probably better in that it makes it easier for columns to have a comma in their name.

Is it possible for a SQLite column to start with a - character?

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

Might have to do something special to get sort-by-nulls-last: https://stackoverflow.com/questions/12503120/how-to-do-nulls-last-in-sqlite

order by ifnull(column_name, -999999)

Would need to figure out a smart way to get the default value - maybe by running a min() or max() against the column first?

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

This is a better pattern as you don't have to pick a minimum value:

ORDER BY CASE WHEN SOMECOL IS NULL THEN 1 ELSE 0 END, SOMECOL

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

I think there are actually four kinds of sort order we need to support;

  • ascending
  • descending
  • ascending, nulls last
  • descending, nulls last

It looks like [-blah] is a valid SQLite table name, so mark I descending with a hyphen prefix isn't good.

Instead, maybe this:

?_sort_asc=col1&_sort_desc_nulls_last=col2

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

I'd like to continue to support _next=token pagination even for custom sort orders.

To do that I should include rowid (or general primary key) as the tie breaker on all sorts so I can incorporate that it into the _next= token.

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

In terms of user interface: the obvious place to put this is as a drop down menu on the column headers.

This also means the UI can support combined sort orders. Assuming you are already sorted by county descending and you select the candidate column header, the options could be:

  • sort all by candidate
  • sort all by candidate, descending
  • sort by county descending, then by candidate
  • sort by county descending, then by candidate descending

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

I'm tempted to put these verbose sorting options inline in the page HTML but have them in the table footer so they don't clog up the top half of the page with uninteresting links - then use JavaScript to hoik them out into a dropdown menu attached to each column header.

@simonw
Copy link
Owner Author

simonw commented Mar 28, 2018

There is one other interesting option for auto-enabling/disabling sort: the inspect command could include data about column index presence and whether or not a column has any null values in it.

This would allow us to dynamically include a "nulls last" option but only for columns that contain at least one null.

It's quite a lot of additional engineering for a very minor feature though, so I think I'll punt on that for the moment.

We may find that the _group_count feature can benefit from column value statistics later on though.

@simonw
Copy link
Owner Author

simonw commented Mar 29, 2018

Alternative idea: by default enable all sorting in the UI.

If a table has more than 100,000 rows disable sorting UI except for columns that have an index.

Allow this to be overridden in metadata.json

@simonw simonw added the medium label Mar 30, 2018
@simonw
Copy link
Owner Author

simonw commented Mar 30, 2018

I'm not entirely sure how to get _next= pagination working against sorted collections when a tie-breaker is needed.

Consider this data:

https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9?sql=select+rowid%2C+*+from+%5Bnfl-wide-receivers%2Fadvanced-historical%5D%0D%0Aorder+by+case+when+career_ranypa+is+null+then+1+else+0+end%2C+career_ranypa%2C+rowid+limit+11

2018-03-29 at 11 46 pm

If the page size was set to 9 rather than 11, the page divide would be between those two rows with the same value in the career_ranypa column. What would the ?_next= token look like such that the correct row would be returned?

@simonw simonw changed the title Ability to sort by column Ability to sort (and paginate) by column Mar 30, 2018
@simonw
Copy link
Owner Author

simonw commented Mar 30, 2018

The problem is that our _next= pagination currently works based on a > - but for this case a >= for the value is needed combined with a > on the tie-breaker (which would be the rowid column).

So I think this is the right SQL:

select rowid, * from [nfl-wide-receivers/advanced-historical]
where career_ranypa >= -6.331167749 and rowid > 2736
order by case when career_ranypa is null then 1 else 0 end, career_ranypa, rowid limit 11

https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9?sql=select+rowid%2C+*+from+%5Bnfl-wide-receivers%2Fadvanced-historical%5D%0D%0Awhere+career_ranypa+%3E%3D+-6.331167749+and+rowid+%3E+2736%0D%0Aorder+by+case+when+career_ranypa+is+null+then+1+else+0+end%2C+career_ranypa%2C+rowid+limit+11

But how do I encode a _next token that means ">= X and > Y"?

@simonw
Copy link
Owner Author

simonw commented Mar 30, 2018

Maybe the answer here is that anything that's encoded in the next token is treated as >= with the exception of columns known to be primary keys, which are treated as >

@simonw
Copy link
Owner Author

simonw commented Mar 30, 2018

Pushed some work-in-progress with failing unit tests here: 2f8359c

Here's a demo: https://datasette-column-sort-wip.now.sh/sortable-4bbaa6f/sortable?_sort=sortable - note that the _sort_desc and _sort_nulls_last options aren't done yet, plus it doesn't correctly paginate (the _next tokens do not yet take sorting into account).

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2018

I'm going to combine the code for explicit sorting with the existing code for _next= pagination - so even tables without an explicit sort order will run through the same code since they are ordered and paginated by primary key.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2018

A common problem with keyset pagination is that it can distort the "total number of rows" logic - every time you navigate to a further page the total rows count can decrease due to the extra arguments in the where clause. The filtered_table_rows value (see #194) calculated using count_sql currently has this problem.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2018

A note about views: a view cannot be paginated using keyset pagination because records returned from a view don't have a primary key - so there's no way to reliably distinguish between _next= records when the sorted column has duplicates with the same value.

Datasette already takes this into account: views are paginated using offset/limit instead. We can continue to do that even for views that have been sorted using a _sort parameter.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2018

To break this up into smaller units, the first implementation of this will only support a single _sort or _sort_desc querystring parameter.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2018

Actually next page SQL when sorting looks more like this:

select rowid, * from [alcohol-consumption/drinks]
where "country" like :p0
and (
    beer_servings > 111
    or (beer_servings = 111 and rowid > 190)
)
order by beer_servings, rowid limit 101

The next page after row 190 with sortable value 111 should show either records that are greater than 111 or records that match 111 but have a greater primary key than the last one seen.

https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9?sql=select+rowid%2C+*+from+%5Balcohol-consumption%2Fdrinks%5D%0D%0Awhere+%22country%22+like+%3Ap0%0D%0Aand+%28%0D%0A++++beer_servings+%3E+111%0D%0A++++or+%28beer_servings+%3D+111+and+rowid+%3E+190%29%0D%0A%29%0D%0Aorder+by+beer_servings%2C+rowid+limit+101&p0=%25a%25

simonw added a commit that referenced this issue Apr 9, 2018
Allows for paginated sorted results based on a specified column.

Refs #189
@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

Demo: senator tweets ordered by number of replies:

https://datasette-issue-189-demo.now.sh/fivethirtyeight-2628db9/twitter-ratio%2Fsenators?_sort_desc=replies

Page 2 (note that since Senators retweet things there are tweets with the same text/number-of-replies but retweeted by different senators that span the page break): https://datasette-issue-189-demo.now.sh/fivethirtyeight-2628db9/twitter-ratio%2Fsenators?_next=8556%2C121799&_sort_desc=replies

@simonw simonw self-assigned this Apr 9, 2018
@simonw simonw added this to the Advanced JSON edition milestone Apr 9, 2018
simonw added a commit that referenced this issue Apr 9, 2018
Plus renamed human_description to human_description_en

Refs #189
@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

Small bug: "201 rows where sorted by sortable_with_nulls" shouldn't have the word "where" in it.

@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

I'm going to split the following out into separate tickets:

  • Ability to sort by multiple columns e.g. ?_sort=name&sort_desc=age&_sort=height
  • Ability to specify nulls last e.g. ?_sort_desc_nulls_last=age

@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

Actually I think I always want nulls last when ordering asc, nulls first when ordering desc.

@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

simonw added a commit that referenced this issue Apr 9, 2018
Verifies that they match an existing column, and only one or the other option
is provided - refs #189

Eses a new DatasetteError exception that closes #193
simonw added a commit that referenced this issue Apr 9, 2018
You can now explicitly set which columns in a table can be used for sorting
using the _sort and _sort_desc arguments using metadata.json:

    {
        "databases": {
            "database1": {
                "tables": {
                    "example_table": {
                        "sortable_columns": [
                            "height",
                            "weight"
                        ]
                    }
                }
            }
        }
    }

Refs #189
simonw added a commit that referenced this issue Apr 9, 2018
simonw added a commit that referenced this issue Apr 9, 2018
We were showing this:

    201 rows where sorted by sortable_with_nulls

We now show this:

    201 rows sorted by sortable_with_nulls
simonw added a commit that referenced this issue Apr 9, 2018
Allows for paginated sorted results based on a specified column.

Refs #189
simonw added a commit that referenced this issue Apr 9, 2018
Plus renamed human_description to human_description_en

Refs #189
simonw added a commit that referenced this issue Apr 9, 2018
Verifies that they match an existing column, and only one or the other option
is provided - refs #189

Eses a new DatasetteError exception that closes #193
simonw added a commit that referenced this issue Apr 9, 2018
You can now explicitly set which columns in a table can be used for sorting
using the _sort and _sort_desc arguments using metadata.json:

    {
        "databases": {
            "database1": {
                "tables": {
                    "example_table": {
                        "sortable_columns": [
                            "height",
                            "weight"
                        ]
                    }
                }
            }
        }
    }

Refs #189
simonw added a commit that referenced this issue Apr 9, 2018
simonw added a commit that referenced this issue Apr 9, 2018
We were showing this:

    201 rows where sorted by sortable_with_nulls

We now show this:

    201 rows sorted by sortable_with_nulls
@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

I've merged this into master.

@simonw simonw closed this as completed Apr 9, 2018
@earthboundkid
Copy link

Awesome!

@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

This is now released in Datasette 0.15 https://github.com/simonw/datasette/releases/tag/0.15

@simonw
Copy link
Owner Author

simonw commented Apr 9, 2018

@earthboundkid
Copy link

I think I found a bug. I tried to sort by middle initial in my salaries set, and many middle initials are null. The next_url gets set by Datasette to:

http://localhost:8001/salaries-d3a5631/2017+Maryland+state+salaries?_next=None%2C391&_sort=middle_initial

But then None is interpreted literally and it tries to find a name with the middle initial "None" and ends up skipping ahead to O on page 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants