Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research using CTEs for faster facet counts #1259

Open
simonw opened this issue Mar 12, 2021 · 5 comments
Open

Research using CTEs for faster facet counts #1259

simonw opened this issue Mar 12, 2021 · 5 comments

Comments

@simonw
Copy link
Owner

simonw commented Mar 12, 2021

https://www.sqlite.org/changes.html#version_3_35_0

Add support for the MATERIALIZED and NOT MATERIALIZED hints when specifying common table expressions. The default behavior was formerly NOT MATERIALIZED, but is now changed to MATERIALIZED for CTEs that are used more than once.

If a CTE creates a table that is used multiple time in that query, SQLite will now default to creating a materialized table for the duration of that query.

This could be a big performance boost when applying faceting multiple times against the same query. Consider this example query:

WITH data as (
  select
    *
  from
    [global-power-plants]
),
country_long as (select 
  'country_long' as col, country_long as value, count(*) as c from data group by country_long
  order by c desc limit 10
),
primary_fuel as (
select
  'primary_fuel' as col, primary_fuel as value, count(*) as c from data group by primary_fuel
  order by c desc limit 10
)
select * from primary_fuel union select * from country_long order by col, c desc

https://global-power-plants.datasettes.com/global-power-plants?sql=WITH+data+as+%28%0D%0A++select%0D%0A++++*%0D%0A++from%0D%0A++++%5Bglobal-power-plants%5D%0D%0A%29%2C%0D%0Acountry_long+as+%28select+%0D%0A++%27country_long%27+as+col%2C+country_long+as+value%2C+count%28*%29+as+c+from+data+group+by+country_long%0D%0A++order+by+c+desc+limit+10%0D%0A%29%2C%0D%0Aprimary_fuel+as+%28%0D%0Aselect%0D%0A++%27primary_fuel%27+as+col%2C+primary_fuel+as+value%2C+count%28*%29+as+c+from+data+group+by+primary_fuel%0D%0A++order+by+c+desc+limit+10%0D%0A%29%0D%0Aselect+*+from+primary_fuel+union+select+*+from+country_long+order+by+col%2C+c+desc

Outputs:

col value c
country_long United States of America 8688
country_long China 4235
country_long United Kingdom 2603
country_long Brazil 2360
country_long France 2155
country_long India 1590
country_long Germany 1309
country_long Canada 1159
country_long Spain 829
country_long Russia 545
primary_fuel Solar 9662
primary_fuel Hydro 7155
primary_fuel Wind 5188
primary_fuel Gas 3922
primary_fuel Coal 2390
primary_fuel Oil 2290
primary_fuel Biomass 1396
primary_fuel Waste 1087
primary_fuel Nuclear 198
primary_fuel Geothermal 189
@simonw
Copy link
Owner Author

simonw commented Mar 12, 2021

https://sqlite.org/lang_with.html

Prior to SQLite 3.35.0, all CTEs where treated as if the NOT MATERIALIZED phrase was present

It looks like this optimization is completely unavailable on SQLite prior to 3.35.0 (released 12th March 2021). But I could still rewrite the faceting to work in this way, using the exact same SQL - it would just be significantly faster on 3.35.0+ (assuming it's actually faster in practice - would need to benchmark).

@simonw
Copy link
Owner Author

simonw commented Mar 12, 2021

OK, a better comparison:

https://global-power-plants.datasettes.com/global-power-plants?sql=WITH+data+as+%28%0D%0A++select%0D%0A++++*%0D%0A++from%0D%0A++++%5Bglobal-power-plants%5D%0D%0A%29%2C%0D%0Acountry_long+as+%28select+%0D%0A++%27country_long%27+as+col%2C+country_long+as+value%2C+count%28*%29+as+c+from+data+group+by+country_long%0D%0A++order+by+c+desc+limit+31%0D%0A%29%2C%0D%0Aprimary_fuel+as+%28%0D%0Aselect%0D%0A++%27primary_fuel%27+as+col%2C+primary_fuel+as+value%2C+count%28*%29+as+c+from+data+group+by+primary_fuel%0D%0A++order+by+c+desc+limit+31%0D%0A%29%2C%0D%0Aowner+as+%28%0D%0Aselect%0D%0A++%27owner%27+as+col%2C+owner+as+value%2C+count%28*%29+as+c+from+data+group+by+owner%0D%0A++order+by+c+desc+limit+31%0D%0A%29%0D%0Aselect+*+from+primary_fuel+union+select+*+from+country_long%0D%0Aunion+select+*+from+owner+order+by+col%2C+c+desc calculates facets against three columns. It takes 78.5ms (and 34.5ms when I refreshed it, presumably after warming some SQLite caches of some sort).

https://global-power-plants.datasettes.com/global-power-plants/global-power-plants?_facet=country_long&_facet=primary_fuel&_trace=1&_size=0 shows those facets with size=0 on the SQL query - and shows a SQL trace at the bottom of the page.

The country_long facet query takes 45.36ms, owner takes 38.45ms, primary_fuel takes 49.04ms - so a total of 132.85ms

That's against https://global-power-plants.datasettes.com/-/versions says SQLite 3.27.3 - so even on a SQLite version that doesn't materialize the CTEs there's a significant performance boost to doing all three facets in a single CTE query.

@simonw simonw changed the title Research using materialized CTEs for faster facet counts Research using CTEs for faster facet counts Mar 12, 2021
@simonw
Copy link
Owner Author

simonw commented Mar 12, 2021

I wonder if I could optimize facet suggestion in the same way?

One challenge: the query time limit will apply to the full CTE query, not to the individual columns.

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2021

If all of the facets were being calculated in a single query, I'd be willing to bump the facet time limit up to something a lot higher, maybe even a full second. There's a chance that could work amazingly well with a materialized CTE.

@simonw
Copy link
Owner Author

simonw commented Mar 21, 2021

CTEs were added in 2014-02-03 SQLite 3.8.3 - so I think it's OK to depend on them for Datasette.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant