[bug] Analyze does not push down the LIMIT clause and results a full scan #276

mmigdiso · 2021-03-02T09:10:44Z

Describe the bug
In the analyze phase, the DatasetAnalyzer runs some count queries using a LIMIT clause to avoid the full scan.
But when I check the query plan, I see that the limit is applied after the results are calculated and that causes a big performance issue for big tables.
Limit should be applied before executing the count/sum operators

To Reproduce
Steps to reproduce the behavior:

run soda analyze
Run explain from a sql client for the query generated by the analyzer.

This is the current plan: (for demodata dataset)

Limit  (cost=2.30..2.31 rows=1 width=16)
--
->  Aggregate  (cost=2.30..2.31 rows=1 width=16)
->  Seq Scan on demodata  (cost=0.00..1.65 rows=65 width=13)

It should be:

Aggregate  (cost=0.05..0.06 rows=1 width=16)
--
->  Limit  (cost=0.00..0.03 rows=1 width=1569)
->  Seq Scan on demodata  (cost=0.00..1.65 rows=65 width=1569)

One way of fixing the problem is changing the FROM demodata limit 1000 to from (select * from demodata limit 1) T;

Context
Include your scan.yml or warehouse.yml when relevant

OS:
Python Version:
Soda SQL Version: 2.0.0.b15
Warehouse Type: Redshift + postgresql

The text was updated successfully, but these errors were encountered:

tombaeyens · 2021-03-02T09:50:23Z

Murat, do you think the same solution with the subselect is going to work for most other warehouses as well? Would you expect it to be a generic solution or only for redshift?

tombaeyens · 2021-03-02T09:53:21Z

Thanks for the PR! With our CI we'll just see if it works on the other warehouses

mmigdiso · 2021-03-02T10:00:20Z

Good question @tombaeyens , I can say that this would work for sparksql, hive, mysql. But I don't have experience with Athena, snowflake, bigquery and sql server. Unfortunately I don't have the environment to test the behavior on other engines :(
The sytnax should be ok for any database supporting the subqueries. It would be surprising to see different behaviours in the execution plan of such a straightforward query across different engines.

abuckenheimer · 2021-03-02T18:35:14Z

I was thinking about this for bigquery today, because you pay per terabyte scanned if you soda analyze on a big dataset you can rack up a large bill without knowing. The proposed subquery solution won't break anything but it also wouldn't help either, to reduce the bytes scanned you'd have to be able to pick a partition to limit your selection to.

mmigdiso · 2021-03-02T20:05:13Z

hey @abuckenheimer I see your point. I think your problem can easily be addressed with this feature request: #136. wdyt?

abuckenheimer · 2021-03-02T22:07:31Z

@mmigdiso great idea, I was looking at your PR sodadata/soda-core#135 and figured you could kind of merge that with the idea in #277 if you could interpolate the table name. So instead of:

analyze_templates:
    table1: "select * from table1 where id > 1000 limit 10"

it be:

analyze_templates:
    default: "select * from {} limit 10"
    table1: "select * from table1 where id > 1000 limit 10"

Then you could pick whatever works best for your base case.

mmigdiso added bug Something isn't working soda-sql labels Mar 2, 2021

tombaeyens mentioned this issue Mar 3, 2021

Support providing a filter (or custom query) which is used when analyzing a table #277

Open

vijaykiran closed this as completed in sodadata/soda-core@16430a3 Mar 11, 2021

vijaykiran transferred this issue from sodadata/soda-core Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Analyze does not push down the LIMIT clause and results a full scan #276

[bug] Analyze does not push down the LIMIT clause and results a full scan #276

mmigdiso commented Mar 2, 2021

tombaeyens commented Mar 2, 2021

tombaeyens commented Mar 2, 2021

mmigdiso commented Mar 2, 2021

abuckenheimer commented Mar 2, 2021

mmigdiso commented Mar 2, 2021

abuckenheimer commented Mar 2, 2021

[bug] Analyze does not push down the LIMIT clause and results a full scan #276

[bug] Analyze does not push down the LIMIT clause and results a full scan #276

Comments

mmigdiso commented Mar 2, 2021

tombaeyens commented Mar 2, 2021

tombaeyens commented Mar 2, 2021

mmigdiso commented Mar 2, 2021

abuckenheimer commented Mar 2, 2021

mmigdiso commented Mar 2, 2021

abuckenheimer commented Mar 2, 2021