Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

[bug] Analyze does not push down the LIMIT clause and results a full scan #276

Closed
mmigdiso opened this issue Mar 2, 2021 · 6 comments
Closed
Labels
bug Something isn't working soda-sql

Comments

@mmigdiso
Copy link

mmigdiso commented Mar 2, 2021

Describe the bug
In the analyze phase, the DatasetAnalyzer runs some count queries using a LIMIT clause to avoid the full scan.
But when I check the query plan, I see that the limit is applied after the results are calculated and that causes a big performance issue for big tables.
Limit should be applied before executing the count/sum operators

To Reproduce
Steps to reproduce the behavior:

  1. run soda analyze
  2. Run explain from a sql client for the query generated by the analyzer.

This is the current plan: (for demodata dataset)

Limit  (cost=2.30..2.31 rows=1 width=16)
--
->  Aggregate  (cost=2.30..2.31 rows=1 width=16)
->  Seq Scan on demodata  (cost=0.00..1.65 rows=65 width=13)

It should be:

Aggregate  (cost=0.05..0.06 rows=1 width=16)
--
->  Limit  (cost=0.00..0.03 rows=1 width=1569)
->  Seq Scan on demodata  (cost=0.00..1.65 rows=65 width=1569)

One way of fixing the problem is changing the FROM demodata limit 1000 to from (select * from demodata limit 1) T;

Context
Include your scan.yml or warehouse.yml when relevant

OS:
Python Version:
Soda SQL Version: 2.0.0.b15
Warehouse Type: Redshift + postgresql

@mmigdiso mmigdiso added bug Something isn't working soda-sql labels Mar 2, 2021
@tombaeyens
Copy link
Contributor

Murat, do you think the same solution with the subselect is going to work for most other warehouses as well? Would you expect it to be a generic solution or only for redshift?

@tombaeyens
Copy link
Contributor

Thanks for the PR! With our CI we'll just see if it works on the other warehouses

@mmigdiso
Copy link
Author

mmigdiso commented Mar 2, 2021

Good question @tombaeyens , I can say that this would work for sparksql, hive, mysql. But I don't have experience with Athena, snowflake, bigquery and sql server. Unfortunately I don't have the environment to test the behavior on other engines :(
The sytnax should be ok for any database supporting the subqueries. It would be surprising to see different behaviours in the execution plan of such a straightforward query across different engines.

@abuckenheimer
Copy link

I was thinking about this for bigquery today, because you pay per terabyte scanned if you soda analyze on a big dataset you can rack up a large bill without knowing. The proposed subquery solution won't break anything but it also wouldn't help either, to reduce the bytes scanned you'd have to be able to pick a partition to limit your selection to.

@mmigdiso
Copy link
Author

mmigdiso commented Mar 2, 2021

hey @abuckenheimer I see your point. I think your problem can easily be addressed with this feature request: #136. wdyt?

@abuckenheimer
Copy link

@mmigdiso great idea, I was looking at your PR sodadata/soda-core#135 and figured you could kind of merge that with the idea in #277 if you could interpolate the table name. So instead of:

analyze_templates:
    table1: "select * from table1 where id > 1000 limit 10"

it be:

analyze_templates:
    default: "select * from {} limit 10"
    table1: "select * from table1 where id > 1000 limit 10"

Then you could pick whatever works best for your base case.

@vijaykiran vijaykiran transferred this issue from sodadata/soda-core Mar 22, 2022
@vijaykiran vijaykiran transferred this issue from sodadata/soda-core Mar 22, 2022
@vijaykiran vijaykiran transferred this issue from sodadata/soda-core Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working soda-sql
Projects
None yet
Development

No branches or pull requests

3 participants