[bug] Analyze does not push down the LIMIT clause and results a full scan #276
Comments
Murat, do you think the same solution with the subselect is going to work for most other warehouses as well? Would you expect it to be a generic solution or only for redshift? |
Thanks for the PR! With our CI we'll just see if it works on the other warehouses |
Good question @tombaeyens , I can say that this would work for sparksql, hive, mysql. But I don't have experience with Athena, snowflake, bigquery and sql server. Unfortunately I don't have the environment to test the behavior on other engines :( |
I was thinking about this for bigquery today, because you pay per terabyte scanned if you |
hey @abuckenheimer I see your point. I think your problem can easily be addressed with this feature request: #136. wdyt? |
@mmigdiso great idea, I was looking at your PR sodadata/soda-core#135 and figured you could kind of merge that with the idea in #277 if you could interpolate the table name. So instead of:
it be:
Then you could pick whatever works best for your base case. |
Describe the bug
In the analyze phase, the DatasetAnalyzer runs some count queries using a LIMIT clause to avoid the full scan.
But when I check the query plan, I see that the limit is applied after the results are calculated and that causes a big performance issue for big tables.
Limit should be applied before executing the count/sum operators
To Reproduce
Steps to reproduce the behavior:
This is the current plan: (for demodata dataset)
It should be:
One way of fixing the problem is changing the
FROM demodata limit 1000
tofrom (select * from demodata limit 1) T;
Context
Include your scan.yml or warehouse.yml when relevant
OS:
Python Version:
Soda SQL Version: 2.0.0.b15
Warehouse Type: Redshift + postgresql
The text was updated successfully, but these errors were encountered: