-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse table scan results when the same table is used in different parts of query #5880
Comments
Similar approach could be used to cache CTE (#5878) in the future. |
maybe prestodb/presto#15155 helps |
Might be addressed by #14271, cc @lukasz-stec |
Hi @sopel39 otherwise scan/fragment caching would be a good feature, WDYT?, we can also refer [prestodb/presto#15155] (prestodb/presto#15155) what @tooptoop4 has also shared. This seems to be not caching the whole TableScan but caching the fragments/pages. |
#14271 was simplified version of original Fusing rules PR. IIRC original fusing rules had good gains but also some non-trivial regressions. More generally, we lean towards implementing something like fusing rules in a more rule-based, iterative approach rather than plan rewriters.
Definitely. In fact, such technology powers Starburst multi-level cache (https://www.starburst.io/blog/introducing-multilayer-caching/). However, there are many ways this can be achieved. |
@sopel39 thanks! we are also very much interested in ways to improve the query performance, to make it more attractable for wider use-cases. Like tardigrade is for ETL. The presto implementation also looks descent. |
|
I'm in the process of open sourcing subquery cache feature from Starburst. Please be patient as it's pretty large feature and consist of multiple parts. Once there is a PR, I will close this issue and #5878 and create a new epic issue with future improvements. |
Here is the PR for subquery cache: #21888 |
Superseded by #22114 |
In
tpcds/q95
web_sales
table is scanned multiple times. Additionally, that table is then distributed across nodes using same hash column:or
web_sales
table is a large table. Instead of reading it multiple times, it should be possible to cacheTableScan
results in output buffers and read it by multiple downstream stages. Note that someweb_sales
scans have DF applied, so such optimization should not increase query wall time.The text was updated successfully, but these errors were encountered: