Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast the larger side during join if colA + colB >5 condition is specified #15605

Closed
ubyyj opened this issue Jan 5, 2023 · 10 comments · Fixed by #15642
Closed

Broadcast the larger side during join if colA + colB >5 condition is specified #15605

ubyyj opened this issue Jan 5, 2023 · 10 comments · Fixed by #15642
Assignees
Labels
bug Something isn't working

Comments

@ubyyj
Copy link
Contributor

ubyyj commented Jan 5, 2023

we are suing trino version 377, and found that it generates very inefficient plan, broadcasting the much larger table, if a condition like colA + colB >5 is given.

Below is the query:

select account_id,
sum(read_time_web + read_time_smart) as score
from
hive_ad.ml.sn_result_hourly b
join
hive_ad.ml.sn_articles_export_hourly a
using (dt, edition, link_id)
where dt between
format_datetime(date_add('day', -3, now()), 'YYYY-MM-dd')
and
format_datetime(date_add('day', -1, now()), 'YYYY-MM-dd')
and a.hh = '23' and edition = 'en_US'
and b.hh='23'
and (read_time_web + read_time_smart) >= 5 -- this line matters
group by 1

Table sn_result_hourly is much larger, but the query plan shows it broadcast this larger table during join. But if remove the condition and (read_time_web + read_time_smart) >= 5, then the plan looks good.
Please find the incorrect query plan in the attached file.

broadcast_larger_side_plan.txt

for your convenience, here is the key part of the plan:

截屏2023-01-05 11 35 19

@findepi
Copy link
Member

findepi commented Jan 5, 2023

cc @sopel39 @raunaqmorarka

@sopel39
Copy link
Member

sopel39 commented Jan 5, 2023

What is correct estimate?

@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 6, 2023

the desired plan should be table scan of sn_result_hourly on the left of the join, since it has much more rows than table sn_articles_export_hourly.
@sopel39 does this answer your question?

@sopel39
Copy link
Member

sopel39 commented Jan 6, 2023

@ubyyj could you run explain analyze also?

@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 8, 2023

run_analyze_bad_query_plan.txt

attached is the result of explain analyze

@raunaqmorarka
Copy link
Member

Could you also provide the output of SHOW STATS FOR sn_result_hourly

@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 9, 2023

show stats for hive_ad.ml.sn_result_hourly;
column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------------------+-----------------------+-----------------------+----------------+----------------------+-----------+------------
read_time_web | NULL | NULL | NULL | NULL | NULL | NULL
... (omit rows that only has only NULL values)
dt | 1.9505585659707977E13 | 1329.0 | 0.0 | NULL | NULL | NULL
hh | 3.9011171319429473E12 | 24.0 | 0.0 | NULL | NULL | NULL
NULL | NULL | NULL | NULL | 1.950558565970551E12 | NULL | NULL

@raunaqmorarka
Copy link
Member

I think this is due to the optimizer not handling unknown stats correctly in the relevant part of the code.
Could you try running ANALYZE hive_ad.ml.sn_result_hourly WITH (columns = ARRAY['read_time_web', 'read_time_smart']) and see if that solves it ?

@raunaqmorarka raunaqmorarka self-assigned this Jan 9, 2023
@raunaqmorarka raunaqmorarka added the bug Something isn't working label Jan 9, 2023
@ubyyj
Copy link
Contributor Author

ubyyj commented Jan 9, 2023

@raunaqmorarka ANALYZE can fix the issue.
Since the optimizer knows the total rows of each side, it should optimize based on the row numbers, still sounds like a bug to me.

@raunaqmorarka
Copy link
Member

It is a bug, just wanted to confirm that explicitly generating column stats solves the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

4 participants