-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit and topn pushdown for Phoenix. #7490
Conversation
Why wasn't it caught by |
I assume because the result was functionally correct, the limit would not be applied at the Phoenix level, but Trino would still apply it (isLimitGuaranteed returns false for the PhoenixConnector) |
Huh? Looks like it is trying to allocate over 390GB of memory in Phoenix?!
Is that a negative test? Update: Phoenix estimates the needed memory as (limit + offset) * row size, it does not actually allocate that memory, it just guards again this. |
It's testLimitMax:
Should I override that test with a comment in |
See also https://issues.apache.org/jira/browse/PHOENIX-6436. I feel I do not want to fix this at the Trino level, so I would just override the test. |
That's only hiding a problem... if a problem exists. The Otherwise, what's the safe LIMIT or TopN value that can be pushed down safety into Phoenix? |
That's the "thing"... It depends on the H/W. The highest that value can set to is 100% of the available heap (again, it does not actually allocate that RAM, it just guards against abuse with a memory chunk manager, but it will lock other tasks that also try to reserve memory chunks for safety). Default is 15% of the heap. Phoenix is not very smart about this. A query might be leading to 1000's of scans, each of which is handed the same overall limit, with a final pass. To make that test pass the limit is about 2.000.000. We could avoid pushing down topn beyond that limit and let Trino do the work in that case, but that's not right either. I tried checking that in supportsTopN(...), but when that method is called the passed JdbcTableHandle does not yet have the limit applied to it. So I'd have to change the TopNFunction itself.
Unfortunately the logic is current hardcoded in Phoenix. This is a problem in Phoenix... Still not convinced that we want to solve that in Trino. |
It's not hard to put a 1m or 2m max on the topN limit and have Trino not push it into Phoenix beyond that. Assuming you run the HBase region servers with 32GB of heap (that's common). The default 15% give a maximum allocation of 4.8GB. Either is good enough, IMHO. @findepi Let me know whether you want to proceed in that direction. One more update: If there is an index that matches the sort order there is no limit in Phoenix, since it just streams out the result in that case, but I cannot tell if this is so. |
Latest change does that. Maximum limit hardcoded to 2m. Could make it configurable, but that seems overly complicated to understand for a user. Still not sure whether we should not just let Phoenix decide whether it can run the query or not. |
Tests pass. But I don't like this very much. Would rather have some queries fail outright and get streaming of any size when there's a proper index (in fact that way I can get ORDER BY push down by just passing a very large LIMIT) The user can always disable topn push-down via the topn_pushdown_enabled session variable. Update: I provided a fix for Phoenix that will presumably be in 5.1.2. |
Apologies for the barrage of comments here... Even with the size-accounting fixed in Phoenix it turns out that Phoenix always involves a client merge-sort on the client, i.e. the Trino worker, and for larger limits Trino's own method is just faster). So I think the patch as is is good. For small values we get the benefit of Phoenix doing the work, and for larger values we defer to Trino. The right threshold depends a bit, but from my testing it is somewhere between 1m and 10m rows, so the 2m in this patch are good enough heuristic. I'll stop here and wait for comment :) |
Friendly (ping) |
@lhofhansl Do we need to change the |
@vincentpoon Thanks for taking a look.
Oh that's weird. Is there another bug? The test clearly failed when it pushed down topN. I'll double check. What's your opinion on only pushing topN when the limit <= 2.000.000. (Interestingly, when doing the client size merge sort - even when the sort order matches an index on the server is slower than letting Trino do from a certain size on.) |
Looking at test failures... For the Phoenix connector What now? It seems I should revert that last change suggested by @vincentpoon . This is the generated plan:
Hence, could also override the test and then match Output -> * -> TableScanNode -> SortNode, to verify that the sorting is pushed down. |
Ok, wasn't aware that
Wondering how you're testing this. e.g. Is your Trino cluster larger than your HBase cluster? (more workers vs region servers). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall % some nits about comments.
Are the buildSql
changes necessary or are they a cleanup? Could be a separate commit.
EDIT: Nevermind, saw the PR description - should be a separate commit then "Fix limit pushdown".
plugin/trino-phoenix5/src/main/java/io/trino/plugin/phoenix5/PhoenixClient.java
Outdated
Show resolved
Hide resolved
plugin/trino-phoenix5/src/main/java/io/trino/plugin/phoenix5/PhoenixClient.java
Outdated
Show resolved
Hide resolved
plugin/trino-phoenix5/src/main/java/io/trino/plugin/phoenix5/PhoenixClient.java
Outdated
Show resolved
Hide resolved
I isolated this on a single machine. Since the post op client merge happens on the Phoenix client (i.e. on the Trino worker) that seemed like a good way to test this. When the limit is larger (and there are a of values to be passed) it kinda makes sense that Trino could be faster: Phoenix would have ship all the rows to the Phoenix client then do the merge, that seems to be fairly inefficient for large sets (past a few million), I'm also debugging that as I find time, but as is Trino seems faster. When Phoenix can limit that work, most notably when there's an index with the right order and the set of small'ish, then Phoenix is (much) faster. |
Whoops I messed something up in git. That looks different from my local history. |
LIMIT push down did not work since the HBase scan are created during the split phase based on the query it is transformed. With this commit splits are create from the transformed query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
// TODO: Remove when this is improved in Phoenix | ||
// Phoenix performs a client side merge sort. | ||
// For large sets this is slower than passing | ||
// the unsorted data to Trino and sort it there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this comment to talk about memory implications of changing (or not having) the limit, or a link to Phoenix issue describing that.
Also, if the Trino-side sorting is faster, I am not sure the TODO is actionable. If it is not, please just remove the "TODO" and please retain the rest of the info.
// TODO: Remove when this is improved in Phoenix | ||
// Phoenix performs a client side merge sort. | ||
// For large sets this is slower than passing | ||
// the unsorted data to Trino and sort it there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously same here
@findepi The situation is not that simple in Phoenix unfortunately. With Phoenix 5.1.1 (current version) queries with large LIMITs will just fail due to the overestimation. Upon running the Phoenix side through the profiler I found that Phoenix is only slower when using a local index that does not cover all the column requested in the query In other cases sorting in Phoenix is faster. Apologies for the confusion - I had to get clarity myself, too. So I'll revise my statement: For Phoenix 5.1.1 we need put a maximum on the LIMIT as this PR does. Once Phoenix 5.1.2 is out, we should push all topN to Phoenix. Again, my apologies for the back-and-forth. Should I update the comment in the code? |
BTW. we obviously get much better performance when we can return I'll file a reminder issue for when Phoenix 5.1.2 is released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
i rerun CI since Phoenix tests failed to start. |
columns, | ||
ImmutableMap.of(), | ||
split); | ||
return new QueryBuilder(this).prepareStatement(session, connection, preparedQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change mean we can return true from isLimitGuaranteed
in Phoenix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. If you pass Phoenix a limit > Integer.MAX_VALUE it will still completely ignore it, so isLimitGuaranteed
has to return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that's JdbcMetadata responsibility, not PhoenixClient, right? PhoenixClient doesn't ignore anything.
depending on answer #7490 (comment), please follow up. |
Thanks @findepi @vincentpoon @hashhar . I'll followup when Phoenix 5.1.2 is out. |
@lhofhansl I have a simple query as follows: select * from table_a where ROW = 'xxx' limit 1. The amount of data is 1.1T, but its overhead time is basically more than 2 seconds, while the time of direct query using Phoenix is about 0.1 seconds. Can it be improved after introducing this patch? |
This adds topn pushdown for the Phoenix connector.
It turns out limit pushdown never worked, this PR fixes that as well.
The current code worked by "luck" only. When calculating the splits, the PhoenixConnector also generates the relevant HBase scans to execute as part of a split. However, those scans are created based on the original query, not the rewritten one, and so the scans would miss some of the annotation required by the rewritten query. It happened to work, but the limit was never applied.
When I first pushed in topn the results were non-sensical since the scans and the expected results disagreed on the shape. The change now uses the same code to generate the split-scans and the expected results during execution.
Only made the phoenix5 change, if the tests pass and we agree with the implementation I'll make the same change on Phoenix.@vincentpoon @findepi