-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: remove semi and anti logical join types and clarify null handling in hash join #467
feat: remove semi and anti logical join types and clarify null handling in hash join #467
Conversation
…hash equijoin. Added 'null is match' property to hash equijoin
I ended up opting for something slightly more drastic than my previous PR. I have now been convinced that semi join and anti join do not make sense as logical join types. Rationale:
Counterpoint:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Small question about whether we need so much detail about why semi/anti joins were removed.
It appears the |
@mbasmanova would this be an acceptable solution for #325 ? |
This makes sense to me. Do we see a future where Substrait can be used to represent the physical plan as well? |
I would say "more physical". At the end of the day I think every engine will have some hints / features / functions that can only be expressed with extensions. However, we can add the more well understood concepts. For example, there is already a "hash equijoin" and "merge equijoin" (and anti/semi are join types for the hash equijoin). This will be driven more as more physical producers like spark gluten (or hopefully someday optimizers) produce plans. |
@thisisnic @gforsyth do you have any input on this from the perspective of the dplyr and Ibis Substrait producers? Both dplyr and Ibis have anti join and semi join methods in their user APIs. |
It's slightly more work for the producer, but not a huge amount -- we can express both operations as subqueries. cc @pdet to see if DuckDB substrait has opinions here. |
Hi @gforsyth, thanks for tagging me :-) DuckDB flattens subqueries when consuming the parser tree and transforming it into a Logical Plan. Since no other representations of subqueries exist internally, with this change, we will not be able to produce substrait plans with subqueries. These operators are also crucial for efficient subquery execution, and removing them caps one of IMHO substrait's strengths, of allowing systems with advanced optimization techniques to produce efficient plans. @Mytherin what do you think? |
With this change, Substrait can no longer represent any type of flattened subquery. That means Substrait can no longer represent all logical plans that DuckDB emits, and neither can it represent all physical plans. This would significantly complicate the life of consumers. Executing subqueries as actual subqueries absolutely kills performance. As such, with this change, any consumer of Substrait will have to implement a subquery flattening optimizer in order to efficiently support execution of subqueries (or it will just not support subqueries, or support them extremely inefficiently). I have implemented subquery flattening before - it is not particularly easy. From DuckDB's perspective this will also significantly limit what we can do with Substrait. As Pedro mentioned, DuckDB has no support for unflattened subqueries outside of the parser stage, as executing unflattened subqueries does not make sense from a performance perspective. That means that either (1) we have to stop supporting subqueries in Substrait, or (2) generation of Substrait plans will have to be moved to the pre-planner and pre-optimizer stages, so DuckDB would stop emitting optimized plans in Substrait. We have also recently introduced syntax for semi and anti joins (duckdb/duckdb#6480), which would also no longer be compatible with Substrait either. As for #325, SEMI/ANTI joins have well defined semantics. There is a join condition, any row that has a matching tuple according to that join condition is either (SEMI) emitted, (ANTI) not emitted. A null-aware anti join seems like a more restrictive variant of the MARK join, which is what is used in the current literature to model |
@pdet / @Mytherin thanks for the input!
I agree that we want to find a good fit here. I'd rather accept "semi/anti" join as a logical redundancy than break compatibility. I had mistakenly assumed that DuckDb was using Substriat for logical plans (and, when I checked, DuckDb did not have a logical anti-join). It sounds like DuckDb is using Substrait to represent some kind of optimized plan. Is DuckDb using Substrait for the physical plan? And, if so, do you have any interest in proposing messages for other physical operators, such as mark join? Or is this some middle layer between "pure logical" (input to the parser) and "physical" (where things like mark join live)
If DuckDb is considering semi/anti join as "logical" then that is probably pretty compelling and I will update the PR to add this back in. I'm curious the motivation for this. Was this in response to user ask (e.g. users didn't like writing subqueries even if they knew they would be flattened?)
This is where things get fun 😀. This is a more interesting question and one we are going to have to inevitably tackle if we want to succeed modeling more physical operators. Let me briefly summarize some assumptions (all of which may be completely wrong):
We can argue about which is the more correct physical operator or approach but, at some point, we are going to have to deal with a certain inevitable truth with physical plans: At the most physical representation almost every engine is going to have something unique to that engine. For a canonical example consider an engine that relies on physical relations which rely on specialized hardware (e.g. a query engine running on a hard disk). So I think the criteria for including something should be "Are there two projects (producers, consumers, optimizers, etc.) which support the relation"? For both mark join and "null aware anti-join" I think the answer is yes:
Unfortunately, by the above criteria, I think there is room in Substrait for both mark join AND a null-aware join as physical operators. However, this means we will start to have dialects of Substrait. In a way, we already have dialects in Substrait because functions have options. TL;DR / next steps:
|
DuckDB has four types of plans.
Note that both the optimized and unoptimized variants of the plan are logical plans - there is no distinction in types there. Joins might get re-ordered, and there are some operators that are modified (for example, an Subqueries do not exist as subqueries anymore in the logical plan layer. They are flattened and turned into joins during the planning process. Subqueries as they are modeled in substrait only exist in the parse tree. For us, this is desirable because it means all optimizers can see the full unflattened tree, which makes writing optimizers that deal with subqueries easier.
Agreed, the further down you go (parsed -> logical -> physical) the more uniqueness you encounter.
I would say we want to emit our logical plans, as that allows us to emit plans that have been planned and optimized. The only alternative would be the parse tree - but that would introduce many limitations and complications.
Yes, that would be great. |
FYI, the SQL dialects for Hive and Impala also support |
CC @paleolimbot |
I just implemented joins in the R package and I did find it odd that the two were grouped together - as Weston noted, anti joins and semi joins don't actually join two tables. In dplyr (which I would argue is a logical query plan language), there are |
In Velox, we found it necessary to implement "project" flavor of semi join to support queries like
|
Hi, |
Yes, I think this was the conclusion we reached. I've been pondering how best to approach this PR and it hasn't been a priority recently. Partly there are several different domains in which a join might need to exist, e.g. "velox" (or, perhaps more generally, "presto"-style), "duckdb" (mark-join style), "unoptimized logical" and so I think, to express all the nuances of behavior, we may want to end up spreading things across multiple relations/hints/constraints/extensions. |
Closing in favor of #585 |
BREAKING CHANGE: The semi and anti join types are removed
from the logical join operator. These join types do not make sense
in a logical context since they do not join two tables. Subqueries
should be used instead.
Closes #325