Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: remove semi and anti logical join types and clarify null handling in hash join #467

Closed

Conversation

westonpace
Copy link
Member

BREAKING CHANGE: The semi and anti join types are removed
from the logical join operator. These join types do not make sense
in a logical context since they do not join two tables. Subqueries
should be used instead.

Closes #325

…hash equijoin. Added 'null is match' property to hash equijoin
@westonpace
Copy link
Member Author

I ended up opting for something slightly more drastic than my previous PR. I have now been convinced that semi join and anti join do not make sense as logical join types.

Rationale:

  • These join types do not actually join two tables together
  • The queries can more naturally be expressed with subqueries
  • Major implementations of logical join do not have this capability
    • DuckDb does support this in SQL
    • Postgres does not support this in SQL
    • MySQL does not support this in SQL
    • Trino does not support this

Counterpoint:

  • Some implementations do express this at a logical level
    • Spark SQL does support ANTI/SEMI

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Small question about whether we need so much detail about why semi/anti joins were removed.

@westonpace
Copy link
Member Author

It appears the proto-prefix.py script is not expecting "reserved" keywords in an enum. I can fix that.

@westonpace
Copy link
Member Author

@mbasmanova would this be an acceptable solution for #325 ?

@mbasmanova
Copy link

@westonpace

The semi and anti join types are removed
from the logical join operator. These join types do not make sense
in a logical context since they do not join two tables. Subqueries
should be used instead.

This makes sense to me. Do we see a future where Substrait can be used to represent the physical plan as well?

@westonpace
Copy link
Member Author

@westonpace

The semi and anti join types are removed
from the logical join operator. These join types do not make sense
in a logical context since they do not join two tables. Subqueries
should be used instead.

This makes sense to me. Do we see a future where Substrait can be used to represent the physical plan as well?

I would say "more physical". At the end of the day I think every engine will have some hints / features / functions that can only be expressed with extensions.

However, we can add the more well understood concepts. For example, there is already a "hash equijoin" and "merge equijoin" (and anti/semi are join types for the hash equijoin). This will be driven more as more physical producers like spark gluten (or hopefully someday optimizers) produce plans.

@ianmcook
Copy link
Contributor

ianmcook commented Mar 16, 2023

@thisisnic @gforsyth do you have any input on this from the perspective of the dplyr and Ibis Substrait producers? Both dplyr and Ibis have anti join and semi join methods in their user APIs.

@gforsyth
Copy link
Member

It's slightly more work for the producer, but not a huge amount -- we can express both operations as subqueries.

cc @pdet to see if DuckDB substrait has opinions here.

@pdet
Copy link
Contributor

pdet commented Mar 16, 2023

Hi @gforsyth, thanks for tagging me :-)

DuckDB flattens subqueries when consuming the parser tree and transforming it into a Logical Plan. Since no other representations of subqueries exist internally, with this change, we will not be able to produce substrait plans with subqueries.

These operators are also crucial for efficient subquery execution, and removing them caps one of IMHO substrait's strengths, of allowing systems with advanced optimization techniques to produce efficient plans.

@Mytherin what do you think?

@Mytherin
Copy link

Mytherin commented Mar 16, 2023

With this change, Substrait can no longer represent any type of flattened subquery. That means Substrait can no longer represent all logical plans that DuckDB emits, and neither can it represent all physical plans.

This would significantly complicate the life of consumers. Executing subqueries as actual subqueries absolutely kills performance. As such, with this change, any consumer of Substrait will have to implement a subquery flattening optimizer in order to efficiently support execution of subqueries (or it will just not support subqueries, or support them extremely inefficiently). I have implemented subquery flattening before - it is not particularly easy.

From DuckDB's perspective this will also significantly limit what we can do with Substrait. As Pedro mentioned, DuckDB has no support for unflattened subqueries outside of the parser stage, as executing unflattened subqueries does not make sense from a performance perspective. That means that either (1) we have to stop supporting subqueries in Substrait, or (2) generation of Substrait plans will have to be moved to the pre-planner and pre-optimizer stages, so DuckDB would stop emitting optimized plans in Substrait.

We have also recently introduced syntax for semi and anti joins (duckdb/duckdb#6480), which would also no longer be compatible with Substrait either.

As for #325, SEMI/ANTI joins have well defined semantics. There is a join condition, any row that has a matching tuple according to that join condition is either (SEMI) emitted, (ANTI) not emitted.

A null-aware anti join seems like a more restrictive variant of the MARK join, which is what is used in the current literature to model IN/ANY/ALL clauses. How would you correctly model an IN clause in a SELECT (not in a WHERE) with a null-aware anti join? Unless I am missing something it seems that that is not possible?

@westonpace
Copy link
Member Author

@pdet / @Mytherin thanks for the input!

With this change, Substrait can no longer represent any type of flattened subquery. That means Substrait can no longer represent all logical plans that DuckDB emits, and neither can it represent all physical plans.

I agree that we want to find a good fit here. I'd rather accept "semi/anti" join as a logical redundancy than break compatibility. I had mistakenly assumed that DuckDb was using Substriat for logical plans (and, when I checked, DuckDb did not have a logical anti-join). It sounds like DuckDb is using Substrait to represent some kind of optimized plan.

Is DuckDb using Substrait for the physical plan? And, if so, do you have any interest in proposing messages for other physical operators, such as mark join? Or is this some middle layer between "pure logical" (input to the parser) and "physical" (where things like mark join live)

We have also recently introduced syntax for semi and anti joins (duckdb/duckdb#6480), which would also no longer be compatible with Substrait either.

If DuckDb is considering semi/anti join as "logical" then that is probably pretty compelling and I will update the PR to add this back in. I'm curious the motivation for this. Was this in response to user ask (e.g. users didn't like writing subqueries even if they knew they would be flattened?)

As for #325, SEMI/ANTI joins have well defined semantics. There is a join condition, any row that has a matching tuple according to that join condition is either (SEMI) emitted, (ANTI) not emitted.

A null-aware anti join seems like a more restrictive variant of the MARK join, which is what is used in the current literature to model IN/ANY/ALL clauses. How would you correctly model an IN clause in a SELECT (not in a WHERE) with a null-aware anti join? Unless I am missing something it seems that that is not possible?

This is where things get fun 😀. This is a more interesting question and one we are going to have to inevitably tackle if we want to succeed modeling more physical operators. Let me briefly summarize some assumptions (all of which may be completely wrong):

  • Neither DuckDb nor Velox evaluates "unflattened" subqueries
  • Some naive systems can handle many (but not all) correlated subqueries using semi/anti join without any concept of null awareness.
  • Velox can handle a larger set of correlated subqueries because it introduces null awareness in anti join
  • DuckDb can handle an even larger set (all correlated subqueries?) because it introduces mark join, and this removes the need for dealing with null awareness in anti join

We can argue about which is the more correct physical operator or approach but, at some point, we are going to have to deal with a certain inevitable truth with physical plans:

At the most physical representation almost every engine is going to have something unique to that engine.

For a canonical example consider an engine that relies on physical relations which rely on specialized hardware (e.g. a query engine running on a hard disk).

So I think the criteria for including something should be "Are there two projects (producers, consumers, optimizers, etc.) which support the relation"?

For both mark join and "null aware anti-join" I think the answer is yes:

  • Mark join
    • Both the DuckDb producer and the DuckDb consumer support mark join (is this true?)
  • Null aware anti-join
    • Velox, as a consumer, supports this
    • Acero, as a consumer, might support this (there is a property JoinKeyCompare which I think equates to null awareness, but I need to do some further testing)
    • Gluten, as a producer, supports setting this (or I assume they will given their behavior so far)

Unfortunately, by the above criteria, I think there is room in Substrait for both mark join AND a null-aware join as physical operators. However, this means we will start to have dialects of Substrait. In a way, we already have dialects in Substrait because functions have options.

TL;DR / next steps:

  • Does DuckDb want to use Substrait at the logical or physical level? (both is a fine answer too)
  • If DuckDb wants to use Substrait at the logical level then we should leave semi/anti join in the logical join operator (given spark also considers this logical then I think we have enough precedent to justify it)
  • If DuckDb wants to use Substrait at the physical level then we need to start getting proposal for physical operators that match what DuckDb can consume & produce.
  • This all probably warrants a general ML discussion about the idea of dialects in Substrait. I'll start something (hopefully I can be more concise there).

@Mytherin
Copy link

Is DuckDb using Substrait for the physical plan? And, if so, do you have any interest in proposing messages for other physical operators, such as mark join? Or is this some middle layer between "pure logical" (input to the parser) and "physical" (where things like mark join live)

DuckDB has four types of plans.

  • Parse Tree
  • Logical Plan (Unoptimized)
  • Logical Plan (Optimized)
  • Physical Plan

Note that both the optimized and unoptimized variants of the plan are logical plans - there is no distinction in types there. Joins might get re-ordered, and there are some operators that are modified (for example, an ORDER + LIMIT might get turned into a Top-N) - but these all still strictly describe at a logical level what should happen to the data. Anti joins, mark joins, etc live in the logical level as well and I would consider all of these logical operators, not physical operators. An anti join or mark join is merely a different kind of join, similar to an inner or left join. The physical plan exists to make concrete implementation choices (e.g. hash versus merge) and is separate from join types.

Subqueries do not exist as subqueries anymore in the logical plan layer. They are flattened and turned into joins during the planning process. Subqueries as they are modeled in substrait only exist in the parse tree. For us, this is desirable because it means all optimizers can see the full unflattened tree, which makes writing optimizers that deal with subqueries easier.

At the most physical representation almost every engine is going to have something unique to that engine.

Agreed, the further down you go (parsed -> logical -> physical) the more uniqueness you encounter.

Does DuckDb want to use Substrait at the logical or physical level? (both is a fine answer too)

I would say we want to emit our logical plans, as that allows us to emit plans that have been planned and optimized. The only alternative would be the parse tree - but that would introduce many limitations and complications.

If DuckDb wants to use Substrait at the logical level then we should leave semi/anti join in the logical join operator (given spark also considers this logical then I think we have enough precedent to justify it)

Yes, that would be great.

@ianmcook
Copy link
Contributor

ianmcook commented Mar 16, 2023

Spark SQL does support ANTI/SEMI

FYI, the SQL dialects for Hive and Impala also support ANTI JOIN and SEMI JOIN

@thisisnic
Copy link
Contributor

CC @paleolimbot

@paleolimbot
Copy link

I just implemented joins in the R package and I did find it odd that the two were grouped together - as Weston noted, anti joins and semi joins don't actually join two tables. In dplyr (which I would argue is a logical query plan language), there are semi_join() and anti_join() but the documentation is grouped into "mutating joins" (left, right, etc.) and "filtering joins" (semi and anti). I wonder if that would be a good split here, too (or maybe that ship has sailed).

@mbasmanova
Copy link

"mutating joins" (left, right, etc.) and "filtering joins" (semi and anti).

In Velox, we found it necessary to implement "project" flavor of semi join to support queries like

SELECT * FROM t WHERE t.a IN (<subquery>) OR t.b > 10

@zeodtr
Copy link

zeodtr commented May 11, 2023

Hi,
I think it can be important to be able to represent an optimized logical plan with Substrait (so that an intelligent planner can send it to simpler executors). So, if anti-join and semi-join are crucial for representing an optimized logical plan, they should remain, with semantics clarified if necessary.

@westonpace
Copy link
Member Author

I think it can be important to be able to represent an optimized logical plan with Substrait (so that an intelligent planner can send it to simpler executors). So, if anti-join and semi-join are crucial for representing an optimized logical plan, they should remain, with semantics clarified if necessary.

Yes, I think this was the conclusion we reached. I've been pondering how best to approach this PR and it hasn't been a priority recently. Partly there are several different domains in which a join might need to exist, e.g. "velox" (or, perhaps more generally, "presto"-style), "duckdb" (mark-join style), "unoptimized logical" and so I think, to express all the nuances of behavior, we may want to end up spreading things across multiple relations/hints/constraints/extensions.

@gforsyth gforsyth added the spec label Jun 27, 2023
@EpsilonPrime EpsilonPrime added the awaiting-user-input This issue is waiting on further input from users label Aug 16, 2023
@westonpace
Copy link
Member Author

Closing in favor of #585

@westonpace westonpace closed this Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-input This issue is waiting on further input from users spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What are the semantics of the ANTI join?