Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal (frontend): Prune duplicate columns #8277

Closed
jon-chuang opened this issue Mar 1, 2023 · 2 comments
Closed

Proposal (frontend): Prune duplicate columns #8277

jon-chuang opened this issue Mar 1, 2023 · 2 comments

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Mar 1, 2023

Similar to ColPrunable, which prunes top down based on output columns, we should have a trait DuplicateColPrunable which prunes bottom up based on duplicate columns originating in various operators.

This idea has been proposed several times (references: ?) but has not been implemented. However, ExprRewritable machinery makes it easier to solve this problem easily.

Motivations

  1. Handle aliasing/Col mapping non-injectivity: Handle non-injective column mappings in a non-hacky way. Both (fix(frontend): Fix non-injectivity of ColIndexMapping for LogicalJoin #8267, fix(optimizer): handle non-injective changes of logical_rewrite_for_stream #8269) do not fix the root cause, which is properly handling non-injective col mappings and efficiently dealing with aliasing.
  2. Performance: It has been shown that removing duplicate agg calls can lead to performance improvements: perf: nexmark q17 #7351 (comment). Removing duplicate columns can merge agg calls that are falsely marked as distinct due to aliasing, in addition to being expected to reduce other forms of unnecessary work done in other operators. Furthermore, we can save storage.

Notes:

  1. Duplicate columns can originate in various operator's:
    • output_indices
    • Project operator equal exprs
    • eq keys of inner joins
    • Duplicate agg calls (While aggs may no longer be a source after perf(agg): reuse existing agg calls while building LogicalAgg #8200, pruning duplicate columns in Agg's input may still cause us to handle duplicate agg calls after inputs have been de-aliased. See below.)
    • naively adding stream key
  2. Whenever we deduplicate, we will pass col_index_mapping to the parent. The col_index_mapping need not be injective; an old column index can map to a single column multiple times. The upstream operator needs to rewrite all its expressions as well as depend on changed input schema.

Examples:

Prune based on duplicate output_indices:

Project { exprs: [InputRef(0), InputRef(1)] }
  Join { output: [0, 0] }
->
Project { exprs: [InputRef(0), InputRef(0)] }
  Join { output: [0] }

Project can also be pruned

Join { eq_keys: [(0, 5)],  other_cond: ($1 > 100:Int64) }
  Project { exprs: [InputRef(0), InputRef(0)] }
->
Join { eq_keys: [(0, 5)],  other_cond: ($0 > 100:Int64) }
  Project { exprs: [InputRef(0)] }

Pruning eq keys in inner joins

Project { exprs: [$0 * 2, $1] }
  Join { eq_keys: [(0, 1)], output: [0, 1] }
    Scan { [x] }
    Scan { [x] }
-> /* Internal step to rewrite output indices */
Project { exprs: [$0 * 2, $1] }
  Join { eq_keys: [(0, 1)], output: [0, 0] }
    Scan { [x] }
    Scan { [x] }
->
Project { exprs: [$0 * 2, $0] }
  Join { eq_keys: [(0, 1)], output: [0] }
    Scan { [x] }
    Scan { [x] }
->

Pruning agg calls falsely considered unique

Agg { agg_calls: [sum($0), sum($1)] }
  Project { exprs: [InputRef(0), InputRef(0)] }
->
Agg { agg_calls: [sum($0), sum($0)] }
  Project { exprs: [InputRef(0)] }
->
Agg { agg_calls: [sum($0)] }
  Project { exprs: [InputRef(0)] }
->

Handling of Materialized View

MaterializedView { output: [0, 1] }
   Join { .., output: [0, 0] }
-> /* In this case, we prefer duplicate expressions to always be absorbed by Project */
MaterializedView { output: [0, 1] }
   Project { exprs: [$0, $0] }
     Join { .., output: [0] }
-> 

Implementation details:

  1. In order to rewrite expressions' InputRef to the alias, we will use ExprRewritable to walk all the expressions in a given parent operator.
  2. Since const_eval can reveal new expressions that are equivalent, perhaps we should run this optimization pass after const_eval

Ease of Implementation

The generic implementation is expected to be simple:

impl DuplicateColPrunable for Logical_ {
  fn prune_duplicate_cols(self) -> (PlanRef, ColIndexMapping) {
    let (new_input, map) = self.inputs.prune_duplicate_cols();
    let plan = self.rewrite_with_inputs(new_input, map);
    let mut new plan = plan.rewrite_exprs(&mut InputRefRewriter::new(map));
    // Now that we have propagated both schema changes in inputs as well as 
    // de-aliased any input cols, we check if we have any duplicate cols that can be removed.
    if let Some(rewritten, mapping) =  new_plan.rewrite_without_duplicates() {
      (rewritten, mapping)
    } else {
      (new_plan, identity)
  }
}

Alternately, to enable Logical_ to handle rewriting input with non-injective mapping more generally, we should move plan.rewrite_exprs(&mut InputRefRewriter::new(map)); into rewrite_with_inputs.

Questions:

  1. How do we deal with LogicalShare operator? A: we need to propagate the result of .prune_duplicate_cols() to each parent simultaneously, so that each parent can propagate pruning of duplicated cols. We can cache this result in a DuplicateColPruningContext, similar to ColPruningContext.
@fuyufjh
Copy link
Contributor

fuyufjh commented Apr 6, 2023

The idea sounds good. Do we have any plan to carry it out someday?

@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2023

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@fuyufjh fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants