Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: subquery to join case 0 #17617

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KKould
Copy link
Member

@KKould KKould commented Mar 18, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This rule comes from the subquery optimization mentioned in the wetune paper: https://ipads.se.sjtu.edu.cn:1312/opensource/wetune/-/blob/main/wtune_data/issues/issues?ref_type=heads#L8

It can remove the redundant scan of left in the case of in subquery.

explain
SELECT
  COUNT(topics.id)
FROM
  topics
WHERE
  id IN (
    SELECT
      topic_id
    FROM
      posts AS p
      INNER JOIN topics AS t2 ON t2.id = p.topic_id
    WHERE
      p.deleted_at IS NULL
      AND t2.user_id <> p.user_id
      AND p.user_id = 9627
  )

-[ EXPLAIN ]-----------------------------------
AggregateFinal
├── output columns: [COUNT(topics.id) (#45)]
├── group by: []
├── aggregate functions: [count()]
├── estimated rows: 1.00
└── AggregatePartial
    ├── group by: []
    ├── aggregate functions: [count()]
    ├── estimated rows: 1.00
    └── AggregateFinal
        ├── output columns: [p.topic_id (#48)]
        ├── group by: [topic_id]
        ├── aggregate functions: []
        ├── estimated rows: 0.00
        └── AggregatePartial
            ├── group by: [topic_id]
            ├── aggregate functions: []
            ├── estimated rows: 0.00
            └── HashJoin
                ├── output columns: [p.topic_id (#48)]
                ├── join type: INNER
                ├── build keys: [t2.id (#97)]
                ├── probe keys: [p.topic_id (#48)]
                ├── keys is null equal: [false]
                ├── filters: [t2.user_id (#104) <> p.user_id (#47)]
                ├── estimated rows: 0.00
                ├── TableScan(Build)
                │   ├── table: default.public.topics
                │   ├── output columns: [id (#97), user_id (#104)]
                │   ├── read rows: 0
                │   ├── read size: 0
                │   ├── partitions total: 0
                │   ├── partitions scanned: 0
                │   ├── push downs: [filters: [], limit: NONE]
                │   └── estimated rows: 0.00
                └── Filter(Probe)
                    ├── output columns: [p.user_id (#47), p.topic_id (#48)]
                    ├── filters: [is_true(p.user_id (#47) = 9627), NOT is_not_null(p.deleted_at (#57))]
                    ├── estimated rows: 0.00
                    └── TableScan
                        ├── table: default.public.posts
                        ├── output columns: [user_id (#47), topic_id (#48), deleted_at (#57)]
                        ├── read rows: 0
                        ├── read size: 0
                        ├── partitions total: 0
                        ├── partitions scanned: 0
                        ├── push downs: [filters: [and_filters(posts.user_id (#47) = 9627, NOT is_not_null(posts.deleted_at (#57)))], limit: NONE]
                        └── estimated rows: 0.00

There are currently two issues waiting to be resolved for this optimization

  1. When the column required by the downstream node is not the child expr of the subquery, a schema mapping error will occur.
explain SELECT COUNT(topics.user_id)
FROM topics
WHERE id IN (SELECT topic_id
             FROM posts AS p
                      INNER JOIN topics AS t2 ON t2.id = p.topic_id
             WHERE p.deleted_at IS NULL
               AND t2.user_id <> p.user_id
               AND p.user_id = 9627);
  1. When there are multiple tables in the left or there are complex situations such as limits

Here are some questions about table creation and related sql to help you test(Please manually change txt to sql)
subquery.txt

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould marked this pull request as draft March 18, 2025 02:40
Copy link
Contributor

github-actions bot commented Mar 18, 2025

At least one test kind must be checked in the PR description.
@KKould please update it 🙏.

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant