Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L025 bug fix: stop incorrectly flagging on nested inner joins #3145

Merged
merged 5 commits into from
Apr 22, 2022

Conversation

tunetheweb
Copy link
Member

Brief summary of the change made

Fixes #3144

Are there any other side effects of this change that we should be aware of?

Adds a new, optional, parameter to recursive_crawl to allow you to provide a type that should not be recursed into.

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #3145 (07c1f5e) into main (2c62def) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #3145   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          165       165           
  Lines        12278     12279    +1     
=========================================
+ Hits         12278     12279    +1     
Impacted Files Coverage Δ
src/sqlfluff/core/parser/segments/base.py 100.00% <100.00%> (ø)
src/sqlfluff/dialects/dialect_ansi.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c62def...07c1f5e. Read the comment docs.

@@ -1460,7 +1460,10 @@ def get_eventual_aliases(self) -> List[Tuple[BaseSegment, AliasInfo]]:
buff.append((from_expression, alias))

# In some dialects, like TSQL, join clauses can have nested join clauses
for join_clause in self.recursive_crawl("join_clause"):
# recurse into them - but not if part of a sub-select statement (see #3144)
for join_clause in self.recursive_crawl(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're fixing the real issue here. It seems to me that we don't need to ignore select_statement. Instead, we need to detect that the on clause is using the alias, therefore it's not unused.

Do we know why the behavior changed from previous versions?

Copy link
Member Author

@tunetheweb tunetheweb Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed more in this comment in #3144

Basically, previously it was this:

for join_clause in self.get_children("join_clause"):

That is, we only got the direct join children, rather than using recursive_crawl.

However #2993 was added to make it recursive to allow SQL like this:

    SELECT
        1 AS RegionCode
    FROM BA
    LEFT OUTER JOIN (
        I LEFT OUTER JOIN P
            ON I.Pcd = P.Iid
    ) ON BA.Iid = I.Bcd

Note that the outermost ON clause is referencing the inner table I. This is a specific syntax allowed in T-SQL (but no where else I don't think?), so the above code needed to change to be recursive_crawl instead of get_children.

Unfortunately there is no easier way of identifying just this case as there will be brackets, and there could be nested, nested calls. But recursive_crawl will find them so all good.

That solve that problem, but then broke it for this SQL - hence this PR to partially revert that:

select col1
from table1 as abg
inner join (
    select
        ceb.emailaddresskey,
        dac.accountkey
    from table2 as dac
    inner join table3 as ceb
        on ceb.col2 = dac.col2
) as abh
on abg.col1 = abh.col1

Here the subquery should not be accessible at all to the outer query, and only the results of that (as abh) as it's not that specific T-SQL instance above, which is the only one that should allow nesting. So effectively it should be a black box, and should look like this (at least while while the rule process the outer select - it'll get to the nested one later):

select col1
from table1 as abg
inner join (
    ---something we don't need to worry about now
) as abh
on abg.col1 = abh.col1

Then we process that inner join later in the rule.

That is what used to happen before #2993 , and then, while processing the outer select statement, it would see table references abg and abh and then uses of both those in the ON clause - great, no rule to fire!

Instead what happens (and why we need this PR) is we now pick up recursive "joins" so we ALSO pick up table reference ceb when we shouldn't. And there is no reference to ceb in the outer select, hence the rule fires complaining about that.

Note that we don't pick up the inner table reference dac btw cause that is an FROM clause and not a JOIN clause and the bug is specific to JOIN clauses. Again, showing that this is wrong to traverse into the inner select for just JOINs.

As the nested SELECT should be a black box (except in very specific use cases for T-SQL, when it's an nested JOIN), I've added functionality to not recurse into it anymore, if it is it's own select statement (which is exactly what happened before), but still recurse for anything else (so the original T-SQL bug is still resolved). There may be other instances we might need to do that, but that's the only one I can think of now.

The alternative is to gather ALL the table references for the whole query (even the inner, nested ones that should not be referenceable, and including dac in this example) and ALL the column references, and then decide which are used and which are not. But that is not the correct thing to do here, since the inner, nested table references are not be accessible outside of it.

Here's a BigQuery example showing the inner table canot be referenced in the outer select:

image

Only via the outer alias:

image

Does that explain it all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, but I'll leave it to @barrywhart to ✅ if he is happy with the explanation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

In general, visibility goes upwards, but not downwards.

@tunetheweb tunetheweb merged commit 4d2cd04 into sqlfluff:main Apr 22, 2022
@tunetheweb tunetheweb deleted the 3144-l025-bug-nested-inner-joins branch April 22, 2022 14:53
pwildenhain pushed a commit that referenced this pull request Apr 22, 2022
* L025 bug fix: stop incorrectly flagging on nested inner joins

* Comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

L025 flagged incorrectly for nested selects
3 participants