Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lineage is incorrect for 2 unnest #2745

Closed
wizardxz opened this issue Jan 1, 2024 · 3 comments
Closed

lineage is incorrect for 2 unnest #2745

wizardxz opened this issue Jan 1, 2024 · 3 comments

Comments

@wizardxz
Copy link

wizardxz commented Jan 1, 2024

Before you file an issue

  • Make sure you specify the "read" dialect eg. parse_one(sql, read="spark")
  • Check if the issue still exists on main

Fully reproducible code snippet

>>> lineage(
...   "b",
...   """
...   WITH _data AS (
...     SELECT array[
...         CAST(ROW(1, 2) AS ROW(a INTEGER, b INTEGER)),
...     ] AS col1,
...     array[
...         CAST(ROW(1, 2) AS ROW(c INTEGER, d INTEGER)),
...     ] AS col2
... )
... SELECT b
... FROM _data
... CROSS JOIN UNNEST(col1)
... CROSS JOIN UNNEST(col2)
...   """,
...   dialect="trino"
... )
Node(name='b', expression=Alias(
  this=Column(
    this=Identifier(this=b, quoted=False)),
  alias=Identifier(this=b, quoted=False)), source=Select(
  expressions=[
    Alias(
      this=Column(
        this=Identifier(this=b, quoted=False)),
      alias=Identifier(this=b, quoted=False))],
  from=From(
    this=Table(
      this=Identifier(this=_data, quoted=False))),
  joins=[
    Join(
      this=Unnest(
        expressions=[
          Column(
            this=Identifier(this=col1, quoted=False),
            table=Identifier(this=_data, quoted=False))],
        alias=TableAlias(
          this=Identifier(this=_q_0, quoted=False))),
      kind=CROSS),
    Join(
      this=Unnest(
        expressions=[
          Column(
            this=Identifier(this=col2, quoted=False),
            table=Identifier(this=_data, quoted=False))],
        alias=TableAlias(
          this=Identifier(this=_q_1, quoted=False))),
      kind=CROSS)],
  with=With(
    expressions=[
      CTE(
        this=Select(
          expressions=[
            Alias(
              this=Array(
                expressions=[
                  Cast(
                    this=Struct(
                      expressions=[
                        Literal(this=1, is_string=False),
                        Literal(this=2, is_string=False)]),
                    to=DataType(
                      this=Type.STRUCT,
                      expressions=[
                        ColumnDef(
                          this=Identifier(this=a, quoted=False),
                          kind=DataType(this=Type.INT, nested=False, prefix=False)),
                        ColumnDef(
                          this=Identifier(this=b, quoted=False),
                          kind=DataType(this=Type.INT, nested=False, prefix=False))],
                      nested=True,
                      prefix=False))]),
              alias=Identifier(this=col1, quoted=False)),
            Alias(
              this=Array(
                expressions=[
                  Cast(
                    this=Struct(
                      expressions=[
                        Literal(this=1, is_string=False),
                        Literal(this=2, is_string=False)]),
                    to=DataType(
                      this=Type.STRUCT,
                      expressions=[
                        ColumnDef(
                          this=Identifier(this=c, quoted=False),
                          kind=DataType(this=Type.INT, nested=False, prefix=False)),
                        ColumnDef(
                          this=Identifier(this=d, quoted=False),
                          kind=DataType(this=Type.INT, nested=False, prefix=False))],
                      nested=True,
                      prefix=False))]),
              alias=Identifier(this=col2, quoted=False))]),
        alias=TableAlias(
          this=Identifier(this=_data, quoted=False)))])), downstream=[Node(name='b', expression=Placeholder(), source=Placeholder(), downstream=[], alias='')], alias='')

The expected the result should be

b -> _q_0.b -> _data.col1

Official Documentation
Please include links to official SQL documentation related to your issue.

@tobymao
Copy link
Owner

tobymao commented Jan 2, 2024

not planned, feel free to submit a PR if you want to tackle this

@tobymao tobymao closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024
@wizardxz
Copy link
Author

wizardxz commented Jan 2, 2024

@tobymao
I think it is not easy because we need to implement type checking or validation and make the column lineage based on Logical plan, instead of AST. This is a fundamental change but the right direction.

@tobymao
Copy link
Owner

tobymao commented Jan 2, 2024

we don't need logical plan for this, we have type checking already based on ast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants