Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Diff Checker unexpected output #3581

Closed
1fanwang opened this issue Jun 3, 2024 · 2 comments
Closed

Semantic Diff Checker unexpected output #3581

1fanwang opened this issue Jun 3, 2024 · 2 comments

Comments

@1fanwang
Copy link

1fanwang commented Jun 3, 2024

Before you file an issue
Hi team, I am trying the diff part of sqlglot for the first time, have a few questions on some experiments I have done, i.e. for a few situations, the diff tool is showing different but was expecting same, wondering if these are expected

Fully reproducible code snippet

  1. Expected semantically cosmetic changes in the Trino SQL. New SQL have aliases for the table names and fully qualified names for the selected columns -> Expect this to be recognized as semantically same, but the diff tool is showing different
>>> diff(parse_one("SELECT colA from default.foo"), parse_one("SELECT foo.colA from default.foo AS foo"))
[Remove(expression=Table(
  this=Identifier(this=foo, quoted=False),
  db=Identifier(this=default, quoted=False))), Remove(expression=From(
  this=Table(
    this=Identifier(this=foo, quoted=False),
    db=Identifier(this=default, quoted=False)))), Insert(expression=From(
  this=Table(
    this=Identifier(this=foo, quoted=False),
    db=Identifier(this=default, quoted=False),
    alias=TableAlias(
      this=Identifier(this=foo, quoted=False))))), Insert(expression=Table(
  this=Identifier(this=foo, quoted=False),
  db=Identifier(this=default, quoted=False),
  alias=TableAlias(
    this=Identifier(this=foo, quoted=False)))), Insert(expression=TableAlias(
  this=Identifier(this=foo, quoted=False))), Keep(source=Select(
  expressions=[
    Column(
      this=Identifier(this=colA, quoted=False))],
  from=From(
    this=Table(
      this=Identifier(this=foo, quoted=False),
      db=Identifier(this=default, quoted=False)))), target=Select(
  expressions=[
    Column(
      this=Identifier(this=colA, quoted=False),
      table=Identifier(this=foo, quoted=False))],
  from=From(
    this=Table(
      this=Identifier(this=foo, quoted=False),
      db=Identifier(this=default, quoted=False),
      alias=TableAlias(
        this=Identifier(this=foo, quoted=False)))))), Update(source=Column(
  this=Identifier(this=colA, quoted=False)), target=Column(
  this=Identifier(this=colA, quoted=False),
  table=Identifier(this=foo, quoted=False)))]
>>> d = diff(parse_one("SELECT colA from default.foo"), parse_one("SELECT foo.colA from default.foo AS foo"))
>>> sum(0 if isinstance(e, Keep) else 1 for e in d)
6
  1. Expected semantically same for: In the new SQL, strings are wrapped with a CAST operator:
>>> d1 = diff(parse_one("SELECT 'Sample String' AS 'strCol' from default.foo"), parse_one("SELECT CAST('Sample String' AS VARCHAR(X)) AS 'strCol' from default.foo AS foo"))
>>> d1
[Remove(expression=Select(
  expressions=[
    Alias(
      this=Literal(this=Sample String, is_string=True),
      alias=Identifier(this=strCol, quoted=True))],
  from=From(
    this=Table(
      this=Identifier(this=foo, quoted=False),
      db=Identifier(this=default, quoted=False))))), Remove(expression=From(
  this=Table(
    this=Identifier(this=foo, quoted=False),
    db=Identifier(this=default, quoted=False)))), Remove(expression=Table(
  this=Identifier(this=foo, quoted=False),
  db=Identifier(this=default, quoted=False))), Insert(expression=Select(
  expressions=[
    Alias(
      this=Cast(
        this=Literal(this=Sample String, is_string=True),
        to=DataType(
          this=Type.VARCHAR,
          expressions=[
            DataTypeParam(
              this=Var(this=X))],
          nested=False)),
      alias=Identifier(this=strCol, quoted=True))],
  from=From(
    this=Table(
      this=Identifier(this=foo, quoted=False),
      db=Identifier(this=default, quoted=False),
      alias=TableAlias(
        this=Identifier(this=foo, quoted=False)))))), Insert(expression=From(
  this=Table(
    this=Identifier(this=foo, quoted=False),
    db=Identifier(this=default, quoted=False),
    alias=TableAlias(
      this=Identifier(this=foo, quoted=False))))), Insert(expression=Cast(
  this=Literal(this=Sample String, is_string=True),
  to=DataType(
    this=Type.VARCHAR,
    expressions=[
      DataTypeParam(
        this=Var(this=X))],
    nested=False))), Insert(expression=TableAlias(
  this=Identifier(this=foo, quoted=False))), Insert(expression=DataTypeParam(
  this=Var(this=X))), Insert(expression=Table(
  this=Identifier(this=foo, quoted=False),
  db=Identifier(this=default, quoted=False),
  alias=TableAlias(
    this=Identifier(this=foo, quoted=False)))), Insert(expression=DataType(
  this=Type.VARCHAR,
  expressions=[
    DataTypeParam(
      this=Var(this=X))],
  nested=False)), Insert(expression=Var(this=X)), Keep(source=Literal(this=Sample String, is_string=True), target=Literal(this=Sample String, is_string=True)), Move(expression=Literal(this=Sample String, is_string=True)), Keep(source=Alias(
  this=Literal(this=Sample String, is_string=True),
  alias=Identifier(this=strCol, quoted=True)), target=Alias(
  this=Cast(
    this=Literal(this=Sample String, is_string=True),
    to=DataType(
      this=Type.VARCHAR,
      expressions=[
        DataTypeParam(
          this=Var(this=X))],
      nested=False)),
  alias=Identifier(this=strCol, quoted=True)))]
>>> sum(0 if isinstance(e, Keep) else 1 for e in d1)
12

Official Documentation
Please include links to official SQL documentation related to your issue.
https://github.com/tobymao/sqlglot/blob/main/posts/sql_diff.md

@tobymao
Copy link
Owner

tobymao commented Jun 3, 2024

this is expected, the differ only diffs asts. if you want to normalize the sql before diffing, try optimizing, specifically, the qualify function

@tobymao tobymao closed this as not planned Won't fix, can't repro, duplicate, stale Jun 3, 2024
@1fanwang
Copy link
Author

1fanwang commented Jun 3, 2024

Thanks @tobymao!

Was able to test it as expected

>>> schema = {"foo": {"colA": "INT"}}
>>> source = "SELECT colA from default.foo"
>>> target = "SELECT foo.colA from default.foo AS foo"
>>> 
>>> optimized_source = optimize(parse_one(source), schema=schema)
>>> optimized_target = optimize(parse_one(target), schema=schema)
>>> edit_script = diff(optimized_source, optimized_target)
>>> 
>>> # Calculate the difference in edits, excluding 'Keep' operations
>>> edit_difference = sum(0 if isinstance(e, Keep) else 1 for e in edit_script)
>>> print(edit_difference)
0
>>> schema = {"foo": {"dummy_col": "VARCHAR"}}
>>> 
>>> # SQL statements to be compared
>>> sql1 = "SELECT 'Sample String' AS 'strCol' from default.foo"
>>> sql2 = "SELECT CAST('Sample String' AS VARCHAR(X)) AS 'strCol' from default.foo AS foo"
>>> 
>>> # Parsing the SQL statements into ASTs
>>> optimized_source = optimize(parse_one(sql1), schema=schema)
>>> optimized_target = optimize(parse_one(sql2), schema=schema)
>>> 
>>> # Computing the differences between the two optimized ASTs
>>> edit_script = diff(optimized_source, optimized_target)
>>> 
>>> # Calculate the difference in edits, excluding 'Keep' operations
>>> edit_difference = sum(0 if isinstance(e, Keep) else 1 for e in edit_script)
>>> 
>>> # Print the result
>>> print("Number of changes:", edit_difference)
Number of changes: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants