Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer produces a group by with duplicated columns #3056

Closed
ggadon opened this issue Feb 29, 2024 · 1 comment
Closed

Optimizer produces a group by with duplicated columns #3056

ggadon opened this issue Feb 29, 2024 · 1 comment
Assignees

Comments

@ggadon
Copy link

ggadon commented Feb 29, 2024

  1. Use the tpch query: https://github.com/dragansah/tpch-dbgen/blob/master/queries/18.sql
q = sqlglot.parse_one(QUERY)

Use the following schema (from tpch):

SCHEMA_DICT = {
    "REGION": {
        "R_REGIONKEY": "SERIAL PRIMARY KEY",
        "R_NAME": "CHAR(25)",
        "R_COMMENT": "VARCHAR(152)"
    },
    "NATION": {
        "N_NATIONKEY": "SERIAL PRIMARY KEY",
	    "N_NAME": "CHAR(25)",
	    "N_REGIONKEY": "BIGINT NOT NULL",
	    "N_COMMENT": "VARCHAR(152)"
    },
    "LINEITEM": {
        "L_ORDERKEY": "BIGINT NOT NULL",
	    "L_PARTKEY": "BIGINT NOT NULL",
	    "L_SUPPKEY": "BIGINT NOT NULL",
	    "L_LINENUMBER": "INTEGER",
        "L_QUANTITY": "DECIMAL",
	    "L_EXTENDEDPRICE": "DECIMAL",
	    "L_DISCOUNT": "DECIMAL",
	    "L_TAX": "DECIMAL",
        "L_RETURNFLAG": "CHAR(1)",
	    "L_LINESTATUS": "CHAR(1)",
	    "L_SHIPDATE": "DATE",
	    "L_COMMITDATE": "DATE",
	    "L_RECEIPTDATE": "DATE",
	    "L_SHIPINSTRUCT": "CHAR(25)",
	    "L_SHIPMODE": "CHAR(10)",
	    "L_COMMENT": "VARCHAR(44)",
    },
    "ORDERS": {
        "O_ORDERKEY": "SERIAL PRIMARY KEY",
	    "O_CUSTKEY": "BIGINT NOT NULL",
	    "O_ORDERSTATUS": "CHAR(1)",
	    "O_TOTALPRICE": "DECIMAL",
	    "O_ORDERDATE": "DATE",
	    "O_ORDERPRIORITY": "CHAR(15)",
	    "O_CLERK": "CHAR(15)",
	    "O_SHIPPRIORITY": "INTEGER",
	    "O_COMMENT": "VARCHAR(79)"
    },
    "CUSTOMER": {
        "C_CUSTKEY": "SERIAL PRIMARY KEY",
        "C_NAME": "VARCHAR(25)",
        "C_ADDRESS": "VARCHAR(40)",
        "C_NATIONKEY": "BIGINT NOT NULL",
        "C_PHONE": "CHAR(15)",
        "C_ACCTBAL": "DECIMAL",
        "C_MKTSEGMENT": "CHAR(10)",
        "C_COMMENT": "VARCHAR(117)"
    }
}
  1. Then use the optimizer:
from sqlglot.optimizer import optimize
print(optimize(q, SCHEMA_DICT).sql(pretty=True))

The output is:

WITH "_u_0" AS (
  SELECT
    "lineitem"."l_orderkey" AS "l_orderkey"
  FROM "lineitem" AS "lineitem"
  GROUP BY
    "lineitem"."l_orderkey",
    "lineitem"."l_orderkey"
  HAVING
    SUM("lineitem"."l_quantity") > 10
)
SELECT
  "customer"."c_custkey" AS "c_custkey",
  "orders"."o_orderkey" AS "o_orderkey",
  "orders"."o_orderdate" AS "o_orderdate",
  "orders"."o_totalprice" AS "o_totalprice",
  SUM("lineitem"."l_quantity") AS "_col_4"
FROM "customer" AS "customer"
JOIN "orders" AS "orders"
  ON "customer"."c_custkey" = "orders"."o_custkey"
LEFT JOIN "_u_0" AS "_u_0"
  ON "_u_0"."l_orderkey" = "orders"."o_orderkey"
JOIN "lineitem" AS "lineitem"
  ON "lineitem"."l_orderkey" = "orders"."o_orderkey"
WHERE
  NOT "_u_0"."l_orderkey" IS NULL
GROUP BY
  "customer"."c_name",
  "customer"."c_custkey",
  "orders"."o_orderkey",
  "orders"."o_orderdate",
  "orders"."o_totalprice"
ORDER BY
  "o_totalprice" DESC,
  "o_orderdate"

and honestly I don't understand why, within the WITH "_u_0" there is a double group by that groups by the "lineitem"."l_orderkey".

Thanks!

@georgesittas
Copy link
Collaborator

Interesting, thanks for reporting - I'll take a look.

@tobymao tobymao self-assigned this Feb 29, 2024
tobymao added a commit that referenced this issue Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants