Optimizer produces a group by with duplicated columns #3056

ggadon · 2024-02-29T15:35:58Z

Use the tpch query: https://github.com/dragansah/tpch-dbgen/blob/master/queries/18.sql

q = sqlglot.parse_one(QUERY)

Use the following schema (from tpch):

SCHEMA_DICT = {
    "REGION": {
        "R_REGIONKEY": "SERIAL PRIMARY KEY",
        "R_NAME": "CHAR(25)",
        "R_COMMENT": "VARCHAR(152)"
    },
    "NATION": {
        "N_NATIONKEY": "SERIAL PRIMARY KEY",
	    "N_NAME": "CHAR(25)",
	    "N_REGIONKEY": "BIGINT NOT NULL",
	    "N_COMMENT": "VARCHAR(152)"
    },
    "LINEITEM": {
        "L_ORDERKEY": "BIGINT NOT NULL",
	    "L_PARTKEY": "BIGINT NOT NULL",
	    "L_SUPPKEY": "BIGINT NOT NULL",
	    "L_LINENUMBER": "INTEGER",
        "L_QUANTITY": "DECIMAL",
	    "L_EXTENDEDPRICE": "DECIMAL",
	    "L_DISCOUNT": "DECIMAL",
	    "L_TAX": "DECIMAL",
        "L_RETURNFLAG": "CHAR(1)",
	    "L_LINESTATUS": "CHAR(1)",
	    "L_SHIPDATE": "DATE",
	    "L_COMMITDATE": "DATE",
	    "L_RECEIPTDATE": "DATE",
	    "L_SHIPINSTRUCT": "CHAR(25)",
	    "L_SHIPMODE": "CHAR(10)",
	    "L_COMMENT": "VARCHAR(44)",
    },
    "ORDERS": {
        "O_ORDERKEY": "SERIAL PRIMARY KEY",
	    "O_CUSTKEY": "BIGINT NOT NULL",
	    "O_ORDERSTATUS": "CHAR(1)",
	    "O_TOTALPRICE": "DECIMAL",
	    "O_ORDERDATE": "DATE",
	    "O_ORDERPRIORITY": "CHAR(15)",
	    "O_CLERK": "CHAR(15)",
	    "O_SHIPPRIORITY": "INTEGER",
	    "O_COMMENT": "VARCHAR(79)"
    },
    "CUSTOMER": {
        "C_CUSTKEY": "SERIAL PRIMARY KEY",
        "C_NAME": "VARCHAR(25)",
        "C_ADDRESS": "VARCHAR(40)",
        "C_NATIONKEY": "BIGINT NOT NULL",
        "C_PHONE": "CHAR(15)",
        "C_ACCTBAL": "DECIMAL",
        "C_MKTSEGMENT": "CHAR(10)",
        "C_COMMENT": "VARCHAR(117)"
    }
}

Then use the optimizer:

from sqlglot.optimizer import optimize
print(optimize(q, SCHEMA_DICT).sql(pretty=True))

The output is:

WITH "_u_0" AS (
  SELECT
    "lineitem"."l_orderkey" AS "l_orderkey"
  FROM "lineitem" AS "lineitem"
  GROUP BY
    "lineitem"."l_orderkey",
    "lineitem"."l_orderkey"
  HAVING
    SUM("lineitem"."l_quantity") > 10
)
SELECT
  "customer"."c_custkey" AS "c_custkey",
  "orders"."o_orderkey" AS "o_orderkey",
  "orders"."o_orderdate" AS "o_orderdate",
  "orders"."o_totalprice" AS "o_totalprice",
  SUM("lineitem"."l_quantity") AS "_col_4"
FROM "customer" AS "customer"
JOIN "orders" AS "orders"
  ON "customer"."c_custkey" = "orders"."o_custkey"
LEFT JOIN "_u_0" AS "_u_0"
  ON "_u_0"."l_orderkey" = "orders"."o_orderkey"
JOIN "lineitem" AS "lineitem"
  ON "lineitem"."l_orderkey" = "orders"."o_orderkey"
WHERE
  NOT "_u_0"."l_orderkey" IS NULL
GROUP BY
  "customer"."c_name",
  "customer"."c_custkey",
  "orders"."o_orderkey",
  "orders"."o_orderdate",
  "orders"."o_totalprice"
ORDER BY
  "o_totalprice" DESC,
  "o_orderdate"

and honestly I don't understand why, within the WITH "_u_0" there is a double group by that groups by the "lineitem"."l_orderkey".

Thanks!

The text was updated successfully, but these errors were encountered:

georgesittas · 2024-02-29T16:04:46Z

Interesting, thanks for reporting - I'll take a look.

tobymao self-assigned this Feb 29, 2024

tobymao added a commit that referenced this issue Feb 29, 2024

fix!: handle unnesting groups closes #3056

964b43c

tobymao closed this as completed in 08bafbd Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer produces a group by with duplicated columns #3056

Optimizer produces a group by with duplicated columns #3056

ggadon commented Feb 29, 2024 •

edited

georgesittas commented Feb 29, 2024

Optimizer produces a group by with duplicated columns #3056

Optimizer produces a group by with duplicated columns #3056

Comments

ggadon commented Feb 29, 2024 • edited

georgesittas commented Feb 29, 2024

ggadon commented Feb 29, 2024 •

edited