Open
Description
Describe the problem
For query 63 in TPC-DS benchmark:
SELECT
*
FROM
(
SELECT
i_manager_id,
sum(ss_sales_price) AS sum_sales,
avg(sum(ss_sales_price)) OVER (
PARTITION BY i_manager_id
)
AS avg_monthly_sales
FROM
item, store_sales, date_dim, store
WHERE
ss_item_sk = i_item_sk
AND ss_sold_date_sk = d_date_sk
AND ss_store_sk = s_store_sk
AND d_month_seq
IN (
1222,
1222 + 1,
1222 + 2,
1222 + 3,
1222 + 4,
1222 + 5,
1222 + 6,
1222 + 7,
1222 + 8,
1222 + 9,
1222 + 10,
1222 + 11
)
AND (
(
i_category
IN (
'Books',
'Children',
'Electronics'
)
AND i_class
IN (
'personal',
'portable',
'reference',
'self-help'
)
AND i_brand
IN (
'scholaramalgamalg #14',
'scholaramalgamalg #7',
'exportiunivamalg #9',
'scholaramalgamalg #9'
)
)
OR (
i_category
IN ('Women', 'Music', 'Men')
AND i_class
IN (
'accessories',
'classical',
'fragrances',
'pants'
)
AND i_brand
IN (
'amalgimporto #1',
'edu packscholar #1',
'exportiimporto #1',
'importoamalg #1'
)
)
)
GROUP BY
i_manager_id, d_moy
)
AS tmp1
WHERE
CASE
WHEN avg_monthly_sales > 0
THEN abs(sum_sales - avg_monthly_sales)
/ avg_monthly_sales
ELSE NULL
END
> 0.1
ORDER BY
i_manager_id, avg_monthly_sales, sum_sales
LIMIT
100;
Its execution time is 732ms, as shown in its query plan:
origin.txt
I found that enforcing the following IF code block brings a significant performance improvement due to a change in the estimated cost for span:
diff --git a/pkg/sql/opt/xform/coster.go b/pkg/sql/opt/xform/coster.go
index ff7609c89ae..0481c04006b 100644
--- a/pkg/sql/opt/xform/coster.go
+++ b/pkg/sql/opt/xform/coster.go
@@ -846,9 +846,7 @@ func (c *coster) computeScanCost(scan *memo.ScanExpr, required *physical.Require
// the scan. See issue #68556.
baseCost += c.largeCardinalityCostPenalty(scan.Relational().Cardinality, rowCount)
- if required.LimitHint != 0 {
rowCount = math.Min(rowCount, required.LimitHint)
- }
cost := baseCost + memo.Cost(rowCount)*(seqIOCostFactor+perRowCost)
Its execution time is reduced to 298ms, as shown in the new query plan:
derived.txt
I wonder whether it is possible to change any logic here to estimate the cost more accurately.
To Reproduce
I attached my entire data folder here: https://drive.google.com/file/d/1IVDVYpzBxptBEjJ_IT_gCfAes4JEzEQQ/view?usp=sharing
tar -zxvf cockroachdb_tpcds_1g.tar.gz
cockroach start-single-node --insecure --store=/path/to/data --listen-addr=0.0.0.0:36257 --sql-addr=0.0.0.0:26257
Then execute the above query on the this version and patched version of CockroachDB to compare the performance.
Environment:
- CockroachDB version [dcb0d27]
- Server OS: [Ubuntu]
- Client app [
psql "postgresql://root@127.0.0.1:26257/tpcds"
]
Jira issue: CRDB-44984
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Backlog