Skip to content

Performance of Query 63 in TPC-DS Benchmark #136350

Open
@bajinsheng

Description

@bajinsheng

Describe the problem

For query 63 in TPC-DS benchmark:

SELECT
 *
FROM
 (
  SELECT
   i_manager_id,
   sum(ss_sales_price) AS sum_sales,
   avg(sum(ss_sales_price)) OVER (
    PARTITION BY i_manager_id
   )
    AS avg_monthly_sales
  FROM
   item, store_sales, date_dim, store
  WHERE
   ss_item_sk = i_item_sk
   AND ss_sold_date_sk = d_date_sk
   AND ss_store_sk = s_store_sk
   AND d_month_seq
    IN (
      1222,
      1222 + 1,
      1222 + 2,
      1222 + 3,
      1222 + 4,
      1222 + 5,
      1222 + 6,
      1222 + 7,
      1222 + 8,
      1222 + 9,
      1222 + 10,
      1222 + 11
     )
   AND (
     (
      i_category
      IN (
        'Books',
        'Children',
        'Electronics'
       )
      AND i_class
       IN (
         'personal',
         'portable',
         'reference',
         'self-help'
        )
      AND i_brand
       IN (
         'scholaramalgamalg #14',
         'scholaramalgamalg #7',
         'exportiunivamalg #9',
         'scholaramalgamalg #9'
        )
     )
     OR (
       i_category
       IN ('Women', 'Music', 'Men')
       AND i_class
        IN (
          'accessories',
          'classical',
          'fragrances',
          'pants'
         )
       AND i_brand
        IN (
          'amalgimporto #1',
          'edu packscholar #1',
          'exportiimporto #1',
          'importoamalg #1'
         )
      )
    )
  GROUP BY
   i_manager_id, d_moy
 )
  AS tmp1
WHERE
 CASE
 WHEN avg_monthly_sales > 0
 THEN abs(sum_sales - avg_monthly_sales)
 / avg_monthly_sales
 ELSE NULL
 END
 > 0.1
ORDER BY
 i_manager_id, avg_monthly_sales, sum_sales
LIMIT
 100;

Its execution time is 732ms, as shown in its query plan:
origin.txt

I found that enforcing the following IF code block brings a significant performance improvement due to a change in the estimated cost for span:

diff --git a/pkg/sql/opt/xform/coster.go b/pkg/sql/opt/xform/coster.go
index ff7609c89ae..0481c04006b 100644
--- a/pkg/sql/opt/xform/coster.go
+++ b/pkg/sql/opt/xform/coster.go
@@ -846,9 +846,7 @@ func (c *coster) computeScanCost(scan *memo.ScanExpr, required *physical.Require
        // the scan. See issue #68556.
        baseCost += c.largeCardinalityCostPenalty(scan.Relational().Cardinality, rowCount)
 
-       if required.LimitHint != 0 {
                rowCount = math.Min(rowCount, required.LimitHint)
-       }
 
        cost := baseCost + memo.Cost(rowCount)*(seqIOCostFactor+perRowCost)
 

Its execution time is reduced to 298ms, as shown in the new query plan:
derived.txt

I wonder whether it is possible to change any logic here to estimate the cost more accurately.

To Reproduce
I attached my entire data folder here: https://drive.google.com/file/d/1IVDVYpzBxptBEjJ_IT_gCfAes4JEzEQQ/view?usp=sharing

tar -zxvf cockroachdb_tpcds_1g.tar.gz 
cockroach start-single-node --insecure --store=/path/to/data --listen-addr=0.0.0.0:36257 --sql-addr=0.0.0.0:26257

Then execute the above query on the this version and patched version of CockroachDB to compare the performance.

Environment:

  • CockroachDB version [dcb0d27]
  • Server OS: [Ubuntu]
  • Client app [psql "postgresql://root@127.0.0.1:26257/tpcds"]

Jira issue: CRDB-44984

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-sql-optimizerSQL logical planning and optimizations.C-performancePerf of queries or internals. Solution not expected to change functional behavior.O-communityOriginated from the communityT-sql-queriesSQL Queries TeamX-blathers-triagedblathers was able to find an ownerbranch-masterFailures and bugs on the master branch.

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions