[BACKPORT 2.20][#12676] YSQL: Add Storage Rows Scanned to output of E… · yugabyte/yugabyte-db@b5c632c

Commit

[BACKPORT 2.20][#12676] YSQL: Add Storage Rows Scanned to output of E…

…XPLAIN(ANALYZE, DIST).

Summary:
Original commit: a302b50 / D31111
This revision adds Storage Row statistics to the output of `EXPLAIN (ANALYZE, DIST)`.
These stats help explain the selectivity of remote filters and index conditions, thereby allowing the user to distinguish between the work done by the storage layer vs the query layer.

Consider this example:
```
yugabyte=# \d filter_test
                      Table "public.filter_test"
 Column |            Type             | Collation | Nullable | Default
--------+-----------------------------+-----------+----------+---------
 h      | integer                     |           | not null |
 v1     | integer                     |           |          |
 ts     | timestamp without time zone |           |          |
Indexes:
    "filter_test_pkey" PRIMARY KEY, lsm (h ASC), colocation: true
    "filter_test_v1" lsm (v1 ASC), colocation: true
Colocation: true

yugabyte=# SELECT COUNT(*) FROM filter_test;
 count
-------
 10000
(1 row)

yugabyte=# EXPLAIN (ANALYZE, DIST) SELECT * FROM filter_test WHERE h <= 1000 AND h % 2 = 0 AND ts >= NOW() - (interval '30 days');
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using filter_test_pkey on filter_test  (cost=0.00..4.14 rows=1 width=16) (actual time=30.375..31.272 rows=146 loops=1)
   Index Cond: (h <= 1000)
   Remote Filter: ((h % 2) = 0)
   Filter: (ts >= (now() - '30 days'::interval))
   Rows Removed by Filter: 354
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 29.232 ms
   Storage Table Rows Scanned: 1000
 Planning Time: 0.311 ms
 Execution Time: 31.549 ms
 Storage Read Requests: 1
 Storage Read Execution Time: 29.232 ms
 Storage Rows Scanned: 1000
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 29.232 ms
 Peak Memory Usage: 24 kB
(19 rows)
```

From the output of EXPLAIN for the above query, the user understands that:
 - 1000 rows out of 10000 rows in the table filter_test were scanned by Index Scan in DocDB. This represents the selectivity of the Index Condition
 - 146 + 354 = 500 rows out of 1000 matched the modulo (remote) filter and were returned to Postgres
 - 146 rows out of 500 matched the timestamp filter and were returned as the result of the query

This revision adds the following stats to scan plan nodes:

  # `Storage Table Rows Scanned`: The number of logical rows scanned in the main table
  # `Storage Index Rows Scanned`: The number of logical rows scanned in secondary indexes

**Can existing DocDB metrics be used instead?**
The difference between the DocDB metrics `docdb_keys_found` and `docdb_obsolete_keys_found` for a given query/plan node provides a good approximation of the number of scanned rows. However, this approach suffers from the following pitfalls:
 - It is an approximation and does not provide the exact the number.
 - Relying on the implementation of key retrieval in DocDB causes a loss of abstraction.
 - More work has to be done to distinguish between Table Rows and Index Rows.

**Is the distinction between Table and Index Rows scanned important?**
In most cases, one can expect to see identical number of scanned rows for tables and indexes. However the two can differ when an index has both a condition and a remote filter on it. Other applicable scenarios include lossy bitmap scans.

Consider this example:
```
yugabyte=# CREATE TABLE filter_test(a int, b int, ts timestamp);
CREATE TABLE
yugabyte=# INSERT INTO filter_test (SELECT i, i, NOW() FROM generate_series(1, 1000) AS i);
INSERT 0 1000
yugabyte=# CREATE INDEX ON filter_test(a ASC);
CREATE INDEX

yugabyte=# \d filter_test
                      Table "public.filter_test"
 Column |            Type             | Collation | Nullable | Default
--------+-----------------------------+-----------+----------+---------
 a      | integer                     |           |          |
 b      | integer                     |           |          |
 ts     | timestamp without time zone |           |          |
Indexes:
    "filter_test_a_idx" lsm (a ASC)

yugabyte=# EXPLAIN (ANALYZE, DIST) SELECT * FROM filter_test WHERE a < 100 AND a % 2 = 0;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using filter_test_a_idx on filter_test  (cost=0.00..5.33 rows=10 width=16) (actual time=6.279..6.328 rows=49 loops=1)
   Index Cond: (a < 100)
   Remote Index Filter: ((a % 2) = 0)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 1.207 ms
   Storage Table Rows Scanned: 49
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 2.639 ms
   Storage Index Rows Scanned: 99
 Planning Time: 17.864 ms
 Execution Time: 6.852 ms
 Storage Read Requests: 2
 Storage Read Execution Time: 3.845 ms
 Storage Rows Scanned: 148
 Storage Write Requests: 0
 Catalog Read Requests: 20
 Catalog Read Execution Time: 54.329 ms
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 58.174 ms
 Peak Memory Usage: 24 kB
(21 rows)
```
In this example:
 - The index condition `a < 100` is applied to selectively scan 99 rows out of 1000 in DocDB.
 - The index filter `(a % 2) = 0` then filters out 49 rows of 99 in which the value of `a` is even.
 - The corresponding 49 rows are scanned on the main table (`filter_test`) and returned to Postgres which in turn returns them as the result of the query.

Notes:
 - The number of rows scanned by Catalog requests are not shown.
 - The rows scanned by an IndexScan on the Primary Key of a user table are counted as Storage Table Rows Scanned.

**Upgrade/Rollback safety:**
This revision does not alter persisted data. New fields are added to the tserver response (to pggate) proto.
Checks are present in pggate to handle both the presence and absence of these fields in the response proto.
This is upgrade and rollback safe.
No new flags are added to guard the feature.

Jira: DB-569

Test Plan:
```
./yb_build.sh ---java-test org.yb.pgsql.TestPgExplainAnalyze
./yb_build.sh --java-test org.yb.pgsql.TestPgExplainAnalyzeColocated
./yb_build.sh --java-test org.yb.pgsql.TestPgExplainAnalyzeScans#testIndexScanConditionAndFilter
```

Reviewers: esheng, telgersma, amartsinchyk

Reviewed By: telgersma

Subscribers: yql, smishra, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31931

Loading branch information

karthik-ramanathan-3006 committed Feb 8, 2024

1 parent e555060 commit b5c632c

java/yb-pgsql/src/test/java/org/yb/pgsql/ExplainAnalyzeUtils.java

-Original file line number
+Diff line change
@@ Expand Up @@
         TopLevelCheckerBuilder plan(ObjectChecker checker);
         TopLevelCheckerBuilder storageReadRequests(ValueChecker<Long> checker);
         TopLevelCheckerBuilder storageReadExecutionTime(ValueChecker<Double> checker);
+        TopLevelCheckerBuilder storageRowsScanned(ValueChecker<Long> checker);
         TopLevelCheckerBuilder storageWriteRequests(ValueChecker<Long> checker);
         TopLevelCheckerBuilder catalogReadRequests(ValueChecker<Long> checker);
         TopLevelCheckerBuilder catalogReadExecutionTime(ValueChecker<Double> checker);
@@ Expand Down Expand Up @@
         // This requires a different type of checker than ValueChecker<>
         PlanCheckerBuilder storageTableReadRequests(Checker checker);
         PlanCheckerBuilder storageTableReadExecutionTime(Checker checker);
+        PlanCheckerBuilder storageTableRowsScanned(ValueChecker<Long> checker);
         // Table Writes
         PlanCheckerBuilder storageTableWriteRequests(ValueChecker<Long> checker);
         // Index Reads
         PlanCheckerBuilder storageIndexReadRequests(ValueChecker<Long> checker);
         PlanCheckerBuilder storageIndexReadExecutionTime(ValueChecker<Double> checker);
+        PlanCheckerBuilder storageIndexRowsScanned(ValueChecker<Long> checker);
         // Index Writes
         PlanCheckerBuilder storageIndexWriteRequests(ValueChecker<Long> checker);
@@ Expand Down @@

java/yb-pgsql/src/test/java/org/yb/pgsql/TestPgExplainAnalyze.java

Large diffs are not rendered by default.

0 comments on commit `b5c632c`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `b5c632c`

Commit

There are no files selected for viewing

0 comments on commit b5c632c

0 comments on commit `b5c632c`