opt: under-estimated row counts for lookup joins with equality key columns and heavy hitters

Row counts for lookup joins can be severely underestimated when the lookup join uses key columns for the lookup (e.g., an equality condition) and the index contains heavy hitters. This can lead to sub-optimal query plans. This is especially a problem for generic query plans which rely heavily on lookup-joins.

Here's an example:

```sql
CREATE TABLE t (
  k INT PRIMARY KEY,
  a INT,
  s STRING,
  INDEX (a)
);

INSERT INTO t
SELECT i, CASE WHEN i < 100 THEN i ELSE 0 END, repeat('a', 50)
FROM generate_series(1, 10000) AS g(i);

ANALYZE t;

SET plan_cache_mode = force_generic_plan;

PREPARE p AS SELECT * FROM t WHERE a = $1;

EXPLAIN ANALYZE EXECUTE p(0);
--                           info
-- --------------------------------------------------------
--   planning time: 31µs
--   execution time: 44ms
--   distribution: local
--   vectorized: true
--   plan type: generic, reused
--   rows decoded from KV: 19,802 (918 KiB, 2 gRPC calls)
--   cumulative time spent in KV: 36ms
--   maximum memory usage: 5.2 MiB
--   network usage: 0 B (0 messages)
--   regions: us-east1
--   sql cpu time: 8ms
--   isolation level: serializable
--   priority: normal
--   quality of service: regular
--
--   • lookup join
--   │ sql nodes: n1
--   │ kv nodes: n1
--   │ regions: us-east1
--   │ actual row count: 9,901
--   │ KV time: 33ms
--   │ KV contention time: 0µs
--   │ KV rows decoded: 9,901
--   │ KV bytes read: 628 KiB
--   │ KV gRPC calls: 1
--   │ estimated max memory allocated: 3.4 MiB
--   │ sql cpu time: 6ms
--   │ estimated row count: 100
--   │ table: t@t_pkey
--   │ equality: (k) = (k)
--   │ equality cols are key
--   │
--   └── • lookup join
--       │ sql nodes: n1
--       │ kv nodes: n1
--       │ regions: us-east1
--       │ actual row count: 9,901
--       │ KV time: 3ms
--       │ KV contention time: 0µs
--       │ KV rows decoded: 9,901
--       │ KV bytes read: 290 KiB
--       │ KV gRPC calls: 1
--       │ estimated max memory allocated: 20 KiB
--       │ sql cpu time: 1ms
--       │ estimated row count: 100
--       │ table: t@t_a_idx
--       │ equality: ($1) = (a)
--       │
--       └── • values
--             sql nodes: n1
--             regions: us-east1
--             actual row count: 1
--             sql cpu time: 4µs
--             size: 1 column, 1 row
```

Notice how the actual row-count is 2 orders of magnitude greater than the estimated row count.

This is cause by the way that we estimate the selectivity of an equality expression between a column and a placeholder value (or another column). The formula is essentionally:

$`\text{estimated\_row\_count} = \text{num\_values} \left( \frac{\text{row\_count} - \text{null\_count}}{\text{distinct\_count}} \right)`$

I think we could improve this by pessimistically assuming that the equality will be a heavy-hitter. In the example above, if we assume `$1` is `0`, then our estimated row count represents a worst-case scenario.



Jira issue: CRDB-51754

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opt: under-estimated row counts for lookup joins with equality key columns and heavy hitters #148703

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

opt: under-estimated row counts for lookup joins with equality key columns and heavy hitters #148703

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions