Summary:
This change introduces testing for histogram_bounds in pg_stats. The error metric used to test the goodness of a histogram is based on the max relative error found in "Random sampling for histogram construction: how much is enough?" by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya. PG cites this paper to show its histogram error bounds so it was suitable to use the same paper for the error metric.
More specifically, the error delta of a histogram is defined as the maximum difference between its bucket sizes and the "perfect" bucket size of `n/k` where `n` is the total number of rows and `k` is the number of buckets. The relative error is defined as delta/(perfect bucket size) which PG expects to be `<= 0.5` on columns with only distinct values. PG overlooks non distinct columns when discussing this error bound for which an adjustment had to be made in this change. Instead of expecting the max error, delta, to be less than `0.5*(perfect bucket size)` we expect it to be less than `0.5*(perfect bucket size) + max(multiplicity(s_i) - 1)` where `s_i` denotes the boundaries of the histogram we are testing. This reduces to the original error bound when dealing with a column that only has distinct values.
Jira: DB-6566
Test Plan:
Jenkins: test regex: .*TestPgAnalyze.*
./yb_build.sh --java-test 'org.yb.pgsql.TestPgAnalyze'
The above test fails before the fix in 36198ae41818bf613550b53d3e13e7773f30f5f9
Reviewers: amartsinchyk, mtakahara
Reviewed By: mtakahara
Subscribers: yql
Differential Revision: https://phorge.dev.yugabyte.com/D25582