Mimic duckdb's post-filter cardinality estimates#7895
Conversation
ccb8877 to
9f91638
Compare
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.995x ➖ datafusion / vortex-file-compressed (0.995x ➖, 0↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.050x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.013x ➖, 0↑ 0↓)
datafusion / parquet (1.011x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.002x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.005x ➖, 0↑ 0↓)
duckdb / parquet (1.022x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.040x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.035x ➖, 0↑ 0↓)
datafusion / parquet (1.037x ➖, 1↑ 1↓)
datafusion / arrow (1.086x ➖, 0↑ 7↓)
duckdb / vortex-file-compressed (1.008x ➖, 3↑ 2↓)
duckdb / vortex-compact (0.979x ➖, 4↑ 3↓)
duckdb / parquet (1.004x ➖, 1↑ 1↓)
duckdb / duckdb (1.015x ➖, 1↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.952x ➖, 14↑ 0↓)
datafusion / vortex-compact (0.958x ➖, 3↑ 2↓)
datafusion / parquet (0.988x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (0.914x ➖, 43↑ 13↓)
duckdb / vortex-compact (0.944x ➖, 34↑ 15↓)
duckdb / parquet (0.983x ➖, 2↑ 1↓)
duckdb / duckdb (0.989x ➖, 3↑ 2↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.877x ➖, 2↑ 0↓)
datafusion / vortex-compact (0.909x ➖, 0↑ 0↓)
datafusion / parquet (0.977x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.925x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.911x ➖, 0↑ 0↓)
duckdb / parquet (0.998x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.956x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.963x ➖, 0↑ 0↓)
duckdb / parquet (0.956x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.972x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.980x ➖, 0↑ 0↓)
datafusion / parquet (0.983x ➖, 0↑ 0↓)
datafusion / arrow (0.936x ➖, 3↑ 0↓)
duckdb / vortex-file-compressed (1.042x ➖, 4↑ 4↓)
duckdb / vortex-compact (0.992x ➖, 5↑ 3↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
duckdb / duckdb (0.993x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.062x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.022x ➖, 0↑ 0↓)
datafusion / parquet (1.041x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.005x ➖, 0↑ 1↓)
duckdb / vortex-compact (0.972x ➖, 0↑ 1↓)
duckdb / parquet (1.020x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.981x ➖, 0↑ 0↓)
datafusion / parquet (0.967x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.935x ➖, 11↑ 1↓)
duckdb / parquet (0.987x ➖, 0↑ 0↓)
duckdb / duckdb (0.992x ➖, 1↑ 1↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.930x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.910x ➖, 0↑ 1↓)
datafusion / parquet (0.966x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.926x ➖, 1↑ 1↓)
duckdb / vortex-compact (0.894x ➖, 1↑ 0↓)
duckdb / parquet (0.931x ➖, 0↑ 0↓)
Full attributed analysis
|
9f91638 to
448754a
Compare
| row_range: Option<Range<u64>>, | ||
| file_selection: Selection, | ||
| file_range: Option<Range<u64>>, | ||
| has_non_optional_filter: bool, |
There was a problem hiding this comment.
What does duckdb do? what about or and other filters?
There was a problem hiding this comment.
This is what duckdb does. The only distinction is "at least one non-optional filter"
| let report_pushed = !expr | ||
| .as_opt::<Binary>() | ||
| .map(|op| *op == Operator::Eq) | ||
| .unwrap_or(false); |
There was a problem hiding this comment.
what about (> /\ =) or > \/ =? what should we do?
| idx_t integer_distinct(LogicalTypeId id, const Value &min, const Value &max) { | ||
| switch (id) { | ||
| case LogicalTypeId::BOOLEAN: | ||
| return 1 + max.GetValueUnsafe<bool>() - min.GetValueUnsafe<bool>(); | ||
| case LogicalTypeId::UTINYINT: | ||
| return 1 + max.GetValueUnsafe<uint8_t>() - min.GetValueUnsafe<uint8_t>(); | ||
| case LogicalTypeId::USMALLINT: | ||
| return 1 + max.GetValueUnsafe<uint16_t>() - min.GetValueUnsafe<uint16_t>(); | ||
| case LogicalTypeId::UINTEGER: | ||
| return 1 + max.GetValueUnsafe<uint32_t>() - min.GetValueUnsafe<uint32_t>(); | ||
| case LogicalTypeId::UBIGINT: | ||
| return 1 + max.GetValueUnsafe<uint64_t>() - min.GetValueUnsafe<uint64_t>(); | ||
| case LogicalTypeId::TINYINT: | ||
| return 1 + abs(max.GetValueUnsafe<int8_t>() - min.GetValueUnsafe<int8_t>()); | ||
| case LogicalTypeId::SMALLINT: | ||
| return 1 + abs(max.GetValueUnsafe<int16_t>() - min.GetValueUnsafe<int16_t>()); | ||
| case LogicalTypeId::INTEGER: | ||
| return 1 + labs(max.GetValueUnsafe<int32_t>() - min.GetValueUnsafe<int32_t>()); | ||
| case LogicalTypeId::BIGINT: | ||
| return 1 + llabs(max.GetValueUnsafe<int64_t>() - min.GetValueUnsafe<int64_t>()); | ||
| // Don't estimate distinct for huge ints since result may not fit in u64. | ||
| default: | ||
| return 0; | ||
| } | ||
| } | ||
|
|
||
| unique_ptr<BaseStatistics> numeric_stats(duckdb_column_statistics &stats, LogicalType type) { | ||
| BaseStatistics out = StringStats::CreateUnknown(type); | ||
| if (stats.min) { | ||
| if (stats.min && stats.max) { | ||
| const Value &min = UnwrapValue(stats.min); | ||
| NumericStats::SetMin(out, min); | ||
|
|
||
| const Value &max = UnwrapValue(stats.max); | ||
| NumericStats::SetMax(out, max); | ||
|
|
||
| if (const idx_t distinct = integer_distinct(type.id(), min, max); distinct > 0) { | ||
| out.SetDistinctCount(distinct); | ||
| } | ||
|
|
||
| duckdb_destroy_value(&stats.min); | ||
| duckdb_destroy_value(&stats.max); |
There was a problem hiding this comment.
should this be in the c++ wrapper,
I would like this to be in rust
Benchmarks: Random AccessVortex (geomean): 0.902x ➖ unknown / unknown (0.935x ➖, 12↑ 0↓)
|
Benchmarks: CompressionVortex (geomean): 1.012x ➖ unknown / unknown (1.007x ➖, 1↑ 5↓)
|
Use 20% selectivity filter for cardinality estimates in duckdb when there is at
least one non-optional filter. This allows removing pushed filters from duckdb's
table filter set.
Set distinct count for string constant columns (min = max).
Derive distinct count from max - min for integer columns.
Remove all but equality comparisons from duckdb's table filter list (equality comparisons produce invalid cardinality estimates which lead to regressions).