Use built-in DF cache for file metadata and fix metrics wiring#6164
Use built-in DF cache for file metadata and fix metrics wiring#6164
Conversation
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
CodSpeed Performance ReportMerging this PR will improve performance by 19.73%Comparing
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | u8_FoR[10M] |
7.5 µs | 6.2 µs | +19.73% |
Footnotes
-
1323 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
49bfc43 to
ecd35ae
Compare
ecd35ae to
272f664
Compare
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
d2bc422 to
550803a
Compare
| let task_ctx = Arc::new(df.task_ctx()); | ||
| let plan = df.create_physical_plan().await?; | ||
| let result = collect(plan.clone(), task_ctx).await?; | ||
|
|
||
| Ok((result, physical_plan)) | ||
| Ok((result, plan)) |
There was a problem hiding this comment.
because metrics aren't global anymore, this is the way to get the actual metrics generated during the query.
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
1df87dd to
93d54cc
Compare
| let root_layout = footer.layout(); | ||
| let layout_size = size_of::<DType>() | ||
| let layout_size = size_of_val(footer.dtype()) | ||
| + root_layout.metadata().len() | ||
| + root_layout.segment_ids().len() * size_of::<SegmentId>(); |
There was a problem hiding this comment.
i think this might be an under-estimate since it doesn't account for the size of the child layouts? the old code i think always was wrong in this way
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Instead of using our own custom file cache, we delegate to DataFusion's file metadata cache to stash footers we've already read, allowing users to re-use other existing DF code and extensions while reducing the amount of code we have to care about. It also makes wiring metrics simpler, as we don't cache the layout readers created during `infer_schema` or `infer_stats`, which is desired because `FileFormat` is long living while `FileSource` isn't, so we don't want it to hold a metrics instance. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com> Signed-off-by: Andrew Duffy <andrew@a10y.dev> Co-authored-by: Andrew Duffy <andrew@a10y.dev>
Instead of using our own custom file cache, we delegate to DataFusion's file metadata cache to stash footers we've already read, allowing users to re-use other existing DF code and extensions while reducing the amount of code we have to care about.
It also makes wiring metrics simpler, as we don't cache the layout readers created during
infer_schemaorinfer_stats, which is desired becauseFileFormatis long living whileFileSourceisn't, so we don't want it to hold a metrics instance.