Benchmarks v3 migration to duckdb#7646
Merged
connortsui20 merged 5 commits intoct/benchmarks-v3from Apr 27, 2026
Merged
Conversation
Reads v2's data.json.gz/commits.json/file-sizes from S3, ports v2's getGroup classifier bug-for-bug, and writes a fully populated v3 DuckDB. Includes a verify subcommand that diffs group/chart structure against the live v2 /api/metadata endpoint. Binary and classifier are throwaway: deleted post-cutover. Signed-off-by: Claude <noreply@anthropic.com>
…tements Two narrow fixes: 1. Classifier wrote v2's display-renamed engine and format strings (e.g. "vortex" instead of "vortex-file-compressed") into v3's columns. v3's live emitter writes canonical Format::name() strings, so historical and live records would split into separate chart series at cutover. Pull engine and format from the raw record name; the rename was a v2 read-time UI concern only. 2. The per-row tx.execute(sql, params) hot path re-parsed SQL on every record. Hoist tx.prepare(sql) outside the row loop and reuse the prepared statement. Local migration time: ~15 minutes -> ~2-3 minutes. (The DuckDB Appender API would be ~10x faster still, but its append_row is unimplemented for BIGINT[] columns in duckdb-rs 1.10502, and Arrow record batches are out of scope for this fix.) Signed-off-by: Claude <noreply@anthropic.com>
Adds tracing-based phase announcements and periodic progress lines (every 5 seconds) so users know the binary isn't hung during multi-minute migrations. Also fixes an inaccurate doc comment about vortex-compact's ext label and skips empty trailing transactions in both streaming loops. No behavior change - all log output, comment-only edits, and a no-op-transaction elision. Signed-off-by: Claude <noreply@anthropic.com>
96b6809 to
3d431b3
Compare
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
3d431b3 to
42ad6a1
Compare
Contributor
Author
|
There are some group regressions that I will eventually figure out if we need them at all: |
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
f764ea4 to
b028148
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a one-shot migration binary to take all of the data from
data.json.gzand bring it into a duckdb database.Simply gathers and aggregates everything into memory and writes data in chunks with arrow arrays. Insert row-by-row took way too long, and the appender API in duckdb does not support
BIGINT[]for some reason...