Skip to content

Benchmarks v3 migration to duckdb#7646

Merged
connortsui20 merged 5 commits intoct/benchmarks-v3from
ct/benchmarks-v3-migration
Apr 27, 2026
Merged

Benchmarks v3 migration to duckdb#7646
connortsui20 merged 5 commits intoct/benchmarks-v3from
ct/benchmarks-v3-migration

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

This is a one-shot migration binary to take all of the data from data.json.gz and bring it into a duckdb database.

Simply gathers and aggregates everything into memory and writes data in chunks with arrow arrays. Insert row-by-row took way too long, and the appender API in duckdb does not support BIGINT[] for some reason...

@connortsui20 connortsui20 added the changelog/skip Do not list PR in the changelog label Apr 26, 2026
claude added 3 commits April 26, 2026 19:36
Reads v2's data.json.gz/commits.json/file-sizes from S3, ports v2's
getGroup classifier bug-for-bug, and writes a fully populated v3
DuckDB. Includes a verify subcommand that diffs group/chart
structure against the live v2 /api/metadata endpoint. Binary and
classifier are throwaway: deleted post-cutover.

Signed-off-by: Claude <noreply@anthropic.com>
…tements

Two narrow fixes:

1. Classifier wrote v2's display-renamed engine and format strings
   (e.g. "vortex" instead of "vortex-file-compressed") into v3's
   columns. v3's live emitter writes canonical Format::name() strings,
   so historical and live records would split into separate chart
   series at cutover. Pull engine and format from the raw record
   name; the rename was a v2 read-time UI concern only.

2. The per-row tx.execute(sql, params) hot path re-parsed SQL on
   every record. Hoist tx.prepare(sql) outside the row loop and
   reuse the prepared statement. Local migration time: ~15 minutes
   -> ~2-3 minutes.

(The DuckDB Appender API would be ~10x faster still, but its
append_row is unimplemented for BIGINT[] columns in duckdb-rs
1.10502, and Arrow record batches are out of scope for this fix.)

Signed-off-by: Claude <noreply@anthropic.com>
Adds tracing-based phase announcements and periodic progress lines
(every 5 seconds) so users know the binary isn't hung during
multi-minute migrations. Also fixes an inaccurate doc comment about
vortex-compact's ext label and skips empty trailing transactions in
both streaming loops.

No behavior change - all log output, comment-only edits, and a
no-op-transaction elision.

Signed-off-by: Claude <noreply@anthropic.com>
@connortsui20 connortsui20 force-pushed the ct/benchmarks-v3-migration branch 3 times, most recently from 96b6809 to 3d431b3 Compare April 27, 2026 00:21
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20 connortsui20 force-pushed the ct/benchmarks-v3-migration branch from 3d431b3 to 42ad6a1 Compare April 27, 2026 00:37
@connortsui20 connortsui20 changed the title [claude] Benchmarks v3 migration to duckdb Benchmarks v3 migration to duckdb Apr 27, 2026
@connortsui20
Copy link
Copy Markdown
Contributor Author

There are some group regressions that I will eventually figure out if we need them at all:

Groups in both v2 and v3:
  + Clickbench
  + Compression
  + Compression Size
  + PolarSignals Profiling
  + Random Access
  + Statistical and Population Genetics
  + TPC-DS (NVMe) (SF=1)
  + TPC-H (NVMe) (SF=1)
  + TPC-H (NVMe) (SF=10)
  + TPC-H (NVMe) (SF=100)
  + TPC-H (S3) (SF=1)
  + TPC-H (S3) (SF=10)
  + TPC-H (S3) (SF=100)
Groups only in v2 (regression candidates):
  - TPC-DS (NVMe) (SF=10)
  - TPC-H (NVMe) (SF=1000)
  - TPC-H (S3) (SF=1000)
Groups only in v3:
  + Fineweb
Chart count diffs:
  Compression : v2=10 v3=6 (delta=-4)
  Compression Size : v2=6 v3=4 (delta=-2)

@connortsui20 connortsui20 marked this pull request as ready for review April 27, 2026 01:21
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20 connortsui20 force-pushed the ct/benchmarks-v3-migration branch from f764ea4 to b028148 Compare April 27, 2026 01:24
@connortsui20 connortsui20 merged commit ae9ad92 into ct/benchmarks-v3 Apr 27, 2026
56 of 57 checks passed
@connortsui20 connortsui20 deleted the ct/benchmarks-v3-migration branch April 27, 2026 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/skip Do not list PR in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants