Benchmarks v3 migration to duckdb by connortsui20 · Pull Request #7646 · vortex-data/vortex

connortsui20 · 2026-04-26T23:08:32Z

This is a one-shot migration binary to take all of the data from data.json.gz and bring it into a duckdb database.

Simply gathers and aggregates everything into memory and writes data in chunks with arrow arrays. Insert row-by-row took way too long, and the appender API in duckdb does not support BIGINT[] for some reason...

Reads v2's data.json.gz/commits.json/file-sizes from S3, ports v2's getGroup classifier bug-for-bug, and writes a fully populated v3 DuckDB. Includes a verify subcommand that diffs group/chart structure against the live v2 /api/metadata endpoint. Binary and classifier are throwaway: deleted post-cutover. Signed-off-by: Claude <noreply@anthropic.com>

…tements Two narrow fixes: 1. Classifier wrote v2's display-renamed engine and format strings (e.g. "vortex" instead of "vortex-file-compressed") into v3's columns. v3's live emitter writes canonical Format::name() strings, so historical and live records would split into separate chart series at cutover. Pull engine and format from the raw record name; the rename was a v2 read-time UI concern only. 2. The per-row tx.execute(sql, params) hot path re-parsed SQL on every record. Hoist tx.prepare(sql) outside the row loop and reuse the prepared statement. Local migration time: ~15 minutes -> ~2-3 minutes. (The DuckDB Appender API would be ~10x faster still, but its append_row is unimplemented for BIGINT[] columns in duckdb-rs 1.10502, and Arrow record batches are out of scope for this fix.) Signed-off-by: Claude <noreply@anthropic.com>

Adds tracing-based phase announcements and periodic progress lines (every 5 seconds) so users know the binary isn't hung during multi-minute migrations. Also fixes an inaccurate doc comment about vortex-compact's ext label and skips empty trailing transactions in both streaming loops. No behavior change - all log output, comment-only edits, and a no-op-transaction elision. Signed-off-by: Claude <noreply@anthropic.com>

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 · 2026-04-27T01:17:14Z

There are some group regressions that I will eventually figure out if we need them at all:

Groups in both v2 and v3:
  + Clickbench
  + Compression
  + Compression Size
  + PolarSignals Profiling
  + Random Access
  + Statistical and Population Genetics
  + TPC-DS (NVMe) (SF=1)
  + TPC-H (NVMe) (SF=1)
  + TPC-H (NVMe) (SF=10)
  + TPC-H (NVMe) (SF=100)
  + TPC-H (S3) (SF=1)
  + TPC-H (S3) (SF=10)
  + TPC-H (S3) (SF=100)
Groups only in v2 (regression candidates):
  - TPC-DS (NVMe) (SF=10)
  - TPC-H (NVMe) (SF=1000)
  - TPC-H (S3) (SF=1000)
Groups only in v3:
  + Fineweb
Chart count diffs:
  Compression : v2=10 v3=6 (delta=-4)
  Compression Size : v2=6 v3=4 (delta=-2)

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 added the changelog/skip Do not list PR in the changelog label Apr 26, 2026

claude added 3 commits April 26, 2026 19:36

connortsui20 force-pushed the ct/benchmarks-v3-migration branch 3 times, most recently from 96b6809 to 3d431b3 Compare April 27, 2026 00:21

fix perf and insert bugs

42ad6a1

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 force-pushed the ct/benchmarks-v3-migration branch from 3d431b3 to 42ad6a1 Compare April 27, 2026 00:37

connortsui20 changed the title ~~[claude] Benchmarks v3 migration to duckdb~~ Benchmarks v3 migration to duckdb Apr 27, 2026

connortsui20 marked this pull request as ready for review April 27, 2026 01:21

clean up and fix bugs

b028148

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 force-pushed the ct/benchmarks-v3-migration branch from f764ea4 to b028148 Compare April 27, 2026 01:24

connortsui20 merged commit ae9ad92 into ct/benchmarks-v3 Apr 27, 2026
56 of 57 checks passed

connortsui20 deleted the ct/benchmarks-v3-migration branch April 27, 2026 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks v3 migration to duckdb#7646

Benchmarks v3 migration to duckdb#7646
connortsui20 merged 5 commits intoct/benchmarks-v3from
ct/benchmarks-v3-migration

connortsui20 commented Apr 26, 2026

Uh oh!

connortsui20 commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

connortsui20 commented Apr 26, 2026

Uh oh!

connortsui20 commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants