feat: compute deterministic timeseries_id column at ingest by g-talbot · Pull Request #6286 · quickwit-oss/quickwit

g-talbot · 2026-04-09T21:15:48Z

Summary

Adds a timeseries_id column (Int64) to the metrics Arrow batch, computed as a deterministic SipHash-2-4 of the series identity columns
Hash includes metric_name, metric_type, and all tags — excludes temporal columns (timestamp_secs, start_timestamp_secs, timestamp) and value columns (value, plus DDSketch components from [metrics] Support DDSketch in the parquet pipeline #6257: count, sum, min, max, flags, keys, counts)
Column is already declared in the metrics default sort schema (metric_name|service|env|datacenter|region|host|timeseries_id|timestamp_secs/V2), so the writer automatically sorts by it and places it in the correct physical position
Adds TimeseriesId variant to ParquetField enum and updates SORT_ORDER

Design reference

Sorted Series Column for QW Parquet Pipeline — this PR implements the Timeseries ID component; the full Sorted Series composite key is a follow-up.

Test plan

8 unit tests for compute_timeseries_id (determinism, exclusions, order independence, key/value non-interchangeability)
All 200 existing tests in quickwit-parquet-engine and quickwit-opentelemetry pass with updated column counts
Clippy clean (pre-existing warning in reorder_columns not introduced by this PR)

🤖 Generated with Claude Code

alanfgates

One question on metadata changes, other than that lgtm.

alanfgates · 2026-04-13T19:15:28Z

+                    WHERE
+                        index_uid = $1
+                        AND split_id = ANY($2)
+                    FOR UPDATE


Adding FOR UPDATE means your locking these rows. Is that what you want? I'm not clear why this change requires the addition of this lock.

FWIW, Claude likes this addition, saying:

The addition of FOR UPDATE in the CTE subquery correctly prevents a TOCTOU race where concurrent mark_metrics_splits_for_deletion could change split state between the CTE read and the DELETE. This matches the established pattern in the non-metrics delete_splits at line 1076-1083 of the same file. Good fix.

alanfgates · 2026-04-13T20:03:24Z

+//! implementation for `str` writes `bytes ++ [0xFF]` and for `u8` writes
+//! `[byte]`; this has been stable since Rust 1.0. A pinned stability test
+//! (`test_hash_stability_pinned`) will catch any regression.
+


Claude comments:

The stability contract relies on Rust's Hash trait implementation for str (writes bytes ++ [0xFF]) and u8 (writes [byte]). While this has been stable since Rust 1.0 and the pinned tests would catch a regression, this is not a guaranteed API contract — it's an implementation detail of the standard library. A Rust toolchain update could theoretically change it.
Recommendation: Consider hashing raw bytes directly (e.g., hasher.write(metric_name.as_bytes()); hasher.write(&[0xFF]);) rather than going through the Hash trait. This would make the stability explicit in the code rather than relying on an undocumented trait implementation. The pinned tests are a good safety net but catch the problem only after a build — explicit byte feeding prevents it.

It's not clear to me if the added safety here is worth the effort, especially considering a change to the hash implementation for str would probably break a lot of things.

im going to take Claude's comment here

alanfgates · 2026-04-13T20:06:44Z

            metric_type_builder.append_value(dp.metric_type as u8);
            timestamp_secs_builder.append_value(dp.timestamp_secs);
            value_builder.append_value(dp.value);
+            timeseries_id_builder.append_value(compute_timeseries_id(


Comment from Claude:

compute_timeseries_id is called per-row and sorts the tags each time. For batches where most rows share the same tag key set, this is redundant work. For typical batch sizes this is unlikely to be a bottleneck, but worth being aware of for large batches.

This seems valid, especially since we're storing one value per row, as compaction proceeds we'll be getting identical tag sets next to each other. Maybe it's too early to optimize this, but leaving a TODO noting we could do it later seems good.

will add a TODO

mattmkim · 2026-04-15T20:08:28Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0344baed2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T20:14:32Z

+        fields.push(Field::new("timeseries_id", DataType::Int64, false));

        for &tag_key in &sorted_tag_keys {
            fields.push(Field::new(


Filter reserved timeseries_id tag before schema expansion

Adding the fixed timeseries_id field here without reserving that key in tag ingestion allows an input attribute named timeseries_id to be added again in the dynamic tag loop, producing duplicate column names in one RecordBatch. This path is reachable because create_number_data_point only strips REQUIRED_FIELDS (metric_name, metric_type, timestamp_secs, value), not timeseries_id. With duplicate names, downstream schema lookups resolve only one match, so one of the two timeseries_id columns is effectively shadowed/ambiguous and query behavior becomes incorrect for those inputs.

Useful? React with 👍 / 👎.

not impossible that a user would want to submit a timeseries_id tag, we'll probably want to deal with this at some point. i also don't think we want to explicitly reserve the timeseries_id tag. adding a TODO

Add a timeseries_id column (Int64) to the metrics Arrow batch, computed as a SipHash-2-4 of the series identity columns (metric_name, metric_type, and all tags excluding temporal/value columns). The hash uses fixed keys for cross-process determinism. The column is already declared in the metrics default sort schema (between host and timestamp_secs), so the parquet writer now automatically sorts by it and places it in the correct physical position. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The timeseries_id hash is persisted to Parquet files — any change silently corrupts compaction and queries. Add: - 3 pinned stability tests with hardcoded expected hash values - 3 proptest properties (order independence, excluded tag immunity, extra-tag discrimination) each running 256 random cases - Boundary ambiguity test ({"ab":"c"} vs {"a":"bc"}) - Same-series-different-timestamp invariant test - All-excluded-tags coverage (every EXCLUDED_TAGS entry verified) - Edge cases: empty strings, unicode, 100-tag cardinality - Module-level doc explaining the stability contract Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mirror the CTE + FOR UPDATE pattern from delete_splits to prevent stale-state races. Without row locking, a concurrent mark_metrics_splits_for_deletion can commit between the state read and the DELETE, causing spurious FailedPrecondition errors and retry churn. The new query locks the target rows before reading their state, reports not-deletable (Staged/Published) and not-found splits separately, and only deletes when all requested splits are in MarkedForDeletion state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mattmkim · 2026-04-15T20:42:41Z

commit history was messed up, i cherry-picked the 4 relevant commits on top of main and force pushed. just in case, i pushed the original history to the https://github.com/quickwit-oss/quickwit/tree/matthew.kim/gtt/sorted-series-column remote branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 720f498323

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8c0c100c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

g-talbot force-pushed the gtt/sorted-series-column branch from 1fba6db to 9ac8674 Compare April 10, 2026 10:44

Base automatically changed from gtt/parquet-column-ordering to gtt/docs-claude-md April 10, 2026 10:57

g-talbot changed the base branch from gtt/docs-claude-md to gtt/parquet-column-ordering April 10, 2026 10:59

g-talbot changed the base branch from gtt/parquet-column-ordering to main April 10, 2026 11:00

g-talbot force-pushed the gtt/sorted-series-column branch from 9ac8674 to 60d859c Compare April 10, 2026 11:11

g-talbot changed the base branch from main to gtt/parquet-column-ordering-v2 April 10, 2026 11:11

g-talbot force-pushed the gtt/sorted-series-column branch from 60d859c to 9522326 Compare April 10, 2026 14:17

g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from 9e5c6ef to cc4492e Compare April 10, 2026 14:18

g-talbot requested a review from mattmkim April 10, 2026 14:20

g-talbot force-pushed the gtt/parquet-column-ordering-v2 branch from cc4492e to 946c229 Compare April 10, 2026 14:42

g-talbot force-pushed the gtt/sorted-series-column branch from 9522326 to b0344ba Compare April 10, 2026 14:42

alanfgates approved these changes Apr 13, 2026

View reviewed changes

alanfgates reviewed Apr 13, 2026

View reviewed changes

Base automatically changed from gtt/parquet-column-ordering-v2 to main April 13, 2026 23:41

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

g-talbot and others added 4 commits April 15, 2026 13:41

style: rustfmt timeseries_id.rs

64e4471

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mattmkim force-pushed the gtt/sorted-series-column branch from b0344ba to f846e6c Compare April 15, 2026 20:41

mattmkim force-pushed the gtt/sorted-series-column branch from 43f35da to 0a0adb5 Compare April 15, 2026 21:22

address Claude comments

720f498

mattmkim force-pushed the gtt/sorted-series-column branch from 0a0adb5 to 720f498 Compare April 15, 2026 21:27

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

Comment thread quickwit/quickwit-parquet-engine/src/timeseries_id.rs

specify excluded tags, specifically for sketches

c8c0c10

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

Comment thread quickwit/quickwit-parquet-engine/src/timeseries_id.rs

mattmkim merged commit 2bc66c4 into main Apr 15, 2026
8 checks passed

mattmkim deleted the gtt/sorted-series-column branch April 15, 2026 22:27

Conversation

g-talbot commented Apr 9, 2026

Summary

Design reference

Test plan

Uh oh!

alanfgates left a comment

Choose a reason for hiding this comment

Uh oh!

alanfgates Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

alanfgates Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

alanfgates Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mattmkim Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

alanfgates Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mattmkim Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

mattmkim commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

mattmkim Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattmkim commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mattmkim Apr 15, 2026 •

edited

Loading