diff --git a/documentation/architecture/time-series-optimizations.md b/documentation/architecture/time-series-optimizations.md index e6f5cae75..9cd908361 100644 --- a/documentation/architecture/time-series-optimizations.md +++ b/documentation/architecture/time-series-optimizations.md @@ -32,6 +32,7 @@ sequential reads, materialized views, and in-memory processing. - **Out-of-order data:** When data arrives out of order, QuestDB [rearranges it](/docs/concepts/partitions/#partition-splitting-and-squashing) to maintain timestamp order. The engine splits partitions to minimize [write amplification](/docs/getting-started/capacity-planning/#write-amplification) and compacts them in the background. + See [Out-of-order data](/docs/concepts/out-of-order-data/) for behavior per ingestion method and tuning guidance. ### Data partitioning and sequential reads diff --git a/documentation/concepts/designated-timestamp.md b/documentation/concepts/designated-timestamp.md index e8c3c3d1c..1f2f58965 100644 --- a/documentation/concepts/designated-timestamp.md +++ b/documentation/concepts/designated-timestamp.md @@ -448,15 +448,12 @@ WHERE received_ts > dateadd('h', -1, now()); ### Out-of-order data -QuestDB handles out-of-order data automatically—no special configuration -needed. Data arriving out of order is merged into the correct position. - -However, excessive out-of-order data increases write amplification. If most -of your data arrives significantly out of order: -- Consider using ingestion time as designated timestamp -- Store event time as a separate indexed column -- Use appropriate partition sizing (smaller partitions = less rewrite per - out-of-order event) +QuestDB accepts out-of-order data automatically. The cost is write +amplification, which scales with partition size and how far behind the +data arrives. + +See [Out-of-order data](/docs/concepts/out-of-order-data/) for behavior +per ingestion method, tuning, and common scenarios. ### Partition size alignment diff --git a/documentation/concepts/out-of-order-data.md b/documentation/concepts/out-of-order-data.md new file mode 100644 index 000000000..da800cfac --- /dev/null +++ b/documentation/concepts/out-of-order-data.md @@ -0,0 +1,234 @@ +--- +title: Out-of-order data +sidebar_label: Out-of-order data +description: + How QuestDB handles out-of-order and late-arriving data. Behavior per + ingestion method, engine mechanics, write amplification, and tuning. +--- + +QuestDB accepts data in any timestamp order. Rows whose timestamps fall behind +already-committed data are merged into their correct chronological position +automatically, with no special configuration or pre-sorting required. + +This page covers what counts as out-of-order, how each ingestion method +handles it, what it costs in write amplification, and how to tune for +workloads where it is common. + +## What QuestDB considers out of order + +A row is out of order when its +[designated timestamp](/docs/concepts/designated-timestamp/) is earlier than +the most recent timestamp already committed to the table. The engine has to +merge the row into its correct chronological position rather than appending +to the end of the data, even if the row's target partition is not the +active (latest) one. + +QuestDB's internal shorthand for out-of-order is **O3**. You will see it in +configuration property names such as `cairo.o3.column.memory.size`. + +Common situations that produce out-of-order data: + +- Late-arriving messages from a queue after a consumer-lag spike +- Replaying a Kafka topic from an earlier offset +- Backfilling historical data while live ingestion continues +- Multiple producers feeding the same table from clocks that are not in sync +- Sensors with intermittent connectivity that buffer readings and flush them + on reconnect + +## Behavior per ingestion method + +| Method | Out-of-order behavior | +|--------|-----------------------| +| ILP (HTTP and TCP) | Accepted. Rows are merged into the correct position. | +| `INSERT` (SQL) | Accepted. Rows are merged into the correct position. | +| `COPY` into a partitioned table | Accepted. Sorted as part of the import. | +| `COPY` into a non-partitioned table | **Rejected.** Non-partitioned `COPY` is serial and requires pre-sorted input. | + +The recommendation is always to ingest into a partitioned, WAL-enabled table. +Non-partitioned `COPY` is the only path that explicitly errors on out-of-order +rows. + +## How the engine handles it + +When out-of-order data arrives, QuestDB does not rewrite the whole partition. +It splits the partition so that only the affected portion is rewritten, then +squashes the splits back together in the background once they are no longer +in the hot write path. + +For the mechanics, including how splits are named and when squashing runs, +see +[Partition splitting and squashing](/docs/concepts/partitions/#partition-splitting-and-squashing). + +## Write amplification + +The most visible cost of out-of-order ingestion is **write amplification**: +how many physical rows are written to disk per logical row committed. +Append-only workloads stay close to 1.0. Out-of-order workloads push the +ratio higher because the engine rewrites portions of existing partitions +to merge new rows into their correct position. + +You can monitor write amplification at two levels: + +- **Per table**, via the `table_write_amp_*` columns in + [`tables()`](/docs/query/functions/meta/#table-metrics-table_-prefix). + These expose p50, p90, p99, and max for recent activity, which is more + diagnostic than a single cumulative ratio. +- **Cluster-wide**, via + [Prometheus metrics](/docs/integrations/other/prometheus/#scraping-prometheus-metrics-from-questdb): + + ``` + write_amplification = questdb_physically_written_rows_total / questdb_committed_rows_total + ``` + + These counters are cumulative for the process lifetime. Compare deltas + over a time window (for example 5 minutes) to see the current rate + rather than the lifetime average. + +### What counts as "high" + +There is no universal threshold. Write amplification is a sensitivity +indicator, not a pass/fail metric. The same ratio can be fine on one +system and crippling on another: + +- **Absolute volume matters more than the ratio.** A ratio of 10,000 that + rewrites one extra row per commit is irrelevant. A ratio of 5 that + rewrites 100 million rows per commit will saturate disk. +- **Disk throughput sets the ceiling.** Fast local NVMe can mask high + ratios that would suspend ingestion on slower network-attached storage. + A workload running fine on SSD can fail when moved to EBS. +- **It behaves like a cliff.** Either the storage subsystem keeps up or it + does not. There is little smooth degradation in between. + +A ratio of 1.0 is the ideal. Beyond that, treat the metric as a *relative* +signal: a sudden jump from your normal baseline indicates a change in +ingestion pattern worth investigating, even if the absolute number is low. +Real production deployments routinely run with write amplification in the +double digits without problems when disk throughput accommodates it. + +### Other sources of write amplification + +Write amplification is not exclusively caused by out-of-order writes. +Other operations that rewrite data show up in the same metric: + +- Incremental [materialized view](/docs/concepts/materialized-views/) + refreshes. Each refresh issues a replace-range commit for the affected + time buckets, so any bucket that is recomputed (because the base table + is still streaming into it, or because late base data lands in an + already-refreshed range) rewrites the prior result on the view. +- `UPDATE` statements, which rewrite the affected column files + (copy-on-write). + +If write amplification is high but your designated timestamps arrive +mostly in order, investigate these other sources before changing partition +size or other O3 tuning. + +## Tuning for out-of-order workloads + +### Fix the source first + +Before tuning partition size, check whether the out-of-order writes are +accidental: a misconfigured client, clock skew between producers, an +unnecessary sort step removed from the pipeline, a Kafka consumer that +got rewound. Fixing the source is almost always more effective than +tuning the storage layout. + +If the out-of-order pattern is genuinely required by your workload +(intermittent connectivity, scheduled backfills, exchange corrections), +proceed to the tuning options below. + +### Partition size + +Smaller partitions reduce the amount of data rewritten per out-of-order +event. If write amplification is causing storage throughput problems and +the out-of-order pattern is unavoidable, step down one partition interval: + +- `PARTITION BY MONTH` to `PARTITION BY DAY` +- `PARTITION BY DAY` to `PARTITION BY HOUR` + +Smaller partitions also mean more partitions overall, which increases +filesystem overhead. The target remains 30 to 80 million rows per +partition. See +[Choosing a partition interval](/docs/concepts/partitions/#choosing-a-partition-interval). + +### O3 memory page size + +The `cairo.o3.column.memory.size` configuration controls how much memory the +writer reserves per column when receiving out-of-order writes. The default +is 8MB, so the writer holds 16MB of RAM per column (2× the page size) during +O3 commits. + +For tables with many columns and frequent out-of-order writes, this can +become a noticeable memory cost. Reducing the value (range 128KB to 8MB) +trades per-write efficiency for lower memory use and the ability to keep +more columns in flight. + +### Deduplication + +If your out-of-order writes may include rows that overwrite existing rows +with the same key (for example, exchange corrections or third-party data +re-published with revisions), enable +[deduplication](/docs/concepts/deduplication/) on the relevant key columns. +The full-row equality check lets QuestDB skip writes for unchanged rows +entirely, which reduces write amplification when reloading large datasets +where only a small portion has changed. + +### Choice of designated timestamp + +When most of your data arrives significantly behind its event time, consider +using ingestion time as the designated timestamp and keeping event time as a +regular column. This converts the workload from out-of-order into +append-only, at the cost of losing event-time interval scans (queries that +filter by event time will read more partitions than strictly necessary). + +This trade-off is rarely worth it for moderate out-of-order workloads but +can help in extreme cases such as bulk replays or sensor networks with +multi-day backfill windows. + +## Common scenarios + +### Exchange time vs gateway time + +Market data tables often have multiple plausible timestamp columns. Exchange +time is when the venue published the event; gateway time is when your +infrastructure received it. Exchange time is the correct event time but can +arrive out of order due to network jitter or batched feeds. Gateway time is +monotonic per consumer but loses temporal accuracy. + +Use exchange time as the designated timestamp for most workloads. The +out-of-order overhead is usually small compared to the analytical value of +querying by event time. Switch to gateway time only if jitter is large +enough that storage throughput cannot keep up after tuning partition size. + +### Kafka replay + +Consuming a Kafka topic from an earlier offset replays old data into +QuestDB. Each batch counts as out-of-order against the live data already +committed. + +Enable deduplication on the row key so that re-consumed rows are detected +as identical to the existing rows and skipped without rewriting. This makes +replay safe to repeat without inflating write amplification. + +### Sensor backfill + +A field device loses connectivity, buffers several hours of readings, and +flushes them when reconnected. The buffered rows have timestamps older than +the live data already in QuestDB. + +This works as expected over ILP. The buffered batch is merged into the +relevant partitions. The cost is proportional to the size of the batch and +how far back in time it reaches. + +## Parquet partitions + +Out-of-order writes targeting a partition that has been converted to Parquet +are governed by storage-tier rules rather than the standard merge path. See +[Storage policy](/docs/concepts/storage-policy/) for the full behavior. + +## See also + +- [Designated timestamp](/docs/concepts/designated-timestamp/) +- [Partition splitting and squashing](/docs/concepts/partitions/#partition-splitting-and-squashing) +- [Deduplication](/docs/concepts/deduplication/) +- [Write-ahead log](/docs/concepts/write-ahead-log/) +- [Write amplification](/docs/getting-started/capacity-planning/#write-amplification) diff --git a/documentation/concepts/partitions.md b/documentation/concepts/partitions.md index 6aed44d68..fe1997098 100644 --- a/documentation/concepts/partitions.md +++ b/documentation/concepts/partitions.md @@ -155,7 +155,9 @@ db/trades/ When out-of-order data arrives for an existing partition, QuestDB may split that partition to avoid rewriting all its data. This is an optimization for write -performance. +performance. For the broader story on out-of-order ingestion, including +write amplification and tuning, see +[Out-of-order data](/docs/concepts/out-of-order-data/). A split occurs when: - The existing partition prefix is larger than the new data plus suffix diff --git a/documentation/concepts/write-ahead-log.md b/documentation/concepts/write-ahead-log.md index a99a236ce..a8454f044 100644 --- a/documentation/concepts/write-ahead-log.md +++ b/documentation/concepts/write-ahead-log.md @@ -20,7 +20,7 @@ This enables concurrent writes, crash recovery, and replication. | **Concurrent writes** | Multiple clients can write simultaneously without blocking | | **Crash recovery** | Committed data is never lost — replay from log after restart | | **Replication** | WAL enables high availability and disaster recovery | -| **Out-of-order handling** | Late-arriving data is merged efficiently | +| **Out-of-order handling** | Late-arriving data is merged efficiently. See [Out-of-order data](/docs/concepts/out-of-order-data/). | | **Deduplication** | Enables [DEDUP UPSERT KEYS](/docs/concepts/deduplication/) | In QuestDB Enterprise, WAL segments are sent to object storage immediately diff --git a/documentation/getting-started/capacity-planning.md b/documentation/getting-started/capacity-planning.md index 0257762a5..91a162c3c 100644 --- a/documentation/getting-started/capacity-planning.md +++ b/documentation/getting-started/capacity-planning.md @@ -101,9 +101,16 @@ For instructions on how to do so, see the ### Write amplification -Write amplification measures how many times data is rewritten during ingestion. -A value of 1.0 means each row is written once (ideal). Higher values indicate -rewrites due to out-of-order data merging into existing partitions. +Write amplification measures how many physical rows are written to disk per +logical row committed. A value of 1.0 means each row is written once (ideal). +Higher values come from any operation that rewrites existing data: out-of-order +ingestion merging into existing partitions, incremental +[materialized view](/docs/concepts/materialized-views/) refreshes replacing +already-refreshed time buckets, and `UPDATE` statements rewriting column files. + +For the full picture on the out-of-order side (per-ingestion-method behavior, +tuning, and common scenarios), see +[Out-of-order data](/docs/concepts/out-of-order-data/). Calculate it using [Prometheus metrics](/docs/integrations/other/prometheus/#scraping-prometheus-metrics-from-questdb): diff --git a/documentation/ingestion/ilp/overview.md b/documentation/ingestion/ilp/overview.md index a0db5c1dd..23acadd8e 100644 --- a/documentation/ingestion/ilp/overview.md +++ b/documentation/ingestion/ilp/overview.md @@ -443,6 +443,14 @@ CREATE TABLE IF NOT EXISTS 'trades' ( You can use the `CREATE TABLE IF NOT EXISTS` construct to make sure the table is created, but without raising an error if the table already exists. +## Data ordering + +ILP performs best when rows arrive in chronological order by their designated +timestamp. QuestDB also accepts rows in any order: late or out-of-order rows +are merged into the correct position automatically. See +[Partition splitting and squashing](/docs/concepts/partitions/#partition-splitting-and-squashing) +for the engine mechanics. + ## HTTP transaction semantics The TCP endpoint does not support transactions. The HTTP ILP endpoint treats diff --git a/documentation/ingestion/import-csv.md b/documentation/ingestion/import-csv.md index fad81f382..2423f7791 100644 --- a/documentation/ingestion/import-csv.md +++ b/documentation/ingestion/import-csv.md @@ -116,8 +116,10 @@ csvstack *.csv > singleFile.csv - `COPY` is parallel when target table is partitioned. -- `COPY` is _serial_ when target table is non-partitioned. Out-of-order - timestamps are rejected. +- `COPY` is _serial_ when target table is non-partitioned, and out-of-order + timestamps are rejected. Partitioned target tables accept out-of-order rows + and sort them during import. See + [Out-of-order data](/docs/concepts/out-of-order-data/) for details. - `COPY` cannot import data into non-empty table. diff --git a/documentation/query/functions/date-time.md b/documentation/query/functions/date-time.md index 54c3d98c9..e27f8bdda 100644 --- a/documentation/query/functions/date-time.md +++ b/documentation/query/functions/date-time.md @@ -51,6 +51,20 @@ reference for Python, Go, Java, JavaScript, C/C++, Rust, and C#/.NET. --- +:::tip Looking for `timestamp()` to elect a designated timestamp? + +The `timestamp(column)` clause assigns a +[designated timestamp](/docs/concepts/designated-timestamp/) to a query result +or to a new table. It is a structural clause rather than a value +transformation, so it is documented on its own page: +[Timestamp function](/docs/query/functions/timestamp/#during-a-select-operation). + +Use it when a table or CTE does not have a designated timestamp, or to change +the designated timestamp for the scope of a query. +::: + +--- + ## Function categories ### Current time diff --git a/documentation/query/functions/timestamp.md b/documentation/query/functions/timestamp.md index 3723b4441..51de87948 100644 --- a/documentation/query/functions/timestamp.md +++ b/documentation/query/functions/timestamp.md @@ -41,14 +41,27 @@ CREATE TABLE tableName (...) timestamp(columnName); Creates a [designated timestamp](/docs/concepts/designated-timestamp/) column in the result of a query. Assigning a timestamp in a `SELECT` statement -(`dynamic timestamp`) allows for time series operations such as `LATEST BY`, -`SAMPLE BY` or `LATEST BY` on tables which do not have a `designated timestamp` +(`dynamic timestamp`) allows for time series operations such as `LATEST ON`, +`SAMPLE BY` or `ASOF JOIN` on tables which do not have a `designated timestamp` assigned. ```questdb-sql SELECT ... FROM tableName timestamp(columnName); ``` +:::warning Data must be sorted by the elected column + +`timestamp(columnName)` does not sort the data. QuestDB assumes the result is +already ordered by `columnName` and uses that assumption for time-series +operations such as `SAMPLE BY`, `LATEST ON` and `ASOF JOIN`. If the data is +not actually in order, results will be silently incorrect and no error is +raised. + +Always add `ORDER BY columnName` before applying `timestamp()` on data that +is not guaranteed sorted (subqueries, `UNION`, `read_parquet()`, casts that +round-trip the timestamp, and so on). +::: + ## Optimization with WHERE clauses When filtering on a designated timestamp column in WHERE clauses, QuestDB automatically optimizes the query by applying time-based partition filtering. This optimization also works with subqueries that return timestamp values. diff --git a/documentation/schema-design-essentials.md b/documentation/schema-design-essentials.md index b88a6c929..8b69ec2f6 100644 --- a/documentation/schema-design-essentials.md +++ b/documentation/schema-design-essentials.md @@ -75,6 +75,18 @@ TIMESTAMP(ts) PARTITION BY MONTH; See [Partitions](/docs/concepts/partitions/) for details. +### Out-of-order writes + +QuestDB accepts data in any timestamp order. Rows that arrive behind already- +committed data are merged into the correct position automatically. The cost +is write amplification, which scales with partition size and how far behind +the data arrives. + +If late or replayed data is part of your workload (exchange corrections, +Kafka replay, sensor backfill), see +[Out-of-order data](/docs/concepts/out-of-order-data/) for behavior per +ingestion method and tuning guidance. + ## Indexing Index your primary filter columns to speed up `WHERE` clause queries. QuestDB diff --git a/documentation/sidebars.js b/documentation/sidebars.js index daa8ce054..988af1630 100644 --- a/documentation/sidebars.js +++ b/documentation/sidebars.js @@ -519,6 +519,7 @@ module.exports = { collapsed: false, items: [ "concepts/designated-timestamp", + "concepts/out-of-order-data", "concepts/timestamps-timezones", "concepts/partitions", "concepts/symbol",