Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions documentation/architecture/time-series-optimizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ sequential reads, materialized views, and in-memory processing.
- **Out-of-order data:**
When data arrives out of order, QuestDB [rearranges it](/docs/concepts/partitions/#partition-splitting-and-squashing) to maintain timestamp order. The
engine splits partitions to minimize [write amplification](/docs/getting-started/capacity-planning/#write-amplification) and compacts them in the background.
See [Out-of-order data](/docs/concepts/out-of-order-data/) for behavior per ingestion method and tuning guidance.


### Data partitioning and sequential reads
Expand Down
15 changes: 6 additions & 9 deletions documentation/concepts/designated-timestamp.md
Original file line number Diff line number Diff line change
Expand Up @@ -448,15 +448,12 @@ WHERE received_ts > dateadd('h', -1, now());

### Out-of-order data

QuestDB handles out-of-order data automatically—no special configuration
needed. Data arriving out of order is merged into the correct position.

However, excessive out-of-order data increases write amplification. If most
of your data arrives significantly out of order:
- Consider using ingestion time as designated timestamp
- Store event time as a separate indexed column
- Use appropriate partition sizing (smaller partitions = less rewrite per
out-of-order event)
QuestDB accepts out-of-order data automatically. The cost is write
amplification, which scales with partition size and how far behind the
data arrives.

See [Out-of-order data](/docs/concepts/out-of-order-data/) for behavior
per ingestion method, tuning, and common scenarios.

### Partition size alignment

Expand Down
234 changes: 234 additions & 0 deletions documentation/concepts/out-of-order-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
---
title: Out-of-order data
sidebar_label: Out-of-order data
description:
How QuestDB handles out-of-order and late-arriving data. Behavior per
ingestion method, engine mechanics, write amplification, and tuning.
---

QuestDB accepts data in any timestamp order. Rows whose timestamps fall behind
already-committed data are merged into their correct chronological position
automatically, with no special configuration or pre-sorting required.

This page covers what counts as out-of-order, how each ingestion method
handles it, what it costs in write amplification, and how to tune for
workloads where it is common.

## What QuestDB considers out of order

A row is out of order when its
[designated timestamp](/docs/concepts/designated-timestamp/) is earlier than
the most recent timestamp already committed to the table. The engine has to
merge the row into its correct chronological position rather than appending
to the end of the data, even if the row's target partition is not the
active (latest) one.

QuestDB's internal shorthand for out-of-order is **O3**. You will see it in
configuration property names such as `cairo.o3.column.memory.size`.

Common situations that produce out-of-order data:

- Late-arriving messages from a queue after a consumer-lag spike
- Replaying a Kafka topic from an earlier offset
- Backfilling historical data while live ingestion continues
- Multiple producers feeding the same table from clocks that are not in sync
- Sensors with intermittent connectivity that buffer readings and flush them
on reconnect

## Behavior per ingestion method

| Method | Out-of-order behavior |
|--------|-----------------------|
| ILP (HTTP and TCP) | Accepted. Rows are merged into the correct position. |
| `INSERT` (SQL) | Accepted. Rows are merged into the correct position. |
| `COPY` into a partitioned table | Accepted. Sorted as part of the import. |
| `COPY` into a non-partitioned table | **Rejected.** Non-partitioned `COPY` is serial and requires pre-sorted input. |

The recommendation is always to ingest into a partitioned, WAL-enabled table.
Non-partitioned `COPY` is the only path that explicitly errors on out-of-order
rows.

## How the engine handles it

When out-of-order data arrives, QuestDB does not rewrite the whole partition.
It splits the partition so that only the affected portion is rewritten, then
squashes the splits back together in the background once they are no longer
in the hot write path.

For the mechanics, including how splits are named and when squashing runs,
see
[Partition splitting and squashing](/docs/concepts/partitions/#partition-splitting-and-squashing).

## Write amplification

The most visible cost of out-of-order ingestion is **write amplification**:
how many physical rows are written to disk per logical row committed.
Append-only workloads stay close to 1.0. Out-of-order workloads push the
ratio higher because the engine rewrites portions of existing partitions
to merge new rows into their correct position.

You can monitor write amplification at two levels:

- **Per table**, via the `table_write_amp_*` columns in
[`tables()`](/docs/query/functions/meta/#table-metrics-table_-prefix).
These expose p50, p90, p99, and max for recent activity, which is more
diagnostic than a single cumulative ratio.
- **Cluster-wide**, via
[Prometheus metrics](/docs/integrations/other/prometheus/#scraping-prometheus-metrics-from-questdb):

```
write_amplification = questdb_physically_written_rows_total / questdb_committed_rows_total
```

These counters are cumulative for the process lifetime. Compare deltas
over a time window (for example 5 minutes) to see the current rate
rather than the lifetime average.

### What counts as "high"

There is no universal threshold. Write amplification is a sensitivity
indicator, not a pass/fail metric. The same ratio can be fine on one
system and crippling on another:

- **Absolute volume matters more than the ratio.** A ratio of 10,000 that
rewrites one extra row per commit is irrelevant. A ratio of 5 that
rewrites 100 million rows per commit will saturate disk.
- **Disk throughput sets the ceiling.** Fast local NVMe can mask high
ratios that would suspend ingestion on slower network-attached storage.
A workload running fine on SSD can fail when moved to EBS.
- **It behaves like a cliff.** Either the storage subsystem keeps up or it
does not. There is little smooth degradation in between.

A ratio of 1.0 is the ideal. Beyond that, treat the metric as a *relative*
signal: a sudden jump from your normal baseline indicates a change in
ingestion pattern worth investigating, even if the absolute number is low.
Real production deployments routinely run with write amplification in the
double digits without problems when disk throughput accommodates it.

### Other sources of write amplification

Write amplification is not exclusively caused by out-of-order writes.
Other operations that rewrite data show up in the same metric:

- Incremental [materialized view](/docs/concepts/materialized-views/)
refreshes. Each refresh issues a replace-range commit for the affected
time buckets, so any bucket that is recomputed (because the base table
is still streaming into it, or because late base data lands in an
already-refreshed range) rewrites the prior result on the view.
- `UPDATE` statements, which rewrite the affected column files
(copy-on-write).

If write amplification is high but your designated timestamps arrive
mostly in order, investigate these other sources before changing partition
size or other O3 tuning.

## Tuning for out-of-order workloads

### Fix the source first

Before tuning partition size, check whether the out-of-order writes are
accidental: a misconfigured client, clock skew between producers, an
unnecessary sort step removed from the pipeline, a Kafka consumer that
got rewound. Fixing the source is almost always more effective than
tuning the storage layout.

If the out-of-order pattern is genuinely required by your workload
(intermittent connectivity, scheduled backfills, exchange corrections),
proceed to the tuning options below.

### Partition size

Smaller partitions reduce the amount of data rewritten per out-of-order
event. If write amplification is causing storage throughput problems and
the out-of-order pattern is unavoidable, step down one partition interval:

- `PARTITION BY MONTH` to `PARTITION BY DAY`
- `PARTITION BY DAY` to `PARTITION BY HOUR`

Smaller partitions also mean more partitions overall, which increases
filesystem overhead. The target remains 30 to 80 million rows per
partition. See
[Choosing a partition interval](/docs/concepts/partitions/#choosing-a-partition-interval).

### O3 memory page size

The `cairo.o3.column.memory.size` configuration controls how much memory the
writer reserves per column when receiving out-of-order writes. The default
is 8MB, so the writer holds 16MB of RAM per column (2× the page size) during
O3 commits.

For tables with many columns and frequent out-of-order writes, this can
become a noticeable memory cost. Reducing the value (range 128KB to 8MB)
trades per-write efficiency for lower memory use and the ability to keep
more columns in flight.

### Deduplication

If your out-of-order writes may include rows that overwrite existing rows
with the same key (for example, exchange corrections or third-party data
re-published with revisions), enable
[deduplication](/docs/concepts/deduplication/) on the relevant key columns.
The full-row equality check lets QuestDB skip writes for unchanged rows
entirely, which reduces write amplification when reloading large datasets
where only a small portion has changed.

### Choice of designated timestamp

When most of your data arrives significantly behind its event time, consider
using ingestion time as the designated timestamp and keeping event time as a
regular column. This converts the workload from out-of-order into
append-only, at the cost of losing event-time interval scans (queries that
filter by event time will read more partitions than strictly necessary).

This trade-off is rarely worth it for moderate out-of-order workloads but
can help in extreme cases such as bulk replays or sensor networks with
multi-day backfill windows.

## Common scenarios

### Exchange time vs gateway time

Market data tables often have multiple plausible timestamp columns. Exchange
time is when the venue published the event; gateway time is when your
infrastructure received it. Exchange time is the correct event time but can
arrive out of order due to network jitter or batched feeds. Gateway time is
monotonic per consumer but loses temporal accuracy.

Use exchange time as the designated timestamp for most workloads. The
out-of-order overhead is usually small compared to the analytical value of
querying by event time. Switch to gateway time only if jitter is large
enough that storage throughput cannot keep up after tuning partition size.

### Kafka replay

Consuming a Kafka topic from an earlier offset replays old data into
QuestDB. Each batch counts as out-of-order against the live data already
committed.

Enable deduplication on the row key so that re-consumed rows are detected
as identical to the existing rows and skipped without rewriting. This makes
replay safe to repeat without inflating write amplification.

### Sensor backfill

A field device loses connectivity, buffers several hours of readings, and
flushes them when reconnected. The buffered rows have timestamps older than
the live data already in QuestDB.

This works as expected over ILP. The buffered batch is merged into the
relevant partitions. The cost is proportional to the size of the batch and
how far back in time it reaches.

## Parquet partitions

Out-of-order writes targeting a partition that has been converted to Parquet
are governed by storage-tier rules rather than the standard merge path. See
[Storage policy](/docs/concepts/storage-policy/) for the full behavior.

## See also

- [Designated timestamp](/docs/concepts/designated-timestamp/)
- [Partition splitting and squashing](/docs/concepts/partitions/#partition-splitting-and-squashing)
- [Deduplication](/docs/concepts/deduplication/)
- [Write-ahead log](/docs/concepts/write-ahead-log/)
- [Write amplification](/docs/getting-started/capacity-planning/#write-amplification)
4 changes: 3 additions & 1 deletion documentation/concepts/partitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,9 @@ db/trades/

When out-of-order data arrives for an existing partition, QuestDB may split that
partition to avoid rewriting all its data. This is an optimization for write
performance.
performance. For the broader story on out-of-order ingestion, including
write amplification and tuning, see
[Out-of-order data](/docs/concepts/out-of-order-data/).

A split occurs when:
- The existing partition prefix is larger than the new data plus suffix
Expand Down
2 changes: 1 addition & 1 deletion documentation/concepts/write-ahead-log.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ This enables concurrent writes, crash recovery, and replication.
| **Concurrent writes** | Multiple clients can write simultaneously without blocking |
| **Crash recovery** | Committed data is never lost — replay from log after restart |
| **Replication** | WAL enables high availability and disaster recovery |
| **Out-of-order handling** | Late-arriving data is merged efficiently |
| **Out-of-order handling** | Late-arriving data is merged efficiently. See [Out-of-order data](/docs/concepts/out-of-order-data/). |
| **Deduplication** | Enables [DEDUP UPSERT KEYS](/docs/concepts/deduplication/) |

In QuestDB Enterprise, WAL segments are sent to object storage immediately
Expand Down
13 changes: 10 additions & 3 deletions documentation/getting-started/capacity-planning.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,16 @@ For instructions on how to do so, see the

### Write amplification

Write amplification measures how many times data is rewritten during ingestion.
A value of 1.0 means each row is written once (ideal). Higher values indicate
rewrites due to out-of-order data merging into existing partitions.
Write amplification measures how many physical rows are written to disk per
logical row committed. A value of 1.0 means each row is written once (ideal).
Higher values come from any operation that rewrites existing data: out-of-order
ingestion merging into existing partitions, incremental
[materialized view](/docs/concepts/materialized-views/) refreshes replacing
already-refreshed time buckets, and `UPDATE` statements rewriting column files.

For the full picture on the out-of-order side (per-ingestion-method behavior,
tuning, and common scenarios), see
[Out-of-order data](/docs/concepts/out-of-order-data/).

Calculate it using [Prometheus metrics](/docs/integrations/other/prometheus/#scraping-prometheus-metrics-from-questdb):

Expand Down
8 changes: 8 additions & 0 deletions documentation/ingestion/ilp/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,14 @@ CREATE TABLE IF NOT EXISTS 'trades' (
You can use the `CREATE TABLE IF NOT EXISTS` construct to make sure the table is
created, but without raising an error if the table already exists.

## Data ordering

ILP performs best when rows arrive in chronological order by their designated
timestamp. QuestDB also accepts rows in any order: late or out-of-order rows
are merged into the correct position automatically. See
[Partition splitting and squashing](/docs/concepts/partitions/#partition-splitting-and-squashing)
for the engine mechanics.

## HTTP transaction semantics

The TCP endpoint does not support transactions. The HTTP ILP endpoint treats
Expand Down
6 changes: 4 additions & 2 deletions documentation/ingestion/import-csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,10 @@ csvstack *.csv > singleFile.csv

- `COPY` is parallel when target table is partitioned.

- `COPY` is _serial_ when target table is non-partitioned. Out-of-order
timestamps are rejected.
- `COPY` is _serial_ when target table is non-partitioned, and out-of-order
timestamps are rejected. Partitioned target tables accept out-of-order rows
and sort them during import. See
[Out-of-order data](/docs/concepts/out-of-order-data/) for details.

- `COPY` cannot import data into non-empty table.

Expand Down
14 changes: 14 additions & 0 deletions documentation/query/functions/date-time.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,20 @@ reference for Python, Go, Java, JavaScript, C/C++, Rust, and C#/.NET.

---

:::tip Looking for `timestamp()` to elect a designated timestamp?

The `timestamp(column)` clause assigns a
[designated timestamp](/docs/concepts/designated-timestamp/) to a query result
or to a new table. It is a structural clause rather than a value
transformation, so it is documented on its own page:
[Timestamp function](/docs/query/functions/timestamp/#during-a-select-operation).

Use it when a table or CTE does not have a designated timestamp, or to change
the designated timestamp for the scope of a query.
:::

---

## Function categories

### Current time
Expand Down
17 changes: 15 additions & 2 deletions documentation/query/functions/timestamp.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,27 @@ CREATE TABLE tableName (...) timestamp(columnName);

Creates a [designated timestamp](/docs/concepts/designated-timestamp/) column in
the result of a query. Assigning a timestamp in a `SELECT` statement
(`dynamic timestamp`) allows for time series operations such as `LATEST BY`,
`SAMPLE BY` or `LATEST BY` on tables which do not have a `designated timestamp`
(`dynamic timestamp`) allows for time series operations such as `LATEST ON`,
`SAMPLE BY` or `ASOF JOIN` on tables which do not have a `designated timestamp`
assigned.

```questdb-sql
SELECT ... FROM tableName timestamp(columnName);
```

:::warning Data must be sorted by the elected column

`timestamp(columnName)` does not sort the data. QuestDB assumes the result is
already ordered by `columnName` and uses that assumption for time-series
operations such as `SAMPLE BY`, `LATEST ON` and `ASOF JOIN`. If the data is
not actually in order, results will be silently incorrect and no error is
raised.

Always add `ORDER BY columnName` before applying `timestamp()` on data that
is not guaranteed sorted (subqueries, `UNION`, `read_parquet()`, casts that
round-trip the timestamp, and so on).
:::

## Optimization with WHERE clauses

When filtering on a designated timestamp column in WHERE clauses, QuestDB automatically optimizes the query by applying time-based partition filtering. This optimization also works with subqueries that return timestamp values.
Expand Down
Loading
Loading