Skip to content

Runs list: ClickHouse timeout on tag filter due to FINAL + hasAny(tags) scan #3426

@eni9889

Description

@eni9889

Summary

Filtering the Runs list by tag in the webapp causes ClickHouse to time out (Timeout error, 30s default request_timeout from @clickhouse/client). For projects with non-trivial run volume, the tag filter is effectively unusable — the UI appears to crash.

Environment

  • Self-hosted trigger.dev v4 (helm chart)
  • ClickHouse 25.7.2
  • task_runs_v2: ~12M rows, 22 active parts, 2.2 GiB on disk, partitioned by toYYYYMM(created_at) across 4 months
  • Reproduces consistently on any tag filter with a moderately old cursor

Failing query (actual log)

SELECT run_id FROM trigger_dev.task_runs_v2 FINAL
WHERE organization_id = {organizationId: String}
  AND project_id = {projectId: String}
  AND environment_id = {environmentId: String}
  AND task_identifier IN {tasks: Array(String)}
  AND status IN {statuses: Array(String)}
  AND hasAny(tags, {tags: Array(String)})
  AND created_at >= fromUnixTimestamp64Milli({period: Int64})
  AND run_id < {runId: String}
ORDER BY created_at DESC, run_id DESC
LIMIT 2

Source: apps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.ts (paginator "has-next" probe, LIMIT 2).

Root cause

Three compounding factors make this query pathologically slow:

  1. FINAL on ReplacingMergeTree — forces merge-at-query-time across every part matching the (organization_id, project_id, environment_id) prefix of the ORDER BY. The existing idx_tags tokenbf_v1 skip index helps at part level but FINAL still reads whole parts to resolve _version/_is_deleted.

  2. hasAny(tags, ...) applied post-FINAL — low selectivity under FINAL means many granules are read and merged before the tag predicate eliminates them. Bloom filter benefit is largely lost.

  3. Cursor pagination LIMIT 2 with no upper bound on created_at — when the cursor (run_id < {runId}) sits far back in time and the tag hit density is low, CH scans backward through many granules looking for 2 matching rows. Each scan repeats the FINAL + tag work.

Observed in production: query exceeds the hard-coded 30s @clickhouse/client default request_timeout (no override in internal-packages/clickhouse/src/client/client.ts), so the webapp logs {"name":"ClickHouse","error":{"message":"Timeout error."}} and returns an error to the UI.

Reproduction

  1. Self-hosted instance with ≥10M rows in task_runs_v2 across 3+ monthly partitions.
  2. Open Runs list in the dashboard.
  3. Apply a tag filter where matches are sparse (e.g. 1 tag that appears in <1% of runs).
  4. Paginate past the first page (cursor advances into older data).
  5. Observe 30s hang then UI error. Webapp logs show ClickHouse Timeout error with the query above.

Proposed fixes (any of these would help)

  1. Push tag filter into PREWHERE — let CH apply the bloom-filter-friendly predicate before the FINAL merge work:

    SELECT run_id FROM trigger_dev.task_runs_v2 FINAL
    PREWHERE hasAny(tags, {tagsFilter: Array(String)})
    WHERE organization_id = ... AND ...
  2. Drop FINAL, dedupe via argMax(_version) aggregation pattern — avoids merge-at-query-time. Requires group-by on run_id and argMax(status, _version) etc., plus a _is_deleted = 0 filter. Heavier per-row but avoids the worst case.

  3. Expose request_timeout on ClickhouseConfiginternal-packages/clickhouse/src/client/client.ts:59 currently doesn't pass request_timeout to createClient. Allow self-hosters to bump it via env while a proper query fix lands.

  4. Bound created_at upper side on cursor queries — carry the cursor row's created_at alongside run_id and add created_at <= {cursorCreatedAt} to prune partitions.

(1) + (3) are the smallest surface-area changes and would unblock most users.

Workaround we're applying

Bumping ClickHouse pod resources (requests 4 CPU / 16 Gi, limits 8 CPU / 32 Gi) to reduce the chance of hitting the 30s client timeout. Doesn't fix the underlying quadratic-ish scan pattern — tag filter will still degrade as the table grows.

Happy to open a PR for (1) + (3) if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions