-
-
Notifications
You must be signed in to change notification settings - Fork 899
fix(clickhouse): ensure start_time is never older than X ms to prevent old partition merge issues #2721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughThis change adds a configurable clamping mechanism for ClickHouse event start times: a new environment variable Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts (1)
37-121: Critical: Unresolved merge conflict and missingstartTimeMaxAgeMsin v2 initializer.This file has two issues:
- Unresolved merge conflict markers that will prevent compilation
- Missing
startTimeMaxAgeMsin v2 initializer (lines 123-136) - the clamping feature won't work for v2 repositoriesResolve the conflict and add
startTimeMaxAgeMsto both initializers:function initializeClickhouseRepository() { // ... existing code ... const repository = new ClickhouseEventRepository({ clickhouse: clickhouse, batchSize: env.EVENTS_CLICKHOUSE_BATCH_SIZE, flushInterval: env.EVENTS_CLICKHOUSE_FLUSH_INTERVAL_MS, maximumTraceSummaryViewCount: env.EVENTS_CLICKHOUSE_MAX_TRACE_SUMMARY_VIEW_COUNT, maximumTraceDetailedSummaryViewCount: env.EVENTS_CLICKHOUSE_MAX_TRACE_DETAILED_SUMMARY_VIEW_COUNT, maximumLiveReloadingSetting: env.EVENTS_CLICKHOUSE_MAX_LIVE_RELOADING_SETTING, insertStrategy: env.EVENTS_CLICKHOUSE_INSERT_STRATEGY, waitForAsyncInsert: env.EVENTS_CLICKHOUSE_WAIT_FOR_ASYNC_INSERT === "1", asyncInsertMaxDataSize: env.EVENTS_CLICKHOUSE_ASYNC_INSERT_MAX_DATA_SIZE, asyncInsertBusyTimeoutMs: env.EVENTS_CLICKHOUSE_ASYNC_INSERT_BUSY_TIMEOUT_MS, + startTimeMaxAgeMs: env.EVENTS_CLICKHOUSE_START_TIME_MAX_AGE_MS, version: "v1", }); return repository; } function initializeClickhouseRepositoryV2() { // ... existing code ... const repository = new ClickhouseEventRepository({ clickhouse: clickhouse, batchSize: env.EVENTS_CLICKHOUSE_BATCH_SIZE, flushInterval: env.EVENTS_CLICKHOUSE_FLUSH_INTERVAL_MS, maximumTraceSummaryViewCount: env.EVENTS_CLICKHOUSE_MAX_TRACE_SUMMARY_VIEW_COUNT, maximumTraceDetailedSummaryViewCount: env.EVENTS_CLICKHOUSE_MAX_TRACE_DETAILED_SUMMARY_VIEW_COUNT, maximumLiveReloadingSetting: env.EVENTS_CLICKHOUSE_MAX_LIVE_RELOADING_SETTING, insertStrategy: env.EVENTS_CLICKHOUSE_INSERT_STRATEGY, waitForAsyncInsert: env.EVENTS_CLICKHOUSE_WAIT_FOR_ASYNC_INSERT === "1", asyncInsertMaxDataSize: env.EVENTS_CLICKHOUSE_ASYNC_INSERT_MAX_DATA_SIZE, asyncInsertBusyTimeoutMs: env.EVENTS_CLICKHOUSE_ASYNC_INSERT_BUSY_TIMEOUT_MS, + startTimeMaxAgeMs: env.EVENTS_CLICKHOUSE_START_TIME_MAX_AGE_MS, version: "v2", }); return repository; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
references/hello-world/src/trigger/example.tsis excluded by!references/**
📒 Files selected for processing (3)
apps/webapp/app/env.server.ts(1 hunks)apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts(30 hunks)apps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
apps/webapp/app/env.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/env.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Files:
apps/webapp/app/env.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
apps/webapp/app/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Access all environment variables through the
envexport ofenv.server.tsinstead of directly accessingprocess.envin the Trigger.dev webapp
Files:
apps/webapp/app/env.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: When importing from@trigger.dev/corein the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp
Files:
apps/webapp/app/env.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
**/*.{js,ts,jsx,tsx,json,md,css,scss}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier
Files:
apps/webapp/app/env.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.tsapps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
🧠 Learnings (5)
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp
Applied to files:
apps/webapp/app/env.server.ts
📚 Learning: 2025-06-14T08:07:46.625Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 2175
File: apps/webapp/app/services/environmentMetricsRepository.server.ts:202-207
Timestamp: 2025-06-14T08:07:46.625Z
Learning: In apps/webapp/app/services/environmentMetricsRepository.server.ts, the ClickHouse methods (getTaskActivity, getCurrentRunningStats, getAverageDurations) intentionally do not filter by the `tasks` parameter at the ClickHouse level, even though the tasks parameter is accepted by the public methods. This is done on purpose as there is not much benefit from adding that filtering at the ClickHouse layer.
Applied to files:
apps/webapp/app/env.server.ts
📚 Learning: 2025-07-12T18:06:04.133Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.
Applied to files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging
Applied to files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Limit task duration using the `maxDuration` property (in seconds)
Applied to files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
🧬 Code graph analysis (1)
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts (3)
apps/webapp/app/v3/eventRepository/common.server.ts (2)
getNowInNanoseconds(23-25)convertDateToNanoseconds(53-55)apps/webapp/app/v3/otlpExporter.server.ts (1)
events(125-131)internal-packages/clickhouse/src/taskEvents.ts (4)
TaskEventV1Input(5-22)TaskEventV1Input(24-24)TaskEventV2Input(139-158)TaskEventV2Input(160-160)
🪛 Biome (2.1.2)
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
[error] 82-90: Expected a statement but instead found '||||||| ancestor
/**
- The version of the ClickHouse task_events table to use.
-
- "v1": Uses task_events_v1 (partitioned by start_time)
-
- "v2": Uses task_events_v2 (partitioned by inserted_at to avoid "too many parts" errors)
*/
version?: "v1" | "v2"'.
- "v2": Uses task_events_v2 (partitioned by inserted_at to avoid "too many parts" errors)
Expected a statement here.
(parse)
[error] 185-187: Expected a statement but instead found '||||||| ancestor
async #flushBatch(flushId: string, events: TaskEventV1Input[])'.
Expected a statement here.
(parse)
[error] 187-189: Expected a statement but instead found '=======
async #flushBatch(flushId: string, events: (TaskEventV1Input | TaskEventV2Input)[])'.
Expected a statement here.
(parse)
apps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
[error] 121-123: Expected a statement but instead found '>>>>>>> theirs'.
Expected a statement here.
(parse)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (5)
apps/webapp/app/env.server.ts (1)
1152-1159: LGTM!The new environment variable follows existing patterns and uses zod for validation as per coding guidelines. The 5-minute default is reasonable for preventing old partition merge issues.
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts (4)
141-183: Clamping implementation is correct.The three helper methods correctly implement the time clamping logic:
- Proper early return when
startTimeMaxAgeMsis not configured- Correct nanosecond conversion using
* 1_000_000n- Clamps to current time when the start time exceeds the maximum age threshold
264-264: Consistent application of clamping across event creation paths.The clamping is properly applied to all event creation methods including:
createEventToTaskEventV1Input- Span event conversions (exception, cancellation, attempt failed, other)
recordEventandtraceEvent- All
complete*RunEventmethodsThis comprehensive coverage ensures no old start times can cause partition merge issues.
1231-1243: Earliest start time tracking logic is sound.The merge methods correctly:
- Track the earliest start time across all non-override/non-event records
- Exclude
ANCESTOR_OVERRIDEandSPAN_EVENTfrom the calculation (as these are metadata, not the primary span)- Apply the earliest time to the final span summary
This ensures consistent
startTimerepresentation when multiple records exist for a span.
1053-1055: Reasonable buffer for endCreatedAt.The 1-minute buffer on
endCreatedAtprovides tolerance for clock skew and events that may have shifted timestamps due to clamping. This aligns well with the clamping feature.
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
Outdated
Show resolved
Hide resolved
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
Outdated
Show resolved
Hide resolved
f674fe4 to
b4cb3d7
Compare
apps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
Outdated
Show resolved
Hide resolved
…t old partition merge issues
b4cb3d7 to
461afb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts (2)
76-81: startTimeMaxAgeMs option wiring looks good; 0ms currently behaves like “no clamping”The config shape and JSDoc match the intended behavior. Note that because the helpers test
if (!this._config.startTimeMaxAgeMs), an explicit0value is treated the same asundefined(i.e., clamping disabled). If you ever want0to mean “clamp everything to now”, you’d need a nullish check (== null) instead of a falsy check; otherwise this is fine and just worth being aware of.
1043-1045: End‑window buffering in getTraceSummary is reasonable; consider tying it to max‑ageAdding a 1s buffer before
startCreatedAtand a 60s buffer afterendCreatedAtshould help avoid missing slightly late or clamped events at the tail of the window.If
EVENTS_CLICKHOUSE_START_TIME_MAX_AGE_MSis tuned significantly above 60 000ms, you might consider deriving this buffer fromstartTimeMaxAgeMs(or at least documenting the relationship) so the query window is guaranteed to encompass any clamped insertions for that trace.Also applies to: 1058-1061
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
references/hello-world/src/trigger/example.tsis excluded by!references/**
📒 Files selected for processing (3)
apps/webapp/app/env.server.ts(1 hunks)apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts(30 hunks)apps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- apps/webapp/app/v3/eventRepository/clickhouseEventRepositoryInstance.server.ts
- apps/webapp/app/env.server.ts
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
apps/webapp/app/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Access all environment variables through the
envexport ofenv.server.tsinstead of directly accessingprocess.envin the Trigger.dev webapp
Files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: When importing from@trigger.dev/corein the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp
Files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
**/*.{js,ts,jsx,tsx,json,md,css,scss}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier
Files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
🧠 Learnings (1)
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Limit task duration using the `maxDuration` property (in seconds)
Applied to files:
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
🧬 Code graph analysis (1)
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts (1)
apps/webapp/app/v3/eventRepository/common.server.ts (2)
getNowInNanoseconds(23-25)convertDateToNanoseconds(53-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (7)
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts (7)
132-178: Clamping helpers correctly implement “old → now” behavior in both ns and Date domainsThe three helpers are consistent with each other and the doc comment:
- They base “now” on epoch nanoseconds/Date via the existing utilities.
- They only adjust timestamps that are older than
startTimeMaxAgeMs, clamping them to the current time as documented.- They short‑circuit cleanly when
startTimeMaxAgeMsis unset/falsy.No correctness issues spotted here.
254-255: All CreateEventInput/span event write paths now clamp start_time to avoid very old partitionsRouting
event.startTimeand eachspanEvent.timethrough#clampAndFormatStartTimeensures no v1 span/log/SPAN_EVENT records can be written with arbitrarily oldstart_timevalues, which matches the PR’s ClickHouse goal.Given this relies on
CreateEventInput.startTimeandconvertDateToNanoseconds(spanEvent.time)being “epoch nanoseconds”, it’s worth double‑checking those types/callers remain consistent (i.e., no monotonichrtimevalues slipping in).Also applies to: 323-325, 359-361, 389-391, 423-425
591-592: recordEvent/traceEvent clamping is consistent with the new policy
recordEventand both the main span and error SPAN_EVENT intraceEventnow clamp theirstart_timevia#clampAndFormatStartTime, so ad‑hoc trace/log writes also respect the max‑age constraint.This will cap the apparent duration of any spans whose provided
options.startTimeis much older than “now”. That’s aligned with the partition‑stability goal, but it does trade precision in long‑lived/late‑reported spans, so it’s a good place to have tests that assert the new expected behavior for extremely oldstartTimeinputs.Also applies to: 692-693, 726-727
764-766: Run completion helpers now clamp createdAt/spanCreatedAt/endTime; semantics look intentionalFor all run completion flows you now:
- Clamp
run.createdAtorspanCreatedAt(orendTimefor the instantaneous attempt‑failed override) via#clampStartTimeDate.- Derive the stored
start_timefrom the clamped value while keeping expiry based on the originalrun.createdAt.That achieves “no start_time older than X ms” without changing retention. The special case in
createAttemptFailedRunEvent(usingendTime ?? new Date()as the effective “start”) matches the idea of that override being an instantaneous event at failure time rather than the run’s creation, but it’s worth confirming that’s the intended timeline in the UI/analytics.Also applies to: 815-817, 858-860, 907-909, 956-958, 1001-1003
1221-1222: SpanDetail now uses earliest non‑event startTime across records, avoiding overrides skewing startThe
#mergeRecordsIntoSpanDetailchanges:
- Track
earliestStartTimeacross all records exceptANCESTOR_OVERRIDEandSPAN_EVENT.- Initialize the span from the first record but then overwrite
span.startTimewith the earliest non‑event timestamp.- Keep individual event times (
SPAN_EVENTetc.) based on their own record timestamps.This aligns with “use the earliest real span/log record as the canonical start” while preventing override/event records from dragging the span earlier or later. Looks correct and matches the PR description.
Also applies to: 1224-1233, 1244-1245, 1271-1272, 1339-1342
1485-1486: SpanSummary aggregation also correctly normalizes to earliest non‑event startTimeThe summary‑level merge now mirrors the detail logic:
earliestStartTimeis computed from all non‑ANCESTOR_OVERRIDE/ non‑SPAN_EVENTrecords.span.data.startTimeis reset to that earliest value after the loop.- Ancestor overrides and span events still contribute per‑event timestamps via
recordStartTime.This should make the span list view consistent with the detailed view and avoid cases where a later non‑SPAN record defines the start purely because of ordering.
Also applies to: 1488-1497, 1513-1514, 1540-1541, 1571-1574
1741-1742: Detailed summary view now shares the same earliest‑startTime semantics as other aggregations
#mergeRecordsIntoSpanDetailedSummaryfollows the same pattern: track the earliest non‑override/non‑event start, initialize from the first record, then overwritedata.startTimewith the earliest seen value and keep per‑event times.That keeps all three representations (detail, summary, detailedSummary) aligned in how they define the canonical span start, which should simplify reasoning in the UI and override logic.
Also applies to: 1744-1753, 1768-1769, 1796-1797, 1823-1826
This is an alternate solution to the fix made in #2719 because task_events_v1 can cause issues even if hardly any events are being sent there.
Core problem with task_events_v1
start_timefrom 3 days ago20251028(e.g., 10 rows, 3KB)The old partitions will never stabilise because we keep inserting late-arriving data into them. Additionally, large merges like this can overwhelm a ch replica and prevent merges on other tables, causing a cascading failure.
This change basically attacks the root problem: insert task_events_v1 records with old start times. You can now set the
EVENTS_CLICKHOUSE_START_TIME_MAX_AGE_MSenv var (defaults to 5 minutes) to clamp the start_time to close to now. Along with this we've updated the application logic that merged many events into a single span to always use the earliest span start_time, instead of just the first.