feat: implement escalated notifications for monitors#199
feat: implement escalated notifications for monitors#199realnoob007 wants to merge 1 commit intoramfam101:developfrom
Conversation
Add escalation rules that allow users to define time-based alert escalation when incidents persist. Users can configure multiple escalation rules per monitor, each with a wait time and notification channels to alert after the specified duration of downtime. Frontend: - Add escalation rules UI section to monitor create/edit page - Add EscalationRule type, form validation, and i18n translations - Support dynamic add/remove of escalation rules with time and notification channel selection Backend: - Add EscalationRule type and escalationRules field to Monitor schema - Add escalation checking in heartbeat job on every tick when monitor is down - Add escalation message builder for escalation-specific notifications - Track sent escalations per incident to prevent duplicate alerts - Clear escalation tracking on incident resolution - Map escalationRules in repository toEntity methods - Support comma-separated CLIENT_HOST for multiple CORS origins
There was a problem hiding this comment.
Pull request overview
Implements time-based escalation notifications for monitors so additional notification channels can be alerted if an incident remains unresolved past configured thresholds.
Changes:
- Adds
escalationRulesto monitor types/schemas and monitor create/edit validation. - Triggers escalation checks on each heartbeat tick while a monitor is down/breached and clears escalation tracking on recovery.
- Adds escalation-specific notification message building and enables comma-separated CORS origins via
CLIENT_HOST.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| server/src/validation/monitorValidation.ts | Validates escalationRules payload for create/edit monitor requests. |
| server/src/validation/envValidation.ts | Relaxes CLIENT_HOST validation to support non-URL / comma-separated values. |
| server/src/types/notificationMessage.ts | Adds new notification type "escalation". |
| server/src/types/monitor.ts | Introduces EscalationRule and Monitor.escalationRules. |
| server/src/service/infrastructure/SuperSimpleQueue/SuperSimpleQueueHelper.ts | Runs escalation checks during heartbeat processing and clears tracking on resolution. |
| server/src/service/infrastructure/notificationsService.ts | Implements escalation sending and deduplication tracking. |
| server/src/service/infrastructure/notificationMessageBuilder.ts | Builds escalation-specific notification messages. |
| server/src/repositories/monitors/MongoMonitorsRepository.ts | Maps escalationRules from DB documents into monitor entities. |
| server/src/db/models/Monitor.ts | Adds escalationRules to the Mongoose monitor schema. |
| server/src/app.ts | Splits CLIENT_HOST into multiple CORS origins when comma-separated. |
| client/src/Validation/monitor.ts | Adds client-side validation for escalationRules. |
| client/src/Types/Monitor.ts | Adds EscalationRule and Monitor.escalationRules on the client. |
| client/src/Pages/CreateMonitor/index.tsx | Adds escalation rules UI with add/remove and notification channel selection. |
| client/src/locales/en.json | Adds i18n strings for the escalation rules UI. |
| client/src/Hooks/useMonitorForm.ts | Adds escalation rules defaults for the monitor form. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const activeIncident = await this.incidentsRepository.findActiveByMonitorId(monitor.id, monitor.teamId); | ||
| if (!activeIncident) { | ||
| return; | ||
| } | ||
|
|
||
| const incidentStartTime = new Date(parseInt(activeIncident.startTime)); | ||
| await this.notificationsService.handleEscalations(monitor, incidentStartTime); |
There was a problem hiding this comment.
activeIncident.startTime is an ISO string (see incidents repository toDateString), so parseInt(activeIncident.startTime) will parse only the year (e.g. 2026) and produce an incorrect incidentStartTime. This will make incidentDuration huge and escalate immediately. Parse it as a date string instead (e.g. new Date(activeIncident.startTime)) and handle invalid dates explicitly.
| // Check if enough time has elapsed | ||
| if (incidentDuration >= rule.waitTime) { | ||
| this.escalationsSent.add(ruleKey); | ||
|
|
||
| const notifications = await this.notificationsRepository.findNotificationsByIds(rule.notificationIds); | ||
| if (notifications.length === 0) { | ||
| continue; | ||
| } | ||
|
|
There was a problem hiding this comment.
handleEscalations adds ruleKey to escalationsSent before verifying there are any matching notifications and before knowing whether any send succeeded. If notificationIds is empty/invalid or all sends fail, the rule will be permanently suppressed for this incident. Only mark the rule as sent after a successful send (or at least after a non-empty notification set and an attempted send) depending on desired retry semantics.
| for (const rule of escalationRules) { | ||
| const ruleKey = `${incidentKey}:${rule.waitTime}`; | ||
|
|
||
| // Skip if already escalated for this rule in this incident | ||
| if (this.escalationsSent.has(ruleKey)) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Escalation deduplication key is based only on incidentStartTime + waitTime. If a monitor has two escalation rules with the same waitTime but different notificationIds, only one will ever send because they collide on the same ruleKey. Either enforce unique waitTime in validation/UI or include additional rule identity (e.g. index or a stable hash of notificationIds) in the key.
| // Tracks which escalation rules have already been sent, keyed by "monitorId:incidentStartTime:waitTime" | ||
| private escalationsSent: Set<string> = new Set(); | ||
|
|
There was a problem hiding this comment.
Escalation tracking is stored only in an in-memory Set. In a multi-process / multi-instance deployment or after a restart, escalations can be re-sent for the same incident because the set is empty again. If the intention is to prevent duplicates per incident across restarts/replicas, persist this state (e.g. on the Incident record) or derive it from stored notification history.
| const allowedOrigin = envSettings.clientHost.includes(",") | ||
| ? envSettings.clientHost.split(",").map((s) => s.trim()) | ||
| : envSettings.clientHost; |
There was a problem hiding this comment.
CLIENT_HOST is now allowed to be comma-separated and is split for CORS, but the same clientHost is also used to build links in emails/notifications (e.g. password reset, invite links, incident URLs). If CLIENT_HOST contains commas, those links will be invalid. Consider introducing a separate env var for CORS origins (or parse and pick a single canonical base URL for link building).
| const details = [ | ||
| `URL: ${monitor.url}`, | ||
| `Status: Down`, | ||
| `Type: ${monitor.type}`, | ||
| `Downtime Duration: ${durationMinutes} minute(s)`, | ||
| `Escalation Threshold: ${waitMinutes} minute(s)`, | ||
| ]; |
There was a problem hiding this comment.
Escalation message details hard-code Status: Down, but escalations are also triggered for status === "breached". This will send incorrect status information for threshold incidents. Use monitor.status (or map to a friendly label) in the message content.
| escalationRules: z.array(z.object({ | ||
| waitTime: z.number().min(1000, "Wait time must be at least 1 second"), | ||
| notificationIds: z.array(z.string()).min(1, "At least one notification is required"), |
There was a problem hiding this comment.
Server-side validation allows waitTime as low as 1s, but the client-side schema/UI enforce a minimum of 1 minute and only present minute+ options. This mismatch can lead to confusing behavior (e.g. API accepts values UI can’t set; logs that show 0 minute(s)). Align server validation with the client/UI constraints (or update the client/UI to match the intended minimum).
| escalationRules: z.array(z.object({ | ||
| waitTime: z.number().min(1000, "Wait time must be at least 1 second"), | ||
| notificationIds: z.array(z.string()).min(1, "At least one notification is required"), |
There was a problem hiding this comment.
Same as create validation: server allows waitTime down to 1s while the client enforces 1 minute. Please keep create/edit validation consistent with the UI and the escalation scheduler expectations.
| // Step 6.5. Handle escalation notifications (check on every heartbeat when monitor is down) | ||
| if (statusChangeResult.monitor.status === "down" || statusChangeResult.monitor.status === "breached") { | ||
| this.handleEscalationCheck(statusChangeResult.monitor).catch((error: unknown) => { | ||
| this.logger.error({ | ||
| message: `Error handling escalations for monitor ${statusChangeResult.monitor.id}: ${error instanceof Error ? error.message : "Unknown error"}`, | ||
| service: SERVICE_NAME, | ||
| method: "getMonitorJob", | ||
| stack: error instanceof Error ? error.stack : undefined, | ||
| }); | ||
| }); | ||
| } | ||
|
|
||
| // Step 6.6. Clear escalation tracking when monitor recovers | ||
| if (decision.shouldResolveIncident) { | ||
| this.notificationsService.clearEscalationTracking(statusChangeResult.monitor.id); | ||
| } |
There was a problem hiding this comment.
New escalation handling in the heartbeat job isn’t covered by existing tests in server/test/superSimpleQueueHelper.test.ts. Adding tests for (1) escalation checks being triggered only when status is down/breached and (2) clearEscalationTracking being called on incident resolution would help prevent regressions.
| <IconButton | ||
| size="small" | ||
| onClick={() => removeEscalation(index)} | ||
| aria-label="Remove escalation rule" | ||
| sx={{ mt: theme.spacing(SPACING.LG) }} |
There was a problem hiding this comment.
The aria-label for the remove button is hard-coded in English. Since the page uses i18n for visible strings, this label should also be translated to ensure consistent localization and accessibility.
Describe your changes
Add escalation rules that allow users to define time-based alert escalation when incidents persist. Users can configure multiple escalation rules per monitor, each with a wait time and notification channels to alert after the specified duration of downtime.
Frontend:
Backend:
Write your issue number after "Fixes "
Fixes #123
Please ensure all items are checked off before requesting a review. "Checked off" means you need to add an "x" character between brackets so they turn into checkmarks.
<div>Add</div>, use):npm run formatin server and client directories, which automatically formats your code.