fix(timelock): don't advance cache when Telegram send fails by spalen0 · Pull Request #250 · yearn/monitoring

spalen0 · 2026-05-28T12:25:28Z

Summary

Skip TIMELOCK_LAST_TS update on any Telegram chunk failure so failed events get re-fetched and retried, instead of being silently lost.
Include Telegram's response body (the JSON description) in TelegramError so the next 400 is debuggable from the log alone.
Key TimelockController operation grouping by (chainId, operationId) so cross-chain identical payloads don't collide into a single alert.
Escape timelock_info.protocol and timelock_info.label in build_alert_message. The _ in YEARN_TIMELOCK was opening a Markdown V1 italic that never closed — Telegram's parser then failed on the first downstream code-span backtick with can't parse entities: Can't find end of the entity starting at byte offset 807. This is the root cause of the 09:39 UTC 400.
Scrub the bot token from every TelegramError message. requests.HTTPError.__str__() puts the full URL (including bot<TOKEN>) into the exception string; the previous code re-raised that into TelegramError and into crash alerts. GH Actions masks secrets in workflow logs, but local runs and the new error-body path do not.

Timeline of the outage

2026-05-27 18:07 UTC → 2026-05-28 09:02 UTC — Morpho's Market.uniqueKey → marketId API rename made morpho/markets.py exit non-zero. bash -eo pipefail aborted the hourly loop before timelock_alerts.py ran.
2026-05-28 09:39 UTC — first run after the Morpho fix landed. 6 backlogged TimelockEvent rows, 4 ops. The YEARN_TIMELOCK chunk was sent to Telegram and returned 400. The script logged the error, and still advanced TIMELOCK_LAST_TS=1779958199, dropping the failed alerts from the cache window forever.
2026-05-28 ~18:11 UTC — local repro (this PR's diagnostic): re-ran with --no-cache --since-seconds 172800 --protocol YEARN_TIMELOCK. Same 400. Captured response body: {"ok":false,"error_code":400,"description":"Bad Request: can't parse entities: Can't find end of the entity starting at byte offset 807"}. Byte 807 was the closing backtick before the AI summary; the unclosed italic was the _ in YEARN_TIMELOCK at byte 63.
Recovery — sent the two missed alerts manually with the underscore escaped. Mainnet 0x5dac358a… (msg 1562) and Base 0xe8c04f74… (msg 1563) both landed (status 200) in the YEARN_TIMELOCK channel.

Why each change

Cache-on-failure. Even after this PR fixes the escape, the script will still drop events any time Telegram returns 400 for any reason (rate limit, future label with markdown special, content rejected by a new server-side rule). Tracking all_sent and skipping the cache write when anything failed makes the retry loop the system's recovery — at the cost of duplicate alerts for successful protocols on the next run after a partial failure. Worth it.
Error body in TelegramError. Without this, the original 09:39 UTC log line was only 400 Client Error for url: … — useless for diagnosis. With this, the description field (can't parse entities: …) reaches the log.
(chainId, operationId) key. operationId is keccak(targets, values, datas, predecessor, salt) — no contract address. The Yearn timelock has the same address on every chain we monitor; a cross-chain identical payload would collide, the grouper would merge events from different chains into one bucket, and only the first chain's alert would fire. Hasn't bitten production yet but was waiting.
escape_markdown for protocol/label. Direct fix for the underlying 400. Both fields are config-supplied; future additions with any of _ * \ [` in them would silently break alerts the same way otherwise.
Token redaction. During this debugging cycle the bot token leaked into a Claude conversation context via an unredacted exception string. TELEGRAM_BOT_TOKEN_DEFAULT should be rotated independently of this PR; the redaction prevents future leaks.

Still TODO (not in this PR)

AI summary text. format_explanation_line's output is interpolated into the Markdown message unescaped. The LLM happens to not have emitted _/* in the texts I observed, but it could. Either escape on the way out or switch the AI lines to plain code blocks.

Test plan

uv run ruff check . clean
uv run ruff format . clean
uv run pytest tests/ — 407 passed, 4 skipped (pre-existing)
Local diagnostic: posted the failing Mainnet payload unescaped → 400 with can't parse entities body; posted the same payload with _ escaped → 200, message landed in the channel.
Both missed 2026-05-27 alerts (Mainnet + Base) delivered to YEARN_TIMELOCK manually after fix confirmed.
Next hourly CI run after merge: confirm new events alert cleanly end-to-end.

🤖 Generated with Claude Code

`process_events` caught Telegram send errors so the loop would continue, but then unconditionally advanced `TIMELOCK_LAST_TS` to the max event timestamp. The next run saw no new events past that timestamp and the failed alerts were dropped forever. This bit us on 2026-05-28 09:39 UTC: after a 15h workflow outage caused by Morpho's API rename (which aborted the hourly bash loop before timelock_alerts could run), the recovery run picked up 6 backlogged TimelockEvent rows, built the alerts, hit a Telegram 400 Bad Request on the YEARN_TIMELOCK chunk, logged the failure — and still advanced the cache. The Yearn timelock alerts for 2026-05-27 19:55 (Mainnet) and 20:04 (Base) were lost; the 10:35 UTC run fetched 0 events. Track per-protocol-chunk success. If any chunk fails, skip the cache update so the next run re-fetches and retries. Risk of duplicate alerts on retry is acceptable; missing alerts is not. Also include Telegram's response body in `TelegramError`. `requests`' HTTPError text only carries the status line, but Telegram's JSON body is where the actual reason lives ("can't parse entities", invalid message_thread_id, etc.). Surfacing it removes a debug round-trip when the next 400 happens. The 2026-05-27 Yearn alerts are still lost from cache. Recovering them needs a one-time cache rollback (set TIMELOCK_LAST_TS=1779830135) once the underlying Telegram 400 root cause is fixed — handled separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`operationId` on a TimelockController is `keccak(targets, values, datas, predecessor, salt)` — purely content-derived, no contract address mixed in. The Yearn timelock at 0x88Ba…BF73 is deployed at the same address on every chain we monitor, so an identical payload scheduled on two of them (plausible for cross-chain governance) would produce the same operationId. The grouper merged both into a single operations[op_id] bucket. The alert then reads `chain_id = op_events[0]["chainId"]` — only the first chain gets reported, the second chain's alert is silently dropped. Key by `f"{chainId}:{operationId}"` so each chain gets its own group. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ken from errors Two related fixes uncovered by recovering the 2026-05-27 missed alerts. 1) Telegram 400 root cause: the `_` in "YEARN_TIMELOCK" was opening a Markdown V1 italic that never closed. The parser then failed on the first downstream code-span backtick with `Bad Request: can't parse entities: Can't find end of the entity starting at byte offset 807`. This silently dropped every Yearn timelock alert until the protocol name happened to be re-escaped or the path was bypassed. `build_alert_message` now runs both `timelock_info.protocol` and `timelock_info.label` through `escape_markdown` before interpolating them into the Markdown-rendered header. Confirmed: posting the captured Mainnet and Base alerts with the escape applied returned 200 from Telegram, where the unescaped versions returned 400. 2) `requests.HTTPError.__str__()` includes the full request URL, which for Telegram is `https://api.telegram.org/bot<TOKEN>/sendMessage`. The previous error path re-raised that string as `TelegramError(f"...: {e}")`, so the bot token landed in any log or downstream crash alert that surfaced the failure. GitHub Actions auto-masks secrets in workflow logs, but local runs (and the new error-body path) do not. Add `_redact_bot_token` (a tiny regex sub) and apply it to every `TelegramError` message in `send_telegram_message`. The bot token used during this debugging cycle has already leaked into one Claude conversation and should be rotated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

spalen0 marked this pull request as ready for review May 28, 2026 16:04

spalen0 and others added 2 commits May 28, 2026 18:05

spalen0 merged commit 126b974 into main May 28, 2026
2 checks passed

spalen0 deleted the fix/timelock-cache-on-send-failure branch May 28, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(timelock): don't advance cache when Telegram send fails#250

fix(timelock): don't advance cache when Telegram send fails#250
spalen0 merged 3 commits into
mainfrom
fix/timelock-cache-on-send-failure

spalen0 commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spalen0 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Timeline of the outage

Why each change

Still TODO (not in this PR)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spalen0 commented May 28, 2026 •

edited

Loading