Skip to content

fix(timelock): don't advance cache when Telegram send fails#250

Merged
spalen0 merged 3 commits into
mainfrom
fix/timelock-cache-on-send-failure
May 28, 2026
Merged

fix(timelock): don't advance cache when Telegram send fails#250
spalen0 merged 3 commits into
mainfrom
fix/timelock-cache-on-send-failure

Conversation

@spalen0
Copy link
Copy Markdown
Collaborator

@spalen0 spalen0 commented May 28, 2026

Summary

  • Skip TIMELOCK_LAST_TS update on any Telegram chunk failure so failed events get re-fetched and retried, instead of being silently lost.
  • Include Telegram's response body (the JSON description) in TelegramError so the next 400 is debuggable from the log alone.
  • Key TimelockController operation grouping by (chainId, operationId) so cross-chain identical payloads don't collide into a single alert.
  • Escape timelock_info.protocol and timelock_info.label in build_alert_message. The _ in YEARN_TIMELOCK was opening a Markdown V1 italic that never closed — Telegram's parser then failed on the first downstream code-span backtick with can't parse entities: Can't find end of the entity starting at byte offset 807. This is the root cause of the 09:39 UTC 400.
  • Scrub the bot token from every TelegramError message. requests.HTTPError.__str__() puts the full URL (including bot<TOKEN>) into the exception string; the previous code re-raised that into TelegramError and into crash alerts. GH Actions masks secrets in workflow logs, but local runs and the new error-body path do not.

Timeline of the outage

  1. 2026-05-27 18:07 UTC → 2026-05-28 09:02 UTC — Morpho's Market.uniqueKeymarketId API rename made morpho/markets.py exit non-zero. bash -eo pipefail aborted the hourly loop before timelock_alerts.py ran.
  2. 2026-05-28 09:39 UTC — first run after the Morpho fix landed. 6 backlogged TimelockEvent rows, 4 ops. The YEARN_TIMELOCK chunk was sent to Telegram and returned 400. The script logged the error, and still advanced TIMELOCK_LAST_TS=1779958199, dropping the failed alerts from the cache window forever.
  3. 2026-05-28 ~18:11 UTC — local repro (this PR's diagnostic): re-ran with --no-cache --since-seconds 172800 --protocol YEARN_TIMELOCK. Same 400. Captured response body: {"ok":false,"error_code":400,"description":"Bad Request: can't parse entities: Can't find end of the entity starting at byte offset 807"}. Byte 807 was the closing backtick before the AI summary; the unclosed italic was the _ in YEARN_TIMELOCK at byte 63.
  4. Recovery — sent the two missed alerts manually with the underscore escaped. Mainnet 0x5dac358a… (msg 1562) and Base 0xe8c04f74… (msg 1563) both landed (status 200) in the YEARN_TIMELOCK channel.

Why each change

  • Cache-on-failure. Even after this PR fixes the escape, the script will still drop events any time Telegram returns 400 for any reason (rate limit, future label with markdown special, content rejected by a new server-side rule). Tracking all_sent and skipping the cache write when anything failed makes the retry loop the system's recovery — at the cost of duplicate alerts for successful protocols on the next run after a partial failure. Worth it.
  • Error body in TelegramError. Without this, the original 09:39 UTC log line was only 400 Client Error for url: … — useless for diagnosis. With this, the description field (can't parse entities: …) reaches the log.
  • (chainId, operationId) key. operationId is keccak(targets, values, datas, predecessor, salt) — no contract address. The Yearn timelock has the same address on every chain we monitor; a cross-chain identical payload would collide, the grouper would merge events from different chains into one bucket, and only the first chain's alert would fire. Hasn't bitten production yet but was waiting.
  • escape_markdown for protocol/label. Direct fix for the underlying 400. Both fields are config-supplied; future additions with any of _ * \ [` in them would silently break alerts the same way otherwise.
  • Token redaction. During this debugging cycle the bot token leaked into a Claude conversation context via an unredacted exception string. TELEGRAM_BOT_TOKEN_DEFAULT should be rotated independently of this PR; the redaction prevents future leaks.

Still TODO (not in this PR)

  1. AI summary text. format_explanation_line's output is interpolated into the Markdown message unescaped. The LLM happens to not have emitted _/* in the texts I observed, but it could. Either escape on the way out or switch the AI lines to plain code blocks.

Test plan

  • uv run ruff check . clean
  • uv run ruff format . clean
  • uv run pytest tests/ — 407 passed, 4 skipped (pre-existing)
  • Local diagnostic: posted the failing Mainnet payload unescaped → 400 with can't parse entities body; posted the same payload with _ escaped → 200, message landed in the channel.
  • Both missed 2026-05-27 alerts (Mainnet + Base) delivered to YEARN_TIMELOCK manually after fix confirmed.
  • Next hourly CI run after merge: confirm new events alert cleanly end-to-end.

🤖 Generated with Claude Code

`process_events` caught Telegram send errors so the loop would continue,
but then unconditionally advanced `TIMELOCK_LAST_TS` to the max event
timestamp. The next run saw no new events past that timestamp and the
failed alerts were dropped forever.

This bit us on 2026-05-28 09:39 UTC: after a 15h workflow outage caused
by Morpho's API rename (which aborted the hourly bash loop before
timelock_alerts could run), the recovery run picked up 6 backlogged
TimelockEvent rows, built the alerts, hit a Telegram 400 Bad Request on
the YEARN_TIMELOCK chunk, logged the failure — and still advanced the
cache. The Yearn timelock alerts for 2026-05-27 19:55 (Mainnet) and
20:04 (Base) were lost; the 10:35 UTC run fetched 0 events.

Track per-protocol-chunk success. If any chunk fails, skip the cache
update so the next run re-fetches and retries. Risk of duplicate alerts
on retry is acceptable; missing alerts is not.

Also include Telegram's response body in `TelegramError`. `requests`'
HTTPError text only carries the status line, but Telegram's JSON body is
where the actual reason lives ("can't parse entities", invalid
message_thread_id, etc.). Surfacing it removes a debug round-trip when
the next 400 happens.

The 2026-05-27 Yearn alerts are still lost from cache. Recovering them
needs a one-time cache rollback (set TIMELOCK_LAST_TS=1779830135) once
the underlying Telegram 400 root cause is fixed — handled separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@spalen0 spalen0 marked this pull request as ready for review May 28, 2026 16:04
spalen0 and others added 2 commits May 28, 2026 18:05
`operationId` on a TimelockController is `keccak(targets, values, datas,
predecessor, salt)` — purely content-derived, no contract address mixed
in. The Yearn timelock at 0x88Ba…BF73 is deployed at the same address on
every chain we monitor, so an identical payload scheduled on two of them
(plausible for cross-chain governance) would produce the same operationId.

The grouper merged both into a single operations[op_id] bucket. The alert
then reads `chain_id = op_events[0]["chainId"]` — only the first chain
gets reported, the second chain's alert is silently dropped.

Key by `f"{chainId}:{operationId}"` so each chain gets its own group.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ken from errors

Two related fixes uncovered by recovering the 2026-05-27 missed alerts.

1) Telegram 400 root cause: the `_` in "YEARN_TIMELOCK" was opening a
Markdown V1 italic that never closed. The parser then failed on the
first downstream code-span backtick with `Bad Request: can't parse
entities: Can't find end of the entity starting at byte offset 807`.
This silently dropped every Yearn timelock alert until the protocol
name happened to be re-escaped or the path was bypassed.

`build_alert_message` now runs both `timelock_info.protocol` and
`timelock_info.label` through `escape_markdown` before interpolating
them into the Markdown-rendered header. Confirmed: posting the captured
Mainnet and Base alerts with the escape applied returned 200 from
Telegram, where the unescaped versions returned 400.

2) `requests.HTTPError.__str__()` includes the full request URL, which
for Telegram is `https://api.telegram.org/bot<TOKEN>/sendMessage`. The
previous error path re-raised that string as `TelegramError(f"...: {e}")`,
so the bot token landed in any log or downstream crash alert that
surfaced the failure. GitHub Actions auto-masks secrets in workflow
logs, but local runs (and the new error-body path) do not.

Add `_redact_bot_token` (a tiny regex sub) and apply it to every
`TelegramError` message in `send_telegram_message`. The bot token used
during this debugging cycle has already leaked into one Claude
conversation and should be rotated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@spalen0 spalen0 merged commit 126b974 into main May 28, 2026
2 checks passed
@spalen0 spalen0 deleted the fix/timelock-cache-on-send-failure branch May 28, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant