fix(timelock): don't advance cache when Telegram send fails#250
Merged
Conversation
`process_events` caught Telegram send errors so the loop would continue,
but then unconditionally advanced `TIMELOCK_LAST_TS` to the max event
timestamp. The next run saw no new events past that timestamp and the
failed alerts were dropped forever.
This bit us on 2026-05-28 09:39 UTC: after a 15h workflow outage caused
by Morpho's API rename (which aborted the hourly bash loop before
timelock_alerts could run), the recovery run picked up 6 backlogged
TimelockEvent rows, built the alerts, hit a Telegram 400 Bad Request on
the YEARN_TIMELOCK chunk, logged the failure — and still advanced the
cache. The Yearn timelock alerts for 2026-05-27 19:55 (Mainnet) and
20:04 (Base) were lost; the 10:35 UTC run fetched 0 events.
Track per-protocol-chunk success. If any chunk fails, skip the cache
update so the next run re-fetches and retries. Risk of duplicate alerts
on retry is acceptable; missing alerts is not.
Also include Telegram's response body in `TelegramError`. `requests`'
HTTPError text only carries the status line, but Telegram's JSON body is
where the actual reason lives ("can't parse entities", invalid
message_thread_id, etc.). Surfacing it removes a debug round-trip when
the next 400 happens.
The 2026-05-27 Yearn alerts are still lost from cache. Recovering them
needs a one-time cache rollback (set TIMELOCK_LAST_TS=1779830135) once
the underlying Telegram 400 root cause is fixed — handled separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`operationId` on a TimelockController is `keccak(targets, values, datas,
predecessor, salt)` — purely content-derived, no contract address mixed
in. The Yearn timelock at 0x88Ba…BF73 is deployed at the same address on
every chain we monitor, so an identical payload scheduled on two of them
(plausible for cross-chain governance) would produce the same operationId.
The grouper merged both into a single operations[op_id] bucket. The alert
then reads `chain_id = op_events[0]["chainId"]` — only the first chain
gets reported, the second chain's alert is silently dropped.
Key by `f"{chainId}:{operationId}"` so each chain gets its own group.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ken from errors Two related fixes uncovered by recovering the 2026-05-27 missed alerts. 1) Telegram 400 root cause: the `_` in "YEARN_TIMELOCK" was opening a Markdown V1 italic that never closed. The parser then failed on the first downstream code-span backtick with `Bad Request: can't parse entities: Can't find end of the entity starting at byte offset 807`. This silently dropped every Yearn timelock alert until the protocol name happened to be re-escaped or the path was bypassed. `build_alert_message` now runs both `timelock_info.protocol` and `timelock_info.label` through `escape_markdown` before interpolating them into the Markdown-rendered header. Confirmed: posting the captured Mainnet and Base alerts with the escape applied returned 200 from Telegram, where the unescaped versions returned 400. 2) `requests.HTTPError.__str__()` includes the full request URL, which for Telegram is `https://api.telegram.org/bot<TOKEN>/sendMessage`. The previous error path re-raised that string as `TelegramError(f"...: {e}")`, so the bot token landed in any log or downstream crash alert that surfaced the failure. GitHub Actions auto-masks secrets in workflow logs, but local runs (and the new error-body path) do not. Add `_redact_bot_token` (a tiny regex sub) and apply it to every `TelegramError` message in `send_telegram_message`. The bot token used during this debugging cycle has already leaked into one Claude conversation and should be rotated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TIMELOCK_LAST_TSupdate on any Telegram chunk failure so failed events get re-fetched and retried, instead of being silently lost.description) inTelegramErrorso the next 400 is debuggable from the log alone.(chainId, operationId)so cross-chain identical payloads don't collide into a single alert.timelock_info.protocolandtimelock_info.labelinbuild_alert_message. The_inYEARN_TIMELOCKwas opening a Markdown V1 italic that never closed — Telegram's parser then failed on the first downstream code-span backtick withcan't parse entities: Can't find end of the entity starting at byte offset 807. This is the root cause of the 09:39 UTC 400.TelegramErrormessage.requests.HTTPError.__str__()puts the full URL (includingbot<TOKEN>) into the exception string; the previous code re-raised that intoTelegramErrorand into crash alerts. GH Actions masks secrets in workflow logs, but local runs and the new error-body path do not.Timeline of the outage
Market.uniqueKey→marketIdAPI rename mademorpho/markets.pyexit non-zero.bash -eo pipefailaborted the hourly loop beforetimelock_alerts.pyran.TIMELOCK_LAST_TS=1779958199, dropping the failed alerts from the cache window forever.--no-cache --since-seconds 172800 --protocol YEARN_TIMELOCK. Same 400. Captured response body:{"ok":false,"error_code":400,"description":"Bad Request: can't parse entities: Can't find end of the entity starting at byte offset 807"}. Byte 807 was the closing backtick before the AI summary; the unclosed italic was the_inYEARN_TIMELOCKat byte 63.0x5dac358a…(msg 1562) and Base0xe8c04f74…(msg 1563) both landed (status 200) in the YEARN_TIMELOCK channel.Why each change
all_sentand skipping the cache write when anything failed makes the retry loop the system's recovery — at the cost of duplicate alerts for successful protocols on the next run after a partial failure. Worth it.TelegramError. Without this, the original 09:39 UTC log line was only400 Client Error for url: …— useless for diagnosis. With this, the description field (can't parse entities: …) reaches the log.(chainId, operationId)key.operationIdiskeccak(targets, values, datas, predecessor, salt)— no contract address. The Yearn timelock has the same address on every chain we monitor; a cross-chain identical payload would collide, the grouper would merge events from different chains into one bucket, and only the first chain's alert would fire. Hasn't bitten production yet but was waiting.escape_markdownfor protocol/label. Direct fix for the underlying 400. Both fields are config-supplied; future additions with any of_ * \[` in them would silently break alerts the same way otherwise.TELEGRAM_BOT_TOKEN_DEFAULTshould be rotated independently of this PR; the redaction prevents future leaks.Still TODO (not in this PR)
format_explanation_line's output is interpolated into the Markdown message unescaped. The LLM happens to not have emitted_/*in the texts I observed, but it could. Either escape on the way out or switch the AI lines to plain code blocks.Test plan
uv run ruff check .cleanuv run ruff format .cleanuv run pytest tests/— 407 passed, 4 skipped (pre-existing)can't parse entitiesbody; posted the same payload with_escaped → 200, message landed in the channel.🤖 Generated with Claude Code