Skip to content

Retry after route rediscovery on NWK_NO_ROUTE in send_packet#273

Merged
puddly merged 6 commits into
zigpy:devfrom
TheJulianJES:tjj/fix-route-rediscovery-recovery
May 31, 2026
Merged

Retry after route rediscovery on NWK_NO_ROUTE in send_packet#273
puddly merged 6 commits into
zigpy:devfrom
TheJulianJES:tjj/fix-route-rediscovery-recovery

Conversation

@TheJulianJES
Copy link
Copy Markdown
Contributor

@TheJulianJES TheJulianJES commented May 30, 2026

When a request fails with NWK_NO_ROUTE, send_packet rediscovered the route but then re-raised, relying on zigpy's application-level retry loop to actually re-send. zigpy 1.5.0 (zigpy/zigpy#1824) moved retries out of ControllerApplication.request, so a directly-issued request was no longer retried and route rediscovery never took effect.

Retry the send once within send_packet's existing recovery loop after rediscovering the route, matching the MAC_TRANSACTION_EXPIRED recovery path which already does this. This makes send_packet self-recovering regardless of caller.

The new NWK_NO_ROUTE retry path could combine with the existing MAC_TRANSACTION_EXPIRED recovery to consume both send attempts via continue without ever raising: a first-attempt NWK_NO_ROUTE followed by a first-time MAC_TRANSACTION_EXPIRED on the final attempt left the loop with succeeded=False and no exception, so an undelivered packet was reported as a successful send. The second commit adds a for/else safety net that re-raises the last error whenever the recovery loop completes without a successful send, along with a regression test for that fall-through (thanks to the Copilot review for catching this).

Requires #272 and #274 for tests to pass.

Behavioral analysis: how the two recovery paths interact across versions

send_packet's recovery loop is for retry_attempt in range(2) — a 2-attempt budget. The real change is who consumes that budget:

MAC_TRANSACTION_EXPIRED path NWK_NO_ROUTE path
Before this PR _discover_route + continue (consumes an attempt) _discover_route + raise (consumes 0 internal attempts; relied on zigpy's app-level retry loop to re-run send_packet)
Now same continue _discover_route + continue (now also consumes an attempt), guarded to retry_attempt < 1

Before the PR, only MAC recovery used the internal budget; NWK recovery bailed out and leaned on zigpy's external ControllerApplication.request retry loop (default 3 attempts). zigpy 1.5.0 deleted that external loop, so both recovery paths now share the single 2-attempt internal budget.

Empirical results (post-PR), for the two cross-order cases:

MAC → NWK : raises DeliveryError(NWK_NO_ROUTE)            route_disc=1  assoc_remove=1  assoc_add=1
NWK → MAC : raises DeliveryError(MAC_TRANSACTION_EXPIRED) route_disc=2  assoc_remove=1  assoc_add=1
  • MAC then NWK — attempt 0 runs full MAC recovery (assoc remove + route discovery + continue); attempt 1 hits NWK_NO_ROUTE but is the final attempt, so the retry_attempt < 1 guard suppresses route rediscovery and it raises. Correct DeliveryError; association re-added.
  • NWK then MAC — attempt 0 rediscovers route + continue; attempt 1 runs full MAC recovery + continue; loop exhausts → for/else raises. Correct DeliveryError; association re-added. (This is the case that silently succeeded before the safety net.)

Both orderings terminate correctly and the association table stays balanced (remove == add).

Genuine differences vs. before the PR (all benign):

  1. NWK_NO_ROUTE on the final internal attempt no longer triggers route rediscovery (the retry_attempt < 1 guard). Negligible — route discovery only helps a subsequent send, and the next Device.request-level retry re-enters send_packet at attempt 0 and rediscovers anyway.
  2. MAC and NWK recovery now compete for the same 2 attempts; a mixed sequence exhausts the budget after two errors and raises.
  3. NWK → MAC now does an assoc-remove/re-add inside a call that can't retry afterward (slightly wasteful churn). Harmless — the re-add happens in finally, so the table stays consistent.

Real traffic is not worse off: normal unicast goes through Device.request, which in zigpy 1.5.x still retries (3 attempts), each invoking send_packet's 2-attempt loop — so total send attempts are higher, not lower, and a pure NWK_NO_ROUTE now self-recovers within a single send_packet call. Pure MAC_TRANSACTION_EXPIRED recovery is unchanged. The only callers without the external retry are direct app.request users (broadcasts, some internal flows), which generally don't carry a single target device/association and so don't hit this recovery.

When a request fails with NWK_NO_ROUTE, send_packet rediscovered the
route but then re-raised, relying on zigpy's application-level retry
loop to actually re-send. zigpy 1.5.0 (#1824) moved retries out of
ControllerApplication.request, so a directly-issued request was no
longer retried and route rediscovery never took effect.

Retry the send once within send_packet's existing recovery loop after
rediscovering the route, matching the MAC_TRANSACTION_EXPIRED recovery
path which already does this. This makes send_packet self-recovering
regardless of caller.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates send_packet recovery so NWK_NO_ROUTE failures trigger route rediscovery and an immediate resend inside zigpy-znp, rather than relying on zigpy application-level retries.

Changes:

  • Renames the retry loop variable so the attempt number can be used.
  • Updates comments to include route rediscovery as a reason for the second send attempt.
  • Adds a NWK_NO_ROUTE recovery path that rediscovers the route and retries once.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread zigpy_znp/zigbee/application.py
The new NWK_NO_ROUTE retry path could combine with the existing
MAC_TRANSACTION_EXPIRED recovery to consume both send attempts via
`continue` without ever raising: a first-attempt NWK_NO_ROUTE followed
by a first-time MAC_TRANSACTION_EXPIRED on the final attempt left the
loop with succeeded=False and no exception, so an undelivered packet
was reported as a successful send.

Add a for/else safety net that re-raises the last error whenever the
loop completes without a successful break. Add a regression test for
the NWK_NO_ROUTE -> MAC_TRANSACTION_EXPIRED fall-through.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

The scenario is a single deterministic config, so exact counts are
stronger than `>= 1`: they pin route discovery to twice (NWK path plus
MAC recovery) and verify the association is removed exactly once and
re-added exactly once, catching an unbalanced remove/add that `>= 1`
would miss.
Keep the comment focused on what the code does, not on zigpy's
historical retry behavior.
@puddly
Copy link
Copy Markdown
Collaborator

puddly commented May 31, 2026

I wonder if attaching through something like retry_attempt to the ZigbeePacket object itself could help make this somewhat less stateful?

@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.49%. Comparing base (faa06d7) to head (5b184f2).
⚠️ Report is 2 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #273   +/-   ##
=======================================
  Coverage   98.49%   98.49%           
=======================================
  Files          43       43           
  Lines        3579     3583    +4     
=======================================
+ Hits         3525     3529    +4     
  Misses         54       54           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@puddly
Copy link
Copy Markdown
Collaborator

puddly commented May 31, 2026

We could also think about intentionally splitting up DeliveryFailure into SendFailure (for early routing and congestion errors) and DeliveryFailure (for late routing errors, no ACK, etc.). Maybe zigpy could absorb some of the library-specific retry logic for packet sending?

@TheJulianJES
Copy link
Copy Markdown
Contributor Author

Yeah, I agree. This solution isn't great. Though I'd consider temporarily releasing a version with this, so we can unbreak HA before a b1 releases with the ZHA bump.

Zigpy does pass force_route_discovery=True if attempt > 0 already, which causes ControllerApplication.request to just change the tx_options for the ZigbeePacket: https://github.com/zigpy/zigpy/blob/e891a9f53fd00918a8eda3879be4dd0414ea2755/zigpy/device.py#L685 + https://github.com/zigpy/zigpy/blob/e891a9f53fd00918a8eda3879be4dd0414ea2755/zigpy/application.py#L1087-L1088. That part is already in zigpy itself, but only used by bellows. But yeah, I think it would be good to move this logic to zigpy if that's possible to do in a nice way.

@puddly puddly merged commit 2ca6864 into zigpy:dev May 31, 2026
12 checks passed
@TheJulianJES
Copy link
Copy Markdown
Contributor Author

Should we also bump the zigpy requirement to 1.5.0 or 1.5.1 in another PR before releasing a new version? Technically, the fixes all still work with older versions, only the adjusted tests require it.

@puddly
Copy link
Copy Markdown
Collaborator

puddly commented May 31, 2026

Sure, it won't hurt.

@TheJulianJES TheJulianJES mentioned this pull request May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants