Added argument to test targets connectivity #195

mcoulombe · 2025-09-23T16:12:13Z

Problem

Every change to a Tailscale network causes the netmap - a map of which devices can connect or know each other - to be recalculated and distributed to all nodes on the tailnet. When a node on the tailnet tries to connect to a peer, both devices must be allowed to connect according to their netmap. Computing and distributing updated netmap is an eventually consistent process, which means that when a node joins a tailnet there is a small delay before it can connect to the desired targets. If nodeA knows nodeB and tries to connect, but nodeB doesn't know nodeA yet, nodeA will experience what appears to it like a generic network error.

We have good reasons to believe this delay and generic error response is the cause of various issues and PRs reporting instability of the GitHub action. Under normal circumstances, the propagation delay is around ~3 seconds, which is short enough for it to complete before the workflow connects to targets in subsequent steps. However we observed propagation spikes of up to ~33 seconds during periods of particularly high or bursty traffic. The delay could be even more significant for particularly large or dynamic tailnets. The variability of the delay and its relation to overall system pressure explains why the failures are sporadic and concentrated around the same periods.

We also observed rare but consistent flakiness when trying to download the Tailscale client:
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

Issues and PRs that are possibly related:

Solution

When dealing with eventually consistent systems, the simplest solution is to check before acting. We added an optional targets input argument so the Tailscale GitHub Action can absorb the delay by verifying for up to 3 minutes if the targets are reachable before continuing.

If this solution is not acceptable due to the potential execution delay or because the targets are not known at the time the Tailscale action is used, we recommend using tailscale ping or other methods to verify connectivity and buffer for the variable propagation delay.

A retry was also added to the curl command that downloads the Tailscale client to mitigate the premature stream closure problem.

Methodology

We created a monitoring rig to measure the success rate from a consuming workflow's perspective. The test workflow uses the Tailscale GitHub Action to connect to a tailnet and perform a series of commands that were reported as flaky by the various GitHub issues.

Without the changes from this PR, the success rate was around 97%, whereas when using the targets input argument the success rate is now well above 99%.

Note that a number of issues mention that connections via DNS would fail while connection via direct IP was more reliable, but from our tests using the Tailscale IP or the MagicDNS machine name did not make a difference.

Sample result:

📊 Tailscale Subdomain
----------------------------------------
   Total runs: 816
   Successful: 812
   Success rate: 99.5%
   Duration p50: 3450ms
   Duration p90: 310193ms
   Duration p95: 319526ms
   Duration p99: 323769ms
   Min duration: 3146ms
   Max duration: 330009ms

The 4 failures were due to connectivity only being achieved through a DERP server which for the sake of monitoring we consider as unsuccessful.

Misc

I've also fixed the README indentation which would supersede the #150

README.md

mpminardi · 2025-09-23T17:09:55Z

action.yml

+          for target in "${TARGET_ARRAY[@]}"; do
+            target=$(echo "$target" | xargs)  # trim whitespace
+            if [ -n "$target" ]; then
+              if ! ${MAYBE_SUDO} tailscale ping --verbose --timeout=5s --c=36 --until-direct=false $target >/dev/null 2>&1; then


Can drop verbose if we are directing to null, also optionally the timeout since 5s is the default . Is there a reason we chose 36 specifically here for the number of pings, and why we are specifying false for until-direct?

36 is just to make the overall timeout 3min which feels like a reasonable upper bound based on the testing and expected restart times during deployments. (5s timeout per tries with 36 tries = 180s)

--until-direct=false is because connectivity through a DERP server, although maybe not ideal wrt resource usage and bandwidth, is still a successful connection from the perspective of the GH workflow. If we consider DERP connectivity unsuccessful from my tests it'll add cause ~0.2%-0.5% of executions to arbitrarily fail.

Hmm I think what --until-direct=false has the impact of here is that we will always do all 36 tries even if we had a successful direct connection on say the 5th iteration, whereas --until-direct=true would stop attempting to ping at the point we had a direct connection and return success.

If we're having additional arbitrary failures with --until-direct=true it feels like there may be another layer of issues here beyond being able to actually ping / reach the node that we are coincidentally helping with the additional delay that the extra pings are adding (e.g., in the above example the extra 31 pings are essentially acting as a sleep).

I thought --until-direct=false meant that ping succeeds as soon as the target is reachable, even if only via DERP servers. Whereas --until-direct=true doesn't consider a response like pong from lax-pve (100.99.0.2) via 47.149.77.162:41641 in 19ms successful.

Can you explain why the behaviour is to exhaust all tries?

Here we return nil if the endpoint is non-empty and until-direct is set, otherwise we continue the loop until the retires are exhausted.

--until-direct=true actually early returns for a response of that form (e.g., pong from lax-pve (100.99.0.2) via 47.149.77.162:41641 in 19ms) which is when we have a direct connection via UDP at that IP:port

Pongs over DERP are of the form pong from hello (100.73.174.8) via DERP(fra) in 493ms.

Locally I get the following with --until-direct=true:

tailscale ping --verbose --timeout=5s --c=10 --until-direct=true hello 2025/09/23 13:18:03 lookup "hello" => "100.73.174.8" pong from hello (100.73.174.8) via 3.126.85.103:52294 in 153ms

and with --until-direct=false:

tailscale ping --verbose --timeout=5s --c=10 --until-direct=false hello 2025/09/23 13:18:08 lookup "hello" => "100.73.174.8" pong from hello (100.73.174.8) via 3.126.85.103:52294 in 152ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 154ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 154ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 151ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 156ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 153ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 152ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 157ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 152ms pong from hello (100.73.174.8) via 3.126.85.103:52294 in 157ms

Kk, I removed the --until-direct=false flag. I misunderstood the doc.

Is there something we can or should use to tell tailscale ping "it is ok if the connection is over DERP"? for example in this run, i think if we bust the timeout for a direct connectivity, if the peer was reachable via DERP, the connectivity step should still succeed.

Ah rats, yeah I think we would need --until-direct=false to get that behaviour with the current CLI options / behaviour.

Taking a step back: I'm curious if tailscale ping is giving us the signal we want here, or if the reduction in failures is more a side effect of the ~40 seconds of sleep after tailscale up that this is adding when --c=36 and --until-direct=false.

E.g., it might be good to get stats (if we don't already) on how many failures we have where tailscale ping with a shorter amount for --c (maybe something like 5) and with --until-direct=false causes the ping to fail with no reply, or if we typically have success with tailscale ping and other failures down the line in this scenario.

I'm curious if tailscale ping is giving us the signal we want here

--until-direct=false was only added yesterday. The bulk of the monitoring ran without the flag and the stats I remember from Monday afternoon were ~600 runs with 1 failure due to Tailscale client download error (mitigated by the curl retry) and 3 failures because the targets were only reachable via DERP (which I thought --until-direct=false would mitigate). So checking connectivity with ping did improve the success rate.

on how many failures we have where tailscale ping with a shorter amount for --c (maybe something like 5) and with --until-direct=false causes the ping to fail with no reply

I don't think we should use --until-direct=false now that i understand better how it behaves. It becomes no better than a hardcoded sleep, the connectivity check should return as soon as possible so it does not artificially slow down the workflows any more than it needs to.

I'll let the monitoring run without --until-direct=false and if it regularly causes runs to fail we should investigate why connectivity can sporadically only be established via DERP, and if direct connectivity cannot be established I think there should be an option to let ping succeed if pong comes via DERP.

Cool that all sounds good to me, thank you for the detail!

action.yml

Update action.yml Co-authored-by: Mario Minardi <mario@tailscale.com>

action.yml

Co-authored-by: Mario Minardi <mario@tailscale.com>

mcoulombe requested a review from mpminardi September 23, 2025 16:12

mpminardi reviewed Sep 23, 2025

View reviewed changes

mcoulombe force-pushed the max/test-target-connectivity-check branch 2 times, most recently from 59d314f to 0ff8f4a Compare September 23, 2025 18:55

mcoulombe mentioned this pull request Sep 23, 2025

action: auto-sanitize default or error on invalid user-defined hostname #196

Merged

mcoulombe force-pushed the max/test-target-connectivity-check branch from 0ff8f4a to cab75fd Compare September 23, 2025 19:41

mpminardi reviewed Sep 24, 2025

View reviewed changes

action.yml Outdated Show resolved Hide resolved

mcoulombe force-pushed the max/test-target-connectivity-check branch from cab75fd to c59af0e Compare September 25, 2025 13:33

+ added argument to test target connectivity

3eef1cc

Update action.yml Co-authored-by: Mario Minardi <mario@tailscale.com>

mcoulombe force-pushed the max/test-target-connectivity-check branch from a0d3585 to 3eef1cc Compare September 25, 2025 14:25

mpminardi reviewed Sep 25, 2025

View reviewed changes

action.yml Outdated Show resolved Hide resolved

mpminardi approved these changes Sep 25, 2025

View reviewed changes

Update action.yml

2208a5a

Co-authored-by: Mario Minardi <mario@tailscale.com>

mcoulombe merged commit 6cae46e into main Sep 25, 2025
9 checks passed

mcoulombe deleted the max/test-target-connectivity-check branch September 25, 2025 15:43

mcoulombe mentioned this pull request Sep 25, 2025

Update README.md #150

Closed

oxtoacart mentioned this pull request Sep 29, 2025

add support for testing peer connectivity tailscale/action-setup-tailscale#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added argument to test targets connectivity #195

Added argument to test targets connectivity #195

Uh oh!

mcoulombe commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

mpminardi Sep 23, 2025

Uh oh!

mcoulombe Sep 23, 2025 •

edited

Loading

Uh oh!

mpminardi Sep 23, 2025

Uh oh!

mcoulombe Sep 23, 2025

Uh oh!

mpminardi Sep 23, 2025

Uh oh!

mcoulombe Sep 23, 2025 •

edited

Loading

Uh oh!

mpminardi Sep 23, 2025

Uh oh!

mcoulombe Sep 23, 2025

Uh oh!

mpminardi Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added argument to test targets connectivity #195

Added argument to test targets connectivity #195

Uh oh!

Conversation

mcoulombe commented Sep 23, 2025

Problem

Solution

Methodology

Misc

Uh oh!

Uh oh!

Uh oh!

mpminardi Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mcoulombe Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpminardi Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mcoulombe Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mpminardi Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mcoulombe Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpminardi Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mcoulombe Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

mpminardi Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mcoulombe Sep 23, 2025 •

edited

Loading

mcoulombe Sep 23, 2025 •

edited

Loading