Skip to content

Conversation

@mcoulombe
Copy link
Contributor

Problem

Every change to a Tailscale network causes the netmap - a map of which devices can connect or know each other - to be recalculated and distributed to all nodes on the tailnet. When a node on the tailnet tries to connect to a peer, both devices must be allowed to connect according to their netmap. Computing and distributing updated netmap is an eventually consistent process, which means that when a node joins a tailnet there is a small delay before it can connect to the desired targets. If nodeA knows nodeB and tries to connect, but nodeB doesn't know nodeA yet, nodeA will experience what appears to it like a generic network error.

We have good reasons to believe this delay and generic error response is the cause of various issues and PRs reporting instability of the GitHub action. Under normal circumstances, the propagation delay is around ~3 seconds, which is short enough for it to complete before the workflow connects to targets in subsequent steps. However we observed propagation spikes of up to ~33 seconds during periods of particularly high or bursty traffic. The delay could be even more significant for particularly large or dynamic tailnets. The variability of the delay and its relation to overall system pressure explains why the failures are sporadic and concentrated around the same periods.

We also observed rare but consistent flakiness when trying to download the Tailscale client:
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

Issues and PRs that are possibly related:

Solution

When dealing with eventually consistent systems, the simplest solution is to check before acting. We added an optional targets input argument so the Tailscale GitHub Action can absorb the delay by verifying for up to 3 minutes if the targets are reachable before continuing.

If this solution is not acceptable due to the potential execution delay or because the targets are not known at the time the Tailscale action is used, we recommend using tailscale ping or other methods to verify connectivity and buffer for the variable propagation delay.

A retry was also added to the curl command that downloads the Tailscale client to mitigate the premature stream closure problem.

Methodology

We created a monitoring rig to measure the success rate from a consuming workflow's perspective. The test workflow uses the Tailscale GitHub Action to connect to a tailnet and perform a series of commands that were reported as flaky by the various GitHub issues.

Without the changes from this PR, the success rate was around 97%, whereas when using the targets input argument the success rate is now well above 99%.

Note that a number of issues mention that connections via DNS would fail while connection via direct IP was more reliable, but from our tests using the Tailscale IP or the MagicDNS machine name did not make a difference.

Sample result:

📊 Tailscale Subdomain
----------------------------------------
   Total runs: 816
   Successful: 812
   Success rate: 99.5%
   Duration p50: 3450ms
   Duration p90: 310193ms
   Duration p95: 319526ms
   Duration p99: 323769ms
   Min duration: 3146ms
   Max duration: 330009ms

The 4 failures were due to connectivity only being achieved through a DERP server which for the sake of monitoring we consider as unsuccessful.

Misc

I've also fixed the README indentation which would supersede the #150

@mcoulombe mcoulombe requested a review from mpminardi September 23, 2025 16:12
action.yml Outdated
for target in "${TARGET_ARRAY[@]}"; do
target=$(echo "$target" | xargs) # trim whitespace
if [ -n "$target" ]; then
if ! ${MAYBE_SUDO} tailscale ping --verbose --timeout=5s --c=36 --until-direct=false $target >/dev/null 2>&1; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can drop verbose if we are directing to null, also optionally the timeout since 5s is the default . Is there a reason we chose 36 specifically here for the number of pings, and why we are specifying false for until-direct?

Copy link
Contributor Author

@mcoulombe mcoulombe Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

36 is just to make the overall timeout 3min which feels like a reasonable upper bound based on the testing and expected restart times during deployments. (5s timeout per tries with 36 tries = 180s)

--until-direct=false is because connectivity through a DERP server, although maybe not ideal wrt resource usage and bandwidth, is still a successful connection from the perspective of the GH workflow. If we consider DERP connectivity unsuccessful from my tests it'll add cause ~0.2%-0.5% of executions to arbitrarily fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think what --until-direct=false has the impact of here is that we will always do all 36 tries even if we had a successful direct connection on say the 5th iteration, whereas --until-direct=true would stop attempting to ping at the point we had a direct connection and return success.

If we're having additional arbitrary failures with --until-direct=true it feels like there may be another layer of issues here beyond being able to actually ping / reach the node that we are coincidentally helping with the additional delay that the extra pings are adding (e.g., in the above example the extra 31 pings are essentially acting as a sleep).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought --until-direct=false meant that ping succeeds as soon as the target is reachable, even if only via DERP servers. Whereas --until-direct=true doesn't consider a response like pong from lax-pve (100.99.0.2) via 47.149.77.162:41641 in 19ms successful.

Can you explain why the behaviour is to exhaust all tries?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we return nil if the endpoint is non-empty and until-direct is set, otherwise we continue the loop until the retires are exhausted.

--until-direct=true actually early returns for a response of that form (e.g., pong from lax-pve (100.99.0.2) via 47.149.77.162:41641 in 19ms) which is when we have a direct connection via UDP at that IP:port

Pongs over DERP are of the form pong from hello (100.73.174.8) via DERP(fra) in 493ms.

Locally I get the following with --until-direct=true:

tailscale ping --verbose --timeout=5s --c=10 --until-direct=true hello

2025/09/23 13:18:03 lookup "hello" => "100.73.174.8"
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 153ms

and with --until-direct=false:

tailscale ping --verbose --timeout=5s --c=10 --until-direct=false hello

2025/09/23 13:18:08 lookup "hello" => "100.73.174.8"
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 152ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 154ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 154ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 151ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 156ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 153ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 152ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 157ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 152ms
pong from hello (100.73.174.8) via 3.126.85.103:52294 in 157ms

Copy link
Contributor Author

@mcoulombe mcoulombe Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk, I removed the --until-direct=false flag. I misunderstood the doc.

Is there something we can or should use to tell tailscale ping "it is ok if the connection is over DERP"? for example in this run, i think if we bust the timeout for a direct connectivity, if the peer was reachable via DERP, the connectivity step should still succeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah rats, yeah I think we would need --until-direct=false to get that behaviour with the current CLI options / behaviour.

Taking a step back: I'm curious if tailscale ping is giving us the signal we want here, or if the reduction in failures is more a side effect of the ~40 seconds of sleep after tailscale up that this is adding when --c=36 and --until-direct=false.

E.g., it might be good to get stats (if we don't already) on how many failures we have where tailscale ping with a shorter amount for --c (maybe something like 5) and with --until-direct=false causes the ping to fail with no reply, or if we typically have success with tailscale ping and other failures down the line in this scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if tailscale ping is giving us the signal we want here

--until-direct=false was only added yesterday. The bulk of the monitoring ran without the flag and the stats I remember from Monday afternoon were ~600 runs with 1 failure due to Tailscale client download error (mitigated by the curl retry) and 3 failures because the targets were only reachable via DERP (which I thought --until-direct=false would mitigate). So checking connectivity with ping did improve the success rate.

on how many failures we have where tailscale ping with a shorter amount for --c (maybe something like 5) and with --until-direct=false causes the ping to fail with no reply

I don't think we should use --until-direct=false now that i understand better how it behaves. It becomes no better than a hardcoded sleep, the connectivity check should return as soon as possible so it does not artificially slow down the workflows any more than it needs to.

I'll let the monitoring run without --until-direct=false and if it regularly causes runs to fail we should investigate why connectivity can sporadically only be established via DERP, and if direct connectivity cannot be established I think there should be an option to let ping succeed if pong comes via DERP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that all sounds good to me, thank you for the detail!

@mcoulombe mcoulombe force-pushed the max/test-target-connectivity-check branch 2 times, most recently from 59d314f to 0ff8f4a Compare September 23, 2025 18:55
@mcoulombe mcoulombe force-pushed the max/test-target-connectivity-check branch from 0ff8f4a to cab75fd Compare September 23, 2025 19:41
@mcoulombe mcoulombe force-pushed the max/test-target-connectivity-check branch from cab75fd to c59af0e Compare September 25, 2025 13:33
Update action.yml

Co-authored-by: Mario Minardi <mario@tailscale.com>
@mcoulombe mcoulombe force-pushed the max/test-target-connectivity-check branch from a0d3585 to 3eef1cc Compare September 25, 2025 14:25
Co-authored-by: Mario Minardi <mario@tailscale.com>
@mcoulombe mcoulombe merged commit 6cae46e into main Sep 25, 2025
9 checks passed
@mcoulombe mcoulombe deleted the max/test-target-connectivity-check branch September 25, 2025 15:43
@mcoulombe mcoulombe mentioned this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants