Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

magicsock: if STUN failed to send before, rebind before STUNning again. #3014

Merged
merged 1 commit into from
Oct 7, 2021

Conversation

apenwarr
Copy link
Member

@apenwarr apenwarr commented Oct 7, 2021

On iOS (and possibly other platforms), sometimes our UDP socket would
get stuck in a state where it was bound to an invalid interface (or no
interface) after a network reconfiguration. We can detect this by
actually checking the error codes from sending our STUN packets.

If we completely fail to send any STUN packets, we know something is
very broken. So on the next STUN attempt, let's rebind the UDP socket
to try to correct any problems.

This fixes a problem where iOS would sometimes get stuck using DERP
instead of direct connections until the backend was restarted.

Fixes #2994

Signed-off-by: Avery Pennarun apenwarr@tailscale.com

@apenwarr apenwarr force-pushed the apenwarr/stunbind branch 3 times, most recently from cfd952f to d530e6d Compare October 7, 2021 01:19
Copy link
Member

@crawshaw crawshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right, worth having @danderson look.

wgengine/magicsock/magicsock.go Outdated Show resolved Hide resolved
IPv6 bool // IPv6 works (send & receive)
IPv6CanSend bool // IPv6 works (send)
IPv4 bool // IPv4 works (send & receive)
IPv4CanSend bool // IPv4 works (send)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify whether these new CanSend fields are any vs all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically all four of them are "any" and not "all".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated for more verbosity.

On iOS (and possibly other platforms), sometimes our UDP socket would
get stuck in a state where it was bound to an invalid interface (or no
interface) after a network reconfiguration. We can detect this by
actually checking the error codes from sending our STUN packets.

If we completely fail to send any STUN packets, we know something is
very broken. So on the next STUN attempt, let's rebind the UDP socket
to try to correct any problems.

This fixes a problem where iOS would sometimes get stuck using DERP
instead of direct connections until the backend was restarted.

Fixes #2994

Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
@apenwarr apenwarr merged commit 0d4a0bf into main Oct 7, 2021
@apenwarr apenwarr deleted the apenwarr/stunbind branch October 7, 2021 17:17
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
call in
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
at the end of the test, waiting for a STUN response which
will never arrive.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test, waiting for a STUN
response which will never arrive.

This causes a test failure due to the resource leak. For
whatever reason, it mostly fails with Windows.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test, waiting for a STUN
response which will never arrive.

This causes a test failure due to the resource leak. For
whatever reason, it mostly fails with Windows.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test, waiting for a STUN
response which will never arrive.

This causes a test failure due to the resource leak. For
whatever reason, it mostly fails with Windows.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test, waiting for a STUN
response which will never arrive.

This causes a test failure due to the resource leak. For
whatever reason, it mostly fails with Windows.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test, waiting for a STUN
response which will never arrive.

This causes a test failure due to the resource leak. For
whatever reason, it mostly fails with Windows.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test, waiting for a STUN
response which will never arrive.

This causes a test failure due to the resource leak. For
whatever reason, it mostly fails with Windows.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test, waiting for a STUN
response which will never arrive.

This causes a test failure due to the resource leak. For
whatever reason, it mostly fails with Windows.

The WG scaffolding to call bind.Close() is not present in this
these tests, so ensure it gets called explicitly.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test waiting for a STUN
response which will never arrive.

This causes a test flake due to the resource leak in those
cases where the Conn decided to rebind. For whatever reason,
it mostly flakes with Windows.

If the Conn is closed, don't Rebind after a send error.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 16, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test waiting for a STUN
response which will never arrive.

This causes a test flake due to the resource leak in those
cases where the Conn decided to rebind. For whatever reason,
it mostly flakes with Windows.

If the Conn is closed, don't Rebind after a send error.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
DentonGentry added a commit that referenced this pull request Oct 17, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test waiting for a STUN
response which will never arrive.

This causes a test flake due to the resource leak in those
cases where the Conn decided to rebind. For whatever reason,
it mostly flakes with Windows.

If the Conn is closed, don't Rebind after a send error.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
bradfitz pushed a commit that referenced this pull request Oct 19, 2021
#3014 added a
rebind on STUN failure, which means there can now be a
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).ReadFromNetaddr
in progress at the end of the test waiting for a STUN
response which will never arrive.

This causes a test flake due to the resource leak in those
cases where the Conn decided to rebind. For whatever reason,
it mostly flakes with Windows.

If the Conn is closed, don't Rebind after a send error.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
(cherry picked from commit def650b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

iOS: gets stuck using DERP instead of direct connect in some cases
3 participants