Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stcpr transport cannot be established after being down #806

Closed
i-hate-nicknames opened this issue Jun 9, 2021 · 1 comment
Closed

stcpr transport cannot be established after being down #806

i-hate-nicknames opened this issue Jun 9, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@i-hate-nicknames
Copy link
Contributor

Describe the bug
Happened when running visor+vpn-server on a separate machine from vpn-client

Environment information:
client:
Linux 4.19.0-13-amd64 x86_64

server:
Linux 5.12.4-arch1-2 x86_64

Steps to Reproduce

  1. started vpn client. Client repeats an error that there is no transport found:
[2021-06-09T16:49:24+03:00] INFO [proc:vpn-client:ac1ffcfbcf5d45c7ade1404b70761093]: Request processed. _elapsed="636.214461ms" _method="Dial" _received="4:49PM" error="route finder: transport not found" input=029081ab2f1a9087f7e59b3b58ca2959530335f861725d31aac1b5a3aff150f832:44 output=&{ConnID:0 LocalPort:0}
  1. start vpn server, add transport manually (no entry before that). vpn works

  2. stop client manually and start again. Same error as at (1), no transport found (even though ls-tp shows it's there, but it's no in up state).

  3. calling add-tp returns the same transport (same ID too) entry with error that transport established but not up.

[2021-06-09T16:48:36+03:00] FATAL [skywire-cli]: Established stcpr transport to 029081ab2f1a9087f7e59b3b58ca2959530335f861725d31aac1b5a3aff150f832 with ID 7f6c3a10-855a-051f-885d-6eeeb007b0e9, but it isn't up

Сalling rm-tp and then add-tp shows the same error. Restarting visor with vpn client doesn't help, with and without removing transport.

  1. However, the following sequence works:
  • Remove transport
  • Stop vpn server
  • Start vpn server
  • Add transport

Actual behavior

  1. Transport is not up upon visor start
  2. Removing transport and establishing it again doesn't help, so client has no control over fixing process.

Expected behavior

  1. Transport should become Up when both ends of it are online.
  2. Transport should be correctly established to Up state when removed and added again
@i-hate-nicknames i-hate-nicknames added the bug Something isn't working label Jun 9, 2021
@i-hate-nicknames
Copy link
Contributor Author

I think I found the reason of the bug.

First a bit of background on how transports act.

When transport is established, a transport object is created and kept in memory.
This object basically wraps underlying connection and performs continuous reading
from that connection. Among other things, there is a thing called redial-loop.

Redial loop basically tries to dial the other end of transport and update local
transport connection if dialing succeeded.
Redial loop requires two prerequisits to trigger:

  • there is no activity in the transport for the last 3 seconds
  • PK of current visor is smaller than PK of the other end of the transport.

Stcpr transport have a couple of features worth noting:

  1. There is no address/port configuration for stcpr. Upon visor start OS assigns a random port
    to stcpr listener.
  2. When stcpr listener starts it also advertizes itself in address resolver, sending port
    assigned by OS. Other visors then should use this port + visor IP address to connect to it.
    There is no check for public/private IP address as far as I can tell. Any visor that starts
    stcpr listen will register itself in AR, even if it has no public IP.
  3. Dial of stcpr type involves resolving PK via AR. The result of this resolving will be
    visor IP addresses and port that was put by it to AR.

Now, in cases when server PK is smaller than client PK, server initiates redialing using
client port from AR. When client has no public address this fails and switches to
redial exponential backoff loop.

It should be possible to fix this issue by forcing the client to redial. I couldn't reproduce
this issue with the same configs and machines, but with keypairs swapped, so that client
would get the smaller PK.

However, the question why does server cannot accept a connection when in redial loop still remains.
When manually removing and adding a transport on client side, server shows some signs of accepting:

[2021-06-11T12:38:18Z] INFO [stcpr]: Accepted connection from 93.72.85.100:42240
[2021-06-11T12:38:18Z] INFO [stcpr]: Performing handshake with 93.72.85.100:42240
[2021-06-11T12:38:18Z] INFO [stcpr]: Sent handshake to 93.72.85.100:42240, local addr 02e10a98ccda924728029fa849c61b33752ca7a8595bf0502091274e7c538b6134:45, remote addr 0364a0b0beec34f06d482c5be268810d22d21393a33026d578cd1f1db7d67bdd37:49153
<2021-06-11T12:38:18Z> INFO [stcpr]: Connection with 0364a0b0beec34f06d482c5be268810d22d21393a33026d578cd1f1db7d67bdd37:49153@93.72.85.100:42240(0364a0b0beec34f06d482c5be268810d22d21393a33026d578cd1f1db7d67bdd37) is encrypted
[2021-06-11T12:38:18Z] INFO [transport_manager]: recv transport connection request: type(stcpr) remote(0364a0b0beec34f06d482c5be268810d22d21393a33026d578cd1f1db7d67bdd37)
[2021-06-11T12:38:18Z] DEBUG [transport_manager]: TP found, accepting...

but the transport doesn't get UP status.

The suggestion for now is:

  1. Investigate and discuss the reason we pick least significant PK for redial initiation. Perhaps it's
    required for other types of transports like dmsg. If it's not required for stcpr, then use initiator-redials
    logic.
  2. Investigate why server cannot accept transport when in redial mode. This might not be important for
    client-server interaction, but in case of other scenarios may be relevant.
    For example, if visor v1 tries to setup transport manually to v2, while v2 is trying to redial v1.
  3. Discuss appropriateness of storing unavailable records in AR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants