New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to standby path more tricky than expected #293
Comments
A lot of this is a tradeoff between simplicity and performance. For performance reasons, we want the traffic to start using the standby path if the available path is "dubious". In my implementation, that's on first PTO. Of course, the PTO may or may not be due to a link failure, but hopefully is is infrequent enough that using the standby path briefly in case of PTO does not break the "spirit" of putting a path in standby. Sending an "abandon path" immediately would also force traffic onto the standby path, but is is more drastic. If the packet loss situation was temporary, it causes the system to stop using the "available" path forever. In contrast, promoting the standby path to "available" is easily reversible. if after a PTO or a couple RTO the "available" path is restored, the client can decide to put the standby path back in standby mode. |
I actually think it is more clear to explicit close a path if you detect it's broken. if the path comes back (whatever that means), you can simply try to open it again or actually a new path in this case. However, we can enforce this behaviour in ether way, therefore I think we should discuss the issue and maybe explain different solution but don't make any strict recommendation. |
You don't know when the path comes back. If you urgently need it, then you need to send probes regularly. If the path context is still up, you can do that by sending a ping, or repeating a path challenge at short intervals. If the path is gone, you need to create a new path, send a challenge, etc. If the challenge fails, you should also send an abandon -- because the peer maybe received the challenge but the response did not make it. So you consume the "number of paths" resource, and also the "number of CID". |
If you send pings you are supposed to close the path after a timeout. Also if the path comes "back", you really don't know if that is still the same path. I think it would be much safer to send a path challenge. This is what we currently say about recent addresses:
|
This issue reminds me the experience I had when trying to optimise the latency of applications on smartphone devices with MPTCP and older non-standard MPQUIC implementations where the WiFi is considered as cheap and the cellular expensive. When smartphone users are initially connected to WiFi and moving away from their access point, they may eventually be out of WiFi reachability, leading to packet losses. There is often some delay between applications experiencing packet losses and the system declaring the WiFi as lost. During that timeline, the WiFi path acts as a blackhole. The path priority (available/standby) is a scheduling concern, and each endpoint runs its own algorithm. It is up to each endpoint to determine when to start using "standby" paths. The "path health status" is also a local information, and depending on your traffic (fully upload or fully download), only one endpoint may be aware of a lossy path. Scheduling decisions at both sides impact the performance of the multipath transfer. I see two ways of handling this:
|
@qdeconinck "from time to time" is the issue. There is bound to be a delay between the time the sender notices packets are not getting acked and the time a pure receiver notices that the occasional PING is not acked. The pure receiver will also tend to use longer estimates for the RTO. So we get the sequence:
The issue that I find is that if the gap between (2) and (3) is too long, the sender will also notice a PTO on the "standby" path, because the ACKs are sent by the receiver through the "available" path, and lost. My "solution" is:
@mirjak proposes to just rely on "ABANDON_PATH". That's doable, but it results in:
That will work, but that means the traffic resumes after "too many RTOs", which is a pain for several applications. |
I did a big batch of debugging the "unique path-id" code in picoquic, porting all the tests that were designed for the previous multipath version, and I found an interesting issue in the "standup" test. The test starts by setting a client to server connection with two paths, one available and on in standby. It runs for a while, then simulates cutting the "available" path off. Expectation is that the connection will continue with the "standby" path. The test was initially failing.
The simulated traffic is from server to client. The server quickly detects that the available path is down, and starts sending data packets on the "standby" path. But the client only sends ACK, and does not react quickly if ACK packets are not acknowledged. So the client keeps sending ACKs on the "available path". Of course, since the path is cut, they are dropped. The server does not see ACKs for the packets sent on the "standby" path, so it quickly concludes that this path is down. Pretty soon, the connection breaks.
There are two potential fixes. One would be to somehow force ACKs on the standby path, if packets are received on that path. The other, which I feel is more robust, is for the server to send mark the standby path as "available" if the available path is "broken". I did that, "promote" the standby path to available, and the "standup" test now succeeds.
This issue might deserve some discussion in the multipath draft.
The text was updated successfully, but these errors were encountered: