Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows reports "Connection failed: the semaphore timeout has expired" #3898

Closed
GregoryLundberg opened this issue Jan 29, 2019 · 9 comments
Closed
Assignees
Labels
Network Issues in the networking API. Windows OS-specific issues that apply to Microsoft Windows

Comments

@GregoryLundberg
Copy link
Contributor

When faced with complete packet loss for a period of time, the Windows client (wesnoth) reports that the connection to the Multi-player server (wesnothd) has failed.

This error can be reproduced by starting a server on one local computer and using another local Windows computer to create a private game on that server. While waiting for others to join, disconnect the network cable or raise a firewall to block all traffic to the server, then cancel the game at the client. After a few moments, a pop-up error should appear with the message.

When seen in the wild, "pinging" the server can often work around the issue, Ravana has created a robot automating this. Depending upon the exact nature of the packet loss, this will often work around the problem; but it might fail to do so in some cases.

My suggestions are:

  1. Modify the Windows version to catch and ignore this error. This should allow the Windows client to fail more gracefully, like the Linux client, producing a more meaningful error for the user.

  2. Modify the connection on Windows to lengthen the timeout causing this error to something more reasonable, like 15 minutes, instead of just a few seconds.

  3. As always, users experiencing complete packet loss should first investigate hardware and networking issues since nothing we can do will work in all cases of complete packet loss.

This error message is well-known to Windows administrators. A web search for the error message leads to a number conversations about it. Most reports involve backups or SQL queries, and are resolved by replacing hardware or networking components to increase speed / reduce latency.

This Issue memorializes a conversation on the forum.

@soliton-
Copy link
Member

  1. How is it helpful to ignore a connection error? Do you expect receive and send functions to just continue to work despite the error?

  2. This sounds like you're now talking about some automatic disconnect if the connection is idle? I would be surprised if wesnoth enables any such functionality or if it was standard even on windows. I would assume some network device like some router doing NAT could cause this though.

What we could do is show a more descriptive error if it's not clear that this is a disconnect and possibly hiding the cryptic semaphore message.

@GregoryLundberg
Copy link
Contributor Author

GregoryLundberg commented Jan 29, 2019

An automatic disconnection on a idle connection would, most likely, follow the TCP and close the connection or, at least, reset it. There are some who claim that certain NAT routers have a short (1-5 minute) timeout and simply drop the connection without so much as a reset. Simply no packets at all.

I'm not suggesting we ignore connection failures, I'm suggesting we ignore the internal Windows semaphone/timer which says, after just a second or two, if there is no packet on the TCP connection issue an obscure error message about the semaphore timing out. If we can set that to something more reasonable, like 15 or 30 minutes, then fine. But it appears, at least on Linux, that if we just don't do anything (I didn't wait a few hours to see if Linux, too, did a timeout) and proceed to allow the user to close the game, thereby going back to the lobby and, when they try to start a new game, reporting the error at that time ... I'm suggesting we make Windows behave like that, as well.

If we simply let the error appear after a couple seconds of waiting, I would suggest we display the message "You're using Windows on a shitty network and something went wrong." /s

@GregoryLundberg
Copy link
Contributor Author

From what I've been able to glean this error is Microsoft deciding that the TCP timeouts are far too long for modern networks, and issuing the error in seconds (if not milliseconds) when an SQL server or network-attached storage device cannot get a packet back. That may be fine for their large corporate customers on their in-house high-speed networks. But, in the Real World, packets can take a surprisingly long time to get there and come back, and far too many never complete the circuit. This is not something new. There is discussion about the issue going back several years. And the common thread seems to be to upgrade the storage devices or network so SQL queries and network backups run as quickly as possible.

The only players talking about this issue are (apparently .. you have to take IP address geo-location with a BIG grain of salt) in central Italy. One appears to be on Telecom Italia and the other on Telecom Italia Mobile. Personally, I think, since this is recent issue, and fairly localized, that the better course would be to begin with a more local issue. Certainly, if I had this issue on my network, if I could prove it's not my personal hardware, I'd be all over my ISP to fix it.

The counter-argument was that the issue is more likely areas with poor networks, unlikely to improve, and few English speakers. And that they only see the issue with Wesnoth. To be honest, they're probably correct. Unlike most games, which have a heavy flurry of TCP and/or UDP traffic, Wesnoth is very lightweight. And, sure, the issue is probably under-reported in some areas,

The fact that Ravana's quick-and-dirty hack seems to improve things does tend to argue in this direction. As I look back, I forgot to mention that. In my experience, if your network is prone to packet loss, no amount of pinging will totally solve the issue. But it can make it so it's statistically unlikely. Basically, what Ravana is doing is make Wesnoth behave more like other network-heavy online games. Given a short enough time-out, we can make it so the users see the failure across all their games. Currently, Ravana is using a 1-minute "ping" and that is reported to not be enough to push the problem into the weeds, so Ravana's ping-bot should be shortened to something like a 1- or 10-second repeat. And we should consider adding it as a feature to the client and server software, probably with an option to enable it if the player's network exhibits high packet loss due to low traffic on the connection.

@loonycyborg
Copy link
Member

I'm not sure that simply ignoring an error is good idea. Will socket even be in valid state after this? I think a better course of action would be to reduce keepalive interval on server to 30s. Unfortunately asio doesn't provide crossplatform helpers for that so #ifdefs would be needed.

@loonycyborg loonycyborg self-assigned this Jan 29, 2019
@GregoryLundberg
Copy link
Contributor Author

I agree, it might not be possible to make Windows work as well as Linux when it comes to this sort of failure. But I note it because I think it would be best to try.

@ProditorMagnus
Copy link
Contributor

I chose 1 min minimal interval since I use very inefficient implementation for checking if someone should be pinged. I expect that if dozen users tried to use it at once it would crash.

@Wedge009 Wedge009 added Windows OS-specific issues that apply to Microsoft Windows Network Issues in the networking API. labels Jan 30, 2019
loonycyborg added a commit that referenced this issue Jan 30, 2019
Hopefully this will prevent NAT from forgetting idling clients
For now this is POSIX only and not configurable.
loonycyborg added a commit that referenced this issue Jan 30, 2019
Hopefully this will prevent NAT from forgetting idling clients
For now this is POSIX only and not configurable.
@GregoryLundberg
Copy link
Contributor Author

@ProditorMagnus good to know.

For causal readers:

The SO_KEEPALIVE changes against non-Windows are for the server (wesnothd).

This means, when the server updates (for 1.14.6), the official Multi-Player server will tell all player/client connections to use a TCP KeepAlive packet every 30 seconds, expecting a response from the client within 30 seconds, and allowing up to 10 failures before the server declares your client unreachable and closes its side of your connection.

TCP KeepAlive is a feature intended precisely for the situation where an intermediate node (router) might drop the connection due to inactivity. Packet loss is a normal fact of life on the Internet, so multiple failures are needed before the connection is declared failed.

Note that TCP KeepAlive is not always effective:

  • some nodes may use shorter time-outs, and

  • random packet loss might exceed the five minutes allowed here, missing the chance to recover

nemaara pushed a commit to nemaara/wesnoth that referenced this issue Jan 30, 2019
Hopefully this will prevent NAT from forgetting idling clients
For now this is POSIX only and not configurable.
@ProditorMagnus
Copy link
Contributor

I will keep the command around for few days, but added message about this, kwargs["reply"]("Server has been updated, using bot for pings should not be needed anymore. Consider cancelling your ping (!ping 0). Post to the forum thread if issue still happens")

@GregoryLundberg
Copy link
Contributor Author

Marking this closed since the users on the Forums report the SO_KEEPALIVE option has solve it for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Network Issues in the networking API. Windows OS-specific issues that apply to Microsoft Windows
Projects
None yet
Development

No branches or pull requests

5 participants