-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement to random port allocation for testing. #3682
Comments
Thanks for pointing this out! 👍 One concern I have with the retry/exponential backoff approach is that it might introduce additional non-determinism into the tests. Would allocating the required ports for a test all in one go not possibly assist with addressing this? Or do your tests get executed in parallel within a single container in your CI? (In which case one could still end up with a port collision). e.g. // GetFreePort in this case makes the closing of the listener the responsibility
// of the caller to allow for a guarantee that multiple random port allocations
// don't collide.
func GetFreePort() (int, *net.TCPListener, error) {
addr, err := net.ResolveTCPAddr("tcp", "localhost:0")
if err != nil {
return 0, nil, err
}
l, err := net.ListenTCP("tcp", addr)
if err != nil {
return 0, nil, err
}
return l.Addr().(*net.TCPAddr).Port, l, nil
}
// GetFreePorts allocates a batch of n TCP ports in one go to avoid collisions.
func GetFreePorts(n int) ([]int, error) {
ports := make([]int, 0)
for i := 0; i < n; i++ {
port, listener, err := GetFreePort()
if err != nil {
return nil, err
}
defer listener.Close()
ports = append(ports, port)
}
return ports, nil
} This assumes Go's internal port allocation mechanism guarantees no collisions, of course (I haven't had time to look into this possibility though). |
@thanethomson Actually we're kind of spinning up and tearing down for each test, but somehow a collision still happened. Instead of exponential backoff we could have an optional param for number of retries and a set timeout for each retry (like 100 ms) or make it an optional struct with these two params. |
Preallocation of ports won't work because these are different test suites, also it is a less flexible approach. I think extending the current approach with some optional args is a better idea in terms of reliability and also does not break any tests that are using the defaults. |
What would cause such a collision? Looking at it now, the only place we're using the The first Have you perhaps traced the issue specifically to the Could there be something in the code elsewhere that is hard-coded to use an IP address in the ephemeral port range that sometimes happens to collide with one allocated by |
Try this gist for example. I've got 1000 concurrent Goroutines attempting to allocate free ports simultaneously, and I haven't been able to produce a collision yet on my local machine. |
I have got proof of this happening today in the ci and logs. I'm pretty sure it happens close to never and it might have to do with a combination of factors, one of which is weak ci machine with a bunch of running concurrent jobs that might not be able to handle this properly, and even then ut happens rarely. What this change is needed for is to be able to account for this and prevent tests from being flakey, which they currently sometimes are. I raised that issue specifically to confirm that this is going to be accepted before implementing and submitting the PR. It's a much easier fix for anyone who would ever want run tendermint tests on ci than any other option that involves wrapping application start calls. |
It's great that you did! 👍 That's our preferred process. We do like to talk through these details in the GitHub issues before we go and make changes. I understand your issue and I'll see if I can help out. I've taken a look at the other comment thread and your Travis CI logs, and all I can see is the fact that the port was in use when this test attempts to grab port 33358. There could be many explanations for this (including the possibility that there's some kind of ephemeral service in the Travis CI container that that happens to come up between the time that you've called Looking at this again now, I still don't see how changing the Your retry solution probably makes more sense to implement in a loop over here, where if the Tendermint node creation fails because of a bind error, reconfigure the ports and retry the node creation. |
@thanethomson So here is the thing: when you are binding on port 0 it gives you the next free port, AFAIK "sequentially". What happens here is that the port is somehow busy already when requested, so a retry would give us a different one! |
|
My bad, I was looking at another set of tests here as these were more interesting, since tendermint there is getting it's own free port. About I will fix it on our side and thanks for pointing that out - I got confused by different approaches in setting up tendermint across our suites. In any case it was a good thing to discuss and I will try to implement some test scenarios when that could possible happen. |
But yeah, I think the ability to get n ports at the same time that are guaranteed to be different is a good fix for this too. |
I have taken a look at this again and tried fixing this on our side, but that means duplicating a lot of functionality from rpctest lib. I now know what would help:
Something like that:
Does that make sense? The second point is actually important whether or not we implement the first one. |
Makes sense - definitely. On your second point, and after taking a look at the code in Weave yesterday, I realise that one thing we can definitely look at doing better is to perhaps provide a simpler interface to programmatically configure and bootstrap a Tendermint node. I don't know if it's such a good idea to rely on the Let me create an issue regarding this specifically so we can separate out the two issues 👍 |
Let me know how to proceed then, if you want the 1st point to be contributed - I'd be glad to do so, as the code is already implemented - just needs to be committed and used in rpctest. |
After thinking about this, I think this issue needs to be superseded by #3690. By providing that API, we provide a more guaranteed way of ensuring that nobody grabs those ports while you're starting up a node, which avoids the need for you to worry about internals like In the meantime, I'd suggest perhaps implementing your own As such, can we close this issue in favour of #3690? |
Yes, I would love to build this api as it’s going to simplify things a lot for us, building it on top of what we have now didn’t really feel great. Let me talk this over with the guys. Yep, I think this is a fair deal, let’s just keep one open. |
@ethanfrey ^^ |
Below is a start api proposal. GetFreePorts is a public function, because it just seems like a nice helper to have, but could be made private if needed. Could even be a private method of ConfigWrapper to avoid having extra func in the namespace. How does this sound:
|
Thanks for this, but what I'd recommend is rather closing this issue and addressing this problem entirely in #3690. One of the big reasons is that even with Hence my reasoning as to why this should be superseded by an API that handles both the port allocation and the starting of the Tendermint node. |
While running tests with
github.com/tendermint/tendermint/rpc/test
package I've noticed that it is sometimes possible to get a collision while trying to bind on port 0 and get a random port allocated.While we are running some tests on the CI there's definitely not enough of them to allocate all the free ports, especially not in a fresh CI container.
This issue here describes the problem pretty good:
iov-one/weave#685
And also this comment: iov-one/weave#685 (comment)
It does not happen often, but what I would propose is to allow an optional retry parameter here https://github.com/tendermint/tendermint/blob/master/libs/common/net.go#L28 and pass it all the way from
rpctest.StartTendermint()
. I'm pretty sure that even with one retry the probability of collision will reduce to almost impossible, not to mention more than one retry.I would also propose an exponential backoff starting with some really low value in milliseconds.
If all of that makes sense - I'm happy to contribute this.
The text was updated successfully, but these errors were encountered: