Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement to random port allocation for testing. #3682

Closed
ruseinov opened this issue May 28, 2019 · 19 comments
Closed

Improvement to random port allocation for testing. #3682

ruseinov opened this issue May 28, 2019 · 19 comments
Labels
T:test Type: Tests that need love

Comments

@ruseinov
Copy link
Contributor

While running tests with github.com/tendermint/tendermint/rpc/test package I've noticed that it is sometimes possible to get a collision while trying to bind on port 0 and get a random port allocated.

While we are running some tests on the CI there's definitely not enough of them to allocate all the free ports, especially not in a fresh CI container.

This issue here describes the problem pretty good:
iov-one/weave#685
And also this comment: iov-one/weave#685 (comment)

It does not happen often, but what I would propose is to allow an optional retry parameter here https://github.com/tendermint/tendermint/blob/master/libs/common/net.go#L28 and pass it all the way from rpctest.StartTendermint(). I'm pretty sure that even with one retry the probability of collision will reduce to almost impossible, not to mention more than one retry.

I would also propose an exponential backoff starting with some really low value in milliseconds.

If all of that makes sense - I'm happy to contribute this.

@thanethomson
Copy link
Contributor

Thanks for pointing this out! 👍

One concern I have with the retry/exponential backoff approach is that it might introduce additional non-determinism into the tests.

Would allocating the required ports for a test all in one go not possibly assist with addressing this? Or do your tests get executed in parallel within a single container in your CI? (In which case one could still end up with a port collision).

e.g.

// GetFreePort in this case makes the closing of the listener the responsibility
// of the caller to allow for a guarantee that multiple random port allocations
// don't collide.
func GetFreePort() (int, *net.TCPListener, error) {
	addr, err := net.ResolveTCPAddr("tcp", "localhost:0")
	if err != nil {
		return 0, nil, err
	}

	l, err := net.ListenTCP("tcp", addr)
	if err != nil {
		return 0, nil, err
	}
	return l.Addr().(*net.TCPAddr).Port, l, nil
}

// GetFreePorts allocates a batch of n TCP ports in one go to avoid collisions.
func GetFreePorts(n int) ([]int, error) {
	ports := make([]int, 0)
	for i := 0; i < n; i++ {
		port, listener, err := GetFreePort()
		if err != nil {
			return nil, err
		}
		defer listener.Close()
		ports = append(ports, port)
	}
	return ports, nil
}

This assumes Go's internal port allocation mechanism guarantees no collisions, of course (I haven't had time to look into this possibility though).

@thanethomson thanethomson added the T:test Type: Tests that need love label May 28, 2019
@ruseinov
Copy link
Contributor Author

@thanethomson Actually we're kind of spinning up and tearing down for each test, but somehow a collision still happened. Instead of exponential backoff we could have an optional param for number of retries and a set timeout for each retry (like 100 ms) or make it an optional struct with these two params.

@ruseinov
Copy link
Contributor Author

Preallocation of ports won't work because these are different test suites, also it is a less flexible approach.

I think extending the current approach with some optional args is a better idea in terms of reliability and also does not break any tests that are using the defaults.

@thanethomson
Copy link
Contributor

Actually we're kind of spinning up and tearing down for each test, but somehow a collision still happened.

What would cause such a collision?

Looking at it now, the only place we're using the GetFreePort method in the RPC tests is effectively https://github.com/tendermint/tendermint/blob/master/rpc/test/helpers.go#L74

The first GetFreePort call during a test will most certainly return a free port. Each subsequent call will have a very small, but non-zero, probability of returning the same port as one of the previously allocated ones because the listener is freed by the time each call returns (thus, from the OS perspective, that port is now free again, when we actually intend on using it).

Have you perhaps traced the issue specifically to the makeAddrs method in the Tendermint RPC code, or is it triggered elsewhere?

Could there be something in the code elsewhere that is hard-coded to use an IP address in the ephemeral port range that sometimes happens to collide with one allocated by GetFreePort?

@thanethomson
Copy link
Contributor

Try this gist for example. I've got 1000 concurrent Goroutines attempting to allocate free ports simultaneously, and I haven't been able to produce a collision yet on my local machine.

@ruseinov
Copy link
Contributor Author

I have got proof of this happening today in the ci and logs. I'm pretty sure it happens close to never and it might have to do with a combination of factors, one of which is weak ci machine with a bunch of running concurrent jobs that might not be able to handle this properly, and even then ut happens rarely.

What this change is needed for is to be able to account for this and prevent tests from being flakey, which they currently sometimes are.

I raised that issue specifically to confirm that this is going to be accepted before implementing and submitting the PR. It's a much easier fix for anyone who would ever want run tendermint tests on ci than any other option that involves wrapping application start calls.

@thanethomson
Copy link
Contributor

I raised that issue specifically to confirm that this is going to be accepted before implementing and submitting the PR.

It's great that you did! 👍 That's our preferred process. We do like to talk through these details in the GitHub issues before we go and make changes.

I understand your issue and I'll see if I can help out. I've taken a look at the other comment thread and your Travis CI logs, and all I can see is the fact that the port was in use when this test attempts to grab port 33358. There could be many explanations for this (including the possibility that there's some kind of ephemeral service in the Travis CI container that that happens to come up between the time that you've called GetFreePort and the time that you want to use that port to start the Tendermint node, and there's a collision with the port you're trying to reserve).

Looking at this again now, I still don't see how changing the GetFreePort() method in the Tendermint code's going to solve that problem. The only way the GetFreePort() method could be causing the problem is if somehow two of the listen addresses here collide, in which case, why not just put a loop in that buildConfig method to validate that the ports aren't the same?

Your retry solution probably makes more sense to implement in a loop over here, where if the Tendermint node creation fails because of a bind error, reconfigure the ports and retry the node creation.

@ruseinov
Copy link
Contributor Author

ruseinov commented May 29, 2019

@thanethomson So here is the thing: when you are binding on port 0 it gives you the next free port, AFAIK "sequentially". What happens here is that the port is somehow busy already when requested, so a retry would give us a different one!

@thanethomson
Copy link
Contributor

GetFreePort doesn't work that way. If it worked that way, you'd be getting this error which is called from here. Instead, you're getting this one.

@ruseinov
Copy link
Contributor Author

My bad, I was looking at another set of tests here as these were more interesting, since tendermint there is getting it's own free port.

About GetFreePort - it kind of does, but I now see why this is happening. It's because we are getting three ports in a row and also after testing when the port is free it gets released so the app could listen to it, so that's when it's possible to grab and reuse it. And I suppose it happens sometimes.

I will fix it on our side and thanks for pointing that out - I got confused by different approaches in setting up tendermint across our suites.

In any case it was a good thing to discuss and I will try to implement some test scenarios when that could possible happen.

@ruseinov
Copy link
Contributor Author

But yeah, I think the ability to get n ports at the same time that are guaranteed to be different is a good fix for this too.

@ruseinov
Copy link
Contributor Author

I have taken a look at this again and tried fixing this on our side, but that means duplicating a lot of functionality from rpctest lib.

I now know what would help:

  1. for rpctest.GetConfig() to use a method that binds to ports while generating all of them, so that there are no collisions between these ports.

Something like that:

func GetFreePorts(n int) ([]int, error) {
	ports := make([]int, n)

	for k, _ := range ports {
		addr, err := net.ResolveTCPAddr("tcp", "localhost:0")
		if err != nil {
			return ports, err
		}

		l, err := net.ListenTCP("tcp", addr)
		if err != nil {
			return ports, err
		}
		// This is done on purpose - we want to keep ports
		// busy to avoid collisions when getting the next one
		defer func() { _ = l.Close() }()
		ports[k] = l.Addr().(*net.TCPAddr).Port
	}

	return ports, nil
}
  1. The ability to use rpctest.GetConfig() functionality without it being cached, so maybe a duplicate of this method that is not cached, like `GetConfigNoCache() which would also be used for the method that does the caching. Otherwise we have to duplicate some of the functionality to properly test our apps.

Does that make sense?

The second point is actually important whether or not we implement the first one.

@thanethomson
Copy link
Contributor

Makes sense - definitely. On your second point, and after taking a look at the code in Weave yesterday, I realise that one thing we can definitely look at doing better is to perhaps provide a simpler interface to programmatically configure and bootstrap a Tendermint node.

I don't know if it's such a good idea to rely on the rpctest library, since it's mostly used for internal testing and the API might change more frequently than other APIs.

Let me create an issue regarding this specifically so we can separate out the two issues 👍

@ruseinov
Copy link
Contributor Author

Let me know how to proceed then, if you want the 1st point to be contributed - I'd be glad to do so, as the code is already implemented - just needs to be committed and used in rpctest.
Actually rpctest was working pretty good for us except for caching and the possible port collisions.

@thanethomson
Copy link
Contributor

After thinking about this, I think this issue needs to be superseded by #3690. By providing that API, we provide a more guaranteed way of ensuring that nobody grabs those ports while you're starting up a node, which avoids the need for you to worry about internals like GetFreePort altogether.

In the meantime, I'd suggest perhaps implementing your own GetFreePort equivalent that meets your needs (something like this) until such time that we can schedule in the work on that API in #3690. Unless of course you'd like to build the API 😁

As such, can we close this issue in favour of #3690?

@ruseinov
Copy link
Contributor Author

Yes, I would love to build this api as it’s going to simplify things a lot for us, building it on top of what we have now didn’t really feel great.

Let me talk this over with the guys.

Yep, I think this is a fair deal, let’s just keep one open.

@ruseinov
Copy link
Contributor Author

@ethanfrey ^^

@ruseinov
Copy link
Contributor Author

ruseinov commented Jun 2, 2019

Below is a start api proposal. GetFreePorts is a public function, because it just seems like a nice helper to have, but could be made private if needed. Could even be a private method of ConfigWrapper to avoid having extra func in the namespace.

How does this sound:

type ConfigWrapper struct {
	c *cfg.Config
	rootDir string
}

func(cw *ConfigWrapper) ChainID() string {
	return cw.c.ChainID()
}

func NewConfigWrapper(testName, chainID string) (*ConfigWrapper, error) {
	rootDir := makePathname(testName)
	c := cfg.ResetTestRootWithChainID(rootDir, chainID)

	addrs, err := GetFreePorts(3)
	if err != nil {
		return nil, err
	}
	cw := &ConfigWrapper{c: c, rootDir: rootDir}

	c.P2P.ListenAddress = cw.addr(addrs[0])
	c.RPC.ListenAddress = cw.addr(addrs[1])
	c.RPC.CORSAllowedOrigins = []string{"https://tendermint.com/"}
	c.RPC.GRPCListenAddress = cw.addr(addrs[2])
	c.TxIndex.IndexTags = "app.creator,tx.height" // see kvstore application

	return cw, nil

}

func(_ *ConfigWrapper) addr(port int) string {
	return fmt.Sprintf("tcp://0.0.0.0:%d", port)
}

// GetFreePorts gets n free ports while listening to all of them before they are returned
// to avoid collisions
func GetFreePorts(n int) ([]int, error) {
	ports := make([]int, n)

	for k, _ := range ports {
		addr, err := net.ResolveTCPAddr("tcp", "localhost:0")
		if err != nil {
			return ports, err
		}

		l, err := net.ListenTCP("tcp", addr)
		if err != nil {
			return ports, err
		}
		// This is done on purpose - we want to keep ports
		// busy to avoid collisions when getting the next one
		defer func() { _ = l.Close() }()
		ports[k] = l.Addr().(*net.TCPAddr).Port
	}

	return ports, nil
}


// returns underlying Config, which could be directly modified
func(cw *ConfigWrapper) Config() *cfg.Config {
	return cw.c
}

// Cleanup makes sure nothing is left on the disk after tests have been run
func(cw *ConfigWrapper) Cleanup() {
	os.RemoveAll(cw.rootDir)
	// ...
}

@thanethomson
Copy link
Contributor

Thanks for this, but what I'd recommend is rather closing this issue and addressing this problem entirely in #3690. One of the big reasons is that even with GetFreePorts guaranteeing no collisions during its own allocation of multiple ports, we can't guarantee that one of the allocated ports won't be grabbed by an external process between the time we call GetFreePorts and the time we start the Tendermint node.

Hence my reasoning as to why this should be superseded by an API that handles both the port allocation and the starting of the Tendermint node.

@ruseinov ruseinov closed this as completed Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:test Type: Tests that need love
Projects
None yet
Development

No branches or pull requests

2 participants