I think I ran across this problem earlier today in our cassandra cluster.
The new onConnect method waits until all the callbacks in the pool have completed before returning the keyspace.
However, if you have one node which is down, it will generally be the one at the end, which then gets called back last without a keyspace.
Ideally, I'd like to have the first good connection callback immediately, but as @hpainter notes dcf4a09, it complicates use call. I'm not sure of a way around it right at this moment, but I'll think on it for a couple days. I submitted a fix which should work and doesn't break the tests. Let me know if you guys have thoughts.
Why can't we have it call back once the first valid connection is up? Maybe, the connection just won't get added to the pool if use hasn't been called. If someone has a rather large cluster, it may take a while, esp. if several of the nodes are down. Also, down the road we are going to implement cluster auto-detection. Something I've been meaning to do.
Yeah, I definitely agree that it should use the first valid connection. When I implemented it using the first connection, the tests would give a timeout on the thriftHostPool test. I'm still not exactly sure why, so I'll continue to look into it this week. Hopefully I'll have something soon which pasess all the tests and passes back a good connection more quickly.
Okay, I've updated the pull request to return the first good connection, with the tests all passing with no timeouts. I believe the earlier timeouts were caused by not deciding to close new connections which may open after the pool has been created and is closing.
Overall, the closing connection state on the pool makes me a little nervous because there are a lot of places to keep track of it, and I'm not sure if we've caught them all yet.
I played around a little with extending a timeout between when the connection was made and then when the tests run, and it passed them all.
I'm not sure, but my local test environment times out on the CQL3 tests, it could be dur to my setup, but we may want to look into it. Im currently running DSC 1.1.0, in a single-node setup locally.
Hmm, I'll check it out, I'm going to spin up a 3 nodes cluster today using ccm to try and fix how our app handles failures. I didn't see any timeouts using the apache 1.1.5 version on a single node, but I'll keep poking around.
Closed, should be running latest stable