Having a bad node will callback with a bad keyspace #77

Closed
calvinfo opened this Issue Oct 1, 2012 · 7 comments

Comments

Projects
None yet
2 participants
Contributor

calvinfo commented Oct 1, 2012

I think I ran across this problem earlier today in our cassandra cluster.

The new onConnect method waits until all the callbacks in the pool have completed before returning the keyspace.

However, if you have one node which is down, it will generally be the one at the end, which then gets called back last without a keyspace.

Ideally, I'd like to have the first good connection callback immediately, but as @hpainter notes dcf4a09, it complicates use call. I'm not sure of a way around it right at this moment, but I'll think on it for a couple days. I submitted a fix which should work and doesn't break the tests. Let me know if you guys have thoughts.

calvinfo referenced this issue Oct 1, 2012

Merged

Connect fix #78

Owner

devdazed commented Oct 1, 2012

Why can't we have it call back once the first valid connection is up? Maybe, the connection just won't get added to the pool if use hasn't been called. If someone has a rather large cluster, it may take a while, esp. if several of the nodes are down. Also, down the road we are going to implement cluster auto-detection. Something I've been meaning to do.

Contributor

calvinfo commented Oct 1, 2012

Yeah, I definitely agree that it should use the first valid connection. When I implemented it using the first connection, the tests would give a timeout on the thriftHostPool test. I'm still not exactly sure why, so I'll continue to look into it this week. Hopefully I'll have something soon which pasess all the tests and passes back a good connection more quickly.

Owner

devdazed commented Oct 1, 2012

awesome

Contributor

calvinfo commented Oct 2, 2012

Okay, I've updated the pull request to return the first good connection, with the tests all passing with no timeouts. I believe the earlier timeouts were caused by not deciding to close new connections which may open after the pool has been created and is closing.

Overall, the closing connection state on the pool makes me a little nervous because there are a lot of places to keep track of it, and I'm not sure if we've caught them all yet.

I played around a little with extending a timeout between when the connection was made and then when the tests run, and it passed them all.

Owner

devdazed commented Oct 2, 2012

I'm not sure, but my local test environment times out on the CQL3 tests, it could be dur to my setup, but we may want to look into it. Im currently running DSC 1.1.0, in a single-node setup locally.

Contributor

calvinfo commented Oct 2, 2012

Hmm, I'll check it out, I'm going to spin up a 3 nodes cluster today using ccm to try and fix how our app handles failures. I didn't see any timeouts using the apache 1.1.5 version on a single node, but I'll keep poking around.

Owner

devdazed commented Oct 18, 2012

Closed, should be running latest stable

devdazed closed this Oct 18, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment