Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-388: connect to other nodes if one node fails #17

Merged
merged 10 commits into from Aug 1, 2013

Conversation

manishtomar
Copy link
Contributor

When node fails to connect, it tries to connect to all nodes in the cluster and fails after trying the whole cluster x number of times.

- tests not yet completed
- Need advice on REVIEW code
if client_i >= num_clients:
if tries >= self.MAX_TRIES:
return failure
# REVIEW: would like to take reactor as arg but that will change signature of this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instance variable passed into init.

also taking max_tries and interval as argument
@manishtomar
Copy link
Contributor Author

retest this please

Manish Tomar added 2 commits July 5, 2013 00:54
Recent commit 706bab about round robin broke the tests. Adjusted them.
Also added docstrings in other places
@dreid
Copy link
Contributor

dreid commented Jul 10, 2013

  1. I would not conflate retrying to connect to whole the cluster with trying to connect to each seed node in the cluster. I would delegate retrying to a separate wrapper that can be used around the RoundRobinCassandraCluster object or around a regular CQLClient.

  2. This will not connect to each seed, rather it will connect to len(seeds). But it will not necessarily be each seed when multiple requests are in flight.

Here is a sequence diagram for the simple case of a single actor (named bob) talking to the cluster: Single Actor

Here is a diagram for the case of two actors (named bob and alice): Two Actors

In the second example, because bob got cass0 out of the pool, and alice got cass1 while bob was still trying to connect to cass0, then cass0 failed, bob skipped to cass1, which was the only node in this scenario that was serving requests.

This happens because the 'pool' is just a simple counter.
0) counter=0

  1. bob got 0 and did (0 + 1) % 3; counter=1
  2. alice got 1 and did (1 + 1) % 3; counter=2
  3. bob got 2 and did (2 + 1) % 3; counter=0
  4. bob got 0 and did (0 + 1) % 3; counter=1

Bob has now tried to connect 3 times, but only to two unique servers.

Maybe in practice under load with all the requests in flight this comes out in a wash. I don't know, but this algorithm doesn't do exactly what you'd think it would.

@manishtomar
Copy link
Contributor Author

OMG. Brilliant catch @dreid. Thanks for realizing a bug that would've been super irritating if it went through. I haven't gone through the whole comment yet. Will read and let you know. Thanks again

Manish Tomar added 2 commits July 15, 2013 11:29
Instead internally increments client index to not accidentally allow some other caller to use index
while it is hopping between cass nodes. While moving to cass nodes, it sets index.
@dreid
Copy link
Contributor

dreid commented Jul 15, 2013

Ok, so this does two things, it retries requests (if a request can't be fulfilled on clusterA it retries on clusterA) and redistributes requests (if a request can't be fulfilled on nodeA it tries it on nodeB)

I would like to see the retrying requests functionality broken out into a separate wrapper class that can be wrapped around either a CQLClient or a RoundRobinCassandraCluster. This should make that functionality easier to test also. I mentioned this as #1 in my last comment.

In addition, you should add a test case that actually exercises the previous race condition. Ideally you would do this by writing a test that actually fails with the old implemention at 9c4513c and succeed with the implementation added in 8d37996.

@manishtomar
Copy link
Contributor Author

Would it help to have connecting to the cluster as separate function that gets retried when the whole cluster connection fails?

@dreid
Copy link
Contributor

dreid commented Jul 18, 2013

On Composition

class IntervalRetryingCQLClient(object):
    def __init__(self, reactor, client, interval, max_retries):
        ...
    def execute(self, query, params, consistency):
         retries = 0

         def maybe_retry(failure, retries):
             failure.trap(ConnectError)
             retries += 1
             d2 = task.deferLater(self._reactor, self._interval, self._client.execute, query, params, consistency)
             if retries <= self._max_retries:
                 d2.addErrback(_maybe_retry, retries)

             return d2

         d = self._client.execute(query, params, consistency)
         d.addErrback(_maybe_retry, retries)
         return d

What I'm advocating for is we break out the functionality of retrying a CQL query to a separate class that only has that responsibility. Something like the above.

This has a few advantages:

  1. It works with CQLClient and RoundRobinCassandraCluster: IntervalRetryingCQLClient(reactor, CQLClient(…), …) and IntervalRetryingCQLClient(reactor, RoundRobinCassandraCluster(…), …)
  2. It's very clear how to disable retrying. (Simply do not instantiate an IntervalRetryingCQLClient.)
  3. It's possible to selectively use retrying for only certain queries (Simple wrap a long-lived object having an execute method in an IntervalRetryingCQLClient at the point where you know you want to retry.
  4. It's allows for other implementations of retry strategies to work with the existing classes. Such as exponential backoff.

It's no accident that CQLClient and RoundRobinCassandraCluster have the exact same interface (primarily an execute method.) This was an intentional choice to make it easy to add features (like logging/timing, and retrying) via composition.

There is a pretty good talk from pycon2013 about composition vs inheritance which you may find interesting: http://pyvideo.org/video/1684/the-end-of-object-inheritance-the-beginning-of

If you do not want to do this work in this PR feel free to simply remove the retry functionality you've added to RoundRobinCassandraCluster and ask for another review.

On trying the next server

There is a reasonable concern that simply calling execute on the next CQLClient could cause a non-idempotent query (such as an UPDATE that increments a counter or appends an entry to a list) to be executed multiple times.

A different approach to this problem rather than retrying would be to let the initial query fail and simply blacklist the bad client node for a period of time.

In this way the application would be able to decide if it should retry the query.

Manish Tomar added 4 commits July 31, 2013 17:23
- removed clock and other retry args.
- updated tests accordingly

Yet to add test for race condition @dreid mentioned
Conflicts:
	silverberg/cluster.py
	silverberg/test/test_cluster.py
@dreid
Copy link
Contributor

dreid commented Aug 1, 2013

+1

manishtomar added a commit that referenced this pull request Aug 1, 2013
AUTO-388: connect to other nodes if one node fails
@manishtomar manishtomar merged commit 94a63c0 into master Aug 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants