New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto-388: connect to other nodes if one node fails #17
Conversation
- tests not yet completed - Need advice on REVIEW code
if client_i >= num_clients: | ||
if tries >= self.MAX_TRIES: | ||
return failure | ||
# REVIEW: would like to take reactor as arg but that will change signature of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instance variable passed into init.
also taking max_tries and interval as argument
retest this please |
Recent commit 706bab about round robin broke the tests. Adjusted them. Also added docstrings in other places
Here is a sequence diagram for the simple case of a single actor (named bob) talking to the cluster: Single Actor Here is a diagram for the case of two actors (named bob and alice): Two Actors In the second example, because bob got cass0 out of the pool, and alice got cass1 while bob was still trying to connect to cass0, then cass0 failed, bob skipped to cass1, which was the only node in this scenario that was serving requests. This happens because the 'pool' is just a simple counter.
Bob has now tried to connect 3 times, but only to two unique servers. Maybe in practice under load with all the requests in flight this comes out in a wash. I don't know, but this algorithm doesn't do exactly what you'd think it would. |
OMG. Brilliant catch @dreid. Thanks for realizing a bug that would've been super irritating if it went through. I haven't gone through the whole comment yet. Will read and let you know. Thanks again |
Instead internally increments client index to not accidentally allow some other caller to use index while it is hopping between cass nodes. While moving to cass nodes, it sets index.
Ok, so this does two things, it retries requests (if a request can't be fulfilled on clusterA it retries on clusterA) and redistributes requests (if a request can't be fulfilled on nodeA it tries it on nodeB) I would like to see the retrying requests functionality broken out into a separate wrapper class that can be wrapped around either a In addition, you should add a test case that actually exercises the previous race condition. Ideally you would do this by writing a test that actually fails with the old implemention at 9c4513c and succeed with the implementation added in 8d37996. |
Would it help to have connecting to the cluster as separate function that gets retried when the whole cluster connection fails? |
On Compositionclass IntervalRetryingCQLClient(object):
def __init__(self, reactor, client, interval, max_retries):
...
def execute(self, query, params, consistency):
retries = 0
def maybe_retry(failure, retries):
failure.trap(ConnectError)
retries += 1
d2 = task.deferLater(self._reactor, self._interval, self._client.execute, query, params, consistency)
if retries <= self._max_retries:
d2.addErrback(_maybe_retry, retries)
return d2
d = self._client.execute(query, params, consistency)
d.addErrback(_maybe_retry, retries)
return d What I'm advocating for is we break out the functionality of retrying a CQL query to a separate class that only has that responsibility. Something like the above. This has a few advantages:
It's no accident that CQLClient and RoundRobinCassandraCluster have the exact same interface (primarily an execute method.) This was an intentional choice to make it easy to add features (like logging/timing, and retrying) via composition. There is a pretty good talk from pycon2013 about composition vs inheritance which you may find interesting: http://pyvideo.org/video/1684/the-end-of-object-inheritance-the-beginning-of If you do not want to do this work in this PR feel free to simply remove the retry functionality you've added to On trying the next serverThere is a reasonable concern that simply calling execute on the next CQLClient could cause a non-idempotent query (such as an UPDATE that increments a counter or appends an entry to a list) to be executed multiple times. A different approach to this problem rather than retrying would be to let the initial query fail and simply blacklist the bad client node for a period of time. In this way the application would be able to decide if it should retry the query. |
- removed clock and other retry args. - updated tests accordingly Yet to add test for race condition @dreid mentioned
Conflicts: silverberg/cluster.py silverberg/test/test_cluster.py
+1 |
AUTO-388: connect to other nodes if one node fails
When node fails to connect, it tries to connect to all nodes in the cluster and fails after trying the whole cluster x number of times.