Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix connections with no messages sent during node outage #940

Conversation

rukai
Copy link
Member

@rukai rukai commented Nov 25, 2022

Changes:

  • prereq PR - cassandra_int_tests: Refactor CassandraConnection execute methods #943
  • cassandra errors that are generated in response to internal shotover errors are now Overloaded errors instead of Server errors. This is needed because the cassandra_cpp drivers will terminate the connection on Server errors but will allow the user to retry a failed message when Overloaded occurs.
  • The cassandra connection abstraction is refactored to allow for error handling:
    • Internal errors are now returned as a ResponseError instead of silently being converted to a cassandra error message.
    • The ResponseError contains the connections destination allowing the transform to recreated dead connections
    • The stream_id is only required in the case of errors allowing us to replace the Response struct with a simple type alias of Result<Message, ResponseError>
  • test_node_going_down is completely refactored to make adding unique test cases easy
    • The test logic is now split up across multiple functions that can be optionally called for each case.
    • cluster_single_rack_node_lost is removed because test_node_going_down now tests every case within a single call.
    • The refactor enabled me to add 2 new test cases to reproduce the problem this PR fixes

I'm opening this up for review despite the unmerged prereq, the changes in helpers/cassandra.rs can be just ignored as they are a subset of those in #943

@rukai rukai force-pushed the fix_connections_with_no_messages_sent_during_node_outage branch 3 times, most recently from 16c23a3 to a42fec1 Compare November 29, 2022 00:55
@rukai rukai marked this pull request as ready for review November 29, 2022 04:41
@rukai rukai requested a review from conorbros November 29, 2022 04:41
@conorbros
Copy link
Member

cassandra errors that are generated in response to internal shotover errors are now Overloaded errors instead of Server errors. This is needed because the cassandra_cpp drivers will terminate the connection on Server errors but will allow the user to retry a failed message when Overloaded occurs.

Why do we want the user to retry a failed message when the error is internal to shotover?

@rukai
Copy link
Member Author

rukai commented Nov 29, 2022

Why do we want the user to retry a failed message when the error is internal to shotover?

An internal shotover error is really just the destination DB did something unexpected, in this case its that the connection was lost. The change in this PR recreates the connection when its lost but we don't retry internally and expect the user to do that for us.
The reason I didn't attempt internal retries: Some queries are non idempotent so we need extra complexity to determine which errors can be retried or not.

But I think you could be onto something.
I'm a little concerned that returning Overloaded could lead to retrying of non idempotent queries, I will change the implementation back to Server errors and skip this part of the test for cassandra_cpp driver.
I will need to investigate if the java driver has the same undesirable Server error handling, but I think its safe to land this PR regardless of what the java driver does and then possibly follow up with more tweaks.

@benbromhead
Copy link
Member

I think returning overloaded is not quite right in this situation. Overloaded indicates to the client that the query never got accepted by a coordinator. Whereas a SERVER error indicates an unknown state, which even with an internal shotover error, could still occur (e.g. we flush the frame to the tcp buffer upstream, but then something busts internally).

The Cassandra drivers all have user configurable retry policies, which also allow for retrying on server errors.

@rukai
Copy link
Member Author

rukai commented Dec 7, 2022

@benbromhead that was the conclusion I had come to as well.
The code already takes that approach, just needs your approval.

The Cassandra drivers all have user configurable retry policies, which also allow for retrying on server errors.

Oooooh maybe we can configure the cassandra_cpp driver to retry on Server error then.

@benbromhead
Copy link
Member

@benbromhead that was the conclusion I had come to as well. The code already takes that approach, just needs your approval.

me read good

The Cassandra drivers all have user configurable retry policies, which also allow for retrying on server errors.

Oooooh maybe we can configure the cassandra_cpp driver to retry on Server error then.

Yup we should be able to if it follows the patterns that all the other datastax based drivers do

@rukai rukai force-pushed the fix_connections_with_no_messages_sent_during_node_outage branch from 3f2ef35 to b104da0 Compare December 8, 2022 04:04
@rukai rukai enabled auto-merge (squash) December 8, 2022 04:05
@rukai rukai merged commit 8e24912 into shotover:main Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants