Fix connections with no messages sent during node outage #940

rukai · 2022-11-25T07:13:53Z

Changes:

prereq PR - cassandra_int_tests: Refactor CassandraConnection execute methods #943
cassandra errors that are generated in response to internal shotover errors are now Overloaded errors instead of Server errors. This is needed because the cassandra_cpp drivers will terminate the connection on Server errors but will allow the user to retry a failed message when Overloaded occurs.
The cassandra connection abstraction is refactored to allow for error handling:
- Internal errors are now returned as a ResponseError instead of silently being converted to a cassandra error message.
- The ResponseError contains the connections destination allowing the transform to recreated dead connections
- The stream_id is only required in the case of errors allowing us to replace the Response struct with a simple type alias of Result<Message, ResponseError>
test_node_going_down is completely refactored to make adding unique test cases easy
- The test logic is now split up across multiple functions that can be optionally called for each case.
- cluster_single_rack_node_lost is removed because test_node_going_down now tests every case within a single call.
- The refactor enabled me to add 2 new test cases to reproduce the problem this PR fixes

I'm opening this up for review despite the unmerged prereq, the changes in helpers/cassandra.rs can be just ignored as they are a subset of those in #943

conorbros · 2022-11-29T10:47:07Z

cassandra errors that are generated in response to internal shotover errors are now Overloaded errors instead of Server errors. This is needed because the cassandra_cpp drivers will terminate the connection on Server errors but will allow the user to retry a failed message when Overloaded occurs.

Why do we want the user to retry a failed message when the error is internal to shotover?

rukai · 2022-11-29T22:17:25Z

Why do we want the user to retry a failed message when the error is internal to shotover?

An internal shotover error is really just the destination DB did something unexpected, in this case its that the connection was lost. The change in this PR recreates the connection when its lost but we don't retry internally and expect the user to do that for us.
The reason I didn't attempt internal retries: Some queries are non idempotent so we need extra complexity to determine which errors can be retried or not.

But I think you could be onto something.
I'm a little concerned that returning Overloaded could lead to retrying of non idempotent queries, I will change the implementation back to Server errors and skip this part of the test for cassandra_cpp driver.
I will need to investigate if the java driver has the same undesirable Server error handling, but I think its safe to land this PR regardless of what the java driver does and then possibly follow up with more tweaks.

benbromhead · 2022-12-07T21:58:35Z

I think returning overloaded is not quite right in this situation. Overloaded indicates to the client that the query never got accepted by a coordinator. Whereas a SERVER error indicates an unknown state, which even with an internal shotover error, could still occur (e.g. we flush the frame to the tcp buffer upstream, but then something busts internally).

The Cassandra drivers all have user configurable retry policies, which also allow for retrying on server errors.

rukai · 2022-12-07T23:46:19Z

@benbromhead that was the conclusion I had come to as well.
The code already takes that approach, just needs your approval.

The Cassandra drivers all have user configurable retry policies, which also allow for retrying on server errors.

Oooooh maybe we can configure the cassandra_cpp driver to retry on Server error then.

benbromhead · 2022-12-08T03:22:53Z

@benbromhead that was the conclusion I had come to as well. The code already takes that approach, just needs your approval.

me read good

The Cassandra drivers all have user configurable retry policies, which also allow for retrying on server errors.

Oooooh maybe we can configure the cassandra_cpp driver to retry on Server error then.

Yup we should be able to if it follows the patterns that all the other datastax based drivers do

…_node_outage

rukai force-pushed the fix_connections_with_no_messages_sent_during_node_outage branch 3 times, most recently from 16c23a3 to a42fec1 Compare November 29, 2022 00:55

rukai marked this pull request as ready for review November 29, 2022 04:41

rukai requested a review from conorbros November 29, 2022 04:41

conorbros approved these changes Nov 30, 2022

View reviewed changes

benbromhead approved these changes Dec 8, 2022

View reviewed changes

Fix connections with no messages sent during node outage

b104da0

rukai force-pushed the fix_connections_with_no_messages_sent_during_node_outage branch from 3f2ef35 to b104da0 Compare December 8, 2022 04:04

rukai enabled auto-merge (squash) December 8, 2022 04:05

Merge branch 'main' into fix_connections_with_no_messages_sent_during…

58c99f3

…_node_outage

rukai merged commit 8e24912 into shotover:main Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix connections with no messages sent during node outage #940

Fix connections with no messages sent during node outage #940

rukai commented Nov 25, 2022 •

edited

conorbros commented Nov 29, 2022

rukai commented Nov 29, 2022 •

edited

benbromhead commented Dec 7, 2022

rukai commented Dec 7, 2022

benbromhead commented Dec 8, 2022

Fix connections with no messages sent during node outage #940

Fix connections with no messages sent during node outage #940

Conversation

rukai commented Nov 25, 2022 • edited

conorbros commented Nov 29, 2022

rukai commented Nov 29, 2022 • edited

benbromhead commented Dec 7, 2022

rukai commented Dec 7, 2022

benbromhead commented Dec 8, 2022

rukai commented Nov 25, 2022 •

edited

rukai commented Nov 29, 2022 •

edited