Topology change resilience for incremental repair #1235

Miles-Garnsey · 2022-10-27T01:31:33Z

Ensure that, in an incremental repair, the replica list is updated on every repair run. This addresses cases where the node IPs change.

Fixes #1213

Miles-Garnsey · 2022-10-27T01:40:09Z

This is still a work in progress. I have written a test (which is passing), but it isn't clear to me that this should actually be working.

On line 470 I make a call to clusterFacade.getRangeToEndpointMap without actually having called .connect.
When attempting to extract the token range from segment, I only extract the first and last tokens. I need to confirm that this will work even when vnodes are present (although given my limited understanding of what a segment is, I think it may be OK?)

Miles-Garnsey · 2022-10-28T03:59:48Z

Manual testing of this change suggests that there are possible improvements still available, but that it does address the specific requirements in the ticket.

First, I tried starting the repair, pausing it, then doing a rolling restart on the cluster. I've included a log below showing the results. The problem there was that no coordinators could be found. test-dc1-reaper-94769775c-npd4n.log

But that problem is subtly different to the one we are trying to address here. The matter at hand is not dealing with a full rolling restart (where every IP changes) but just the case where one node's IP changes. I recommend creating another ticket to handle this more extreme case by redoing the DNS lookup so that the list of potential coordinators is refreshed for cases where an FQDN is used as the contact point.

So to make this test more specific I created a new repair, paused it, took note of the host which the current segment was running against, and restarted just that node (leaving the other two in place).

I do still get the below error, but the repair appears to reschedule correctly:

java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
	at java.base/java.util.Objects.checkIndex(Objects.java:372)
	at java.base/java.util.ArrayList.get(ArrayList.java:459)
	at io.cassandrareaper.service.RepairRunner.isAllowedToRun(RepairRunner.java:267)
	at io.cassandrareaper.service.RepairRunner.run(RepairRunner.java:234)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
	at java.base/java.lang.Thread.run(Thread.java:829)

You'll note from the below screenshot that it does switch to a different coordinator for the second two segments, which I think is the behaviour we want.

… creates.

…and looks up its current IP when determining the list of potential coordinators.

Miles-Garnsey · 2022-11-08T02:18:50Z

We took this back to the drawing board because our JMX calls were not returning any way to identify the primary replica when doing range -> endpoint queries.

This new approach instead tracks the hostID associated with a given segment and looks up the current IP associated with it when starting each run. It then uses that IP as the coordinator to ensure we are hitting the same endpoint even if the IP has changed.

…getEndpointToHostId.

… repair tests.

…at the coordinator is the node that gets swapped out.

…lement.

…D field in RepairSegment.

…e correctly propagated into the potentialCoordinators so that repairs correctly pick them up.

adejanovski

Awesome sauce @Miles-Garnsey! Approved ✅

Allows incremental repair to survive nodes changing IP address during the repair. The hostID is now stored in each segment and the ip address is recomputed from it when the segment runs.

Miles-Garnsey marked this pull request as draft October 27, 2022 01:32

Miles-Garnsey force-pushed the bugfix/incremental-resilience branch from accb60c to 27186ed Compare October 27, 2022 01:37

Miles-Garnsey force-pushed the bugfix/incremental-resilience branch from 27186ed to 323dbaa Compare October 27, 2022 01:52

Miles-Garnsey marked this pull request as ready for review October 28, 2022 04:11

Miles-Garnsey added 5 commits November 8, 2022 13:12

Migration file to add host_id to repair_run table.

0593f56

Changes to CassandraStorage to read/write new hostID field.

52f1261

Changes to RepairSegment class to include new hostID field.

95f0eae

RepairRunService now adds in hostID for every repair RepairSegment it…

72a53d3

… creates.

RepairRunner now consults the hostID field for the nominated segment …

92643af

…and looks up its current IP when determining the list of potential coordinators.

Miles-Garnsey force-pushed the bugfix/incremental-resilience branch from 323dbaa to 92643af Compare November 8, 2022 02:16

Test for topology change when running incremental repairs.

f4f931c

Miles-Garnsey added 4 commits November 8, 2022 15:19

Make checkstyle happy.

f5ae5bd

Tweaks to the way hostID is instantiated on RepairSegment.

91ef907

Better mocks for the test.

7d2a410

Update mocks in failIncrRepairRunCreationTest to cater to new use of …

d25a9e4

…getEndpointToHostId.

adejanovski added the zh:In-Progress label Nov 8, 2022

Miles-Garnsey added 5 commits November 9, 2022 15:30

HostID is only added where the repair run is incremental.

8f74d82

Tweak error messages in RepairRunner.

6fa57f7

Remove extra if statement from createRepairSegmentsForIncrementalRepair.

0431d5b

Test method addNewRepairRun needs to set the hostID for incremental…

16f0052

… repair tests.

Fix up testDontFailRepairAfterTopologyChangeIncrementalRepair so th…

c186b9a

…at the coordinator is the node that gets swapped out.

Miles-Garnsey force-pushed the bugfix/incremental-resilience branch from 3e16fac to c186b9a Compare November 10, 2022 05:34

Miles-Garnsey added 5 commits November 10, 2022 16:49

addNewRepairRun was still missing a hostID on its second returned e…

12a15f6

…lement.

Don't throw exceptions when there are no potential replicas.

248c9ae

Comments and optimisations.

0989743

Checkstyle...

04f2d5d

Ensure createRepairSegmentFromRow storage function hydrates the hostI…

e8a2d04

…D field in RepairSegment.

Miles-Garnsey force-pushed the bugfix/incremental-resilience branch from 05eb4bf to e8a2d04 Compare November 14, 2022 02:24

adejanovski self-requested a review November 14, 2022 06:48

Miles-Garnsey added 2 commits November 16, 2022 15:03

(Courtesy of Alexander Dejanovski) ensure that the segmentReplicas ar…

aff7669

…e correctly propagated into the potentialCoordinators so that repairs correctly pick them up.

Change some info level logs to debug.

4a0d51a

Miles-Garnsey force-pushed the bugfix/incremental-resilience branch from fbdb724 to 4a0d51a Compare November 16, 2022 04:20

adejanovski approved these changes Nov 16, 2022

View reviewed changes

adejanovski merged commit 7cc571b into master Nov 16, 2022

Miles-Garnsey mentioned this pull request Jun 22, 2023

Add missing Astra migrations #1310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology change resilience for incremental repair #1235

Topology change resilience for incremental repair #1235

Miles-Garnsey commented Oct 27, 2022 •

edited by adejanovski

Miles-Garnsey commented Oct 27, 2022

Miles-Garnsey commented Oct 28, 2022 •

edited

Miles-Garnsey commented Nov 8, 2022 •

edited

adejanovski left a comment

Topology change resilience for incremental repair #1235

Topology change resilience for incremental repair #1235

Conversation

Miles-Garnsey commented Oct 27, 2022 • edited by adejanovski

Miles-Garnsey commented Oct 27, 2022

Miles-Garnsey commented Oct 28, 2022 • edited

Miles-Garnsey commented Nov 8, 2022 • edited

adejanovski left a comment

Choose a reason for hiding this comment

Miles-Garnsey commented Oct 27, 2022 •

edited by adejanovski

Miles-Garnsey commented Oct 28, 2022 •

edited

Miles-Garnsey commented Nov 8, 2022 •

edited