Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues when listing repair runs using state priorities #1307

Merged
merged 6 commits into from Jun 13, 2023

Conversation

adejanovski
Copy link
Contributor

@adejanovski adejanovski commented Jun 8, 2023

Fixes #1272

Several changes have been made to the RepairRunResource.java file, including updating the
getRepairRunsForCluster method to use clusterName instead of cluster_name as a path
parameter, updating the listRepairRuns method to use clusterName instead of cluster_name
as a query parameter. These changes fix bugs that have been existing for a while it seems, but were hidden by how repairs were listed by pure chronological order and client side filters applied in the UI.

In the RepairSegmentDao.java file, the getSegmentAmountForRepairRun method has
been updated to use getRepairSegmentCountByRunIdPrepStmt instead of
getRepairSegmentCountByRunIdAndStatePrepStmt, which was a bug.

A bug was fixed in the RepairRunDao.getRepairRunsForClusterPrioritiseRunning() method, which was not really applying a limit on the flattenedUuids list as subList() returns a new list, it doesn't modify in place the list it's applied to.

Additionally, this pull request includes updates to the acceptance tests to add fake
clusters and verify that ongoing repairs are prioritized over finished ones when listing them.
This checks that repair runs across multiple clusters are prioritized based on their state and that depending on the filters applied, we get the expected runs displayed.

@adejanovski adejanovski marked this pull request as ready for review June 8, 2023 17:08
Copy link
Contributor

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions, issues/suggestions here. I haven't manually tested yet pending what you'd like to do about those.

I'm also confused how the manual testing we did on the last PR failed to catch this issue.

The manual testing I did basically mirrored what was in the previous acceptance test - do you know why that failed to catch the problems? I don't want to miss another issue this time around...

@@ -604,9 +604,9 @@ public Response abortRepairRunSegment(@PathParam("id") UUID repairRunId, @PathPa
* @return all know repair runs for a cluster.
*/
@GET
@Path("/cluster/{cluster_name}")
@Path("/cluster/{clusterName}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think this might be the wrong way around. Generally shouldn't URI query strings use snake case? Would it be better to change this in the UI?

Issue: Also, as this is a breaking change to the API I'm not sure what our approach should be here. Does the API have a versioning policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a pathParam, so it doesn't matter that much. The query param from listRepairRuns on the other hand could indeed be a breaking change.
Let me see if I can turn this around.

@@ -175,7 +175,7 @@ private void prepareStatements() {

public int getSegmentAmountForRepairRun(UUID runId) {
return (int) session
.execute(getRepairSegmentCountByRunIdAndStatePrepStmt.bind(runId))
.execute(getRepairSegmentCountByRunIdPrepStmt.bind(runId))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: It looks like this didn't actually have enough parameters bound, how was it ever running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was failing :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall ever seeing it in the logs, and I know manual testing worked. Maybe that was on the other endpoint - weird.

@@ -415,16 +415,26 @@ Feature: Using Reaper
Then reaper has no longer the last added cluster in storage
${cucumber.upgrade-versions}

@sidecar
@sidecar
@current_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably want to remove this before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally do 👍

Miles-Garnsey
Miles-Garnsey previously approved these changes Jun 13, 2023
Copy link
Contributor

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a tough time testing this, partially because the current behaviour is so dependant on what order particular repair runs were added in, as well as whether all clusters vs a specific cluster are filtered on.

When testing I:

  1. Built reaper from a clean state on this branch.
  2. Ran the new test.
  3. Switched to main, rebuilt and ran Reaper.

Looking at the assertions here to compare my results I found that;

  1. Without any filtering on cluster, I got the expected number of paused repairs showing up (I didn't check that the number of ABORTED repairs was corrrect).
  2. When filtering on fake1 cluster, I got the expected display of repairs, both paused and aborted.
  3. When filtering on cluster test, I got the expected paused repairs, but no aborted repairs, which is indeed wrong.
  4. When filtering on fake2 I get one paused and only 3 aborted repairs, where it should have been 1 and 9 respectively.

Given this behaviour on master, I am not convinced that we are necessarily replicating the issues described in the ticket with the new acceptance test, since the issue described no filtering on the clusters.

However, I am satisfied that we are diagnosing a bug, since we are missing aborted repairs which could be shown under the given limit settings. I have confirmed that this behaviour is resolved by the new branch, which (on my manual testing) leads to the expected results being displayed for all three clusters.

Once the @current_test tag is removed from the acceptance tests, this is ready to merge.

@adejanovski adejanovski merged commit 2bfef00 into master Jun 13, 2023
22 checks passed
@adejanovski adejanovski deleted the fix-repairs-display-multicluster branch June 13, 2023 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reaper web not displaying all running repairs
2 participants