Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve status reporting in UI #1262

Merged
merged 32 commits into from Feb 20, 2023
Merged

Conversation

Miles-Garnsey
Copy link
Contributor

The status page doesn't always show repairs which are actually running. We should sort the returned results according to which statuses are most "interesting" and show the most interesting rows first.

Fixes

#1217

@Miles-Garnsey Miles-Garnsey linked an issue Jan 11, 2023 that may be closed by this pull request
1 task
…r and state and then query the main `repair_runs` table using the found UUIDs.
Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the Cucumber scenario to verify the feature is correctly implemented:

@sidecar
  Scenario Outline: Verify that ongoing repairs are prioritized over finished ones when listing the runs
    Given that reaper <version> is running
    And reaper has no cluster in storage
    When an add-cluster request is made to reaper with authentication
    Then reaper has the last added cluster in storage
    And a new repair is added for "test" and keyspace "test_keyspace"
    And I add and abort 10 repairs for "test" and keyspace "test_keyspace2"
    Then when I list the last 10 repairs, I can see 1 repairs at "NOT_STARTED" state
    And when I list the last 10 repairs, I can see 9 repairs at "ABORTED" state
    When the last added cluster is deleted
    Then reaper has no longer the last added cluster in storage
  ${cucumber.upgrade-versions}

The 2 new steps to implement are:

And I add and abort ?? repairs for "???" and keyspace "???"
Then when I list the last ?? repairs, I can see ? repairs at "????" state

…iority statuses.

Fix issue in MemoryStorage and add test for it.
Comment on lines 2899 to 2900
@And("^I add 11 and abort the most recent 10 repairs for cluster \"([^\"]*)\" and keyspace \"([^\"]*)\"$")
public void addAndAbortRepairs(String clusterName, String keyspace) throws Throwable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 and 10 should be variables, so that we can reuse this step if needed with different numbers.

Suggested change
@And("^I add 11 and abort the most recent 10 repairs for cluster \"([^\"]*)\" and keyspace \"([^\"]*)\"$")
public void addAndAbortRepairs(String clusterName, String keyspace) throws Throwable {
@And("^I add (\\d+) and abort the most recent (\\d+) repairs for cluster \"([^\"]*)\" and keyspace \"([^\"]*)\"$")
public void addAndAbortRepairs(int nbRepairRuns, int abortedRepairRuns, String clusterName, String keyspace) throws Throwable {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the requested changes. Thinking on this, should this not be several steps? So perhaps the spec should be:

  • When I add 11 repairs
  • When I set the state on the most recent 10 repairs to ABORTED

That would better promote reusability I think.


RUNNERS.parallelStream().forEach(runner -> {
Integer iter = 1;
while (iter <= 11) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be creating just 10 repairs here, not 11.
Also I don't think you should use a parallel stream across runners, because each runner will create 10 runs here.
Just pick the first runner by using RUNNERS.get(0).callReaper(...) at line 2913, and remove the surrounding stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that. The runners all run against the same cluster? Changed as suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE creating too few repairs, with this logic I create 1,2,3,4,5,6,7,8,9,10,11 - which is 11 I'm pretty sure?

I used my fingers to count, so someone needs to rescind my math postgrad if I'm wrong about this.

@Miles-Garnsey Miles-Garnsey force-pushed the feature/better-status-reporting branch 2 times, most recently from d911355 to b2f6ac3 Compare January 19, 2023 02:14
Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, I still have a few request so that we don't create confusion amongst users.

final Collection<RepairRun> repairRuns = context.storage.getRepairRunsForCluster(clusterName, limit);
final Collection<RepairRun> repairRuns = context
.storage
.getRepairRunsForClusterPrioritiseRunning(clusterName, limit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to do the same on line 665. It's the method that lists repair runs throughput all registered clusters.
Currently it's still using the old getRepairRunsForCluster()  method which isn't optimized to prioritize running repairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that on line 665, it is adding the runs from ALL clusters into a list. I'll need to re-sort the runs so that the statuses are in the right order irrespective of cluster. Once you let me know the precise way you want the sorting done, I might just encapsulate it in a function so that we aren't duplicating that logic in multiple places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you'll need to merge lists from different clusters and re-sort to apply the limits.
FYI, repair ids are timeuuids, which makes it possible to apply the sorting on them directly using their time component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be done and working now, although the cucumber test doesn't test against multiple clusters. I'm not sure if I should add a multi-cluster test to confirm that this endpoint does indeed work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be fairly hard to achieve with the limited resources we have at our disposal.
A mocked test wouldn't help much I guess, so we're left with manual testing for multi cluster 🤷

Miles-Garnsey and others added 3 commits February 14, 2023 14:49
Co-authored-by: Alexander Dejanovski <alex@thelastpickle.com>
@Miles-Garnsey
Copy link
Contributor Author

I've created a new method on RepairRun - SortByRunState which implements the logic you want using the existing function isTerminated(). I've then subbed that method in for MemoryStorage and inside the RepairRunResource calls (CassandraStorage uses different logic which already partitions the terminated/unterminated statuses, so I think that is OK but please let me know if I'm wrong).

My only concern with this is that I don't like having methods on RepairRun since it doesn't have a lot right now and is almost a model object (having few methods beyond a builder); is there a better place for this do you think? Maybe even in RepairRunResource, but then the storage layer would need to call the presentation layer, which seems questionable too?

Maybe we shouldn't be doing the sorting in the storage layer at all since this type of sorting is purely a presentation layer concern (so perhaps implement in 'RepairRunResource' and leave responses from storage layers unsorted)? I'm open to feedback on the design here, as you know the patterns in this codebase better than me.

@Miles-Garnsey Miles-Garnsey force-pushed the feature/better-status-reporting branch 2 times, most recently from 2e09f7a to 093633c Compare February 16, 2023 05:13
@Miles-Garnsey
Copy link
Contributor Author

Manual testing suggests that this isn't working again for some reason. I've tried flipping the ordering in these conditions, but neither ordering appears to give us the result we want in the UI:

  if (!o1.getRunState().isTerminated() && o2.getRunState().isTerminated()) {
          return 1; // o2 appears first.
        }  else if (o1.getRunState().isTerminated() && !o2.getRunState().isTerminated()) {
          return -1; // o1 appears first.

More concerning, I haven't seen the cucumber tests failing either way, so they don't appear to be detecting errors.

It may be the case that there is some ordering being applied in the UI, since I note that the repair runs are always ordered by start time (not creation time, which UUID should give us, I think?) I'll investigate further.

@Miles-Garnsey
Copy link
Contributor Author

Manual testing confirms this PR works as intended. The outstanding questions are:

  1. Should we restructure the cucumber test functions so that starting and aborting repairs are separate functions?
  2. Should we make this a multicluster test to better cover the repair_runs/ endpoint (which queries for all clusters, and isn't currently fully covered by the e2e tests).

adejanovski
adejanovski previously approved these changes Feb 17, 2023
Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion to simplify the code a little bit.
No need to try testing multi cluster stuff in cucumber, we won't have enough resources in GHA to do so.
Also, it's fine to keep create/abort in the same cucumber step for now. If we need it later, we can refactor it in two separate steps.

final Collection<RepairRun> repairRuns = context.storage.getRepairRunsForCluster(clusterName, limit);
final Collection<RepairRun> repairRuns = context
.storage
.getRepairRunsForClusterPrioritiseRunning(clusterName, limit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be fairly hard to achieve with the limited resources we have at our disposal.
A mocked test wouldn't help much I guess, so we're left with manual testing for multi cluster 🤷

Comment on lines 971 to 978
for (RunState state :
Arrays
.stream(RunState.values())
.filter(v ->
Arrays.asList("RUNNING", "PAUSED", "NOT_STARTED")
.contains(v.toString()))
.collect(Collectors.toList())
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Do you think this could be simplified to the following?

for (String state:Arrays.asList("RUNNING", "PAUSED", "NOT_STARTED")){

I'm not sure why we need the stream().filter().collect() here, but I could be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, it would be simpler. I'll make this change.

@Miles-Garnsey Miles-Garnsey merged commit ed9c4da into master Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Force running repairs to always show up in Running on Repairs page
2 participants