Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore repair_run_by_cluster_v2 rows with no corresponding repair #1478

Conversation

SesquipedalianDefenestrator
Copy link
Contributor

When bad things (like Reaper filling the drive with logs) happen, it's possible to end up with repair_run_by_cluster_v2 rows with no corresponding repair row, which breaks Reaper. So, just skip them.

When bad things (like Reaper filling the drive with logs) happen,
it's possible to end up with repair_run_by_cluster_v2 rows with no
corresponding repair row, which breaks Reaper.  So, just skip them.
Copy link

No linked issues found. Please add the corresponding issues in the pull request description.
Use GitHub automation to close the issue when a PR is merged

@SesquipedalianDefenestrator
Copy link
Contributor Author

I'm not actually sure if anything will clean up the repair_run_by_cluster_v2 rows, but I'm not sure it's a problem. The mismatch between repair_run and repair_run_by_cluster_v2 is frequent enough to be a pain (basically, a percentage of the time when anything goes wrong with the cluster and Reaper logs until the drive fills), but not often enough to generate measurable load.

Per comment, rewrite to stick a filter between instead
of shoving a ternary into the map.  Cleaner and easier to read.
@SesquipedalianDefenestrator
Copy link
Contributor Author

Looks like there's an issue for this already as well: #1463

@adejanovski adejanovski linked an issue Mar 14, 2024 that may be closed by this pull request
@adejanovski adejanovski requested review from Miles-Garnsey and removed request for FieteO March 14, 2024 12:35
@Miles-Garnsey
Copy link
Contributor

I'm looking at this, having some issues with my local setup as far as testing goes so bear with me while I try to work around them.

Copy link
Contributor

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of testing, I've followed the procedure below:

Try to repro the issue from main by:

  1. First commenting out the testSSLHotReload (which is a problem to run locally) then building the repo into a docker image, kind loading it into a kind cluster, spinning up a K8ssandraCluster with the new Reaper image.
  2. Creating a new repair via the UI.
  3. Confirming the repair exists in both repair_run_by_cluster_v2 and repair_run
  4. Deleting from repair_run via delete from repair_run where id = d03c24a0-e59e-11ee-9a67-5dd395ae716d

When reloading the UI I then see:

image

Which I think is what we want based on the error repro'd here.

I then go and rebuild the image and re-deploy everything using the branch from this PR. Following the same procedure, I no longer see the NPE visible in Reaper's logs. Instead, the UI simply returns no results for repair runs in the DONE or RUNNING state.

I'm going to chat to @adejanovski about whether we need some more work to ensure that repair runs are purged from repair_run_by_cluster_v2, and whether this data inconsistency should be logged as a warning. But I'll approve this for now on the basis that it eliminates the NPE.

@Miles-Garnsey Miles-Garnsey merged commit 88a9981 into thelastpickle:master Mar 19, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NPE when listing repairs
3 participants