Degredation of PVCs #19820
Unanswered
srstrickland
asked this question in
Q&A
Degredation of PVCs
#19820
Replies: 2 comments 7 replies
-
This has happened again. Here is the current pod situation:
And a smoking gun... on every one of the highlighted pods (and none of the others!), there are multiple
Every single one of the others look something like this:
It seems possibly related to #19155 |
Beta Was this translation helpful? Give feedback.
5 replies
-
@jszwedko this seems pretty similar to the issues we're facing as well. Is there any plan to make |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi!
We have been using PVC-backed data buffers, mostly because we were unsure about undertaking the operational burden of keeping kafka happy. We have since decided the operational flexibility is worth the effort, so will be redesigning our system as such, but recently I've run into a more pressing problem with the PVC solution: it appears to be unstable after time.
Unfortunately, I don't have a ton of information here... there is nothing outstanding in the logs and really all I see is CPU utilization going through the roof and discarded events being reported. If all I do is remove the pod, it comes back (with its PVC) and continues its high CPU utilization and discarded errors. We're using HPA, so all it takes is for one pod to throw off the average, which triggers HPA scale-out. Over time, it happens to more and more pods. My only recourse is to delete the pod and its corresponding PVC.
Recently, this happened to 6 of 20 pods (we only had 20 pods, up from the usual 8, because multiple pods started exhibiting this behavior). Here's what I saw:
I cleaned up the 6 indicated via this script:
The chaos stopped, and HPA scaled back down with 8 pods hovering around 800m CPU.
The PVC data has to be the culprit here... I just don't know how to diagnose this issue. I've saved the
/vector-data-dir
from one of the instances, but not sure how to go about troubleshooting it.Before we get to the kafka-based solution (which we're working on), I'm wondering whether it's worth applying some workaround:
rm -rf /vector-data-dir/*
on "clean" shutdown, and possibly do regular rolling restartsAny guidance here would be great!
Beta Was this translation helpful? Give feedback.
All reactions