New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent health check failure causes unreachable downstream clusters #34819
Comments
Having the same issue. Upgrading monitoring to .22 on all downstream clusters and projects seemed to help, but it's still happening. |
I also encounter this problem intermittently, do not know how to solve! |
This is linked with issue #30959 |
I believe I have successfully reproduced when running Rancher and downstream k3s on the same box. Please see details and Results section below. Reproduction Environment: Rancher version: v2.6.3 Downstream cluster type: k3s Reproduction steps: Note: These steps are derrived from Darren's comment here.
Results: When I run the script I see that shell output starts with the following (which I did not expect):
However, I then see unexpected EOF errors for
And finally when I check file sizes on disk, some of the files are not the full 100MB size:
Additional Info: While running the script I see free memory on the DO host decline, but it doesn't appear to be detrimental. Typically always at least 400MB free/cached. Likely not an issue. Rancher server logs snippet:
k3s server logs snippet:
|
My validation checks Failed Reproduction Steps: I already reproduced this recently in the prior comment above. Also, when I reproduced I tested several times and had the same result each time (encountering EOF errors). Validation Environment: Rancher version: v2.5-head (effectively equivalent to v2.5.12-rc3 - rc3 was broken so I had to use v2.5-head; it should make no difference as both have the change) commitId Downstream cluster type: k3s Validation steps: Note: These steps are derrived from Darren's comment here.
Results: I still encounter this unexpected output, but then the script seems to work as expected. And it shows less EOF errors than previously when testing with v2.6.3.
I then see some EOF errors. These are encountered on loop iteration 4,5,7, and 10.
And when I check sizes on disk I observe the following, as expected, due to the EOF errors:
Additional Info: While running the script I see free memory on the DO host decline again, but it doesn't appear to be detrimental. Typically always at least 400MB free/cached. Likely not an issue. Rancher server logs snippet below. Notice this is similar to my snippet prior when I reproduced. This is probably because the snippet prior wasn't related to the actions being performed by the script. We now see remotedialer buffer exceeded errors. There are a ton of these errors, this is just a snippet of them.
k3s server logs snippet below. I don't thinkany of this is relevant except maybe the last two lines.
Additionally, note that I tested four times. First time four EOF errors, 2nd time two, 3rd and 4th time just one EOF error. Compared to prior where I'd get 8/10 or 9/10 with EOF errors. |
/forwardport v2.6.3-patch1 |
PR: #36206 |
Based on offline discussion with @davidnuzik @ibuildthecloud @deniseschannon closing this issue for 2.5.12 release. |
SURE-3343, SURE-3344
Rancher Cluster:
Rancher version:2.5.9
Number of nodes:1500
Node OS version: RancherOS- v1.5.6
RKE/RKE2/K3S version: RKE
Kubernetes version: 1.18.6
Downstream Cluster:
Number of Downstream clusters: 150
Node OS: RancherOS-v1.5.6
RKE/RKE2/K3S version: RKE
Kubernetes version:v1.18.20
CNI:
Other:
Underlying Infrastructure: Azure
Any 3rd party software installed on the nodes: NA
Customer’s main time zone: UTC +2
Describe the bug
System fails intermittently with message:
Cluster health check failed: Failed to communicate with API server: Get "https://[api-server]/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded
The clusters recover after a few seconds and the errors occur at random intervals.
To Reproduce
We currently don't have repro steps but the issue might be happening while applying a big ConfigMap
Result
Random downstream clusters are intermittently unreachable for 5
10 min and auto recovering after 510 min.Expected Result
The cluster should not enter the unreachable state.
The text was updated successfully, but these errors were encountered: