Intermittent health check failure causes unreachable downstream clusters #34819

SheilaghM · 2021-09-20T15:42:58Z

SURE-3343, SURE-3344

Rancher Cluster:
Rancher version:2.5.9
Number of nodes:1500
Node OS version: RancherOS- v1.5.6
RKE/RKE2/K3S version: RKE
Kubernetes version: 1.18.6

Downstream Cluster:
Number of Downstream clusters: 150
Node OS: RancherOS-v1.5.6
RKE/RKE2/K3S version: RKE
Kubernetes version:v1.18.20
CNI:

Other:
Underlying Infrastructure: Azure
Any 3rd party software installed on the nodes: NA
Customer’s main time zone: UTC +2

Describe the bug
System fails intermittently with message:

Cluster health check failed: Failed to communicate with API server: Get "https://[api-server]/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded

The clusters recover after a few seconds and the errors occur at random intervals.

To Reproduce
We currently don't have repro steps but the issue might be happening while applying a big ConfigMap

Result
Random downstream clusters are intermittently unreachable for 5~~10 min and auto recovering after 5~~10 min.

Expected Result
The cluster should not enter the unreachable state.

junkiebev · 2021-10-02T14:22:27Z

Having the same issue. Upgrading monitoring to .22 on all downstream clusters and projects seemed to help, but it's still happening.

hyj-github · 2021-10-08T15:14:44Z

I also encounter this problem intermittently, do not know how to solve!

AlessioCasco · 2021-10-20T15:23:18Z

This is linked with issue #30959

davidnuzik · 2022-01-12T20:02:21Z

I believe I have successfully reproduced when running Rancher and downstream k3s on the same box. Please see details and Results section below.

Reproduction Environment:

Rancher version: v2.6.3
Rancher cluster type: single-node docker install
Docker version: 20.10

Downstream cluster type: k3s
Downstream K8s version: v1.22.5+k3s1 - running on the same box as the single-node rancher server

Reproduction steps:

Note: These steps are derrived from Darren's comment here.

Create a DigitalOcean VM. I did this with ubuntu-20-04-x64 image in sf3 region with size s-2vcpu-4gb
doctl compute droplet create dave-34819-4gb --tag-names dave-daily --image ubuntu-20-04-x64 --region sfo3 --size s-2vcpu-4gb --ssh-keys FINGERPRINT-HERE
SSH into the box, install docker 20.10 and deploy single-node rancher 2.6.3 with docker.
docker run -v ~/state:/var/lib/rancher -d -p 80:80 -p 443:443 --name two-six-three --restart=no --privileged rancher/rancher:v2.6.3 (note -v flag and --name are optional here; this is just how I operate...)
Install K3S (such as like curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--write-kubeconfig-mode 0644 --node-external-ip EXTERNAL-IP-HERE" sh -) (setting node-external-ip is probably not needed for this issue, just a habit)
Note, at this time the digitalOcean vm has almost 1GB free/cached memory which should be fine. (Swappiness is disabled OOTB).
Delete the traefik service (so that we can still access rancher frontend without any port conflicts) kubectl delete svc traefik -n kube-system (I might have been able to tell k3s at install-time to not deploy traefik; not sure but this works)
Login to the rancher frontend. Import this k3s cluster.
Create a deployment in the downstream k3s cluster. In my case I just deployed ranchertest/mytestcontainer:latest.
Exec shell into the container. Create a 100mb file on disk inside the container. I just did this at /
fallocate -l $((100*1024*1024)) test and then confirmed the file is as I expect it to be: ls -lh test and confirmed it's indeed 100MB on disk:
-rw-r--r-- 1 root root 100M Jan 12 19:17 test
Now, on a client outside the LAN (which was my home computer, 1gbps down wan connection, Ethernet) get the downstream k3s cluster kubeconfig file and set it as kubeconfig for kubectl. Query for the name of the pod (kubectl get pods -A) and export that name into a var named POD like export POD=eoftest-6bbd465565-lcxjx for example. Then, create a script on my local machine like as follows and execute it. This will ensure we run kubectl cp sort of in parallel (I think it's sort of because this has to loop, but it's almost like parallel because it takes time to download each); in theory this should reproduce the issue especially if the connection is slow enough. It seems like I could reproduce though even with my fast 1gbps download speed and low latency/ping.

#!/bin/bash

for i in $(seq 10); do
    (
        time kubectl cp ${POD}:/test test.$i
        ls -l test.$i
        echo done $i
    ) 2>&1 | tee test.$i.log &
done

Results:
So, I believe I have reproduced? @ibuildthecloud

When I run the script I see that shell output starts with the following (which I did not expect):

 tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names

However, I then see unexpected EOF errors for ~~all~~ most instances (I reproduced then?):

david@neo:~/kc-cp-test> error: unexpected EOF

real    0m24.732s
user    0m0.453s
sys     0m0.319s
-rw-r--r-- 1 david users 97965568 Jan 12 12:52 test.6
done 6
error: unexpected EOF

real    0m26.117s
user    0m0.350s
sys     0m0.396s
-rw-r--r-- 1 david users 99104256 Jan 12 12:52 test.2
done 2
error: unexpected EOF

real    0m28.817s
user    0m0.396s
sys     0m0.366s
-rw-r--r-- 1 david users 95718912 Jan 12 12:52 test.5
done 5

real    0m31.193s
user    0m0.417s
sys     0m0.378s
-rw-r--r-- 1 david users 104857600 Jan 12 12:52 test.3
done 3

real    0m31.742s
user    0m0.465s
sys     0m0.363s
-rw-r--r-- 1 david users 104857600 Jan 12 12:52 test.9
done 9
error: unexpected EOF

real    0m32.589s
user    0m0.429s
sys     0m0.336s
-rw-r--r-- 1 david users 101301760 Jan 12 12:52 test.7
done 7
error: unexpected EOF

real    0m32.737s
user    0m0.417s
sys     0m0.335s
-rw-r--r-- 1 david users 97205760 Jan 12 12:52 test.4
done 4
error: unexpected EOF

real    0m34.521s
user    0m0.405s
sys     0m0.396s
-rw-r--r-- 1 david users 103837184 Jan 12 12:52 test.10
done 10
error: unexpected EOF

real    0m34.549s
user    0m0.365s
sys     0m0.419s
-rw-r--r-- 1 david users 104461824 Jan 12 12:52 test.8
done 8
error: unexpected EOF

real    0m35.539s
user    0m0.453s
sys     0m0.357s
-rw-r--r-- 1 david users 104173056 Jan 12 12:52 test.1
done 1

And finally when I check file sizes on disk, some of the files are not the full 100MB size:

david@neo:~/kc-cp-test> ls -laht | grep -iv "log"
total 967M
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.1
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.8
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.10
-rw-r--r-- 1 david users  93M Jan 12 12:52 test.4
-rw-r--r-- 1 david users  97M Jan 12 12:52 test.7
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.9
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.3
-rw-r--r-- 1 david users  92M Jan 12 12:52 test.5
-rw-r--r-- 1 david users  95M Jan 12 12:52 test.2
-rw-r--r-- 1 david users  94M Jan 12 12:52 test.6
drwxr-xr-x 1 david users  354 Jan 12 12:51 .
-rwxr-xr-x 1 david users  217 Jan 12 12:51 kc-cp-script.sh
drwxr-xr-x 1 david users 1.3K Jan 12 12:39 ..

Additional Info:

While running the script I see free memory on the DO host decline, but it doesn't appear to be detrimental. Typically always at least 400MB free/cached. Likely not an issue.

Rancher server logs snippet:

W0112 20:05:58.443670      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.421644      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.447449      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.460455      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.461586      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.461375      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462110      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462350      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462551      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462692      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462871      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.463184      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.481316      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.523158      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.535121      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.555078      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.606692      33 transport.go:288] Unable to cancel request for *client.addQuery

k3s server logs snippet:

Jan 12 19:48:11 dave-34819-4gb k3s[43989]: E0112 19:48:11.274844   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49500->127.0.0.1:10010: write tcp 127.0.0.1:49500->127.0.0.1:10010: write: broken pipe
Jan 12 19:48:14 dave-34819-4gb k3s[43989]: E0112 19:48:14.724824   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:16 dave-34819-4gb k3s[43989]: E0112 19:48:16.905848   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49538->127.0.0.1:10010: write tcp 127.0.0.1:49538->127.0.0.1:10010: write: broken pipe
Jan 12 19:48:20 dave-34819-4gb k3s[43989]: E0112 19:48:20.552629   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:21 dave-34819-4gb k3s[43989]: E0112 19:48:21.650725   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49522->127.0.0.1:10010: write tcp 127.0.0.1:49522->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:48:22 dave-34819-4gb k3s[43989]: E0112 19:48:22.217899   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49506->127.0.0.1:10010: write tcp 127.0.0.1:49506->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:48:22 dave-34819-4gb k3s[43989]: E0112 19:48:22.681905   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49536->127.0.0.1:10010: write tcp 127.0.0.1:49536->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.225714   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49542->127.0.0.1:10010: write tcp 127.0.0.1:49542->127.0.0.1:10010: write: broken pipe
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.262229   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.499522   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.715775   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:25 dave-34819-4gb k3s[43989]: E0112 19:48:25.395349   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:52:20 dave-34819-4gb k3s[43989]: E0112 19:52:20.132403   43989 upgradeaware.go:401] Error proxying data from backend to client: readfrom tcp 127.0.0.1:42568->127.0.0.1:10250: write tcp 127.0.0.1:42568->127.0.0.1:10250: write: br
Jan 12 19:52:24 dave-34819-4gb k3s[43989]: E0112 19:52:24.430293   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:51722->127.0.0.1:10010: write tcp 127.0.0.1:51722->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:52:26 dave-34819-4gb k3s[43989]: E0112 19:52:26.838437   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:52:27 dave-34819-4gb k3s[43989]: E0112 19:52:27.053228   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:51734->127.0.0.1:10010: write tcp 127.0.0.1:51734->127.0.0.1:10010: write: broken pipe
Jan 12 19:52:28 dave-34819-4gb k3s[43989]: E0112 19:52:28.120048   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF

davidnuzik · 2022-01-14T18:23:31Z

My validation checks Failed

Reproduction Steps:

I already reproduced this recently in the prior comment above. Also, when I reproduced I tested several times and had the same result each time (encountering EOF errors).

Validation Environment:

Rancher version: v2.5-head (effectively equivalent to v2.5.12-rc3 - rc3 was broken so I had to use v2.5-head; it should make no difference as both have the change) commitId 923f59c (docker image downloaded on 1/14/2022)
Rancher cluster type: single-node docker install
Docker version: 20.10

Downstream cluster type: k3s
Downstream K8s version: v1.20.14+k3s1 (Rancher 2.5 should work with up to 1.20.x Kubernetes so this version was used for validation)

Validation steps:

Note: These steps are derrived from Darren's comment here.

Create a DigitalOcean VM. I did this with ubuntu-20-04-x64 image in sf3 region with size s-2vcpu-4gb
doctl compute droplet create dave-34819-4gb --tag-names dave-daily --image ubuntu-20-04-x64 --region sfo3 --size s-2vcpu-4gb --ssh-keys FINGERPRINT-HERE
SSH into the box, install docker 20.10 and deploy single-node rancher 2.6.3 with docker.
docker run -v ~/state:/var/lib/rancher -d -p 80:80 -p 443:443 --name two-five-head --restart=no --privileged rancher/rancher:v2.5-head (note -v flag and --name are optional here; this is just how I operate...)
Install K3S (such as like curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=1.20.14+k3s1 INSTALL_K3S_EXEC="--write-kubeconfig-mode 0644 --node-external-ip EXTERNAL-IP-HERE" sh -) (setting node-external-ip is probably not needed for this issue, just a habit)
Note, at this time the digitalOcean vm has almost 1GB free/cached memory which should be fine. (Swappiness is disabled OOTB).
Delete the traefik service (so that we can still access rancher frontend without any port conflicts) kubectl delete svc traefik -n kube-system (I might have been able to tell k3s at install-time to not deploy traefik; not sure but this works)
Login to the rancher frontend. Import this k3s cluster.
Create a deployment in the downstream k3s cluster. In my case I just deployed ranchertest/mytestcontainer:latest.
Exec shell into the container. Create a 100mb file on disk inside the container. I just did this at /
fallocate -l $((100*1024*1024)) test and then confirmed the file is as I expect it to be: ls -lh test and confirmed it's indeed 100MB on disk:
-rw-r--r-- 1 root root 100M Jan 14 17:55 test
Now, on a client outside the LAN (which was my home computer, 1gbps down wan connection, Ethernet) get the downstream k3s cluster kubeconfig file and set it as kubeconfig for kubectl. Query for the name of the pod (kubectl get pods -A) and export that name into a var named POD like export POD=eoftest-6bbd465565-lcxjx for example. Then, create a script on my local machine like as follows and execute it. This will ensure we run kubectl cp sort of in parallel (I think it's sort of because this has to loop, but it's almost like parallel because it takes time to download each); in theory this should reproduce the issue especially if the connection is slow enough. When I tested I saw less EOF errors, but still encountered them.

#!/bin/bash

for i in $(seq 10); do
    (
        time kubectl cp ${POD}:/test test.$i
        ls -l test.$i
        echo done $i
    ) 2>&1 | tee test.$i.log &
done

Results:
I still encounter the EOF errors but I seem to encounter less of them. This is either due to the changes partially resolving the issue or perhaps the method in which I am testing could be the culprit. However, I am testing with the exact same way as I did when I initially reproduced the issue in my prior comment.

I still encounter this unexpected output, but then the script seems to work as expected. And it shows less EOF errors than previously when testing with v2.6.3.

 tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names

I then see some EOF errors. These are encountered on loop iteration 4,5,7, and 10.

real    0m40.880s
user    0m0.449s
sys     0m0.384s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.8
done 8
error: unexpected EOF

real    0m42.496s
user    0m0.413s
sys     0m0.403s
-rw-r--r-- 1 david users 103628288 Jan 14 11:03 test.5
done 5
error: unexpected EOF

real    0m42.640s
user    0m0.452s
sys     0m0.309s
-rw-r--r-- 1 david users 97392128 Jan 14 11:03 test.4
done 4
error: unexpected EOF

real    0m43.139s
user    0m0.384s
sys     0m0.380s
-rw-r--r-- 1 david users 97558016 Jan 14 11:03 test.7
done 7
error: unexpected EOF

real    0m43.199s
user    0m0.358s
sys     0m0.430s
-rw-r--r-- 1 david users 97320448 Jan 14 11:03 test.10
done 10

real    0m43.245s
user    0m0.463s
sys     0m0.376s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.1
done 1

real    0m43.676s
user    0m0.477s
sys     0m0.373s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.6
done 6

real    0m44.347s
user    0m0.425s
sys     0m0.409s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.3
done 3

real    0m44.402s
user    0m0.502s
sys     0m0.344s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.9
done 9

real    0m45.016s
user    0m0.428s
sys     0m0.398s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.2
done 2

And when I check sizes on disk I observe the following, as expected, due to the EOF errors:

david@neo:~/kc-cp-test> ls -laht | grep -iv "log"
total 978M
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.2
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.9
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.3
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.6
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.1
-rw-r--r-- 1 david users  93M Jan 14 11:03 test.10
-rw-r--r-- 1 david users  94M Jan 14 11:03 test.7
-rw-r--r-- 1 david users  99M Jan 14 11:03 test.5
-rw-r--r-- 1 david users  93M Jan 14 11:03 test.4
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.8
drwxr-xr-x 1 david users  392 Jan 14 11:03 .
drwxr-xr-x 1 david users  324 Jan 14 10:58 v2.6.3-test-results
drwxr-xr-x 1 david users 1.3K Jan 13 09:07 ..
-rwxr-xr-x 1 david users  217 Jan 12 12:51 kc-cp-script.sh

Additional Info:

While running the script I see free memory on the DO host decline again, but it doesn't appear to be detrimental. Typically always at least 400MB free/cached. Likely not an issue.

Rancher server logs snippet below. Notice this is similar to my snippet prior when I reproduced. This is probably because the snippet prior wasn't related to the actions being performed by the script. We now see remotedialer buffer exceeded errors. There are a ton of these errors, this is just a snippet of them.

2022/01/14 18:03:47 [ERROR] remotedialer buffer exceeded, length: 4215507
2022/01/14 18:03:47 [ERROR] remotedialer buffer exceeded, length: 4248275
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4206503
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4239271
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4272039
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4304807
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4337575
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4370343
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4403111
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4435879
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4468647
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4501415
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4509027

k3s server logs snippet below. I don't thinkany of this is relevant except maybe the last two lines.

Jan 14 18:03:17 dave-34819-4gb k3s[788250]: I0114 18:03:17.698615  788250 trace.go:205] Trace[1605878027]: "GuaranteedUpdate etcd3" type:*core.ConfigMap (14-Jan-2022 18:03:17.060) (total time: 638ms):
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1605878027]: ---"Transaction committed" 637ms (18:03:00.698)
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1605878027]: [638.370347ms] [638.370347ms] END
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: I0114 18:03:17.698914  788250 trace.go:205] Trace[1447201942]: "Update" url:/api/v1/namespaces/fleet-system/configmaps/fleet-agent-lock,user-agent:fleetagent/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.42.0.11 (14-Jan-2022 18:03:17.059) (total time: 638ms):
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1447201942]: ---"Object stored in database" 638ms (18:03:00.698)
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1447201942]: [638.932235ms] [638.932235ms] END
Jan 14 18:03:56 dave-34819-4gb k3s[788250]: E0114 18:03:56.170954  788250 upgradeaware.go:373] Error proxying data from client to backend: EOF
Jan 14 18:03:56 dave-34819-4gb k3s[788250]: E0114 18:03:56.175627  788250 upgradeaware.go:387] Error proxying data from backend to client: tls: use of closed connection

Additionally, note that I tested four times. First time four EOF errors, 2nd time two, 3rd and 4th time just one EOF error. Compared to prior where I'd get 8/10 or 9/10 with EOF errors.

deniseschannon · 2022-01-19T15:50:15Z

/forwardport v2.6.3-patch1

sowmyav27 · 2022-01-20T15:48:31Z

PR: #36206

sowmyav27 · 2022-01-21T19:05:55Z

Based on offline discussion with @davidnuzik @ibuildthecloud @deniseschannon closing this issue for 2.5.12 release.
Note: We have not be able to full repro/validate this issue. We will have user test it using 2.5.12 and 2.6.3-patch1 and if there are any further fixes needed, they will be addressed separately.

SheilaghM added the [zube]: RT - Sprint Ready label Sep 20, 2021

SheilaghM added this to the v2.5.10 milestone Sep 20, 2021

SheilaghM added the internal label Sep 20, 2021

StrongMonkey self-assigned this Sep 21, 2021

SheilaghM added [zube]: Next Up and removed [zube]: RT - Sprint Ready labels Sep 21, 2021

SheilaghM modified the milestones: v2.5.10, v2.5.11 Sep 30, 2021

junkiebev mentioned this issue Oct 5, 2021

Dashboard generates too many requests when listing resources #30718

Closed

Jono-SUSE-Rancher modified the milestones: v2.5.11, v2.5.12 Oct 26, 2021

jmcsagdc self-assigned this Nov 15, 2021

SheilaghM modified the milestones: v2.5.12, v2.5.13 Nov 16, 2021

deniseschannon added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Dec 1, 2021

deniseschannon assigned ibuildthecloud and deniseschannon Dec 1, 2021

deniseschannon added [zube]: Working and removed [zube]: Next Up labels Dec 1, 2021

deniseschannon modified the milestones: v2.5.13, v2.5.12 Dec 5, 2021

MKlimuszka assigned aiyengar2 and unassigned StrongMonkey Dec 6, 2021

deniseschannon assigned thedadams and unassigned aiyengar2 Jan 4, 2022

sowmyav27 assigned slickwarren and unassigned jmcsagdc Jan 4, 2022

sowmyav27 added the QA/need-info label Jan 11, 2022

thedadams added the [zube]: To Test label Jan 12, 2022

zube bot removed the [zube]: Review label Jan 12, 2022

slickwarren assigned Auston-Ivison-Suse and unassigned slickwarren Jan 13, 2022

Auston-Ivison-Suse added [zube]: QA Working and removed [zube]: To Test labels Jan 13, 2022

sowmyav27 assigned davidnuzik Jan 14, 2022

davidnuzik added [zube]: Reopened and removed [zube]: QA Working QA/need-info labels Jan 14, 2022

sowmyav27 unassigned Auston-Ivison-Suse Jan 14, 2022

slickwarren mentioned this issue Jan 18, 2022

Cluster explorer and websocket connection broken after upgrading to rancher 2.5.8 (fine with 2.5.7) #32934

Closed

rancherbot mentioned this issue Jan 19, 2022

[Forwardport v2.6] Intermittent health check failure causes unreachable downstream clusters #36191

Closed

sowmyav27 added [zube]: To Test and removed [zube]: Reopened labels Jan 20, 2022

davidnuzik added the QA/need-info label Jan 20, 2022

sowmyav27 closed this as completed Jan 21, 2022

zube bot added [zube]: Done and removed [zube]: To Test labels Jan 21, 2022

snasovich added the release-note Note this issue in the milestone's release notes label Jan 24, 2022

snasovich mentioned this issue Feb 4, 2022

Cluster goes to unavailable because failed to communicate with API server: i/o timeout #30959

Closed

samjustus mentioned this issue Feb 18, 2022

buffer exceeded errors #36584

Closed

Lebvanih mentioned this issue Apr 8, 2022

Clusters become unreachable via UI and kubectl after some times #37250

Closed

zube bot removed the [zube]: Done label Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent health check failure causes unreachable downstream clusters #34819

Intermittent health check failure causes unreachable downstream clusters #34819

SheilaghM commented Sep 20, 2021 •

edited by dramich

junkiebev commented Oct 2, 2021

hyj-github commented Oct 8, 2021

AlessioCasco commented Oct 20, 2021

davidnuzik commented Jan 12, 2022 •

edited

davidnuzik commented Jan 14, 2022 •

edited

deniseschannon commented Jan 19, 2022

sowmyav27 commented Jan 20, 2022

sowmyav27 commented Jan 21, 2022

Intermittent health check failure causes unreachable downstream clusters #34819

Intermittent health check failure causes unreachable downstream clusters #34819

Comments

SheilaghM commented Sep 20, 2021 • edited by dramich

junkiebev commented Oct 2, 2021

hyj-github commented Oct 8, 2021

AlessioCasco commented Oct 20, 2021

davidnuzik commented Jan 12, 2022 • edited

davidnuzik commented Jan 14, 2022 • edited

deniseschannon commented Jan 19, 2022

sowmyav27 commented Jan 20, 2022

sowmyav27 commented Jan 21, 2022

SheilaghM commented Sep 20, 2021 •

edited by dramich

davidnuzik commented Jan 12, 2022 •

edited

davidnuzik commented Jan 14, 2022 •

edited