Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent health check failure causes unreachable downstream clusters #34819

Closed
SheilaghM opened this issue Sep 20, 2021 · 13 comments
Closed
Assignees
Labels
internal QA/need-info release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@SheilaghM
Copy link

SheilaghM commented Sep 20, 2021

SURE-3343, SURE-3344

Rancher Cluster:
Rancher version:2.5.9
Number of nodes:1500
Node OS version: RancherOS- v1.5.6
RKE/RKE2/K3S version: RKE
Kubernetes version: 1.18.6

Downstream Cluster:
Number of Downstream clusters: 150
Node OS: RancherOS-v1.5.6
RKE/RKE2/K3S version: RKE
Kubernetes version:v1.18.20
CNI:

Other:
Underlying Infrastructure: Azure
Any 3rd party software installed on the nodes: NA
Customer’s main time zone: UTC +2

Describe the bug
System fails intermittently with message:

Cluster health check failed: Failed to communicate with API server: Get "https://[api-server]/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded

The clusters recover after a few seconds and the errors occur at random intervals.

To Reproduce
We currently don't have repro steps but the issue might be happening while applying a big ConfigMap

Result
Random downstream clusters are intermittently unreachable for 510 min and auto recovering after 510 min.

Expected Result
The cluster should not enter the unreachable state.

@junkiebev
Copy link

Having the same issue. Upgrading monitoring to .22 on all downstream clusters and projects seemed to help, but it's still happening.

@hyj-github
Copy link

I also encounter this problem intermittently, do not know how to solve!

@AlessioCasco
Copy link

This is linked with issue #30959

@Jono-SUSE-Rancher Jono-SUSE-Rancher modified the milestones: v2.5.11, v2.5.12 Oct 26, 2021
@jmcsagdc jmcsagdc self-assigned this Nov 15, 2021
@SheilaghM SheilaghM modified the milestones: v2.5.12, v2.5.13 Nov 16, 2021
@deniseschannon deniseschannon added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Dec 1, 2021
@deniseschannon deniseschannon modified the milestones: v2.5.13, v2.5.12 Dec 5, 2021
@MKlimuszka MKlimuszka assigned aiyengar2 and unassigned StrongMonkey Dec 6, 2021
@deniseschannon deniseschannon assigned thedadams and unassigned aiyengar2 Jan 4, 2022
@sowmyav27 sowmyav27 assigned slickwarren and unassigned jmcsagdc Jan 4, 2022
@davidnuzik
Copy link
Contributor

davidnuzik commented Jan 12, 2022

I believe I have successfully reproduced when running Rancher and downstream k3s on the same box. Please see details and Results section below.

Reproduction Environment:

Rancher version: v2.6.3
Rancher cluster type: single-node docker install
Docker version: 20.10

Downstream cluster type: k3s
Downstream K8s version: v1.22.5+k3s1 - running on the same box as the single-node rancher server

Reproduction steps:

Note: These steps are derrived from Darren's comment here.

  1. Create a DigitalOcean VM. I did this with ubuntu-20-04-x64 image in sf3 region with size s-2vcpu-4gb
    doctl compute droplet create dave-34819-4gb --tag-names dave-daily --image ubuntu-20-04-x64 --region sfo3 --size s-2vcpu-4gb --ssh-keys FINGERPRINT-HERE
  2. SSH into the box, install docker 20.10 and deploy single-node rancher 2.6.3 with docker.
    docker run -v ~/state:/var/lib/rancher -d -p 80:80 -p 443:443 --name two-six-three --restart=no --privileged rancher/rancher:v2.6.3 (note -v flag and --name are optional here; this is just how I operate...)
  3. Install K3S (such as like curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--write-kubeconfig-mode 0644 --node-external-ip EXTERNAL-IP-HERE" sh -) (setting node-external-ip is probably not needed for this issue, just a habit)
    Note, at this time the digitalOcean vm has almost 1GB free/cached memory which should be fine. (Swappiness is disabled OOTB).
  4. Delete the traefik service (so that we can still access rancher frontend without any port conflicts) kubectl delete svc traefik -n kube-system (I might have been able to tell k3s at install-time to not deploy traefik; not sure but this works)
  5. Login to the rancher frontend. Import this k3s cluster.
  6. Create a deployment in the downstream k3s cluster. In my case I just deployed ranchertest/mytestcontainer:latest.
  7. Exec shell into the container. Create a 100mb file on disk inside the container. I just did this at /
    fallocate -l $((100*1024*1024)) test and then confirmed the file is as I expect it to be: ls -lh test and confirmed it's indeed 100MB on disk:
    -rw-r--r-- 1 root root 100M Jan 12 19:17 test
  8. Now, on a client outside the LAN (which was my home computer, 1gbps down wan connection, Ethernet) get the downstream k3s cluster kubeconfig file and set it as kubeconfig for kubectl. Query for the name of the pod (kubectl get pods -A) and export that name into a var named POD like export POD=eoftest-6bbd465565-lcxjx for example. Then, create a script on my local machine like as follows and execute it. This will ensure we run kubectl cp sort of in parallel (I think it's sort of because this has to loop, but it's almost like parallel because it takes time to download each); in theory this should reproduce the issue especially if the connection is slow enough. It seems like I could reproduce though even with my fast 1gbps download speed and low latency/ping.
#!/bin/bash

for i in $(seq 10); do
    (
        time kubectl cp ${POD}:/test test.$i
        ls -l test.$i
        echo done $i
    ) 2>&1 | tee test.$i.log &
done

Results:
So, I believe I have reproduced? @ibuildthecloud

When I run the script I see that shell output starts with the following (which I did not expect):

 tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names

However, I then see unexpected EOF errors for all most instances (I reproduced then?):

david@neo:~/kc-cp-test> error: unexpected EOF

real    0m24.732s
user    0m0.453s
sys     0m0.319s
-rw-r--r-- 1 david users 97965568 Jan 12 12:52 test.6
done 6
error: unexpected EOF

real    0m26.117s
user    0m0.350s
sys     0m0.396s
-rw-r--r-- 1 david users 99104256 Jan 12 12:52 test.2
done 2
error: unexpected EOF

real    0m28.817s
user    0m0.396s
sys     0m0.366s
-rw-r--r-- 1 david users 95718912 Jan 12 12:52 test.5
done 5

real    0m31.193s
user    0m0.417s
sys     0m0.378s
-rw-r--r-- 1 david users 104857600 Jan 12 12:52 test.3
done 3

real    0m31.742s
user    0m0.465s
sys     0m0.363s
-rw-r--r-- 1 david users 104857600 Jan 12 12:52 test.9
done 9
error: unexpected EOF

real    0m32.589s
user    0m0.429s
sys     0m0.336s
-rw-r--r-- 1 david users 101301760 Jan 12 12:52 test.7
done 7
error: unexpected EOF

real    0m32.737s
user    0m0.417s
sys     0m0.335s
-rw-r--r-- 1 david users 97205760 Jan 12 12:52 test.4
done 4
error: unexpected EOF

real    0m34.521s
user    0m0.405s
sys     0m0.396s
-rw-r--r-- 1 david users 103837184 Jan 12 12:52 test.10
done 10
error: unexpected EOF

real    0m34.549s
user    0m0.365s
sys     0m0.419s
-rw-r--r-- 1 david users 104461824 Jan 12 12:52 test.8
done 8
error: unexpected EOF

real    0m35.539s
user    0m0.453s
sys     0m0.357s
-rw-r--r-- 1 david users 104173056 Jan 12 12:52 test.1
done 1

And finally when I check file sizes on disk, some of the files are not the full 100MB size:

david@neo:~/kc-cp-test> ls -laht | grep -iv "log"
total 967M
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.1
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.8
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.10
-rw-r--r-- 1 david users  93M Jan 12 12:52 test.4
-rw-r--r-- 1 david users  97M Jan 12 12:52 test.7
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.9
-rw-r--r-- 1 david users 100M Jan 12 12:52 test.3
-rw-r--r-- 1 david users  92M Jan 12 12:52 test.5
-rw-r--r-- 1 david users  95M Jan 12 12:52 test.2
-rw-r--r-- 1 david users  94M Jan 12 12:52 test.6
drwxr-xr-x 1 david users  354 Jan 12 12:51 .
-rwxr-xr-x 1 david users  217 Jan 12 12:51 kc-cp-script.sh
drwxr-xr-x 1 david users 1.3K Jan 12 12:39 ..

Additional Info:

While running the script I see free memory on the DO host decline, but it doesn't appear to be detrimental. Typically always at least 400MB free/cached. Likely not an issue.

Rancher server logs snippet:

W0112 20:05:58.443670      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.421644      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.447449      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.460455      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.461586      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.461375      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462110      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462350      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462551      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462692      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.462871      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.463184      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.481316      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.523158      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.535121      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.555078      33 transport.go:288] Unable to cancel request for *client.addQuery
W0112 20:05:59.606692      33 transport.go:288] Unable to cancel request for *client.addQuery

k3s server logs snippet:

Jan 12 19:48:11 dave-34819-4gb k3s[43989]: E0112 19:48:11.274844   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49500->127.0.0.1:10010: write tcp 127.0.0.1:49500->127.0.0.1:10010: write: broken pipe
Jan 12 19:48:14 dave-34819-4gb k3s[43989]: E0112 19:48:14.724824   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:16 dave-34819-4gb k3s[43989]: E0112 19:48:16.905848   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49538->127.0.0.1:10010: write tcp 127.0.0.1:49538->127.0.0.1:10010: write: broken pipe
Jan 12 19:48:20 dave-34819-4gb k3s[43989]: E0112 19:48:20.552629   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:21 dave-34819-4gb k3s[43989]: E0112 19:48:21.650725   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49522->127.0.0.1:10010: write tcp 127.0.0.1:49522->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:48:22 dave-34819-4gb k3s[43989]: E0112 19:48:22.217899   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49506->127.0.0.1:10010: write tcp 127.0.0.1:49506->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:48:22 dave-34819-4gb k3s[43989]: E0112 19:48:22.681905   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49536->127.0.0.1:10010: write tcp 127.0.0.1:49536->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.225714   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:49542->127.0.0.1:10010: write tcp 127.0.0.1:49542->127.0.0.1:10010: write: broken pipe
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.262229   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.499522   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:24 dave-34819-4gb k3s[43989]: E0112 19:48:24.715775   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:48:25 dave-34819-4gb k3s[43989]: E0112 19:48:25.395349   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:52:20 dave-34819-4gb k3s[43989]: E0112 19:52:20.132403   43989 upgradeaware.go:401] Error proxying data from backend to client: readfrom tcp 127.0.0.1:42568->127.0.0.1:10250: write tcp 127.0.0.1:42568->127.0.0.1:10250: write: br
Jan 12 19:52:24 dave-34819-4gb k3s[43989]: E0112 19:52:24.430293   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:51722->127.0.0.1:10010: write tcp 127.0.0.1:51722->127.0.0.1:10010: write: connection reset by peer
Jan 12 19:52:26 dave-34819-4gb k3s[43989]: E0112 19:52:26.838437   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF
Jan 12 19:52:27 dave-34819-4gb k3s[43989]: E0112 19:52:27.053228   43989 upgradeaware.go:387] Error proxying data from client to backend: readfrom tcp 127.0.0.1:51734->127.0.0.1:10010: write tcp 127.0.0.1:51734->127.0.0.1:10010: write: broken pipe
Jan 12 19:52:28 dave-34819-4gb k3s[43989]: E0112 19:52:28.120048   43989 upgradeaware.go:401] Error proxying data from backend to client: unexpected EOF

@davidnuzik
Copy link
Contributor

davidnuzik commented Jan 14, 2022

My validation checks Failed

Reproduction Steps:

I already reproduced this recently in the prior comment above. Also, when I reproduced I tested several times and had the same result each time (encountering EOF errors).


Validation Environment:

Rancher version: v2.5-head (effectively equivalent to v2.5.12-rc3 - rc3 was broken so I had to use v2.5-head; it should make no difference as both have the change) commitId 923f59c (docker image downloaded on 1/14/2022)
Rancher cluster type: single-node docker install
Docker version: 20.10

Downstream cluster type: k3s
Downstream K8s version: v1.20.14+k3s1 (Rancher 2.5 should work with up to 1.20.x Kubernetes so this version was used for validation)

Validation steps:

Note: These steps are derrived from Darren's comment here.

  1. Create a DigitalOcean VM. I did this with ubuntu-20-04-x64 image in sf3 region with size s-2vcpu-4gb
    doctl compute droplet create dave-34819-4gb --tag-names dave-daily --image ubuntu-20-04-x64 --region sfo3 --size s-2vcpu-4gb --ssh-keys FINGERPRINT-HERE
  2. SSH into the box, install docker 20.10 and deploy single-node rancher 2.6.3 with docker.
    docker run -v ~/state:/var/lib/rancher -d -p 80:80 -p 443:443 --name two-five-head --restart=no --privileged rancher/rancher:v2.5-head (note -v flag and --name are optional here; this is just how I operate...)
  3. Install K3S (such as like curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=1.20.14+k3s1 INSTALL_K3S_EXEC="--write-kubeconfig-mode 0644 --node-external-ip EXTERNAL-IP-HERE" sh -) (setting node-external-ip is probably not needed for this issue, just a habit)
    Note, at this time the digitalOcean vm has almost 1GB free/cached memory which should be fine. (Swappiness is disabled OOTB).
  4. Delete the traefik service (so that we can still access rancher frontend without any port conflicts) kubectl delete svc traefik -n kube-system (I might have been able to tell k3s at install-time to not deploy traefik; not sure but this works)
  5. Login to the rancher frontend. Import this k3s cluster.
  6. Create a deployment in the downstream k3s cluster. In my case I just deployed ranchertest/mytestcontainer:latest.
  7. Exec shell into the container. Create a 100mb file on disk inside the container. I just did this at /
    fallocate -l $((100*1024*1024)) test and then confirmed the file is as I expect it to be: ls -lh test and confirmed it's indeed 100MB on disk:
    -rw-r--r-- 1 root root 100M Jan 14 17:55 test
  8. Now, on a client outside the LAN (which was my home computer, 1gbps down wan connection, Ethernet) get the downstream k3s cluster kubeconfig file and set it as kubeconfig for kubectl. Query for the name of the pod (kubectl get pods -A) and export that name into a var named POD like export POD=eoftest-6bbd465565-lcxjx for example. Then, create a script on my local machine like as follows and execute it. This will ensure we run kubectl cp sort of in parallel (I think it's sort of because this has to loop, but it's almost like parallel because it takes time to download each); in theory this should reproduce the issue especially if the connection is slow enough. When I tested I saw less EOF errors, but still encountered them.
#!/bin/bash

for i in $(seq 10); do
    (
        time kubectl cp ${POD}:/test test.$i
        ls -l test.$i
        echo done $i
    ) 2>&1 | tee test.$i.log &
done

Results:
I still encounter the EOF errors but I seem to encounter less of them. This is either due to the changes partially resolving the issue or perhaps the method in which I am testing could be the culprit. However, I am testing with the exact same way as I did when I initially reproduced the issue in my prior comment.

I still encounter this unexpected output, but then the script seems to work as expected. And it shows less EOF errors than previously when testing with v2.6.3.

 tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names                                                                                                                                                                                                                                   
tar: Removing leading `/' from member names

I then see some EOF errors. These are encountered on loop iteration 4,5,7, and 10.

real    0m40.880s
user    0m0.449s
sys     0m0.384s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.8
done 8
error: unexpected EOF

real    0m42.496s
user    0m0.413s
sys     0m0.403s
-rw-r--r-- 1 david users 103628288 Jan 14 11:03 test.5
done 5
error: unexpected EOF

real    0m42.640s
user    0m0.452s
sys     0m0.309s
-rw-r--r-- 1 david users 97392128 Jan 14 11:03 test.4
done 4
error: unexpected EOF

real    0m43.139s
user    0m0.384s
sys     0m0.380s
-rw-r--r-- 1 david users 97558016 Jan 14 11:03 test.7
done 7
error: unexpected EOF

real    0m43.199s
user    0m0.358s
sys     0m0.430s
-rw-r--r-- 1 david users 97320448 Jan 14 11:03 test.10
done 10

real    0m43.245s
user    0m0.463s
sys     0m0.376s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.1
done 1

real    0m43.676s
user    0m0.477s
sys     0m0.373s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.6
done 6

real    0m44.347s
user    0m0.425s
sys     0m0.409s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.3
done 3

real    0m44.402s
user    0m0.502s
sys     0m0.344s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.9
done 9

real    0m45.016s
user    0m0.428s
sys     0m0.398s
-rw-r--r-- 1 david users 104857600 Jan 14 11:03 test.2
done 2

And when I check sizes on disk I observe the following, as expected, due to the EOF errors:

david@neo:~/kc-cp-test> ls -laht | grep -iv "log"
total 978M
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.2
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.9
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.3
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.6
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.1
-rw-r--r-- 1 david users  93M Jan 14 11:03 test.10
-rw-r--r-- 1 david users  94M Jan 14 11:03 test.7
-rw-r--r-- 1 david users  99M Jan 14 11:03 test.5
-rw-r--r-- 1 david users  93M Jan 14 11:03 test.4
-rw-r--r-- 1 david users 100M Jan 14 11:03 test.8
drwxr-xr-x 1 david users  392 Jan 14 11:03 .
drwxr-xr-x 1 david users  324 Jan 14 10:58 v2.6.3-test-results
drwxr-xr-x 1 david users 1.3K Jan 13 09:07 ..
-rwxr-xr-x 1 david users  217 Jan 12 12:51 kc-cp-script.sh

Additional Info:

While running the script I see free memory on the DO host decline again, but it doesn't appear to be detrimental. Typically always at least 400MB free/cached. Likely not an issue.

Rancher server logs snippet below. Notice this is similar to my snippet prior when I reproduced. This is probably because the snippet prior wasn't related to the actions being performed by the script. We now see remotedialer buffer exceeded errors. There are a ton of these errors, this is just a snippet of them.

2022/01/14 18:03:47 [ERROR] remotedialer buffer exceeded, length: 4215507
2022/01/14 18:03:47 [ERROR] remotedialer buffer exceeded, length: 4248275
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4206503
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4239271
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4272039
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4304807
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4337575
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4370343
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4403111
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4435879
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4468647
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4501415
2022/01/14 18:03:56 [ERROR] remotedialer buffer exceeded, length: 4509027

k3s server logs snippet below. I don't thinkany of this is relevant except maybe the last two lines.

Jan 14 18:03:17 dave-34819-4gb k3s[788250]: I0114 18:03:17.698615  788250 trace.go:205] Trace[1605878027]: "GuaranteedUpdate etcd3" type:*core.ConfigMap (14-Jan-2022 18:03:17.060) (total time: 638ms):
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1605878027]: ---"Transaction committed" 637ms (18:03:00.698)
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1605878027]: [638.370347ms] [638.370347ms] END
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: I0114 18:03:17.698914  788250 trace.go:205] Trace[1447201942]: "Update" url:/api/v1/namespaces/fleet-system/configmaps/fleet-agent-lock,user-agent:fleetagent/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.42.0.11 (14-Jan-2022 18:03:17.059) (total time: 638ms):
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1447201942]: ---"Object stored in database" 638ms (18:03:00.698)
Jan 14 18:03:17 dave-34819-4gb k3s[788250]: Trace[1447201942]: [638.932235ms] [638.932235ms] END
Jan 14 18:03:56 dave-34819-4gb k3s[788250]: E0114 18:03:56.170954  788250 upgradeaware.go:373] Error proxying data from client to backend: EOF
Jan 14 18:03:56 dave-34819-4gb k3s[788250]: E0114 18:03:56.175627  788250 upgradeaware.go:387] Error proxying data from backend to client: tls: use of closed connection

Additionally, note that I tested four times. First time four EOF errors, 2nd time two, 3rd and 4th time just one EOF error. Compared to prior where I'd get 8/10 or 9/10 with EOF errors.

@deniseschannon
Copy link

/forwardport v2.6.3-patch1

@sowmyav27
Copy link
Contributor

PR: #36206

@sowmyav27
Copy link
Contributor

Based on offline discussion with @davidnuzik @ibuildthecloud @deniseschannon closing this issue for 2.5.12 release.
Note: We have not be able to full repro/validate this issue. We will have user test it using 2.5.12 and 2.6.3-patch1 and if there are any further fixes needed, they will be addressed separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal QA/need-info release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests