-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The container name \"/etcd-fix-perm\" is already in use by container #2632
Comments
The other problem here is that every time creates |
So this is to work around an API error? How does this reproduce every time, is there an instance or configuration that triggers this consistently? It doesn't reproduce when I test things out but if we have a solid reproducer that would help (also for validating the fix) |
This issue is easy to repeat on hosts with poor disk performance. Subsequent retries are invalid after the first container creation API response is incorrect for any reason. So my idea is to make container creation idempotent again. |
I have performed about 200 rancher redeployments (with automation) on two separate clusters. In the last month. I have several conclusions:
All it would take IMHO, is to unify timeout for certain operations and expose it as a single variable into clusterfile. "RKE is simply giving up too fast" but it doesn't have to. Please add to tests some old boxes, for the sake of all of us I would always observe clusters going up with Someone, somewhere will eventually make an utility that measures how far reaching the bias really is: ram speed, disk speed, cpu flags. It would make for an interesting experiment to keep track of this data and generate crash report that can be pushed to Rancher for analysis. I wouldn't hesitate to send that back to Rancher if it ment someone is looking at it to improve the product. Peace |
Hey There Guys! |
Hello @barnettZQG and @styk-tv OK, here is the setup for 30+ VMs.
@superseb mostly answering your question how often it is reproduced. Initially everything starts clear and bright. It starts normally, click to see.
It even managed to successfully start nginx-proxy on a few hosts: srv55, srv56, srv57, srv63
But then something happens on the other nodes. They complain about Docker daemon at
And later the same hosts suddenly complain that they already have
Sometimes one node can successfully start All right, we've dismantled the cluster with They cluster has started. Though none of nodes are Ready. We hit exactly the same issue like described there: |
@HectorB-2020 Thanks for the data, this will require some investigation. As this seems to be mostly I/O bound, what type disk and IOPS do you have on the VMs? If there was a 100% reproduce rate on this issue, this issue would have a lot more attention so we need to figure out how you can reproduce it so often. We probably also want Docker daemon debug logs at the time it reports the error to see what it actually receives and responds with, then decide how we can make RKE more resilient to this. Just continuing if it's already in use is too easy I think, we can't be sure that the container is correct with what we wanted. We probably want to delete it and let the loop try again (possibly with a sleep to give the machine some time to recover). Any data is appreciated. |
@superseb, I was trying to examine Docker logs but didn't find anything suspicious there. At least I didn't find evidence of stopped Docker daemon if I interpret this error correctly. Surely you know better than I. But in my view it's unlikely that disk IO plays any signification role here. We have quite powerful VMs based on VMWare enterprise hypervisor. And these VMs don't anything else to run, they are purely empty, waiting for k8s for settle on them. For me it more looks like a bottleneck in network or in our private registry running on Docker Registry v2. Another hunch is network topology that could be a potential culprit. The fact that about the same nodes usually successfully pull images while the other set of nodes usually fail allows me to surmise that the successful node could share the same rack and ToR switch in that data center, while the other less successful nodes reside in other racks. Unfortunately I'm not aware of the network topology. |
One thing's been a mystery for me since very beginning: why is it always |
The reason I point out disk I/O is because that is the only way I have been able to reproduce the issue. If I restrict read and write on the disk with the Docker root directory, I can reproduce this fairly reliably. This is the log with some added logging and sleeps on a limited I/O host:
Needs some more investigation but we should be able to implement a solution. |
RKE version: v1.3.0-rc1.0.20210503155726-c25848db1e86
Docker version: (
docker version
,docker info
preferred) 19.03.5Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred) CentOS Linux 7 (Core)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) VirtualBox
cluster.yml file: Irrelevant to the issues.
Steps to Reproduce:
Results:
We can repeat this issue many times, each time with an
etcd-fix-perm
container.According to the log, there will be three retries to create the container. The first time, the API will report an error, but the container has already been created. Therefore, the second and third attempts will report an error.
From the perspective of code implementation, whether the container can tolerate errors. such as:
https://github.com/rancher/rke/blob/master/docker/docker.go#L442
The text was updated successfully, but these errors were encountered: