Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

File descriptors increasing - requires docker daemon restart and sometimes a node restart #2073

Closed
mkboudreau opened this issue Feb 22, 2018 · 11 comments
Assignees

Comments

@mkboudreau
Copy link

Environment Details:

Docker UCP with vSphere Storage for Docker volume driver.

  • We are updated to 0.21.1 on the volume driver and UCP 2.2.5

Steps to Reproduce:
Intermittent... Still don't know how to reproduce it. :(

Expected Result:

Services to relocate from node to node as designed without bringing down the node

Actual Result:

Every 2-10 days, when one of our containers is being rebuilt and having its service updated, file descriptors starting increasing at a consistent rate of around 200 per hour.

This problem actually does not occur most of the time. I've even tried to get it to occur, without success. The underlying trigger is not known yet, but we heavily suspect the bug to be in the vmware vsphere volume driver.

Triage:

Here is what we have observed after some underlying issue occurs:

  • The issue always begins with a docker service update or docker stack deploy which causes a new container to be brought up to replace an out-of-date container.
  • The issue has only occurred with containers using vsphere volumes
  • The file descriptors are all owned by the docker daemon
  • Initially all containers that use vsphere volumes on the node where the issue is start failing and not responding to requests. All other containers (non-vsphere) are usually operational for a time until the docker daemon starts to become non-responsive.
  • Sometimes a restart of the docker daemon works.
  • Sometimes a restart of the docker dameon on all worker nodes is required.
  • Sometimes the vmware vsphere process is still around after shutting down docker and it needs to be killed before bringing docker back up.
  • We have had a ticket open with Docker for some time. Getting nowhere! We very much suspect this is related to vsphere.

@bteichner has been the point of contact for these issues with vmware

@govint
Copy link
Contributor

govint commented Feb 22, 2018

The file descriptors being owned by the Docker daemon would make this a Docker side issue. If the vSphere volume driver is a separate process all together and if its opening files (AFAIK, the plugin only opens the VMCI socket to make calls into ESX and that path has been in use all along).

Can we try an lsof -p or _ls -l /proc//fd.

For logs, pls. set "debug' in the plugin config file and restart the plugin.

@shuklanirdesh82
Copy link
Contributor

shuklanirdesh82 commented Feb 22, 2018

Hey @govint

#2073 (comment)
For logs, pls. set "debug' in the plugin config file and restart the plugin.

Is there any regression that you know of? plugin restart is not able to parse the plugin config file.

@mkboudreau
Copy link
Author

mkboudreau commented Feb 22, 2018

Regarding logs... we've been struggling to get it working from the config file, even though we've been told to do it that way. The config.json-template file seems to set the VDVS_LOG_LEVEL env var and it appears that if that var is set, then the log level from the config file would never be considered (see config.go). Am I understanding this correctly?

config.go:
https://github.com/vmware/vsphere-storage-for-docker/blob/master/client_plugin/utils/config/config.go

config.json-template:
https://github.com/vmware/vsphere-storage-for-docker/blob/master/plugin_dockerbuild/config.json-template

@bteichner
Copy link

Enabled debug logging with the following commands:

docker plugin disable -f vsphere
docker plugin set vsphere VDVS_LOG_LEVEL="debug"
docker plugin enable vsphere

@govint
Copy link
Contributor

govint commented Mar 8, 2018

@mkboudreau, @bteichner, can you'll confirm what the issue is with the docker daemon, given the observed behavior. Do you'll wish to keep this issue open here as we presently don't see any bug in the volume plugin to fix.

@mkboudreau
Copy link
Author

Since we turned debug logging on, we have not encountered the issue. It has always taken between 2 and 15 days between incidents. Please keep the issue open a little longer. I would really like to see this happen while we have debug logging turned on.

@govint
Copy link
Contributor

govint commented Mar 8, 2018

@mkboudreau, sure no problem.

@govint
Copy link
Contributor

govint commented Mar 27, 2018

@mkboudreau can you'll update on this issue, can we close if there aren't any more updates.

@mkboudreau
Copy link
Author

Thank you for following up. Go ahead and close and I can always reopen it if needed.

@govint
Copy link
Contributor

govint commented Mar 27, 2018

Closing.

@govint govint closed this as completed Mar 27, 2018
@mkboudreau
Copy link
Author

Hi @govint

Today we started having some issues where all our vsphere volume driver starting timing out on every docker volume operation. We did not get a file descriptor issue in our docker daemon like we had in the past, only timeouts. I'm guessing that this might be something that was recently fixed in docker EE as mentioned in the latest release notes.

We took a look at the file descriptors using the most file descriptors on the worker nodes and sure enough, the vsphere and ucp-agent processes were consuming a lot of file descriptors.

[root@ourhost ~]# for d in `ls -d /proc/[0-9]*`; do   echo "`ls $d/fd | wc -l`          $d"; done | sort -n | tail -10
ls: cannot access /proc/29810/fd: No such file or directory
35          /proc/24989
36          /proc/1
36          /proc/16187
38          /proc/25383
54          /proc/1624
90          /proc/1479
114          /proc/24192
269          /proc/747
22963          /proc/16204
30319          /proc/24232
[root@ourhost ~]# ps -ef | grep 24232
root     24232 24215  0 Apr02 ?        00:29:04 /usr/bin/vsphere-storage-for-docker --config /etc/vsphere-storage-for-docker.conf
root     31275 29249  0 15:06 pts/0    00:00:00 grep --color=auto 24232
[root@ourhost ~]# ps -ef | grep 16204
root     16204 16187  0 Apr02 ?        03:16:50 /bin/ucp-agent proxy --disk-usage-interval 2h --metrics-scrape-interval 1m
root     31408 29249  0 15:06 pts/0    00:00:00 grep --color=auto 16204

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants