Docker provider stops working when dockerd is restarted, requires traefik restart to fix #5833

c2h5oh · 2019-11-14T19:33:20Z

What did you do?

Updated Docker OS package, which caused Docker daemon (dockerd) to be restarted.
Simply restarting dockerd service has the same effect.

What did you expect to see?

Traefik Docker provider keeps working as it was prior to dockerd restart

What did you see instead?

Traefik Docker provider stops picking up changes until traefik is restarted.
~~This is likely related to #5589 - dockerd socket is bind-mounted into traefik container and when dockerd is restarted that socket is recreated.~~ Edit: it looks like it isn't - see #5833 (comment)

Output of `traefik version`: (What version of Traefik are you using?)

Version:      2.0.2
Codename:     montdor
Go version:   go1.13.1
Built:        2019-10-09T19:26:05Z
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

Traefik running in a docker container, using docker provider

api:
  dashboard: false
  insecure: false

entryPoints:
  http:
    address: ":80"

  https:
    address: ":443"

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false

Our dockerd config contains "live-restore": true which results in dockerd restarts not causing container restarts.

The text was updated successfully, but these errors were encountered:

dsseng · 2019-12-08T11:51:22Z

I tried to reproduce and it looks like it's not caused by bind-mounts. My Traefik is running on host, connecting via UNIX socket and it fails when I restart Docker service.

ginkel · 2020-10-12T14:49:05Z

Is there a chance to at least detect this error condition via a failing health ping and/or Prometheus metric? ATM Traefik starts logging the following error, but no metrics hint at something being fundamentally broken:

{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 9.991645219s","providerName":"docker","time":"2020-10-12T13:54:14Z"}

goshander · 2021-06-08T18:36:40Z

Same issue
traefik 2.3.7
docker 20.10.5

"live-restore": true

dsseng · 2021-06-14T07:49:32Z

Could not reproduce. Traefik at v2.4 branch, Docker 20.10.2, no live restore.
Steps:

docker run -l "traefik.http.routers.nginx.rule=Path(\"/\")" --name nginx -d nginx:alpine
It works
systemctl restart docker
Error in Traefik log, not working (container stopped)
docker start nginx
Traefik picked up new data and page access works

api:
  dashboard: true
  insecure: true

log:
  level: DEBUG

entryPoints:
  http:
    address: ":3080"

  https:
    address: ":3443"

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"

c2h5oh · 2021-06-14T13:17:43Z

no live restore

This is the key difference. With live restore off when dockerd is restarted so is every container, including traefik. Enable live restore to reproduce.

dsseng · 2021-06-14T13:31:43Z

Yes, but when I explored this back in 2019 it didn't work this way too, so partially fixed (and probably the issue needs renaming).

dsseng · 2021-06-14T13:41:28Z

Weird, that works as well. Maybe I have some difference in setup?

docker run -l "traefik.http.routers.nginx.rule=Path(\"/\")" --name nginx -d nginx:alpine
Service is available
systemctl restart docker (it does live-reload, There are old running containers, the network config will not take affect)
Error, then Docker config picked up, then service becomes available

ERRO[2021-06-14T16:34:04+03:00] Provider connection error unexpected EOF, retrying in 574.147282ms  providerName=docker
DEBU[2021-06-14T16:34:04+03:00] Provider connection established with docker 20.10.2 (API 1.41)  providerName=docker
DEBU[2021-06-14T16:34:04+03:00] Configuration received from provider docker: {"http":{"routers":{"nginx":{"service":"nginx","rule":"Path(\"/\")"}},"services":{"nginx":{"loadBalancer":{"servers":[{"url":"http://172.17.0.2:80"}],"passHostHeader":true}}}},"tcp":{},"udp":{}}  providerName=docker
INFO[2021-06-14T16:34:04+03:00] Skipping same configuration                   providerName=docker

jhowe-uw · 2021-10-26T20:53:03Z

We see this issue in production with the following setup. Hopefully, this can help you replicate this issue.

Basically, we run Docker in live-restore mode, so that we can update the docker daemon without having to restart the running containers ( especially the more complex java based-containers with several minute start-up penalties ).

We have Debian based VMs running docker with traefik running in a container to route traffic to other containers loaded on the server. Debian is configured to run unattended-upgrades between 12 AM - 1 AM. Unattended-upgrades is configured to auto-update docker components.

When unattended-upgrades runs and updates docker, we see the following errors in /var/log/traefik/traefik.log.json:

{"level":"error","msg":"Provider connection error unexpected EOF, retrying in 618.698409ms","providerName":"docker","time":"2021-10-26T07:39:57Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:39:58Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 751.724919ms","providerName":"docker","time":"2021-10-26T07:39:58Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:39:59Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 1.559652463s","providerName":"docker","time":"2021-10-26T07:39:59Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:00Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 903.644192ms","providerName":"docker","time":"2021-10-26T07:40:00Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:01Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 2.130022298s","providerName":"docker","time":"2021-10-26T07:40:01Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:03Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 5.36563643s","providerName":"docker","time":"2021-10-26T07:40:03Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:09Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 4.481232746s","providerName":"docker","time":"2021-10-26T07:40:09Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:13Z"}

Since this runs at the middle-of-the-night, our elastic stack picks up this error via filebeat and floods our e-mails with alerts.

This post-docker upgrade state requires a restart of the traefik container.

I think a graceful resolution would be to try to reconnect to the docker socket ( unless this is a limitation of docker a/o docker-compose ).

Or would a better best-practice be to connect to /var/run/docker.sock via a proxy service? And then, would the proxy socket service handle this edge-case of docker restarts with live-restore with a containerized traefik instance?

We have the following environment:

libvirt/qemu running Debian 11.1 as the virtual machine.

Debian 11.1

ii  docker-ce                         5:20.10.10~3-0~debian-bullseye amd64        Docker: the open-source application container engine
ii  docker-ce-cli                     5:20.10.10~3-0~debian-bullseye amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras         5:20.10.10~3-0~debian-bullseye amd64        Rootless support for Docker.
ii  docker-scan-plugin                0.9.0~debian-bullseye          amd64        Docker scan cli plugin.

docker-compose version 1.29.2, build 5becea4c

# cat /etc/apt/sources.list.d/docker.list

# This file is managed by Puppet. DO NOT EDIT.
# docker
deb [arch=amd64] https://download.docker.com/linux/debian bullseye stable

# cat /etc/apt/apt.conf.d/50unattended-upgrades

// This file is managed by Puppet. DO NOT EDIT.
// Automatically upgrade packages from these (origin:archive) pairs
//
Unattended-Upgrade::Origins-Pattern {
	"origin=Debian,codename=${distro_codename}";
	"origin=Debian,codename=${distro_codename}-security";
	"origin=Debian,codename=${distro_codename}-updates";
	"origin=elastic,codename=stable";
	"origin=Docker,suite=${distro_codename}";
	"origin=Puppetlabs,codename=${distro_codename}";
};
...

# cat /etc/default/docker

# This file is managed by Puppet and local changes
# may be overwritten

OPTIONS=" -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-masq=true --log-driver json-file --log-opt max-size=100m --log-opt max-file=5 --log-opt compress=true -G docker --experimental=true --metrics-addr=0.0.0.0:9323 --live-restore=true --no-new-privileges=true --exec-opt native.cgroupdriver=systemd --oom-score-adjust 0 --ip=127.0.0.1 --ipv6=false"

# This is also a handy place to tweak where Docker's temporary files go.
TMPDIR="/tmp/"

# cat /opt/docker-compose/traefik/docker-compose.yml

# PUPPET-MANAGED: All local edits will be over-written! Use gitlab for changes.
#
version: "3.8"

volumes:
  letsencrypt: {}

networks:
  default:
    external:
      name: external_web

services:
  traefik:
    image: traefik:v2.5.3
    restart: unless-stopped
    container_name: traefik

    ports:
      - "80:80"
      - "443:443"

    volumes:
      # Absolute
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /etc/apache2/ssl/:/etc/certs/:ro
      - /var/log/traefik:/var/log/traefik
      # Relative
      - ./dynamic-configs:/dynamic-configs:ro
      - ./users_credentials:/users_credentials:ro
      # Abstracted
      - letsencrypt:/letsencrypt

    command:
      # Enable Dashboard
      - --api.dashboard=true

      # Docker Provider
      - --providers.docker=true
      - --providers.docker.exposedByDefault=false
      - --providers.docker.network=external_web

      # Enable loading of dynamics configs
      - --providers.file.directory=/dynamic-configs
      - --providers.file.watch=true

      # Configure entrypoints
      - --entrypoints.web.address=:80
      - --entryPoints.websecure.address=:443
      # Redirect HTTP to HTTPS
      - --entrypoints.web.http.redirections.entryPoint.to=websecure
      - --entrypoints.web.http.redirections.entryPoint.scheme=https
      - --entrypoints.web.http.redirections.entrypoint.permanent=true
      # Bind Common Security Headers to HTTPS
      #   see env below:
      #     labels: traefik.http.middlewares.secure-headers.headers.*
      - --entrypoints.websecure.http.middlewares=secure-headers

      # LetsEncrypt Certificate settings
      #   Challenge types
      #     HTTP-01 challenge
      - --certificatesResolvers.letsencrypt.acme.httpChallenge=true
      - --certificatesResolvers.letsencrypt.acme.httpChallenge.entryPoint=web
      - --certificatesresolvers.letsencrypt.acme.email=support@*FQDN*
      - --certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json

      # Traefik Logging
      - --log.level=INFO
      - --log.filePath=/var/log/traefik/traefik.log.json
      - --log.format=json

      # Access Logs
      - --accesslog=true
      - --accesslog.filepath=/var/log/traefik/access.log.json
      - --accesslog.format=json
      - --accesslog.fields.defaultmode=keep
      - --accesslog.fields.headers.defaultmode=keep

      # Enable Prometheus Metrics
      - --metrics.prometheus=true

    labels:
      - "traefik.enable=true"
      #
      # MiddleWares
      #   Define Basic Auth for API/Dashboard
      - "traefik.http.middlewares.*ORG*-authenticated-users.basicauth.usersfile=/users_credentials"
      #
      #   Define Network CIDR Restrictions ( localhost, *ORG* CIDR )
      - "traefik.http.middlewares.limit-access-to-*ORG*-cidr.ipwhitelist.sourcerange=127.0.0.1/32,*N.N.N.N*/24"
      #
      #   Define Secure-Headers applied to all hosted compositions
      #     Set Strict-Transport-Security ( HSTS ) Header, expire 6 months
      - "traefik.http.middlewares.secure-headers.headers.stsSeconds=15768000"
      - "traefik.http.middlewares.secure-headers.headers.stsIncludeSubdomains=true"
      #     X-Content-Type-Options: nosniff
      - "traefik.http.middlewares.secure-headers.headers.contentTypeNosniff=true"
      #     X-XSS-Protection: 1; mode=block
      - "traefik.http.middlewares.secure-headers.headers.browserXssFilter=true"
      #     X-Frame-Options: SAMEORIGIN
      - "traefik.http.middlewares.secure-headers.headers.customFrameOptionsValue=SAMEORIGIN"
      #     Referrer-Policy: no-referrer-when-downgrade
      - "traefik.http.middlewares.secure-headers.headers.referrerPolicy=no-referrer-when-downgrade"
      #     Permissions-Policy:
      #       Replaces Feature-Policy
      - "traefik.http.middlewares.secure-headers.headers.customResponseHeaders.Permissions-Policy=payment=()"
      #     Secure Cookies must be set per container
      #       # Set-Cookie ^(.*)$ "$1; HttpOnly; Secure"
      #
      # Dashboard config
      - "traefik.http.routers.traefik.rule=Host(`traefik-dashboard.*FQDN*`)"
      - "traefik.http.routers.traefik.entrypoints=websecure"
      - "traefik.http.routers.traefik.tls=true"
      - "traefik.http.routers.traefik.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik.service=api@internal"
      - "traefik.http.routers.traefik.middlewares=*ORG*-authenticated-users"
      #
      # Prometheus/Metrics config
      - "traefik.http.routers.traefik-metrics.rule=Host(`metrics.*FQDN*`)"
      - "traefik.http.routers.traefik-metrics.entrypoints=websecure"
      - "traefik.http.routers.traefik-metrics.tls=true"
      - "traefik.http.routers.traefik-metrics.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik-metrics.service=prometheus@internal"
      - "traefik.http.routers.traefik-metrics.middlewares=limit-access-to-*ORG*-cidr"

lifeofguenter · 2023-02-19T00:29:20Z

Is this maybe fixable by connecting to the dockerd via IP instead of a socket file?

c2h5oh · 2023-02-19T19:05:00Z

Is this maybe fixable by connecting to the dockerd via IP instead of a socket file?

Since docker daemon has not built-in authentication by doing that you are essentially giving root access to any process that can connect to localhost IP on that host.

lifeofguenter · 2023-02-19T19:54:10Z

Since docker daemon has not built-in authentication by doing that you are essentially giving root access to any process that can connect to localhost IP on that host.

thats not true, you can authenticate via server-client TLS.

However there is def less authorization vs ro mount. Maybe solvable with a different way, I dont think this is a traefik issue but maybe one with dockerd and how mounts work.

michaelkebe · 2023-09-08T09:27:37Z

We are experiencing the same issue with a setup like @jhowe-uw (VMs, docker with live-restore enabled).

We are updating docker via ansible and restarting the traefik container along with it. This is just a workaround and not very satisfactory.

We tried to use the ping healthcheck endpoint, but when it happens, the healthcheck is successful. So we are not able to detect the problem with the healthcheck.

I also think the problem is how docker handles the volume mounts.

michaelkebe · 2023-09-08T11:32:41Z

I investigated a little bit further. It looks like docker handles file mounts and directory mounts differently.

I think, that I found a solution:
Don't mount the docker.sock directly with:

/var/run/docker.sock:/var/run/docker.sock:ro

Mount the /var/run directory containing the docker.sock

/var/run:/var/run:ro

I cannot say much about the consequences mouting the whole /var/run directory.

My testsetup looks like this https://gist.github.com/michaelkebe/a1fd64c5d31aaca5b092aa2b7409bf6d
The watcher.sh simulates traefik trying to connect to the docker.sock.

Clone the testsetup:
$ git clone https://gist.github.com/a1fd64c5d31aaca5b092aa2b7409bf6d.git
Start the setup:
$ cd a1fd64c5d31aaca5b092aa2b7409bf6d/
$ docker compose up -d
Attach to the container to watch the output
$ docker attach a1fd64c5d31aaca5b092aa2b7409bf6d-dockersockwatcher-1
In another terminal run upgrades, downgrades, restarts of the docker daemon as you wish and while looking at the output of the watcher.sh.
$ sudo apt install docker-ce=5:23.0.6-1~ubuntu.20.04~focal
$ sudo apt install docker-ce=5:24.0.6-1~ubuntu.20.04~focal

If you want to try the different mount options, edit the docker-compose.yml and bring the testsetup up again.

With the not working option (mount docker.sock file directly) the watcher.sh outputs

nc: unix connect failed: Connection refused
nc: /var/run/docker.sock: Connection refused

michaelkebe · 2023-09-08T11:40:10Z

Here is a discussion exactly about this problem.

moby/moby#22789

traefiker added the status/0-needs-triage label Nov 14, 2019

dduportal added area/provider/docker kind/enhancement a new or improved feature. priority/P3 maybe and removed status/0-needs-triage labels Nov 15, 2019

dduportal added this to issues in v2 via automation Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker provider stops working when dockerd is restarted, requires traefik restart to fix #5833

Docker provider stops working when dockerd is restarted, requires traefik restart to fix #5833

c2h5oh commented Nov 14, 2019 •

edited

dsseng commented Dec 8, 2019 •

edited

ginkel commented Oct 12, 2020

goshander commented Jun 8, 2021 •

edited

dsseng commented Jun 14, 2021

c2h5oh commented Jun 14, 2021 •

edited

dsseng commented Jun 14, 2021

dsseng commented Jun 14, 2021

jhowe-uw commented Oct 26, 2021

lifeofguenter commented Feb 19, 2023

c2h5oh commented Feb 19, 2023

lifeofguenter commented Feb 19, 2023 •

edited

michaelkebe commented Sep 8, 2023

michaelkebe commented Sep 8, 2023

michaelkebe commented Sep 8, 2023

Docker provider stops working when dockerd is restarted, requires traefik restart to fix #5833

Docker provider stops working when dockerd is restarted, requires traefik restart to fix #5833

Comments

c2h5oh commented Nov 14, 2019 • edited

What did you do?

What did you expect to see?

What did you see instead?

Output of traefik version: (What version of Traefik are you using?)

What is your environment & configuration (arguments, toml, provider, platform, ...)?

dsseng commented Dec 8, 2019 • edited

ginkel commented Oct 12, 2020

goshander commented Jun 8, 2021 • edited

dsseng commented Jun 14, 2021

c2h5oh commented Jun 14, 2021 • edited

dsseng commented Jun 14, 2021

dsseng commented Jun 14, 2021

jhowe-uw commented Oct 26, 2021

lifeofguenter commented Feb 19, 2023

c2h5oh commented Feb 19, 2023

lifeofguenter commented Feb 19, 2023 • edited

michaelkebe commented Sep 8, 2023

michaelkebe commented Sep 8, 2023

michaelkebe commented Sep 8, 2023

c2h5oh commented Nov 14, 2019 •

edited

Output of `traefik version`: (What version of Traefik are you using?)

dsseng commented Dec 8, 2019 •

edited

goshander commented Jun 8, 2021 •

edited

c2h5oh commented Jun 14, 2021 •

edited

lifeofguenter commented Feb 19, 2023 •

edited