Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker provider stops working when dockerd is restarted, requires traefik restart to fix #5833

Open
c2h5oh opened this issue Nov 14, 2019 · 14 comments
Labels
Projects

Comments

@c2h5oh
Copy link

c2h5oh commented Nov 14, 2019

What did you do?

Updated Docker OS package, which caused Docker daemon (dockerd) to be restarted.
Simply restarting dockerd service has the same effect.

What did you expect to see?

Traefik Docker provider keeps working as it was prior to dockerd restart

What did you see instead?

Traefik Docker provider stops picking up changes until traefik is restarted.
This is likely related to #5589 - dockerd socket is bind-mounted into traefik container and when dockerd is restarted that socket is recreated. Edit: it looks like it isn't - see #5833 (comment)

Output of traefik version: (What version of Traefik are you using?)

Version:      2.0.2
Codename:     montdor
Go version:   go1.13.1
Built:        2019-10-09T19:26:05Z
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

Traefik running in a docker container, using docker provider

api:
  dashboard: false
  insecure: false

entryPoints:
  http:
    address: ":80"

  https:
    address: ":443"

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false

Our dockerd config contains "live-restore": true which results in dockerd restarts not causing container restarts.

@dduportal dduportal added this to issues in v2 via automation Nov 15, 2019
@dsseng
Copy link
Contributor

dsseng commented Dec 8, 2019

I tried to reproduce and it looks like it's not caused by bind-mounts. My Traefik is running on host, connecting via UNIX socket and it fails when I restart Docker service.

@ginkel
Copy link

ginkel commented Oct 12, 2020

Is there a chance to at least detect this error condition via a failing health ping and/or Prometheus metric? ATM Traefik starts logging the following error, but no metrics hint at something being fundamentally broken:

{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 9.991645219s","providerName":"docker","time":"2020-10-12T13:54:14Z"}

@goshander
Copy link

goshander commented Jun 8, 2021

Same issue
traefik 2.3.7
docker 20.10.5

"live-restore": true

@dsseng
Copy link
Contributor

dsseng commented Jun 14, 2021

Could not reproduce. Traefik at v2.4 branch, Docker 20.10.2, no live restore.
Steps:

  1. docker run -l "traefik.http.routers.nginx.rule=Path(\"/\")" --name nginx -d nginx:alpine
  2. It works
  3. systemctl restart docker
  4. Error in Traefik log, not working (container stopped)
  5. docker start nginx
  6. Traefik picked up new data and page access works
api:
  dashboard: true
  insecure: true

log:
  level: DEBUG

entryPoints:
  http:
    address: ":3080"

  https:
    address: ":3443"

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"

@c2h5oh
Copy link
Author

c2h5oh commented Jun 14, 2021

no live restore

This is the key difference. With live restore off when dockerd is restarted so is every container, including traefik. Enable live restore to reproduce.

@dsseng
Copy link
Contributor

dsseng commented Jun 14, 2021

Yes, but when I explored this back in 2019 it didn't work this way too, so partially fixed (and probably the issue needs renaming).

@dsseng
Copy link
Contributor

dsseng commented Jun 14, 2021

Weird, that works as well. Maybe I have some difference in setup?

  1. docker run -l "traefik.http.routers.nginx.rule=Path(\"/\")" --name nginx -d nginx:alpine
  2. Service is available
  3. systemctl restart docker (it does live-reload, There are old running containers, the network config will not take affect)
  4. Error, then Docker config picked up, then service becomes available
ERRO[2021-06-14T16:34:04+03:00] Provider connection error unexpected EOF, retrying in 574.147282ms  providerName=docker
DEBU[2021-06-14T16:34:04+03:00] Provider connection established with docker 20.10.2 (API 1.41)  providerName=docker
DEBU[2021-06-14T16:34:04+03:00] Configuration received from provider docker: {"http":{"routers":{"nginx":{"service":"nginx","rule":"Path(\"/\")"}},"services":{"nginx":{"loadBalancer":{"servers":[{"url":"http://172.17.0.2:80"}],"passHostHeader":true}}}},"tcp":{},"udp":{}}  providerName=docker
INFO[2021-06-14T16:34:04+03:00] Skipping same configuration                   providerName=docker

@jhowe-uw
Copy link

We see this issue in production with the following setup. Hopefully, this can help you replicate this issue.

Basically, we run Docker in live-restore mode, so that we can update the docker daemon without having to restart the running containers ( especially the more complex java based-containers with several minute start-up penalties ).

We have Debian based VMs running docker with traefik running in a container to route traffic to other containers loaded on the server. Debian is configured to run unattended-upgrades between 12 AM - 1 AM. Unattended-upgrades is configured to auto-update docker components.

When unattended-upgrades runs and updates docker, we see the following errors in /var/log/traefik/traefik.log.json:

{"level":"error","msg":"Provider connection error unexpected EOF, retrying in 618.698409ms","providerName":"docker","time":"2021-10-26T07:39:57Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:39:58Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 751.724919ms","providerName":"docker","time":"2021-10-26T07:39:58Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:39:59Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 1.559652463s","providerName":"docker","time":"2021-10-26T07:39:59Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:00Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 903.644192ms","providerName":"docker","time":"2021-10-26T07:40:00Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:01Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 2.130022298s","providerName":"docker","time":"2021-10-26T07:40:01Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:03Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 5.36563643s","providerName":"docker","time":"2021-10-26T07:40:03Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:09Z"}
{"level":"error","msg":"Provider connection error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?, retrying in 4.481232746s","providerName":"docker","time":"2021-10-26T07:40:09Z"}
{"level":"error","msg":"Failed to retrieve information of the docker client and server host: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","providerName":"docker","time":"2021-10-26T07:40:13Z"}

Since this runs at the middle-of-the-night, our elastic stack picks up this error via filebeat and floods our e-mails with alerts.

This post-docker upgrade state requires a restart of the traefik container.

I think a graceful resolution would be to try to reconnect to the docker socket ( unless this is a limitation of docker a/o docker-compose ).

Or would a better best-practice be to connect to /var/run/docker.sock via a proxy service? And then, would the proxy socket service handle this edge-case of docker restarts with live-restore with a containerized traefik instance?

We have the following environment:

libvirt/qemu running Debian 11.1 as the virtual machine.

Debian 11.1

ii  docker-ce                         5:20.10.10~3-0~debian-bullseye amd64        Docker: the open-source application container engine
ii  docker-ce-cli                     5:20.10.10~3-0~debian-bullseye amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras         5:20.10.10~3-0~debian-bullseye amd64        Rootless support for Docker.
ii  docker-scan-plugin                0.9.0~debian-bullseye          amd64        Docker scan cli plugin.

docker-compose version 1.29.2, build 5becea4c

# cat /etc/apt/sources.list.d/docker.list

# This file is managed by Puppet. DO NOT EDIT.
# docker
deb [arch=amd64] https://download.docker.com/linux/debian bullseye stable

# cat /etc/apt/apt.conf.d/50unattended-upgrades

// This file is managed by Puppet. DO NOT EDIT.
// Automatically upgrade packages from these (origin:archive) pairs
//
Unattended-Upgrade::Origins-Pattern {
	"origin=Debian,codename=${distro_codename}";
	"origin=Debian,codename=${distro_codename}-security";
	"origin=Debian,codename=${distro_codename}-updates";
	"origin=elastic,codename=stable";
	"origin=Docker,suite=${distro_codename}";
	"origin=Puppetlabs,codename=${distro_codename}";
};
...

# cat /etc/default/docker

# This file is managed by Puppet and local changes
# may be overwritten

OPTIONS=" -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-masq=true --log-driver json-file --log-opt max-size=100m --log-opt max-file=5 --log-opt compress=true -G docker --experimental=true --metrics-addr=0.0.0.0:9323 --live-restore=true --no-new-privileges=true --exec-opt native.cgroupdriver=systemd --oom-score-adjust 0 --ip=127.0.0.1 --ipv6=false"

# This is also a handy place to tweak where Docker's temporary files go.
TMPDIR="/tmp/"

# cat /opt/docker-compose/traefik/docker-compose.yml

# PUPPET-MANAGED: All local edits will be over-written! Use gitlab for changes.
#
version: "3.8"

volumes:
  letsencrypt: {}

networks:
  default:
    external:
      name: external_web

services:
  traefik:
    image: traefik:v2.5.3
    restart: unless-stopped
    container_name: traefik

    ports:
      - "80:80"
      - "443:443"

    volumes:
      # Absolute
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /etc/apache2/ssl/:/etc/certs/:ro
      - /var/log/traefik:/var/log/traefik
      # Relative
      - ./dynamic-configs:/dynamic-configs:ro
      - ./users_credentials:/users_credentials:ro
      # Abstracted
      - letsencrypt:/letsencrypt

    command:
      # Enable Dashboard
      - --api.dashboard=true

      # Docker Provider
      - --providers.docker=true
      - --providers.docker.exposedByDefault=false
      - --providers.docker.network=external_web

      # Enable loading of dynamics configs
      - --providers.file.directory=/dynamic-configs
      - --providers.file.watch=true

      # Configure entrypoints
      - --entrypoints.web.address=:80
      - --entryPoints.websecure.address=:443
      # Redirect HTTP to HTTPS
      - --entrypoints.web.http.redirections.entryPoint.to=websecure
      - --entrypoints.web.http.redirections.entryPoint.scheme=https
      - --entrypoints.web.http.redirections.entrypoint.permanent=true
      # Bind Common Security Headers to HTTPS
      #   see env below:
      #     labels: traefik.http.middlewares.secure-headers.headers.*
      - --entrypoints.websecure.http.middlewares=secure-headers

      # LetsEncrypt Certificate settings
      #   Challenge types
      #     HTTP-01 challenge
      - --certificatesResolvers.letsencrypt.acme.httpChallenge=true
      - --certificatesResolvers.letsencrypt.acme.httpChallenge.entryPoint=web
      - --certificatesresolvers.letsencrypt.acme.email=support@*FQDN*
      - --certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json

      # Traefik Logging
      - --log.level=INFO
      - --log.filePath=/var/log/traefik/traefik.log.json
      - --log.format=json

      # Access Logs
      - --accesslog=true
      - --accesslog.filepath=/var/log/traefik/access.log.json
      - --accesslog.format=json
      - --accesslog.fields.defaultmode=keep
      - --accesslog.fields.headers.defaultmode=keep

      # Enable Prometheus Metrics
      - --metrics.prometheus=true

    labels:
      - "traefik.enable=true"
      #
      # MiddleWares
      #   Define Basic Auth for API/Dashboard
      - "traefik.http.middlewares.*ORG*-authenticated-users.basicauth.usersfile=/users_credentials"
      #
      #   Define Network CIDR Restrictions ( localhost, *ORG* CIDR )
      - "traefik.http.middlewares.limit-access-to-*ORG*-cidr.ipwhitelist.sourcerange=127.0.0.1/32,*N.N.N.N*/24"
      #
      #   Define Secure-Headers applied to all hosted compositions
      #     Set Strict-Transport-Security ( HSTS ) Header, expire 6 months
      - "traefik.http.middlewares.secure-headers.headers.stsSeconds=15768000"
      - "traefik.http.middlewares.secure-headers.headers.stsIncludeSubdomains=true"
      #     X-Content-Type-Options: nosniff
      - "traefik.http.middlewares.secure-headers.headers.contentTypeNosniff=true"
      #     X-XSS-Protection: 1; mode=block
      - "traefik.http.middlewares.secure-headers.headers.browserXssFilter=true"
      #     X-Frame-Options: SAMEORIGIN
      - "traefik.http.middlewares.secure-headers.headers.customFrameOptionsValue=SAMEORIGIN"
      #     Referrer-Policy: no-referrer-when-downgrade
      - "traefik.http.middlewares.secure-headers.headers.referrerPolicy=no-referrer-when-downgrade"
      #     Permissions-Policy:
      #       Replaces Feature-Policy
      - "traefik.http.middlewares.secure-headers.headers.customResponseHeaders.Permissions-Policy=payment=()"
      #     Secure Cookies must be set per container
      #       # Set-Cookie ^(.*)$ "$1; HttpOnly; Secure"
      #
      # Dashboard config
      - "traefik.http.routers.traefik.rule=Host(`traefik-dashboard.*FQDN*`)"
      - "traefik.http.routers.traefik.entrypoints=websecure"
      - "traefik.http.routers.traefik.tls=true"
      - "traefik.http.routers.traefik.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik.service=api@internal"
      - "traefik.http.routers.traefik.middlewares=*ORG*-authenticated-users"
      #
      # Prometheus/Metrics config
      - "traefik.http.routers.traefik-metrics.rule=Host(`metrics.*FQDN*`)"
      - "traefik.http.routers.traefik-metrics.entrypoints=websecure"
      - "traefik.http.routers.traefik-metrics.tls=true"
      - "traefik.http.routers.traefik-metrics.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik-metrics.service=prometheus@internal"
      - "traefik.http.routers.traefik-metrics.middlewares=limit-access-to-*ORG*-cidr"

@lifeofguenter
Copy link

Is this maybe fixable by connecting to the dockerd via IP instead of a socket file?

@c2h5oh
Copy link
Author

c2h5oh commented Feb 19, 2023

Is this maybe fixable by connecting to the dockerd via IP instead of a socket file?

Since docker daemon has not built-in authentication by doing that you are essentially giving root access to any process that can connect to localhost IP on that host.

@lifeofguenter
Copy link

lifeofguenter commented Feb 19, 2023

Since docker daemon has not built-in authentication by doing that you are essentially giving root access to any process that can connect to localhost IP on that host.

thats not true, you can authenticate via server-client TLS.

However there is def less authorization vs ro mount. Maybe solvable with a different way, I dont think this is a traefik issue but maybe one with dockerd and how mounts work.

@michaelkebe
Copy link

We are experiencing the same issue with a setup like @jhowe-uw (VMs, docker with live-restore enabled).

We are updating docker via ansible and restarting the traefik container along with it. This is just a workaround and not very satisfactory.

We tried to use the ping healthcheck endpoint, but when it happens, the healthcheck is successful. So we are not able to detect the problem with the healthcheck.

I also think the problem is how docker handles the volume mounts.

@michaelkebe
Copy link

I investigated a little bit further. It looks like docker handles file mounts and directory mounts differently.

I think, that I found a solution:
Don't mount the docker.sock directly with:

/var/run/docker.sock:/var/run/docker.sock:ro

Mount the /var/run directory containing the docker.sock

/var/run:/var/run:ro

I cannot say much about the consequences mouting the whole /var/run directory.

My testsetup looks like this https://gist.github.com/michaelkebe/a1fd64c5d31aaca5b092aa2b7409bf6d
The watcher.sh simulates traefik trying to connect to the docker.sock.

  1. Clone the testsetup:
    $ git clone https://gist.github.com/a1fd64c5d31aaca5b092aa2b7409bf6d.git
  2. Start the setup:
    $ cd a1fd64c5d31aaca5b092aa2b7409bf6d/
    $ docker compose up -d
  3. Attach to the container to watch the output
    $ docker attach a1fd64c5d31aaca5b092aa2b7409bf6d-dockersockwatcher-1
  4. In another terminal run upgrades, downgrades, restarts of the docker daemon as you wish and while looking at the output of the watcher.sh.
    $ sudo apt install docker-ce=5:23.0.6-1~ubuntu.20.04~focal
    $ sudo apt install docker-ce=5:24.0.6-1~ubuntu.20.04~focal

If you want to try the different mount options, edit the docker-compose.yml and bring the testsetup up again.

With the not working option (mount docker.sock file directly) the watcher.sh outputs

nc: unix connect failed: Connection refused
nc: /var/run/docker.sock: Connection refused

@michaelkebe
Copy link

Here is a discussion exactly about this problem.

moby/moby#22789

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
v2
issues
Development

No branches or pull requests

9 participants