New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyperctl run fails consistently with certain images on random k8s nodes #1081

Open
Zhongtang opened this Issue May 23, 2018 · 7 comments

Comments

Projects
None yet
1 participant
@Zhongtang
Contributor

Zhongtang commented May 23, 2018

What happened:

On some of the k8s nodes, we are observing hyperctl run failing consistently with the following error message when running with a docker image our user built on top of the docker.io/gradle:jdk8 image:

hyperctl ERROR: Error from daemon's response: finished with errors: map[XXXXXX:Create new container failed: Error: ]

This is causing the workers fail to launch builds using this image.

What you expected to happen:

hyperctl should be able to launch hyperd container with any normal Docker images

How to reproduce it:

  • When a node is in the bad state, it's consistently reproducible (assuming one already knows the gradle docker image name) with: sudo hyperctl run -t <IMAGE> date
  • But how a node gets into the state is unknown to us, and we don't have a way to reproduce this part.
@Zhongtang

This comment has been minimized.

Contributor

Zhongtang commented May 23, 2018

What we know so far

  • The same image works on some of the other nodes
  • Other images work on the same node
  • The issue has nothing to do with what command we run inside the container. i.e. if we swap date with pwd or anything else the same error will show up
  • The issue is not related to Docker. docker run the same image on the same node doesn't have any problem
  • Even if the image is rmi'ed and pulled to the node again, the same error would occur
  • Historically, restarting hyperd would fix the problem. But after the restart, the problem would come back (in our case the node has been up for 7 days)

@Zhongtang Zhongtang closed this May 23, 2018

@Zhongtang Zhongtang reopened this May 23, 2018

@Zhongtang Zhongtang added the bug label May 23, 2018

@Zhongtang

This comment has been minimized.

Contributor

Zhongtang commented May 23, 2018

Update:

We are seeing the following log everytime we hyperctl run with the gradle image on the broken nodes:

May 22 19:00:01 <HOSTNAME> kvm: 10 guests now active
May 22 19:00:01 <HOSTNAME> systemd: Started Session 12911 of user root.
May 22 19:00:01 <HOSTNAME> systemd: Starting Session 12911 of user root.
May 22 19:00:01 <HOSTNAME> systemd: Started Session 12912 of user root.
May 22 19:00:01 <HOSTNAME> systemd: Starting Session 12912 of user root.
May 22 19:00:03 <HOSTNAME> kernel: kvm [19358]: vcpu0 unhandled rdmsr: 0x140
May 22 19:00:03 <HOSTNAME> kernel: kvm [19358]: vcpu0 unhandled rdmsr: 0x1c9
May 22 19:00:03 <HOSTNAME> kernel: kvm [19358]: vcpu0 unhandled rdmsr: 0x1a6
May 22 19:00:03 <HOSTNAME> kernel: kvm [19358]: vcpu0 unhandled rdmsr: 0x1a7
May 22 19:00:03 <HOSTNAME> kernel: kvm [19358]: vcpu0 unhandled rdmsr: 0x3f6
May 22 19:00:04 <HOSTNAME> kernel: kvm [19358]: vcpu0 unhandled rdmsr: 0x34
May 22 19:00:04 <HOSTNAME> kernel: XFS (dm-30): Mounting V5 Filesystem
May 22 19:00:04 <HOSTNAME> kernel: XFS (dm-30): Ending clean mount
May 22 19:00:04 <HOSTNAME> kernel: XFS (dm-30): Unmounting Filesystem
May 22 19:00:04 <HOSTNAME> kernel: XFS (dm-30): Mounting V5 Filesystem
May 22 19:00:04 <HOSTNAME> kernel: XFS (dm-30): Ending clean mount
May 22 19:00:04 <HOSTNAME> kernel: XFS (dm-30): Unmounting Filesystem
May 22 19:00:05 <HOSTNAME> kernel: buffer_io_error: 33878 callbacks suppressed
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 0, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 1, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 2, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 3, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 4, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 5, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 16, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 32, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 48, lost async page write
May 22 19:00:05 <HOSTNAME> kernel: Buffer I/O error on dev dm-30, logical block 64, lost async page write
May 22 19:00:06 <HOSTNAME> kernel: VFS: Dirty inode writeback failed for block device dm-30 (err=-5).
May 22 19:00:06 <HOSTNAME> kernel: XFS (dm-31): Mounting V5 Filesystem
May 22 19:00:06 <HOSTNAME> kernel: XFS (dm-31): Ending clean mount
May 22 19:00:06 <HOSTNAME> kernel: XFS (dm-31): Unmounting Filesystem
May 22 19:00:06 <HOSTNAME> kernel: hyper0: port 10(vm-cHFFYYwNP0) entered blocking state
May 22 19:00:06 <HOSTNAME> kernel: hyper0: port 10(vm-cHFFYYwNP0) entered disabled state
May 22 19:00:06 <HOSTNAME> kernel: device vm-cHFFYYwNP0 entered promiscuous mode
May 22 19:00:06 <HOSTNAME> kernel: hyper0: port 10(vm-cHFFYYwNP0) entered blocking state
May 22 19:00:06 <HOSTNAME> kernel: hyper0: port 10(vm-cHFFYYwNP0) entered forwarding state
May 22 19:00:07 <HOSTNAME> hyperd: E0522 19:00:07.021639    1258 container.go:251] Pod[gradle-latest-6110172325] Con[1667b9497a53(gradle-latest-6110172325)] failed to start container: Create new container failed: Error:
May 22 19:00:07 <HOSTNAME> hyperd: E0522 19:00:07.021688    1258 provision.go:501] Pod[gradle-latest-6110172325] error during start all containers: finished with errors: map[1667b9497a53d41ab3f526735f5793505e1554467529e8501ec95dbdb07bd853:Create new container failed: Error: ]
May 22 19:00:07 <HOSTNAME> hyperd: E0522 19:00:07.021728    1258 server.go:170] Handler for POST /v1.0.0/pod/start returned error: finished with errors: map[1667b9497a53d41ab3f526735f5793505e1554467529e8501ec95dbdb07bd853:Create new container failed: Error: ]

In comparison, the following log appears with a working hyperctl run on a healthy node:

May 22 20:54:01 <HOSTNAME> kvm: 3 guests now active
May 22 20:54:01 <HOSTNAME> systemd: Started Session 12781 of user root.
May 22 20:54:01 <HOSTNAME> systemd: Starting Session 12781 of user root.
May 22 20:54:03 <HOSTNAME> kernel: kvm [43450]: vcpu0 unhandled rdmsr: 0x140
May 22 20:54:03 <HOSTNAME> kernel: kvm [43450]: vcpu0 unhandled rdmsr: 0x1c9
May 22 20:54:03 <HOSTNAME> kernel: kvm [43450]: vcpu0 unhandled rdmsr: 0x1a6
May 22 20:54:03 <HOSTNAME> kernel: kvm [43450]: vcpu0 unhandled rdmsr: 0x1a7
May 22 20:54:04 <HOSTNAME> kernel: kvm [43450]: vcpu0 unhandled rdmsr: 0x3f6
May 22 20:54:05 <HOSTNAME> kernel: kvm [43450]: vcpu0 unhandled rdmsr: 0x34
May 22 20:54:05 <HOSTNAME> kernel: kvm [43450]: vcpu0 unhandled rdmsr: 0x606
May 22 20:54:05 <HOSTNAME> kernel: XFS (dm-15): Mounting V5 Filesystem
May 22 20:54:05 <HOSTNAME> kernel: XFS (dm-15): Ending clean mount
May 22 20:54:06 <HOSTNAME> kernel: XFS (dm-15): Unmounting Filesystem
May 22 20:54:06 <HOSTNAME> kernel: XFS (dm-15): Mounting V5 Filesystem
May 22 20:54:06 <HOSTNAME> kernel: XFS (dm-15): Ending clean mount
May 22 20:54:06 <HOSTNAME> kernel: XFS (dm-15): Unmounting Filesystem
May 22 20:54:07 <HOSTNAME> kernel: XFS (dm-16): Mounting V5 Filesystem
May 22 20:54:07 <HOSTNAME> kernel: XFS (dm-16): Ending clean mount
May 22 20:54:08 <HOSTNAME> kernel: XFS (dm-16): Unmounting Filesystem
May 22 20:54:08 <HOSTNAME> kernel: hyper0: port 3(vm-HnkfROjBG0) entered blocking state
May 22 20:54:08 <HOSTNAME> kernel: hyper0: port 3(vm-HnkfROjBG0) entered disabled state
May 22 20:54:08 <HOSTNAME> kernel: device vm-HnkfROjBG0 entered promiscuous mode
May 22 20:54:09 <HOSTNAME> kernel: hyper0: port 3(vm-HnkfROjBG0) entered blocking state
May 22 20:54:09 <HOSTNAME> kernel: hyper0: port 3(vm-HnkfROjBG0) entered forwarding state
@Zhongtang

This comment has been minimized.

Contributor

Zhongtang commented May 23, 2018

Update:

It seems the parent image, gradle:jdk8 (docker.io/gradle@sha256:5e201ab90d57682c1d382727e82b9f1a8ccf407643a62b7dae621dc8fea378d0) failed with the same error.

However the grandparent, openjdk:8-jdk (docker.io/openjdk@sha256:49d76c26366af08b15dda9e5c1288b2bdf02c7bc1699a623cfd0d17b1907c2e2) works:

(I'm assuming this is the right Dockerfile for gradle:jdk8 https://github.com/keeganwitt/docker-gradle/blob/64a348e79cbe0bc8acb9da9062f75aca02bf3023/jdk8/Dockerfile)

$ sudo hyperctl run -t --rm gradle:jdk8 date
hyperctl ERROR: Error from daemon's response: finished with errors: map[854decbd917b220a39567a08c190ea4c5f0d5f939a4b114dd8f54c82c00fce0e:Create new container failed: Error: ]
$ 
$ sudo hyperctl run -t --rm openjdk:8-jdk date
Wed May 23 18:34:48 UTC 2018
@Zhongtang

This comment has been minimized.

Contributor

Zhongtang commented May 23, 2018

Tried to rebuild an image on top of the working grandparent openjdk:8-jdk, copying the entire Dockerfile https://github.com/keeganwitt/docker-gradle/blob/64a348e79cbe0bc8acb9da9062f75aca02bf3023/jdk8/Dockerfile#L1-L38. The new image had the same issue. So it appears whatever is defined in those lines was causing the problem

Will try to binary search and figure out exactly which line was causing this issue.

@Zhongtang

This comment has been minimized.

Contributor

Zhongtang commented May 23, 2018

In the meantime, checked disk space with df -h and sudo hyperctl info. Everything seems to be normal on the disk space side (except the Disk Space Available known to hyperd seems to be larger than the actual available disk space reported by df)

$ df -h
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/sys-root          424G  334G   69G  83% /
devtmpfs                       24G     0   24G   0% /dev
tmpfs                          24G   24K   24G   1% /dev/shm
tmpfs                          24G   12M   24G   1% /run
tmpfs                          24G     0   24G   0% /sys/fs/cgroup
/dev/sda2                     488M  385M   68M  85% /boot
/dev/mapper/sys-home          9.8G   45M  9.2G   1% /home
tmpfs                         4.8G     0  4.8G   0% /run/user/0
tmpfs                         4.8G     0  4.8G   0% /run/user/79329
gradle-jdk8-0352124494-hosts  1.0M  4.0K 1020K   1% /var/lib/hyper/hosts/gradle-jdk8-0352124494

$ sudo hyperctl info
Images: 20
Containers: 1
PODs: 18
Storage Driver: devicemapper
  Pool Name: docker-253:0-8390668-pool
  Pool Blocksize: 65.54 kB
  Base Device Size: 21.47 GB
  Backing Filesystem: xfs
  Data file: /dev/loop0
  Metadata file: /dev/loop1
  Data Space Used: 104.7 GB
  Data Space Total: 455.3 GB
  Data Space Available: 96.46 GB
  Metadata Space Used: 101 MB
  Metadata Space Total: 2.147 GB
  Metadata Space Available: 2.047 GB
@Zhongtang

This comment has been minimized.

Contributor

Zhongtang commented May 23, 2018

Seems like this line is what's causing the hyperctl run to fail. Any image built without this line works just fine. And an image built from a Dockerfile as simple as the following 2 lines will have this issue:

FROM openjdk:8-jdk
VOLUME "/home/gradle/.gradle"

Will keep digging exactly why this line is causing fs corruption

UPDATE:

It seems this issue has nothing to do with the base image or the path of the volume. Even an image created with Dockerfile like below will have the same error, as long as the host is in the bad state.

FROM scratch
VOLUME "/tmp"
@Zhongtang

This comment has been minimized.

Contributor

Zhongtang commented May 25, 2018

Update:

Seems like the issue is with the default volume location hyperd uses on the host OS:

# Works with -v host:guest
$ sudo hyperctl run -t --rm -v /tmp:/tmp openjdk:8-jdk date
Fri May 25 00:59:47 UTC 2018

# Fails with -v guest only
$ sudo hyperctl run -t --rm -v /tmp openjdk:8-jdk date
hyperctl ERROR: Error from daemon's response: finished with errors: map[cc8c1283fa9ba2395ea926712d78ec4be4da807281847f3a4445934027ff72ca:Create new container failed: Error: ]

Would be nice to know where does hyperd mount /tmp in the host OS by default, and check if there are disk space issue.

The disk space usage does look healthy on the host OS though:

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/sys-root  424G  334G   69G  84% /
devtmpfs               24G     0   24G   0% /dev
tmpfs                  24G   24K   24G   1% /dev/shm
tmpfs                  24G   12M   24G   1% /run
tmpfs                  24G     0   24G   0% /sys/fs/cgroup
/dev/sda2             488M  385M   68M  85% /boot
/dev/mapper/sys-home  9.8G   45M  9.2G   1% /home
tmpfs                 4.8G     0  4.8G   0% /run/user/0
tmpfs                 4.8G     0  4.8G   0% /run/user/79329
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment