New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROS boot hang on kernels (maybe>=4.14.36) #2473

Open
niusmallnan opened this Issue Sep 9, 2018 · 16 comments

Comments

Projects
None yet
5 participants
@niusmallnan
Member

niusmallnan commented Sep 9, 2018

RancherOS Version: (ros os version)
v1.4.1-rc2
kernel 4.14.67

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
docker-machine + VirtualBox

It only appears on the first boot, this should be related to ros-bootstrap.

dmesg output:

[   28.271974] random: fast init done
[   28.504239] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input4
[   28.642856] e1000 0000:00:03.0 eth0: (PCI:33MHz:32-bit) 08:00:27:c0:a2:3e
[   28.643546] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
[   29.043658] e1000 0000:00:08.0 eth1: (PCI:33MHz:32-bit) 08:00:27:99:89:e0
[   29.044367] e1000 0000:00:08.0 eth1: Intel(R) PRO/1000 Network Connection
[  134.643987] random: crng init done
[  134.644523] random: 4 urandom warning(s) missed due to ratelimiting
[  134.889131] EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
[  134.968344] udevd[51]: starting version 3.2.5
[  134.971086] udevd[52]: starting eudev-3.2.5
[  136.375081] EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)

It seems that linuxkit also hit this problem.
linuxkit/linuxkit#3032

@rootwuj

This comment has been minimized.

rootwuj commented Sep 18, 2018

RancherOS Version 1.4.1-rc3 9/17
Verified fixed

@rootwuj rootwuj closed this Sep 18, 2018

@niusmallnan niusmallnan added BACKPORTED and removed BACKPORT labels Sep 19, 2018

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Sep 19, 2018

When I create a ros instance using docker-machine+vbox, ros-bootstrap will run mkfs to format the disk on the first boot. The mkfs relies on the generation of random numbers. A commit in kernel 4.14.36 will result in the inability to use random numbers in early boot, and mkfs will get stuck for a long time. It is caused by the empty entropy source.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.14.36&id=6e513bc20ca63f594632eca4e1968791240b8f18

By adding a entropy daemon to fill the entropy pool can solve this issue, so I run rngd in ros-boostrap.

About why I didn’t run rngd always, there are two reasons:

  1. In addition to affecting mkfs, no other effects were found. When the system is fully booted, mkfs has no problems even without rngd.
  2. In the the later build of CoreOS and Ubuntu, I didn’t see the built-in rngd or havegd.

So my plan is to run only once in ros-bootstrap to ensure that there is no problem with mkfs in bootstrap.
And rngd is already built into os-base, but it won’t run as daemon, I will follow feedback from the community and other distributions.

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Sep 19, 2018

Backported to v1.4.1.

@espigle

This comment has been minimized.

espigle commented Oct 5, 2018

I am still experiencing this issue using the latest 1.4.1 version as available from here:

https://releases.rancher.com/os/latest/rancheros-vmware.iso

@kinolaev

This comment has been minimized.

kinolaev commented Oct 10, 2018

Same problem on vsphere 6.5 and rancheros 1.4.1.
Version 1.4.0 boots successfully, but 1.4.1 stuck until I open console and press any buttons.

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Oct 11, 2018

I don't know how to reproduce it, I can confirm that we have fixed this issue in our setup.
This guy said that vsphere 6.5 worked fine with the latest rancheros-vmware iso.
#2505

@cristianocasella

This comment has been minimized.

cristianocasella commented Oct 12, 2018

Hi, same problem here, the issue seems related to the device used for generate entropy during the installation, I think that you are using /dev/urandom instead /dev/random. In this case the server has no entropy and /dev/urandom can't generate random entity. When you move mouse or keyb in the console you generate enough entropy. Switching to /dev/random should fix the issue.
Happened yesterday with last latest tag (https://releases.rancher.com/os/latest/rancheros-vmware.iso) v1.4.1 - Docker 18.03.1-ce - Linux 4.14.67

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Oct 12, 2018

@cristianocasella Not sure if I got it wrong. When the entropy pool is empty, reads from /dev/random(not /dev/urandom) will block until additional environmental noise is gathered.

In this case, the mkfs relies on random number generation. For mkfs, how do I switch to use /dev/urandom?

========here to anyone========
If anyone wants to report this issue, please include your setup information including hardware, because either vmware or virtuabox has been shown to be fixed in our setup.

@cristianocasella

This comment has been minimized.

cristianocasella commented Oct 12, 2018

@niusmallnan sorry, I inverted the devices (https://linux.die.net/man/4/urandom).
You can use an entropy generator (https://vanheusden.com/Linux/#security) or remove the wrong device and make a symbolic link to the secure one. I like the first choice.

@kinolaev

This comment has been minimized.

kinolaev commented Oct 12, 2018

Hello, @niusmallnan,
we use Intel(R) Xeon(R) CPU E5-2603 and Intel(R) Xeon(R) CPU E5-2620

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Oct 23, 2018

After various attempts, I finally found a way to reproduce the problem.
If I run ./scripts/run on GCP with nested kvm, I can see that this behavior still exists.
So I should reopen this issue.

It seems that it is not easy to fix it completely, I need to find more scenes.

@kinolaev

This comment has been minimized.

kinolaev commented Oct 23, 2018

Probably, we need to find a way to pass host's random generator to guests in vSphere. Something like -device virtio-rng-pci (doc) in kvm.

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Oct 25, 2018

Is there someone can try this ISO? http://releases.rancher.com/os/test/rancheros-test.iso

I tried this rngd -r /dev/urandom to fill in the entropy pool.
Of course, this is a really bad idea, since I am simple filling the kernel entropy pool with entropy coming from the kernel itself!
But this does not need to consider the user's hw rngd drivers.

@cristianocasella

This comment has been minimized.

cristianocasella commented Oct 25, 2018

Hi @niusmallnan ,
it works!

@espigle

This comment has been minimized.

espigle commented Oct 25, 2018

@niusmallnan - worked for me.

@kinolaev

This comment has been minimized.

kinolaev commented Oct 26, 2018

@niusmallnan, worked for me too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment