fix(nvmf): set netroot=nbft #10

mwilck · 2023-05-04T14:49:27Z

The logic added in 9b9dd99 ("35network-legacy: only skip waiting for interfaces if netroot is set") will cause all NBFT interfaces to be waited for unless the "netroot" shell variable is set. Avoid this by setting "netroot=nbft": this will cause the boot to proceed even NBFT interfaces are missing, as long as the initrd root file system has been found.

This requires installing a netroot handler /sbin/nbftroot, which will be called by the networking scripts via /sbin/netroot when the interface has been brought up. Create a simple nbftroot script that simply calls nvmf-autoconnect.sh. With this installed, we can skip calling nvmf-autoconnect.sh from the "online" initqueue hook.

Fixes #9, but only for the network-legacy networking backend.

I think that with the network-manager backend, the issue doesn't exist in the first place.

The logic added in 9b9dd99 ("35network-legacy: only skip waiting for interfaces if netroot is set") will cause all NBFT interfaces to be waited for unless the "netroot" shell variable is set. Avoid this by setting "netroot=nbft": this will cause the boot to proceed even NBFT interfaces are missing, as long as the initrd root file system has been found. This requires installing a netroot handler /sbin/nbftroot, which will be called by the networking scripts via /sbin/netroot when the interface has been brought up. Create a simple nbftroot script that simply calls nvmf-autoconnect.sh. With this installed, we can skip calling nvmf-autoconnect.sh from the "online" initqueue hook.

mwilck · 2023-05-10T18:43:33Z

As discussed on the last Timberland meeting, I double-checked the network-manager backend, too, and updated the PR description.

Elaborating some more, NM doesn't use finished initqueue scripts for individual interfaces at all. Rather, it uses nm-wait-online-initrd.service, which calls nm-online -s -q -t 3600. I don't understand the semantics of this tool exactly, but the man page says "nm-online waits until NetworkManager reports an active connection, or specified timeout expires". Reporting of an "active connection" depends on the autoconnect and ipv4.may-fail and ipv6.may-fail settings (and perhaps more, again I don't fully understand it) of the configured connections ¹, but unless I am mistaken, NM will signal an "active connection" (and thus, nm-online will return success) as soon as one network connection becomes active ².

Therefore I think the "problem" that inactive interfaces will be waited for in the "NVMe/TCP multipath" case does not exist with NM. The second problem described in #9 (second interface not up after boot) might very well exist, too.

@johnmeneghini, @tbzatek: could you discuss this with NM experts for confirmation?

The netroot parameter is used by NM, and thus I think this PR won't cause a regression.

The connections and their settings are generated by the nm-initrd-generator tool. ↩
Fixme: does routing play a role here? would NM look for a route to the pubklic internet, like the infamous "connectivity check" known from the desktop? ↩

thom311 · 2023-05-24T11:59:11Z

I am not familiar with this topic, so I cannot give a qualified review.

Only a comment about NetworkManager...

but the man page says "nm-online waits until NetworkManager reports an active connection, or specified timeout expires".

This quote from man nm-online is mainly about how the tool behaves when called without --wait-for-startup. Which isn't relevant here. nm-online tool is almost not useful on it's own (the manual even says that). The relevant part is that it's called as implementation detail by NetworkManager-wait-online.service (in real-root) and nm-wait-online-initrd.service (in initrd)

man NetworkManager-wait-online.service (here) better explains how this is supposed to work.

NM will signal an "active connection" (and thus, nm-online will return success) as soon as one network connection becomes active

NetworkManager-wait-online.service (and nm-wait-online-initrd.service and nm-online -s) will wait until NetworkManager indicates that it is done configuring the network. You can affect that via various means (listed in the manual page), but among others, it will wait until all interfaces that are supposed to be configured, are configured. That is, as long as you see devices in "activating"/"connecting" state in nmcli device, NetworkManager is not yet done configuring the network and the tools still wait for online.

bengal · 2023-05-24T14:29:31Z

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service. In this way, the dracut initqueue (which runs nm-run.sh, and basically executes the online and netroot hooks) starts only after all interfaces that need configuration are activated or failed to activate. The interfaces that need configuration are the ones for which nm-initrd-generator created a profile from the command line.

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

mwilck · 2023-05-24T14:55:11Z

@thom311, @bengal, thanks for your comments.

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service.

So this differs from the way network-legacy works, where network interface activation / configuration is done as part of the initqueue processing. That won't make it easier for us, unfortunately.

In this way, the dracut initqueue (which runs nm-run.sh, and basically executes the online and netroot hooks) starts only after all interfaces that need configuration are activated or failed to activate. The interfaces that need configuration are the ones for which nm-initrd-generator created a profile from the command line.

For NBFT boot, the nvmf module generates ip=... cmdline arguments which (to my understanding) are converted to NM profiles by nm-initrd-generator. Thus, IIUC NM would wait for each interface before even starting the initqueue. Right?

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

Yeah, it probably won't work this way. OTOH, you said NM waits until all interfaces are "activated or failed to activate". If an interface is unplugged, I suppose NM would wait for some time (probably connection.wait-device-timeout), and set the interace to "failed to activate" afterwards. Which would mean that initqueue could proceed.

I guess someone needs to just test this. @johnmeneghini, can you do this with the rh-poc?

The almost correct behavior in the multipath case would be to wait forever until at least one interface is up, and once this happens, stop waiting for any other interfaces. The problem with this is that if there are multiple interfaces, you don't know if it's just multipath, or if different devices are accessed via different network / NVMe connections. But I guess we can ignore that for the time being.

The really correct behavior (IMHO) would be to wait for connections and the root FS at the same time, and once all devices necessary to mount the root FS¹ are detected, stop waiting for any other interfaces. This is basically how the legacy module behaves with this PR.

I have now idea if, and how, that could be achieved with the dracut networkmanager module.

May I ask whether you have discussed this issue in the context of iSCSI/iBFT multipath boot, and whether you have found a solution for that?

and other mandatory file systems ↩

mwilck · 2023-05-24T15:19:46Z

Side note to @thom311: NM will also need support for NBFT-configured interfaces at run time (in the real root FS):

it should understand that these interfaces should not be reconfigured or shut down, as they are necessary to access the root FS,
however, it must take care of some things, such as DHCP release renewal,
if some interface hasn't been brought up during initrd processing, it should take care of the interface bringup and configuration according to the parameters in the NBFT,
it may need to run nvme connect-all --nbft after configuring an NBFT interface (see Fix nbft multipath linux-nvme/nvme-cli#1954)

So far we have implemented this "feature set" in the SUSE tool "wicked". For wicked, I've written a shell-script plugin which reads the JSON-formatted HFI information from the NBFT and transforms it into XML that wicked understands. I suppose some similar approach would be possible for NM. NM has been on my todo list but I haven't had time to actually work on it. I've also repeatedly mentioned in Timberland meetings that this is a necessary puzzle piece to make NVMe boot production ready for NM-based systems. Some hints, or better even someone else looking to this with my support, would be much appreciated.

bengal · 2023-05-24T15:37:27Z

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service.

So this differs from the way network-legacy works, where network interface activation / configuration is done as part of the initqueue processing. That won't make it easier for us, unfortunately.

Right.

For NBFT boot, the nvmf module generates ip=... cmdline arguments which (to my understanding) are converted to NM profiles by nm-initrd-generator. Thus, IIUC NM would wait for each interface before even starting the initqueue. Right?

That's correct.

There is a dracut PR (dracutdevs#2173) to change this a bit, and run the hooks as soon as each interface is activated; but that doesn't change the fact that the initqueue runs after all interfaces are activated.

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

Yeah, it probably won't work this way. OTOH, you said NM waits until all interfaces are "activated or failed to activate". If an interface is unplugged, I suppose NM would wait for some time (probably connection.wait-device-timeout), and set the interace to "failed to activate" afterwards. Which would mean that initqueue could proceed.

I'm not sure if by "unplugged" you mean with the cable unplugged (i.e. without carrier), or that the device is physically unplugged from the system (i.e. not present at all). In the first case there is a carrier-timeout of 10 seconds, in the second case the timeout for the device to appear is 60 seconds (only when neednet=1 or when the device is the bootdev). After the timeout expires, the initqueue proceeds.

The really correct behavior (IMHO) would be to wait for connections and the root FS at the same time, and once all devices necessary to mount the root FS1 are detected, stop waiting for any other interfaces. This is basically how the legacy module behaves with this PR.

I have now idea if, and how, that could be achieved with the dracut networkmanager module.

I guess that would require:

the PR mentioned above, to start hooks immediately when interfaces go up;
to provide a way to make nm-wait-online-initrd finish earlier from an online hook, so that the initqueue can be started as soon as the rootfs is mounted. At the moment I don't know how to do that, but there is a way probably.

May I ask whether you have discussed this issue in the context of iSCSI/iBFT multipath boot, and whether you have found a solution for that?

I am not aware of any previous discussion about this or similar issues.

mwilck · 2023-05-24T15:55:04Z

I'm not sure if by "unplugged" you mean with the cable unplugged

I meant "no carrier", or "down" for whatever other reason (e.g. no IP address obtained from DHCP). No hardware hot-plug discussion here :-)

mwilck · 2023-05-24T16:02:17Z

At the moment I don't know how to do that, but there is a way probably.

Why did you make nm-wait-online-initrd a prerequisite for starting the initqueue in the first place? NM could be started in parallel with the initqueue and use some "finished" initqueue script to signal dracut that network setup is ready. So you must have had some strong reason not to do it that way, at we'd need to understand what it was to avoid regressions.

I am not aware of any previous discussion about this or similar issues.

Hm. Strange. iSCSI multipath boot would have exactly the same problem. We have found a solution with network-legacy only quite recently, too. Perhaps people just don't use this technology.

thom311 · 2023-05-24T16:09:53Z

it should understand that these interfaces should not be reconfigured or shut down, as they are necessary to access the > root FS,
however, it must take care of some things, such as DHCP release renewal,

That is not different from other networking which is setup by NM in initrd (iBFT). Interestingly, NetworkManager to this day doesn't support something like systemd-networkd's KeepConfiguration= setting (seems the demand is not high enough for anybody working on that??). In any case, while useful/necessary, it would be orthogonal to an NBFT feature.

mwilck · 2023-05-24T20:10:51Z

That is not different from other networking which is setup by NM in initrd (iBFT)

right, it is not. But I guess someone needs to code the plugin :-) I'll have a look at NM's iBFT code and see to which extent it can be reused for NBFT support.

thom311 · 2023-05-24T20:41:22Z

the iBFT code for NetworkManager is here.

mwilck · 2023-05-25T07:07:52Z

@bengal, @johnmeneghini: acceptance of this PR is currently blocking the upstream dracut PR for timberland. Can we agree to merge this into timberland_final branch now, acknowledging that it may be necessary to apply further changes to the dracut NM module?

bengal · 2023-05-25T12:34:06Z

Why did you make nm-wait-online-initrd a prerequisite for starting the initqueue in the first place? NM could be started in parallel with the initqueue and use some "finished" initqueue script to signal dracut that network setup is ready. So you must have had some strong reason not to do it that way, at we'd need to understand what it was to avoid regressions.

There might have been other reasons that I don't remember, but I think the main one was to leave the hooks invocation in the initqueue, and only use unit dependencies as synchronization mechanism to ensure hooks are invoked only after the network is configured. In this way there is no need for custom scripts and everything works similarly as in the real root, using the network-online target. This can be revisited if there are issues not solvable with the current approach.

I am not aware of any previous discussion about this or similar issues.

Hm. Strange. iSCSI multipath boot would have exactly the same problem. We have found a solution with network-legacy only quite recently, too. Perhaps people just don't use this technology.

One problem in dracut is that there is no documentation or knowledge about supported use cases and this makes it difficult to introduce new features or do changes. It would be great if every use case would be covered by the test suite (see the test/ directory in the dracut tree). NetworkManager also tests different dracut scenarios in the integration test suite and it tries to cover most of the known use cases.

bengal · 2023-05-25T12:35:19Z

@bengal, @johnmeneghini: acceptance of this PR is currently blocking the upstream dracut PR for timberland. Can we agree to merge this into timberland_final branch now, acknowledging that it may be necessary to apply further changes to the dracut NM module?

This makes sense to me.

johnmeneghini

I've tested these changes with Fedora and everything works.

As observed in your review comments, network manager doesn't appear to rely upon these changes and I am able to boot with multiple paths using multiple nbft attempt files w/out a problem. Error insertions tests are also passing showing that dracut will use any of the available paths and continue to boot from nbft correctly. When both paths are working correctly the system even boots from nbft and enables multipathing.

[root@host-vm ~]# nvme nbft show
/sys/firmware/acpi/tables/NBFT:

NBFT Subsystems:

Idx|NQN                                                                 |Trsp|Address       |SvcId|HFIs
---+--------------------------------------------------------------------+----+--------------+-----+----
1  |nqn.2014-08.org.nvmexpress:uuid:0c468c4d-a385-47e0-8299-6e95051277db|tcp |192.168.101.20|4420 |1   
2  |nqn.2014-08.org.nvmexpress:uuid:0c468c4d-a385-47e0-8299-6e95051277db|tcp |192.168.110.20|4420 |1   

NBFT HFIs:

Idx|Trsp|PCI Addr  |MAC Addr         |DHCP|IP Addr       |Mask|Gateway |DNS     
---+----+----------+-----------------+----+--------------+----+--------+--------
1  |tcp |0:0:4.0   |ea:eb:d3:58:89:58|no  |192.168.101.30|24  |0.0.0.0 |0.0.0.0 
2  |tcp |0:0:5.0   |ea:eb:d3:59:89:59|no  |192.168.110.30|24  |0.0.0.0 |0.0.0.0 
[root@host-vm ~]# nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress:uuid:0c468c4d-a385-47e0-8299-6e95051277db
\
 +- nvme0 tcp traddr=192.168.101.20,trsvcid=4420,host_traddr=192.168.101.30,src_addr=192.168.101.30 live 
 +- nvme1 tcp traddr=192.168.110.20,trsvcid=4420,host_traddr=192.168.101.30,src_addr=192.168.101.30 live 
[root@host-vm ~]# ip -br addr
lo               UNKNOWN        127.0.0.1/8 ::1/128 
enp0s3           UP             192.168.0.216/24 2601:195:4000:62f:3467:102d:df16:84e7/64 fe80::875:9c79:c479:e6e4/64 
nbft0            UP             192.168.101.30/24 
nbft1            UP             192.168.110.30/24

johnmeneghini · 2023-05-25T18:19:45Z

The almost correct behavior in the multipath case would be to wait forever until at least one interface is up, and once this happens, stop waiting for any other interfaces. The problem with this is that if there are multiple interfaces, you don't know if it's just multipath, or if different devices are accessed via different network / NVMe connections. But I guess we can ignore that for the time being.

This is a policy decision. We can't wait forever. This looks like a hung system. It is better to fail to boot and let the user intervene. The nbft table has a timeout. This can be used by the user set the timeout policy. If the use want to wait forever during boot, they can use this timeout to set the policy.

johnmeneghini · 2023-05-25T18:21:48Z

I think are ready go move forward with the upstream dracut pull request. Please go ahead and merge this change and then move forward with the upstream pull request.

mwilck · 2023-05-26T14:37:50Z

This is a policy decision. We can't wait forever. This looks like a hung system

dracut's default is to wait forever for the root FS. You can question whether that makes sense, but I don't think we should use a different default.

mwilck · 2023-05-26T15:02:36Z

Note: I squashed the changes from this PR into the top commit of the timberland_final branch. I also updated the commit message to reflect the changes made by this PR.

Hash before squash: ac66c00, after squash: f58e1d5

johnmeneghini · 2023-05-26T15:29:09Z

This is a policy decision. We can't wait forever. This looks like a hung system

dracut's default is to wait forever for the root FS. You can question whether that makes sense, but I don't think we should use a different default.

I've been testing this and I see what you mean. I test things by toggling one or both of my nvme/tcp target port networks up and down on the target machine and then watching how the host reacts. When booting for the first time I see that UEFI will use the programmed timeout from NBFT. After timing out it returns to the Boot Menu. However, when I run the same test using a host reboot it hangs forever. I assume this is because a warm reboot is using initramfs and dracut is simply waiting forever, until I bring the IP link up on the nvme-tcp target port it's waiting for. Then is connects and boots. From what I can see dracut will not try to use the alternate path in this situation. I always hangs on the first path. I and bring the second path up and down and the host never see it. It hangs trying to boot from the first path.... forever.

The firmware appears to do the same thing. So it looks like we still have some path ordering issues in EDK2, and in dracut.

mwilck · 2023-05-26T16:43:32Z

When booting for the first time I see that UEFI will use the programmed timeout from NBFT.

I think you mean the ConnectTimeout from the UEFI input file, but AFAIU that's only effective for the firmware; there is no corresponding field in the NBFT.

After timing out it returns to the Boot Menu.

So this was with both interfaces down?

However, when I run the same test using a host reboot it hangs forever. I assume this is because a warm reboot is using initramfs and dracut is simply waiting forever, until I bring the IP link up on the nvme-tcp target port it's waiting for.

Hm, I can't quite follow. Are you talking about a host reset from the BIOS menu? If yes, do you see the grub menu / the kernel booting? I would assume that a host reset goes through the BIOS, and would behave just like the first time boot.
Again, is this with one or two devices down?

From what I can see dracut will not try to use the alternate path in this situation. I always hangs on the first path

If it's hanging in dracut with one interface up and one down, you're observing Problem 1 from #9.
Which would indicate that there's indeed work to do for NM to make multipath boot work.

github-actions bot added modules nvmf labels May 4, 2023

mwilck force-pushed the multipath-fix-netroot branch from 7af4d7f to c7731fc Compare May 4, 2023 14:49

mwilck requested review from charles-rose and johnmeneghini May 4, 2023 14:50

mwilck mentioned this pull request May 4, 2023

Boot failure in NBFT "multipath" scenario if one path is down #9

Open

mwilck mentioned this pull request May 24, 2023

Fix nbft multipath linux-nvme/nvme-cli#1954

Closed

mwilck mentioned this pull request May 25, 2023

Fixes for NVMeoF/TCP boot bugs openSUSE/dracut#268

Merged

3 tasks

johnmeneghini approved these changes May 25, 2023

View reviewed changes

mwilck merged commit c7731fc into timberland_final May 26, 2023

mwilck deleted the multipath-fix-netroot branch May 26, 2023 14:45

mwilck mentioned this pull request May 26, 2023

NVMeoF / TCP boot support dracutdevs/dracut#2184

Merged

1 task

mwilck mentioned this pull request Jun 1, 2023

Delete unused branches in timberland-sig/dracut #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nvmf): set netroot=nbft #10

fix(nvmf): set netroot=nbft #10

mwilck commented May 4, 2023 •

edited

Loading

mwilck commented May 10, 2023 •

edited

Loading

thom311 commented May 24, 2023

bengal commented May 24, 2023

mwilck commented May 24, 2023 •

edited

Loading

mwilck commented May 24, 2023 •

edited

Loading

bengal commented May 24, 2023

mwilck commented May 24, 2023

mwilck commented May 24, 2023

thom311 commented May 24, 2023

mwilck commented May 24, 2023

thom311 commented May 24, 2023

mwilck commented May 25, 2023

bengal commented May 25, 2023

bengal commented May 25, 2023

johnmeneghini left a comment

johnmeneghini commented May 25, 2023

johnmeneghini commented May 25, 2023

mwilck commented May 26, 2023

mwilck commented May 26, 2023

johnmeneghini commented May 26, 2023

mwilck commented May 26, 2023

fix(nvmf): set netroot=nbft #10

fix(nvmf): set netroot=nbft #10

Conversation

mwilck commented May 4, 2023 • edited Loading

mwilck commented May 10, 2023 • edited Loading

Footnotes

thom311 commented May 24, 2023

bengal commented May 24, 2023

mwilck commented May 24, 2023 • edited Loading

Footnotes

mwilck commented May 24, 2023 • edited Loading

bengal commented May 24, 2023

mwilck commented May 24, 2023

mwilck commented May 24, 2023

thom311 commented May 24, 2023

mwilck commented May 24, 2023

thom311 commented May 24, 2023

mwilck commented May 25, 2023

bengal commented May 25, 2023

bengal commented May 25, 2023

johnmeneghini left a comment

Choose a reason for hiding this comment

johnmeneghini commented May 25, 2023

johnmeneghini commented May 25, 2023

mwilck commented May 26, 2023

mwilck commented May 26, 2023

johnmeneghini commented May 26, 2023

mwilck commented May 26, 2023

mwilck commented May 4, 2023 •

edited

Loading

mwilck commented May 10, 2023 •

edited

Loading

mwilck commented May 24, 2023 •

edited

Loading

mwilck commented May 24, 2023 •

edited

Loading