Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nvmf): set netroot=nbft #10

Merged
merged 1 commit into from
May 26, 2023
Merged

Conversation

mwilck
Copy link
Collaborator

@mwilck mwilck commented May 4, 2023

The logic added in 9b9dd99 ("35network-legacy: only skip waiting for interfaces if netroot is set") will cause all NBFT interfaces to be waited for unless the "netroot" shell variable is set. Avoid this by setting "netroot=nbft": this will cause the boot to proceed even NBFT interfaces are missing, as long as the initrd root file system has been found.

This requires installing a netroot handler /sbin/nbftroot, which will be called by the networking scripts via /sbin/netroot when the interface has been brought up. Create a simple nbftroot script that simply calls nvmf-autoconnect.sh. With this installed, we can skip calling nvmf-autoconnect.sh from the "online" initqueue hook.

Fixes #9, but only for the network-legacy networking backend.

I think that with the network-manager backend, the issue doesn't exist in the first place.

The logic added in 9b9dd99 ("35network-legacy: only skip waiting for
interfaces if netroot is set") will cause all NBFT interfaces to be
waited for unless the "netroot" shell variable is set. Avoid this by
setting "netroot=nbft": this will cause the boot to proceed even
NBFT interfaces are missing, as long as the initrd root file system
has been found.

This requires installing a netroot handler /sbin/nbftroot, which will
be called by the networking scripts via /sbin/netroot when the
interface has been brought up. Create a simple nbftroot script that
simply calls nvmf-autoconnect.sh. With this installed, we can skip
calling nvmf-autoconnect.sh from the "online" initqueue hook.
@mwilck
Copy link
Collaborator Author

mwilck commented May 10, 2023

As discussed on the last Timberland meeting, I double-checked the network-manager backend, too, and updated the PR description.

Elaborating some more, NM doesn't use finished initqueue scripts for individual interfaces at all. Rather, it uses nm-wait-online-initrd.service, which calls nm-online -s -q -t 3600. I don't understand the semantics of this tool exactly, but the man page says "nm-online waits until NetworkManager reports an active connection, or specified timeout expires". Reporting of an "active connection" depends on the autoconnect and ipv4.may-fail and ipv6.may-fail settings (and perhaps more, again I don't fully understand it) of the configured connections 1, but unless I am mistaken, NM will signal an "active connection" (and thus, nm-online will return success) as soon as one network connection becomes active 2.

Therefore I think the "problem" that inactive interfaces will be waited for in the "NVMe/TCP multipath" case does not exist with NM. The second problem described in #9 (second interface not up after boot) might very well exist, too.

@johnmeneghini, @tbzatek: could you discuss this with NM experts for confirmation?

The netroot parameter is used by NM, and thus I think this PR won't cause a regression.

Footnotes

  1. The connections and their settings are generated by the nm-initrd-generator tool.

  2. Fixme: does routing play a role here? would NM look for a route to the pubklic internet, like the infamous "connectivity check" known from the desktop?

@thom311
Copy link

thom311 commented May 24, 2023

I am not familiar with this topic, so I cannot give a qualified review.

Only a comment about NetworkManager...

but the man page says "nm-online waits until NetworkManager reports an active connection, or specified timeout expires".

This quote from man nm-online is mainly about how the tool behaves when called without --wait-for-startup. Which isn't relevant here. nm-online tool is almost not useful on it's own (the manual even says that). The relevant part is that it's called as implementation detail by NetworkManager-wait-online.service (in real-root) and nm-wait-online-initrd.service (in initrd)

man NetworkManager-wait-online.service (here) better explains how this is supposed to work.

NM will signal an "active connection" (and thus, nm-online will return success) as soon as one network connection becomes active

NetworkManager-wait-online.service (and nm-wait-online-initrd.service and nm-online -s) will wait until NetworkManager indicates that it is done configuring the network. You can affect that via various means (listed in the manual page), but among others, it will wait until all interfaces that are supposed to be configured, are configured. That is, as long as you see devices in "activating"/"connecting" state in nmcli device, NetworkManager is not yet done configuring the network and the tools still wait for online.

@bengal
Copy link

bengal commented May 24, 2023

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service. In this way, the dracut initqueue (which runs nm-run.sh, and basically executes the online and netroot hooks) starts only after all interfaces that need configuration are activated or failed to activate. The interfaces that need configuration are the ones for which nm-initrd-generator created a profile from the command line.

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

@thom311, @bengal, thanks for your comments.

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service.

So this differs from the way network-legacy works, where network interface activation / configuration is done as part of the initqueue processing. That won't make it easier for us, unfortunately.

In this way, the dracut initqueue (which runs nm-run.sh, and basically executes the online and netroot hooks) starts only after all interfaces that need configuration are activated or failed to activate. The interfaces that need configuration are the ones for which nm-initrd-generator created a profile from the command line.

For NBFT boot, the nvmf module generates ip=... cmdline arguments which (to my understanding) are converted to NM profiles by nm-initrd-generator. Thus, IIUC NM would wait for each interface before even starting the initqueue. Right?

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

Yeah, it probably won't work this way. OTOH, you said NM waits until all interfaces are "activated or failed to activate". If an interface is unplugged, I suppose NM would wait for some time (probably connection.wait-device-timeout), and set the interace to "failed to activate" afterwards. Which would mean that initqueue could proceed.

I guess someone needs to just test this. @johnmeneghini, can you do this with the rh-poc?

The almost correct behavior in the multipath case would be to wait forever until at least one interface is up, and once this happens, stop waiting for any other interfaces. The problem with this is that if there are multiple interfaces, you don't know if it's just multipath, or if different devices are accessed via different network / NVMe connections. But I guess we can ignore that for the time being.

The really correct behavior (IMHO) would be to wait for connections and the root FS at the same time, and once all devices necessary to mount the root FS1 are detected, stop waiting for any other interfaces. This is basically how the legacy module behaves with this PR.

I have now idea if, and how, that could be achieved with the dracut networkmanager module.

May I ask whether you have discussed this issue in the context of iSCSI/iBFT multipath boot, and whether you have found a solution for that?

Footnotes

  1. and other mandatory file systems

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

Side note to @thom311: NM will also need support for NBFT-configured interfaces at run time (in the real root FS):

  • it should understand that these interfaces should not be reconfigured or shut down, as they are necessary to access the root FS,
  • however, it must take care of some things, such as DHCP release renewal,
  • if some interface hasn't been brought up during initrd processing, it should take care of the interface bringup and configuration according to the parameters in the NBFT,
  • it may need to run nvme connect-all --nbft after configuring an NBFT interface (see Fix nbft multipath linux-nvme/nvme-cli#1954)

So far we have implemented this "feature set" in the SUSE tool "wicked". For wicked, I've written a shell-script plugin which reads the JSON-formatted HFI information from the NBFT and transforms it into XML that wicked understands. I suppose some similar approach would be possible for NM. NM has been on my todo list but I haven't had time to actually work on it. I've also repeatedly mentioned in Timberland meetings that this is a necessary puzzle piece to make NVMe boot production ready for NM-based systems. Some hints, or better even someone else looking to this with my support, would be much appreciated.

@bengal
Copy link

bengal commented May 24, 2023

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service.

So this differs from the way network-legacy works, where network interface activation / configuration is done as part of the initqueue processing. That won't make it easier for us, unfortunately.

Right.

For NBFT boot, the nvmf module generates ip=... cmdline arguments which (to my understanding) are converted to NM profiles by nm-initrd-generator. Thus, IIUC NM would wait for each interface before even starting the initqueue. Right?

That's correct.

There is a dracut PR (dracutdevs#2173) to change this a bit, and run the hooks as soon as each interface is activated; but that doesn't change the fact that the initqueue runs after all interfaces are activated.

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

Yeah, it probably won't work this way. OTOH, you said NM waits until all interfaces are "activated or failed to activate". If an interface is unplugged, I suppose NM would wait for some time (probably connection.wait-device-timeout), and set the interace to "failed to activate" afterwards. Which would mean that initqueue could proceed.

I'm not sure if by "unplugged" you mean with the cable unplugged (i.e. without carrier), or that the device is physically unplugged from the system (i.e. not present at all). In the first case there is a carrier-timeout of 10 seconds, in the second case the timeout for the device to appear is 60 seconds (only when neednet=1 or when the device is the bootdev). After the timeout expires, the initqueue proceeds.

The really correct behavior (IMHO) would be to wait for connections and the root FS at the same time, and once all devices necessary to mount the root FS1 are detected, stop waiting for any other interfaces. This is basically how the legacy module behaves with this PR.

I have now idea if, and how, that could be achieved with the dracut networkmanager module.

I guess that would require:

  • the PR mentioned above, to start hooks immediately when interfaces go up;
  • to provide a way to make nm-wait-online-initrd finish earlier from an online hook, so that the initqueue can be started as soon as the rootfs is mounted. At the moment I don't know how to do that, but there is a way probably.

May I ask whether you have discussed this issue in the context of iSCSI/iBFT multipath boot, and whether you have found a solution for that?

I am not aware of any previous discussion about this or similar issues.

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

I'm not sure if by "unplugged" you mean with the cable unplugged

I meant "no carrier", or "down" for whatever other reason (e.g. no IP address obtained from DHCP). No hardware hot-plug discussion here :-)

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

At the moment I don't know how to do that, but there is a way probably.

Why did you make nm-wait-online-initrd a prerequisite for starting the initqueue in the first place? NM could be started in parallel with the initqueue and use some "finished" initqueue script to signal dracut that network setup is ready. So you must have had some strong reason not to do it that way, at we'd need to understand what it was to avoid regressions.

I am not aware of any previous discussion about this or similar issues.

Hm. Strange. iSCSI multipath boot would have exactly the same problem. We have found a solution with network-legacy only quite recently, too. Perhaps people just don't use this technology.

@thom311
Copy link

thom311 commented May 24, 2023

it should understand that these interfaces should not be reconfigured or shut down, as they are necessary to access the > root FS,
however, it must take care of some things, such as DHCP release renewal,

That is not different from other networking which is setup by NM in initrd (iBFT). Interestingly, NetworkManager to this day doesn't support something like systemd-networkd's KeepConfiguration= setting (seems the demand is not high enough for anybody working on that??). In any case, while useful/necessary, it would be orthogonal to an NBFT feature.

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

That is not different from other networking which is setup by NM in initrd (iBFT)

right, it is not. But I guess someone needs to code the plugin :-) I'll have a look at NM's iBFT code and see to which extent it can be reused for NBFT support.

@thom311
Copy link

thom311 commented May 24, 2023

the iBFT code for NetworkManager is here.

@mwilck
Copy link
Collaborator Author

mwilck commented May 25, 2023

@bengal, @johnmeneghini: acceptance of this PR is currently blocking the upstream dracut PR for timberland. Can we agree to merge this into timberland_final branch now, acknowledging that it may be necessary to apply further changes to the dracut NM module?

@bengal
Copy link

bengal commented May 25, 2023

Why did you make nm-wait-online-initrd a prerequisite for starting the initqueue in the first place? NM could be started in parallel with the initqueue and use some "finished" initqueue script to signal dracut that network setup is ready. So you must have had some strong reason not to do it that way, at we'd need to understand what it was to avoid regressions.

There might have been other reasons that I don't remember, but I think the main one was to leave the hooks invocation in the initqueue, and only use unit dependencies as synchronization mechanism to ensure hooks are invoked only after the network is configured. In this way there is no need for custom scripts and everything works similarly as in the real root, using the network-online target. This can be revisited if there are issues not solvable with the current approach.

I am not aware of any previous discussion about this or similar issues.

Hm. Strange. iSCSI multipath boot would have exactly the same problem. We have found a solution with network-legacy only quite recently, too. Perhaps people just don't use this technology.

One problem in dracut is that there is no documentation or knowledge about supported use cases and this makes it difficult to introduce new features or do changes. It would be great if every use case would be covered by the test suite (see the test/ directory in the dracut tree). NetworkManager also tests different dracut scenarios in the integration test suite and it tries to cover most of the known use cases.

@bengal
Copy link

bengal commented May 25, 2023

@bengal, @johnmeneghini: acceptance of this PR is currently blocking the upstream dracut PR for timberland. Can we agree to merge this into timberland_final branch now, acknowledging that it may be necessary to apply further changes to the dracut NM module?

This makes sense to me.

Copy link
Collaborator

@johnmeneghini johnmeneghini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested these changes with Fedora and everything works.

As observed in your review comments, network manager doesn't appear to rely upon these changes and I am able to boot with multiple paths using multiple nbft attempt files w/out a problem. Error insertions tests are also passing showing that dracut will use any of the available paths and continue to boot from nbft correctly. When both paths are working correctly the system even boots from nbft and enables multipathing.

[root@host-vm ~]# nvme nbft show
/sys/firmware/acpi/tables/NBFT:

NBFT Subsystems:

Idx|NQN                                                                 |Trsp|Address       |SvcId|HFIs
---+--------------------------------------------------------------------+----+--------------+-----+----
1  |nqn.2014-08.org.nvmexpress:uuid:0c468c4d-a385-47e0-8299-6e95051277db|tcp |192.168.101.20|4420 |1   
2  |nqn.2014-08.org.nvmexpress:uuid:0c468c4d-a385-47e0-8299-6e95051277db|tcp |192.168.110.20|4420 |1   

NBFT HFIs:

Idx|Trsp|PCI Addr  |MAC Addr         |DHCP|IP Addr       |Mask|Gateway |DNS     
---+----+----------+-----------------+----+--------------+----+--------+--------
1  |tcp |0:0:4.0   |ea:eb:d3:58:89:58|no  |192.168.101.30|24  |0.0.0.0 |0.0.0.0 
2  |tcp |0:0:5.0   |ea:eb:d3:59:89:59|no  |192.168.110.30|24  |0.0.0.0 |0.0.0.0 
[root@host-vm ~]# nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress:uuid:0c468c4d-a385-47e0-8299-6e95051277db
\
 +- nvme0 tcp traddr=192.168.101.20,trsvcid=4420,host_traddr=192.168.101.30,src_addr=192.168.101.30 live 
 +- nvme1 tcp traddr=192.168.110.20,trsvcid=4420,host_traddr=192.168.101.30,src_addr=192.168.101.30 live 
[root@host-vm ~]# ip -br addr
lo               UNKNOWN        127.0.0.1/8 ::1/128 
enp0s3           UP             192.168.0.216/24 2601:195:4000:62f:3467:102d:df16:84e7/64 fe80::875:9c79:c479:e6e4/64 
nbft0            UP             192.168.101.30/24 
nbft1            UP             192.168.110.30/24 

@johnmeneghini
Copy link
Collaborator

The almost correct behavior in the multipath case would be to wait forever until at least one interface is up, and once this happens, stop waiting for any other interfaces. The problem with this is that if there are multiple interfaces, you don't know if it's just multipath, or if different devices are accessed via different network / NVMe connections. But I guess we can ignore that for the time being.

This is a policy decision. We can't wait forever. This looks like a hung system. It is better to fail to boot and let the user intervene. The nbft table has a timeout. This can be used by the user set the timeout policy. If the use want to wait forever during boot, they can use this timeout to set the policy.

@johnmeneghini
Copy link
Collaborator

I think are ready go move forward with the upstream dracut pull request. Please go ahead and merge this change and then move forward with the upstream pull request.

@mwilck
Copy link
Collaborator Author

mwilck commented May 26, 2023

This is a policy decision. We can't wait forever. This looks like a hung system

dracut's default is to wait forever for the root FS. You can question whether that makes sense, but I don't think we should use a different default.

@mwilck mwilck merged commit c7731fc into timberland_final May 26, 2023
@mwilck mwilck deleted the multipath-fix-netroot branch May 26, 2023 14:45
@mwilck
Copy link
Collaborator Author

mwilck commented May 26, 2023

Note: I squashed the changes from this PR into the top commit of the timberland_final branch. I also updated the commit message to reflect the changes made by this PR.

Hash before squash: ac66c00, after squash: f58e1d5

@johnmeneghini
Copy link
Collaborator

This is a policy decision. We can't wait forever. This looks like a hung system

dracut's default is to wait forever for the root FS. You can question whether that makes sense, but I don't think we should use a different default.

I've been testing this and I see what you mean. I test things by toggling one or both of my nvme/tcp target port networks up and down on the target machine and then watching how the host reacts. When booting for the first time I see that UEFI will use the programmed timeout from NBFT. After timing out it returns to the Boot Menu. However, when I run the same test using a host reboot it hangs forever. I assume this is because a warm reboot is using initramfs and dracut is simply waiting forever, until I bring the IP link up on the nvme-tcp target port it's waiting for. Then is connects and boots. From what I can see dracut will not try to use the alternate path in this situation. I always hangs on the first path. I and bring the second path up and down and the host never see it. It hangs trying to boot from the first path.... forever.

The firmware appears to do the same thing. So it looks like we still have some path ordering issues in EDK2, and in dracut.

@mwilck
Copy link
Collaborator Author

mwilck commented May 26, 2023

When booting for the first time I see that UEFI will use the programmed timeout from NBFT.

I think you mean the ConnectTimeout from the UEFI input file, but AFAIU that's only effective for the firmware; there is no corresponding field in the NBFT.

After timing out it returns to the Boot Menu.

So this was with both interfaces down?

However, when I run the same test using a host reboot it hangs forever. I assume this is because a warm reboot is using initramfs and dracut is simply waiting forever, until I bring the IP link up on the nvme-tcp target port it's waiting for.

Hm, I can't quite follow. Are you talking about a host reset from the BIOS menu? If yes, do you see the grub menu / the kernel booting? I would assume that a host reset goes through the BIOS, and would behave just like the first time boot.
Again, is this with one or two devices down?

From what I can see dracut will not try to use the alternate path in this situation. I always hangs on the first path

If it's hanging in dracut with one interface up and one down, you're observing Problem 1 from #9.
Which would indicate that there's indeed work to do for NM to make multipath boot work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants