Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for usable network connectivity before running discovery-* #18

Closed
wants to merge 1 commit into from
Closed

Conversation

schiermi
Copy link
Contributor

In systems with many (network & block) devices or in networks with slow DHCP servers the discovery-* services start without a proper resolv.conf & network connectivity.

This leads to:
-failing TFTP downloads of extension zips (server not reachable or name resolution error in case address supplied via fdi.zipserver is a hostname).
-discovery-registers is not able to register against katello/foreman because /etc/resolv.conf is not usable when the ruby script starts / tries name resolution for the first time (resolv.conf ist not re-read by Ruby / the script)

The directive "After=…network-online.target…" in the current discovery-*service definitions is not sufficient. This target (at least in Centos 7) is reached as soon NetworkManager is started.

This diff adds another service (and correspondign Requires to discovery-*.services) which returns as soon NetworkManager thinks the system is connected to a network.

@lzap
Copy link
Member

lzap commented Jul 21, 2015

Hello, thanks for your contribution!

How does this differ from a service that ships with RHEL7/CentOS7 and which is already enabled on our image?

cat /usr/lib/systemd/system/NetworkManager-wait-online.service
[Unit]
Description=Network Manager Wait Online
Requisite=NetworkManager.service
After=NetworkManager.service
Wants=network.target
Before=network.target network-online.target

[Service]
Type=oneshot
ExecStart=/usr/bin/nm-online -s -q --timeout=30

[Install]
WantedBy=multi-user.target

Also discovery-register can start without network, it will just fail with the initial request, but it re-sends every minute.

Since smart-proxy binds to all interfaces it should be also fine with starting before network was initialized.

The only issue is the TFTP, we have a different PR that tries to solve it, but does not yet work: #14

@lzap
Copy link
Member

lzap commented Jul 21, 2015

I mean your patch is still valid, but can you delete unnecessary network.target and network-online.target and replace that with NetworkManager-wait-online.service instead?

More reading on this topic is at: http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

@lzap
Copy link
Member

lzap commented Jul 21, 2015

The bug might be in the fact that we have the service enabled, but we need to order our services after that online service.

@schiermi
Copy link
Contributor Author

/usr/lib/systemd/system/NetworkManager-wait-online.service calls nm-online with parameter -s:

man 1 nm-online:

       -s, --wait-for-startup
              Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically.  Startup
              is  considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connec‐
              tion which is available given the current network state. (This is generally only  useful  at  boot  time;  after
              startup has completed, nm-online -s will just return immediately, regardless of the current network state.)

"Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically." is not sufficient. nm-connected.service calls nm-online without -s (and a higher timeout) which is much more meaningful before running discovery-*.

discovery-register tries every 30 seconds, and fails because of "getaddrinfo: Name or service not known" when trying to resolv our Katello/Foreman hostname. The retry works when using IP addresses. When using DNS the first lookup fails (because /etc/resolv.conf is missing / incomplete) and subsequent requests fail also because ruby does not re-initialize (res_init()) the resolver, which still uses the broken resolv.conf information (https://sourceware.org/bugzilla/show_bug.cgi?id=984).

nm-connected.service is just a drop-in to ensure all discovery-* services have proper network connectivity available the time they are started.

My PR also solves the TFTP issue without depending on file "semaphores".

@lzap
Copy link
Member

lzap commented Jul 22, 2015

Ok that looks good. Can you please squash into one commit with msg:

Fixes #10650 - wait for nm IPv4 DHCP initialization

Also couple of other comments will follow.

@@ -15,6 +15,9 @@ systemctl enable NetworkManager-wait-online.service
echo " * enabling nm-prepare service"
systemctl enable nm-prepare.service

echo " * enabling nm-connected service"
systemctl enable nm-connected.service

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove systemctl enable NetworkManager-wait-online.service block please as this replaces it completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without NetworkManager-wait-online.service we effectively lose the synchronization point in network-online.target. Or do I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I thought that since nm-connected have Before=network-online.target we don't need that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.
/usr/lib/systemd/system/network-online.target.wants/NetworkManager-wait-online.service comes from the NetworkManager package and is NOT provided by NetworkManager-wait-online.service [Install]-section. I remove systemctl enable nm-connected.service and give it a try.

@lzap
Copy link
Member

lzap commented Jul 22, 2015

Unfortunately we cannot remove the dance around TFTP downloading. We need to put /etc/default/discovery into both foreman-proxy and discovery-register and activated nm-online does not necessary mean that all dispatch scripts has been successfuly executed. Therefore we need to do it this way.

But there is one bug still, can you replace PathExists with PathChanged in discovery-setup.service please? This ensures the unit gets activated after the file is closed and not opened.

If you find a different way to make sure we have the environment variables ready for both services above, send your proposal as a different commit please (with the same issue number).

@lzap
Copy link
Member

lzap commented Jul 22, 2015

Ok it works fine, if you can do the PathChanged change, nit picks and rebase, I am happy to merge for the next release.

@lzap
Copy link
Member

lzap commented Jul 22, 2015

Or we can leave PathChanged out of this patch perhaps, but I was under impression this can also cause race condition.

@schiermi
Copy link
Contributor Author

Even with PathChanged a race condition exists: discovery-setup.service is pulled by multi-user.target (after all units up to nm-connected.service triggered). The unit does not complain about an missing EnvironmentFile (there is a - in EnvironmentFile=-/etc/default/discovery-zip-server) which may have not been created at this point in time. In case fdi.zipserver is not defined, we do not get the zips inside /usr/bin/discovery-setup. Let me think a moment about that…

@lzap
Copy link
Member

lzap commented Jul 22, 2015

Yeah /etc/default/discovery-zip-server file is treated basically as a semaphore. We could simply ship it as it is unchanged by the script. But we must be sure that discover-setup finishes before discovery-register and foreman-proxy are started. I don't know how to achieve this with systemd.

@schiermi
Copy link
Contributor Author

I suggest to move the kernel command line parsing out of discovery-setup and put it into get-zip-server_config.
get-zip-server_config should check if the DHCP response is for the interface the system was booted from (BOOTIF kernel command line). Afterwards this script should create the /etc/default/discovery-zip-server either from the kernel parameter or from DHCP's next-server. /etc/default/discovery-zip-server is monitored by systemd to start the subsequent discovery-* units.

I'm going prepare a proposed patchset within one day.

@schiermi
Copy link
Contributor Author

This way round PathExists does not need to be modifed.

@lzap
Copy link
Member

lzap commented Jul 23, 2015

I'm going prepare a proposed patchset within one day.

Ok, also I was wondering why we use dhclient script and not
/etc/NetworkManager/dispatcher.d standard NM script. This way we could
rid of the weird naming _config.

Later,
Lukas #lzap Zapletal

@lzap
Copy link
Member

lzap commented Jul 23, 2015

This way round PathExists does not need to be modifed.

I'd still rather see PathChanged tho:

       PathExists=, PathExistsGlob=, PathChanged=, PathModified=, DirectoryNotEmpty=
           Defines paths to monitor for certain changes: PathExists= may be used to watch the mere existence of a file or
           directory. If the file specified exists, the configured unit is activated.  PathExistsGlob= works similar, but
           checks for the existence of at least one file matching the globbing pattern specified.  PathChanged= may be used
           to watch a file or directory and activate the configured unit whenever it changes. It is not activated on every
           write to the watched file but it is activated if the file which was open for writing gets closed.  PathModified=
           is similar, but additionally it is activated also on simple writes to the watched file.  DirectoryNotEmpty= may
           be used to watch a directory and activate the configured unit whenever it contains at least one file.

Later,
Lukas #lzap Zapletal

@lzap
Copy link
Member

lzap commented Jul 23, 2015

I'd also removed - characters in Env statements, we should fail out early. If it does not exist, that's a problem. Maybe to statically ship this file and use just some empty file semaphore instead?

@schiermi
Copy link
Contributor Author

Maybe the dhclient script was intended with some compatibility in mind? On the other hand get-zip-server_config is not "pure-dhclient-ready" because it uses DHCP4_… variables instead of dhclients new_dhcp4_…. Im going to rearrange the functionality to be placed into /etc/NetworkManager/dispatcher.d.

@schiermi
Copy link
Contributor Author

I've reorganized the startup of the discovery-* scripts. Their units are not longer "WantedBy" multi-user.target. Instead the chain starting at discovery-setup.service is triggered by "PathExists=/etc/default/discovery-zip-server" in discovery-setup.path. discovery-setup.service wants discovery-autostart.service which wants discovery-register.service. Other dependencies are dropped (including my nm-connected.service).

/etc/default/discovery-zip-server is created by /etc/NetworkManager/dispatcher.d/15-get-zip-server; a "pure" NM dispatcher.d script. This script only processes "up" events for the "BOOTIF" interface.
The kernel cmdline parameter fdi.zipserver takes precedence over the DHCP supplied next server. Using a temporary file only a non-empty file is written (to match systemd PathExists).

nm-prepare was also prone to race conditions. I replaced the systemd unit with a more appropriate udev rule and modified the script.

@lzap
Copy link
Member

lzap commented Jul 29, 2015

Re: dhclient compatibility - no backward compatibility needed, our image only supports CentOS7/RHEL7+.

Taking a look, btw can you rebase it looks like there is a small conflict. Thanks.

defroute=no
fi
script=/etc/sysconfig/network-scripts/ifcfg-$dev
cat > $script <<EONS
DEVICE=$dev
TYPE=Ethernet
BOOTPROTO=dhcp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, this whole file has been completely rewritten in d4cb4ea and it also triggers via udev now. You can basically drop your changes as it works the same way (accepts a parameter = device name). It now does not create rh-cfg plugin configurations but directly NetworkManager ini-style profiles instead. I also use systemd-cat utility to redirect output to journal.

@lzap
Copy link
Member

lzap commented Jul 29, 2015

Ok this looks good. Please note the new fdi.initnet option after you rebase. We now initialize only primary interface by default. Test this with fdi.initnet=all option as well to see if it prevents from ordering bug.

@lzap
Copy link
Member

lzap commented Jul 29, 2015

During rebase, please reword the commit message to Fixes #11241 - ...

http://projects.theforeman.org/issues/11241

@lzap
Copy link
Member

lzap commented Jul 29, 2015

Oh one note, nm-prepare is being called from udev, but to configure the primary interface it is still being called as systemd service. A special parameter primary is provided in that case and MAC address is taken from the kernel command line (appended there by PXELinux).

@schiermi
Copy link
Contributor Author

Sounds good.
I will rename…
-discovery-setup to discovery-fetch-extensions
-discovery-autostart to discovery-start-extensions

Introduce /usr/share/fdi/commonfunc.sh with functions exportKCL and normalizeHwAddr.
I'm going to replace all occurrences of /proc/cmdline parsing with commonfunc.sh / exportKCL.

@schiermi
Copy link
Contributor Author

KCL_BOOTIF is the variable based on the kernel cmd line (including ARP type 01).
BOOTMAC is derived from this with the ARP type stripped.

@lzap
Copy link
Member

lzap commented Jul 30, 2015

KCL_BOOTIF is the variable based on the kernel cmd line (including ARP type 01).
BOOTMAC is derived from this with the ARP type stripped.

Sounds great.

Later,
Lukas #lzap Zapletal

@schiermi
Copy link
Contributor Author

Image based in Centos 7 tested successfully with and without fdi.initnet=all.

@schiermi schiermi closed this Jul 30, 2015
@schiermi schiermi reopened this Jul 30, 2015
fi

# enable ssh
if [[ "$(cat /proc/cmdline | grep -io 'fdi.ssh=\S*')" == "fdi.ssh=1" ]]; then
if [ "$KCL_FDI_SSH" == "1" ]]; then
systemctl start sshd
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work for me, sshd is not started.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like commonfunc.sh does work, but rc.local is not enabled. This needs to be enabled explicitly on systemd systems AFAIK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually make sure it's executable and then systemctl enable rc-local.service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait its a syntax error actually: Jul 31 10:51:55 fdi rc.local[1107]: /etc/rc.d/rc.local: line 12: [: missing]'`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to fix this. Sorry for not testing SSH access.

@lzap
Copy link
Member

lzap commented Jul 31, 2015

Ok except the ssh, it works fine. Please fix and amend into a single commit and I will retest multiple times to see if race conditions are really gone (so far so good). Looks great!

@lzap
Copy link
Member

lzap commented Aug 3, 2015

Almost there, I see a parsing error during fetch script startup:

[root@fdi ~]# systemctl status discovery-fetch-extensions.service -l
discovery-fetch-extensions.service - Download zip extensions to this image
   Loaded: loaded (/etc/systemd/system/discovery-fetch-extensions.service; static)
   Active: active (exited) since Mon 2015-08-03 14:11:59 UTC; 3min 39s ago
  Process: 738 ExecStart=/usr/bin/discovery-fetch-extensions (code=exited, status=0/SUCCESS)
 Main PID: 738 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/discovery-fetch-extensions.service

Aug 03 14:11:59 fdi systemd[1]: Starting Download zip extensions to this image...
Aug 03 14:11:59 fdi discovery-fetch-extensions[738]: /usr/bin/discovery-fetch-extensions: line 8: local: can only be used in a function
Aug 03 14:11:59 fdi systemd[1]: Started Download zip extensions to this image.

But network seems to be fine.

# cmdline first and fall back to DHCP next-server passed in through
# ENV

local extdir="/opt/extension"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry; transparent proxy gave me an outdated fdi-image.tar during download to my test system.

@schiermi
Copy link
Contributor Author

schiermi commented Aug 3, 2015

Please wait; I re-test the whole functionality.

@schiermi
Copy link
Contributor Author

schiermi commented Aug 3, 2015

Tested on KVM & HW.

systemctl enable discovery-setup.path
systemctl enable discovery-setup.service
systemctl enable discovery-autostart.service
systemctl enable discovery-register.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good but discovery-register does not actually start on my instances:

[root@fdi ~]# systemctl status discovery-register
discovery-register.service - Register this host in Foreman
   Loaded: loaded (/etc/systemd/system/discovery-register.service; disabled)
   Active: inactive (dead)

I will merge this because I am currently working on TUI support which actually starts this service manually after TUI splash screen appears.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm setup and autostart are gone/renamed, but I still see usage here:

root/etc/systemd/system/discovery-start-extensions.service
4:After=basic.target network.target network-online.target nss-lookup.target discovery-setup.service

root/etc/systemd/system/discovery-register.service
4:After=basic.target network.target network-online.target nss-lookup.target discovery-autostart.service foreman-proxy.service

Can you delete these and also make sure register service starts?

@lzap
Copy link
Member

lzap commented Aug 4, 2015

Merged as 7618547, thank you!

@lzap lzap closed this Aug 4, 2015
@lzap
Copy link
Member

lzap commented Aug 6, 2015

FYI found one additional issue, not sure why I see it now but TFTP download via CURL does not work because firewall blocks client to send ack packets. This enables it:

echo -e "ip_conntrack_tftp\nnf_conntrack_netbios_ns" > /etc/modules-load.d/tftp-firewall.conf

@schiermi
Copy link
Contributor Author

schiermi commented Aug 7, 2015

This is strange. I rebuilt the FDI (fdi-centos7.ks) from a fresh clone. The TFTP download succeeded without any hickups. ip_conntrack_tftp is not loaded in the kernel of the running FDI. iptables -L shows no active rules.

Where is this firewall activated at your side?

I switched to curl to reduce the needed packages and to be able to implement HTTP downloads easily in the future for higher throughput.

@lzap
Copy link
Member

lzap commented Aug 10, 2015

Where is this firewall activated at your side?

Sorry, I realized that I actually activated firewalld in my patch. So it
works fine...

Later,
Lukas #lzap Zapletal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants