Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many sd-boot/bootctl fixes, and a new "boot counting" concept, for automatic fallback to older kernels on persistent failures #9437

Merged
merged 15 commits into from Oct 19, 2018

Conversation

@poettering
Copy link
Member

@poettering poettering commented Jun 26, 2018

Yupp, a lot of patches in a single PR, but most are rather simple and the stuff at the end is mostly docs...

In order to grok the concepts introduced here it might make sense to read through the last commit first which contains docs explaining how this all fits together.

For unattended systems such as servers, IoT/embedded devices or appliances an
automatic logic of reverting back to previous versions of the OS or kernel in
case system boots consistently fail is important. systemd provides support for
this in various of its components. When used together they provide a complete
Copy link
Member

@vcaputo vcaputo Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/various/a variety/

Loading

* The
[`systemd-boot(7)`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html)
boot loader optionally maintains a per-boot-loader-entry counter that is
decreased by one on each attempt to boot the entry, priorizing entries that
Copy link
Member

@vcaputo vcaputo Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/priorizing/prioritizing/

Loading


* The
[`systemd-boot-check.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-boot-check.service.html)
service is a simple health check tool that determines whether the boot
Copy link
Member

@vcaputo vcaputo Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/whether/wether/

Loading


## Walkthrough

Here's an example walkthrough how this all fits together.
Copy link
Member

@vcaputo vcaputo Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/how/of how/

Loading

@vcaputo
Copy link
Member

@vcaputo vcaputo commented Jun 26, 2018

@poettering I can see the appeal of overloading the entry filenames to contain the attempts/remaining counters, but I think it would probably be better to leave the entries alone. What if someone wanted to have the directory or fs containing the entries be non-writable outside of privileged update windows when they are manipulated?

Maybe have a separate directory containing the per-entry counters, and use symlinks named after the entry files. Make the link target contain the counters rather than actually pointing anywhere meaningful in the filesystem.

This also would eliminate the issue of entries unintentionally containing counters in their name, if that was a possibility (I didn't examine if the naming of those entries was entirely user-defined).

Loading

@evverx
Copy link
Member

@evverx evverx commented Jun 26, 2018

This pull request introduces 1 alert when merging 1d8536b into ad4bc33 - view on LGTM.com

new alerts:

  • 1 for Comparison result is always the same

Comment posted by LGTM.com

Loading

service automatically marks a boot loader entry for which boot counting as
mentioned above is enabled, as "good" when a boot has been determined to be
successful, thus turning off boot counting for it again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comma is missing: "marks a boot loader entry, for which boot counting as mentioned above is enabled, as "good" when a boot has been determined to be successful".

Loading

transaction.

8. The `boot-complete.target` unit is ordered after and pulls in various units
that shall are required to succeed for the boot process to be considered

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/shall//

Loading

Copy link
Contributor

@medhefgo medhefgo left a comment

That was an exhausting read :D

I generally like this but I really disagree on using rename to take track of tries. We already distrust EFI firmwares (for good reason), so using EFI variables for this is out of the question. But then we also cannot rely on them implementing renames in an atomic fashion. What if we are really lucky and lose power while the EFI is touching the fs? We could end up losing the boot config file or it could end up with garbage. And if it happens to be the only one the user has (for whatever reason) we wouldn't even have anything to boot anymore.

@vcaputo We cannot rely on fancy things like links on ESP because vfat is such an awesome filesystem.

Loading

<variablelist>
<varlistentry>
<term><varname>LoaderBootCountPath</varname></term>
<listitem><para>If boot counting is enabled contains the path the of the file the counter is encoded in the
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence does not parse for me.

Loading


<varlistentry>
<term><varname>LoaderConfigTimeout</varname></term>
<listitem><para>The menu time-out. Read by the boot loader.</para></listitem>
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, sd-boot can also change that (t/T).

Loading

<term><varname>LoaderEntryDefault</varname></term>
<term><varname>LoaderEntryOneShot</varname></term>

<listitem><para>The identifier of the default boot loader entry. Set by the OS and read by the boot
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, LoaderEntryDefault can also be set by the bootloader (d/D).

Loading

newest boot entry is now tried, i.e. the system automatically reverted back
to an earlier version.

The above describes the a walkthrough when the selected boot entry continously
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superfulous "a" here.

Loading

combined with `Requires=` dependencies from the target, so that the target
cannot be reached when any of the units fail. You may add any number of
units like this, and only if they all succeed the boot entry is marked as
good. Note that the target unit shal pull in these boot checking units, not
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/shal/shall

Loading

wrap them in a unit and order them after `boot-complete.target`, pulling it
in.

# FAQ
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add why this is not stored in UEFI variables too.

Loading

@@ -41,6 +41,11 @@ typedef struct {
EFI_STATUS (*call)(VOID);
BOOLEAN no_autoselect;
BOOLEAN non_unique;
UINTN tries_done;
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like using INTN would make this much nicer without all the (UINTN)-1 casts all over the place.

Loading

Copy link
Member Author

@poettering poettering Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, but that would suggest the counter could actually go negative, but it can't... We use (UINTN) -1 as a special way to indicate one particular exceptional case (and we do this similar elsewhere in the tree), hence I don't think we should really change the type altogether....

Loading

arg_path, good,
arg_path, bad);

log_debug("Tries left: %" PRIu64"\n"
Copy link
Contributor

@medhefgo medhefgo Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can unify these 2 log_debug() calls-

Loading

@medhefgo
Copy link
Contributor

@medhefgo medhefgo commented Jun 27, 2018

Also, I feel like turning boot counting off after the first successful boot is the wrong way to go. The whole point of this is to be failsafe, no? And booting could go wrong later again...

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 27, 2018

@poettering I can see the appeal of overloading the entry filenames to contain the attempts/remaining counters, but I think it would probably be better to leave the entries alone. What if someone wanted to have the directory or fs containing the entries be non-writable outside of privileged update windows when they are manipulated?

Maybe have a separate directory containing the per-entry counters, and use symlinks named after the entry files. Make the link target contain the counters rather than actually pointing anywhere meaningful in the filesystem.

sd-boot operates on the ESP, i.e. FAT. FAT doesn't have symlinks. And even if we loosen the requirement for FAT/ESP one day we should still stick to operations that are implementable on FAT.

We need to be able to manipulate the counters from two sides: the boot loader, before we start a boot process. And the OS, after we completed a boot process. (And of course, also during update, to install the image and set an initial count on it.) These time windows are all very brief and well defined I'd say, on their own.

This also would eliminate the issue of entries unintentionally containing counters in their name, if that was a possibility (I didn't examine if the naming of those entries was entirely user-defined).

I thought about that, and if it indeed turns out to be a problem I figure we should simply make the whole logic something one can turn off, so that the "+" stuff looses its magic effect. After all the whole design is put together so that a boot loader that implements the boot loader spec but doesn't do boot counting will just work, but ignore the counters.

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 27, 2018

I generally like this but I really disagree on using rename to take track of tries. We already distrust EFI firmwares (for good reason), so using EFI variables for this is out of the question.

The distrust in EFI variables is probably more caused by the quality of memory chips backing them, i.e. it's doubt about the hardware more than the software.

But then we also cannot rely on them implementing renames in an atomic fashion. What if we are really lucky and lose power while the EFI is touching the fs? We could end up losing the boot config file or it could end up with garbage. And if it happens to be the only one the user has (for whatever reason) we wouldn't even have anything to boot anymore.

Well, that leaves very little wiggle room... We only really have the ESP and the vars as storage space.

Loading

@medhefgo
Copy link
Contributor

@medhefgo medhefgo commented Jun 27, 2018

It's fine putting them in ESP, just use a separate file instead of the loader entries. You could put them in /loader/tries/* for example.

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 27, 2018

It's fine putting them in ESP, just use a separate file instead of the loader entries. You could put them in /loader/tries/* for example.

I think that's more risky actually, as writing files means actually allocating/releasing new blocks usually, and that is hard to do atomically (and also involves allocation algorithms, and there's a good chance firmware will fuck that up). Renaming otoh is very likely implementable atomically even on crappy, simpler file systems... I mean, we have no golden bullet, we can just pick the least bad option from a pool of crappy ones. I am pretty sure renaming is safer than writing file contents...

What I like about the rename approach is firstly that the operation it implements has a chance of being atomic and reliable, but also that this means the counting info is directly associated with the entry, and the lifecycle of this data always bound together. You cannot remove the entry without also removing the associated counting data, and that's a really good thing. If you split the counting data into a separate file you need to be careful to always explicitly lifecycle that counting data along with the entry as they are not directly and implicitly attached to each other.

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 27, 2018

Also, I feel like turning boot counting off after the first successful boot is the wrong way to go. The whole point of this is to be failsafe, no? And booting could go wrong later again...

Hmm, so downgrade attacks are a major problem in computer security, in particular in cryptography. After we verified that a kernel works sufficiently well for us to mark the entry good, we should really stick to it I think. There are two opposing concepts here I figure: "don't make yourself vulnerable to downgrade attacks" and "downgrade if upgrade is fucked". I think we can only marry these two if we say we never downgrade once an upgrade worked at least once.

Loading

@vcaputo
Copy link
Member

@vcaputo vcaputo commented Jun 27, 2018

Oops, I keep forgetting about UEFI/ESP stuff, still living in the 90s with my trusty X61s.

With regards to the rename atomicity argument, the last time I checked actual on-disk atomicity of rename was not something guaranteed by most if not all filesystems. POSIX only speaks to runtime atomicity with regards to visibility of the replaced name to other processes. It's a common misconception to believe this equates durably atomic renames across unclean mounts of the filesystem, but that's not really the case IIRC.

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 27, 2018

With regards to the rename atomicity argument, the last time I checked actual on-disk atomicity of rename was not something guaranteed by most if not all filesystems. POSIX only speaks to runtime atomicity with regards to visibility of the replaced name to other processes. It's a common misconception to believe this equates durably atomic renames across unclean mounts of the filesystem, but that's not really the case IIRC.

Yeah, that's why I say there's a "chance" it's atomic. But you are right, there's no guarantee, and shitty code will always make this hard. But it's taking the least bad from a pile of crappy options.

Loading

@medhefgo
Copy link
Contributor

@medhefgo medhefgo commented Jun 27, 2018

I can see the point on downgrading attacks.

Another thing: I feel like counting tries is overkill if you just want to make sure upgrades work. If the new kernel+initrd don't work, retrying would be futile, no? So making this into a boolean signaling if this booted successful once should be enough.

Loading

@medhefgo
Copy link
Contributor

@medhefgo medhefgo commented Jun 27, 2018

I just had a crazy idea: in theory you could store this information in the file mtime... :D

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 27, 2018

Another thing: I feel like counting tries is overkill if you just want to make sure upgrades work. If the new kernel+initrd don't work, retrying would be futile, no? So making this into a boolean signaling if this booted successful once should be enough.

Well, you never know what people do: maybe they turn off the machine during boot (think: some random appliance handled by random people, or even a desktop PC with an encrypted harddisk and a password prompt in the initrd). There might be many reasons why a kernel/os image doesn't boot properly and only a subset of them have anything to do with the image itself. Hence I think there should be some room for configuration here. If deployments are in particular volatile environments it might make sense to initialize the counter to something high to allow a larger number of tries before giving up and reverting back. If deployments are in very well defined, very deterministic and supervised environments then it might make sense to initialize it to something low, maybe even 1, to only offer one attempt before reverting back.

Loading


r = make_good(prefix, suffix, &good);
if (r < 0)
return log_oom(),
Copy link
Member

@evverx evverx Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the alert was a false positive until I noticed this comma. I guess a semicolon should be here instead.

Loading

Copy link
Member Author

@poettering poettering Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch! I kept scratching my head about this warning, and couldn't find what was wrong! Thanks!

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 27, 2018

I force pushed a new version with all reported typos fixed!

Loading

@boucman
Copy link
Contributor

@boucman boucman commented Jun 27, 2018

I know non-EFI system are not well handled by systemd, but having a systemd-provided point for "system is good" would be very interesting for embedded systems, and embedded/IoT systems don't have EFI...

Would it be possible to split the various units into a generic bootloader-independant part and a EFI-specific part, the same way the wait-online in network-manager-independant ?

This way we could andd u-boot or barebox specific implementations easily...

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jun 28, 2018

I know non-EFI system are not well handled by systemd, but having a systemd-provided point for "system is good" would be very interesting for embedded systems, and embedded/IoT systems don't have EFI...

Would it be possible to split the various units into a generic bootloader-independant part and a EFI-specific part, the same way the wait-online in network-manager-independant ?

That's actually excactly what this is already doing, and exactly this is even explicitly mentioned in the markdown document explaining how this all fits together, at the second to last commit of this PR. Please have a look.

This way we could andd u-boot or barebox specific implementations easily...

Yes, absolutely, the "boot-complete.target" is supposed to be generic, and usable with other boot loaders and other "boot ok" schemes.

Loading

@boucman
Copy link
Contributor

@boucman boucman commented Jun 28, 2018

right, my bad...

I'll reread the whole PR in more details. I was kinda rushed when I first read it.

sorry

Loading

completed successfuly. When enabled it becomes an indirect dependency of
`systemd-boot-complete.service` (by means of `boot-complete.target`, see
below), ensuring that the boot will not be considered successful if there are
any failed services.
Copy link
Contributor

@boucman boucman Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is where I got confused... systemd-boot-check is actually independent of using systemd-boot. maybe rename it boot-check.service ?

Loading

Copy link
Member Author

@poettering poettering Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally prefix stuff units that run code provided by us with "systemd-", and this service does invoke C code we hacked up, hence it carries that prefix.

Loading

off.

12. On the following boot (and all subsequent boots after that) the entry is
now seen with boot counting turned off, no further renaming takes place.
Copy link
Contributor

@boucman boucman Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider reseting the counter instead of disabling boot-counting entirely.

I understand the idea of "if it booted once properly, the image is correct and it won't magically stop working" but I have actually seen that happen in real life.

Suppose you have two completely separate (i.e different rootfs) boot entries. One is the normal operation one, the other is a recovery system that can do only one thing : reformat and reinstall the normal partition.

The normal partition might become bad for various reasons on a completely autonomous system. Bad blocks, bad handling of disk full, bad recovery on incorrect shutdown... Those kind of things are very rare, but on autonomous systems without any sysadmin they can happen and, statistically, they do.

At this point, having the boot counter still active provides a simple "parachute" for a complete reinstall, and that's very welcome.

Loading

Copy link
Member Author

@poettering poettering Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@medhefgo also suggested this, see above. and also see my comments on that. Besides the downgrade-attack thing I also think there's value in putting boundaries on the time where we keep changing the $BOOT file system. With the implemented approach, if an update cycle is clean we'll update $BOOT three times: once to install the new kernel, once on the next boot to count down, and once after it completed to mark it OK. After that $BOOT isn't touched anymore, regardless how many cycles follow, until the next update. I think that's much better behaviour than continuing to count forerver.

I mean, the current scheme makes its changes to the ESP, which is VFAT, and hence has less than ideal integrity behaviour. If we had something more reliable we'd use it, but it's what we have to work with, but we should still optimize the write cycles, and I think O(1) instead of O(n) write cycles per update is highly preferable.

But dunno. we can certainly look into making other schemes workable if there's really the need, once the initial support is in.

Loading

Copy link
Contributor

@boucman boucman Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's no big deal, I'll have to reimplement that part of the logic for u-boot anyway...

that was more a case of "i've been there before" but there are many use cases and only one can be the default...

Loading

units/boot-complete.target Show resolved Hide resolved
Loading
Copy link
Member

@keszybz keszybz left a comment

My major gripe is with the positioning of of the boot counter update service. IIUC, it will be called very early, when it actually needs to be called very late.

I also think the names of many things would benefit from being changed. If those somewhat misleading names are established early on, people will have a harder time using this correctly.

But overall, this looks nice. The counting scheme is a bit complex with the two counters, but not too much, and I see the value in having the two counters.

Loading

(meson-mode . ((meson-indent-basic . 8)))
(nil . ((indent-tabs-mode . nil)
(tab-width . 8)
(fill-column . 79))))
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we establish in the other PR that this doesn't actually work?

Loading

* Names we grok, and the series they result in:
*
* foobar+3.efi → foobar+2-1.efi → foobar+1-2.efi → foobar+0-3.efi → STOP!
* foobar+4.0.efi → foobar+3-1.efi → foobar+2-2.efi → foobar+1-3.efi → foobar+0-4.efi → STOP!
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here "4.0".

Loading

*
* foobar+3.efi → foobar+2-1.efi → foobar+1-2.efi → foobar+0-3.efi → STOP!
* foobar+4.0.efi → foobar+3-1.efi → foobar+2-2.efi → foobar+1-3.efi → foobar+0-4.efi → STOP!
*
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious line?

Loading

UINTN new_left, digit;
digit = file[i] - '0';
if (digit > (((UINTN) -1) / factor)) /* overflow check */
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those extra parens around the division operation are not necessary…

Loading

<listitem><para>Show a help screen</para></listitem>
</varlistentry>

<varlistentry>
<term>Ctrl + l</term>
<term><keycombo><keycap>Ctrl</keycap><keycap>l</keycap></keycombo></term>
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, never occurred to me to check if docbook has something like this.

Loading

#include "util.h"
#include "virt.h"

/* This generator pulls in systemd-boot-complete.service if the "LoaderBootCountPath" EFI variable is set, i.e. the
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pulls ... into what ?

Loading


/* We pull this in from basic.target so that it ends up in all "regular" boot ups, but not in rescue.target or
* even emergency.target. */
p = strjoina(arg_dest, "/" SPECIAL_BASIC_TARGET ".wants/systemd-boot-complete.service");
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is cannot be. This needs to be hooked up late into boot, after all the checks and whatnot have happened. Marking a boot as successful early in the boot is pointless.

Loading

Copy link
Member Author

@poettering poettering Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(misunderstanding, as discussed below)

Loading

mkosi.build Outdated
# Manually update the boot loader from the one we just built
mkdir -p "$DESTDIR"/boot/efi/EFI/systemd "$DESTDIR"/boot/efi/EFI/BOOT
cp "$DESTDIR"/usr/lib/systemd/boot/efi/systemd-bootx64.efi "$DESTDIR"/boot/efi/EFI/systemd/systemd-bootx64.efi
cp "$DESTDIR"/usr/lib/systemd/boot/efi/systemd-bootx64.efi "$DESTDIR"/boot/efi/EFI/BOOT/bootx64.efi
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are those paths (with /boot/efi) always correct?

Loading

mkosi.build Outdated
# Manually update the boot loader from the one we just built
mkdir -p "$DESTDIR"/boot/efi/EFI/systemd "$DESTDIR"/boot/efi/EFI/BOOT
cp "$DESTDIR"/usr/lib/systemd/boot/efi/systemd-bootx64.efi "$DESTDIR"/boot/efi/EFI/systemd/systemd-bootx64.efi
cp "$DESTDIR"/usr/lib/systemd/boot/efi/systemd-bootx64.efi "$DESTDIR"/boot/efi/EFI/BOOT/bootx64.efi
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are those paths (with /boot/efi) always correct?

Loading

Copy link
Member Author

@poettering poettering Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the mkosi image we build they are...

Loading

@@ -28,7 +28,7 @@ enum loader_type {
};
typedef struct {
CHAR16 *file;
CHAR16 *id; /* The identifier for this entry (not that this id is not necessarily unique though!) */
Copy link
Member

@keszybz keszybz Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many "not"s.

Loading

@poettering
Copy link
Member Author

@poettering poettering commented Jul 4, 2018

My major gripe is with the positioning of of the boot counter update service. IIUC, it will be called very early, when it actually needs to be called very late.

The boot counter update service systemd-boot-complete.service is ordered after sysinit.target by default, as well as boot-complete.target. The latter is supposed to be the extension point which people can use to configure precisely how early or late they want the boot counter update to happen. If they don't order anything else before it, yes it will run pretty early on. But they can order anything they like before it, for example:

  1. let's say httpd.service and mysql.service is extremely imporant to them. In that case they can add Requires= pointing from boot-complete.target towards them (through RequiredBy=), and the marking will be delayed until httpd and mysql are known to have started up successfuly.

  2. They can also enable our own "systemd-boot-check.service", which is mostly an example but can be useful too. It is ordered by default after multi-user.target and graphical.target, i.e very late and it checks for any failed services. They can also hack up their own services like this, possible with simple scripts, and order them properly to move the marking as early or late into the boot process as they wish.

So, by default we'll mark things as good early on, but people have total freedom there, and can move it somewhere in the middle of the boot process (which option 1 above would do) or very later into it (which option 2 would do).

Loading

keszybz added a commit that referenced this issue Oct 8, 2018
the bootctl changes from PR #9437 (the boot counting PR)
keszybz added a commit that referenced this issue Oct 18, 2018
the EFI changes from PR #9437 (the boot counting PR)
@keszybz
Copy link
Member

@keszybz keszybz commented Oct 18, 2018

I'll force push this PR (with no changes) so github sees that some of the commits are already merged.

Loading

Copy link
Member

@keszybz keszybz left a comment

Looks nice, just some typos.

Loading

support is built into various of its components. When used together these
components provide a complete solution on UEFI systems, built as add-on to the
[Boot Loader
Specification](https://github.com/systemd/systemd/blob/master/docs/BOOT_LOADER_SPECIFICATION.md). However,
Copy link
Member

@keszybz keszybz Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated like in a084849.

Loading


* The
[`systemd-boot-check-no-failures.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-boot-check-no-failures.service.html)
service is a simple health check tool that determines wether the boot
Copy link
Member

@keszybz keszybz Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether

Loading

completed successfuly. When enabled it becomes an indirect dependency of
`systemd-bless-boot.service` (by means of `boot-complete.target`, see
below), ensuring that the boot will not be considered successful if there are
any failed services.
Copy link
Member

@keszybz keszybz Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my Fedora laptop, failing services are always a thing. After upgrading to F29 I have three, and so far I didn't bother to investigate. This would prevent me from ever blessing a new kernel. But the man page makes it clear that this is not supposed to be used for real, so it's all OK.

Loading

<literal>+</literal> followed by one or two numbers (if two they need to be separated by a <literal>-</literal>),
before the <filename>.conf</filename> or <filename>.efi</filename> suffix is subject to boot counting: the first of
the two numbers ('tries left') is decreased by one on every boot attempt, the second of the two numbers ('done
tries') is increased by one (if 'tries done' is absent it is considered equivalent to 0). Depending on the current
Copy link
Member

@keszybz keszybz Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'tries done'

Loading

<para>Internally, the service operates based on the <varname>LoaderBootCountPath</varname> EFI variable (of the
vendor UUID <constant>4a67b082-0a4c-41cf-b6c7-440b29bb8c4</constant>), which is passed from the boot loader to the
OS. It contains a file system path (relative to the EFI system partition) of the <ulink
url="https://github.com/systemd/systemd/blob/master/docs/BOOT_LOADER_SPECIFICATION.md">Boot Loader
Copy link
Member

@keszybz keszybz Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemd.io...

Loading


/* This generator pulls systemd-bless-boot.service into the initial transaction if the "LoaderBootCountPath" EFI
* variable is set, i.e. the system boots up with boot counting in effect, which means we should mark the boot as
* "good" if we manage to boot up this far. */
Copy link
Member

@keszybz keszybz Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this far" is unclear. The generator runs very early, but the blessing happens very late.

Loading

Copy link
Member Author

@poettering poettering Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed this to "far enough" now. It's supposed to be vague, since after all it's supposed to open what is meant by "successful boot". But of course, the "this" is confusing indeed, since it's not just vague but potentially referencing something in the same paragraph and that's something we should fix which I hope "far enough" does. Hope that makes sense.

Loading

poettering added 15 commits Oct 19, 2018
This is the counterpiece to the boot counting implemented in
systemd-boot: if a boot is detected as successful we mark drop the
counter again from the booted snippet or kernel image.
This generator automatically pulls in "systemd-bless-boot.service" if
a boot with boot counting is detected.
…ries with "tries"

This makes two changes:

1. When called for "remove" any drop-ins with "+" suffix are removed
   too, so that the logic works for entries with boot counting enabled
   too and we don't lose track of configuration snippets created that
   way.

2. When called for "add" we optionally generate a "+" suffix, based on
   the data in /etc/kernel/tries if it exists.

   This basically means after "echo 5 > /etc/kernel/tries" any installed
   kernels will automatically set up for 5 boot tries before older
   kernels will be tried.
Many general updates, but most importantly, document the
/etc/kernel/tries logic briefly.
This is might be useful in some cases, but it's primarily an example for
a boot check service that can be plugged before boot-complete.target.

It's disabled by default.

All it does is check whether the failed unit count is zero
@poettering
Copy link
Member Author

@poettering poettering commented Oct 19, 2018

OK, I rebased it on top of current git and pushed it, so that it applies again. I made all suggested changes. I am taking the liberty to put the green label on it, given that the new changes are only about typo fixes. Thanks for the review!

Loading

@poettering poettering merged commit a2689fa into systemd:master Oct 19, 2018
2 of 6 checks passed
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Linked issues

Successfully merging this pull request may close these issues.

None yet

8 participants