New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
boot: cmd/snap-bootstrap: handle a candidate recovery system v2 #9940
boot: cmd/snap-bootstrap: handle a candidate recovery system v2 #9940
Conversation
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…om initramfs Add a helpers for checking whether the current recovery system is being tried and marking its state. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…rdingly When we 'try' a new recovery system, make sure it boots up to a point where snap-bootstrap has set up the system for further execution in after-pivot world. This ensures that the try system is functional until that point, but does not execute any services from it in case those would alter the system state. We also verify that the ubuntu-data is mounted and accessible, and as such the new recovery system should be functional if fully booted. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…being removed Auto labeler seems to be removing the labels unexpectedly, see: - snapcore#9940 (comment) label was added manually when opening a PR and later removed by bot - snapcore#9943 (comment) modified files match the label, but labeler removed 'Run nested' label Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…t' into bboozzoo/uc20-recovery-mgmt-sb-try-handling-v2-wip
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…t' into bboozzoo/uc20-recovery-mgmt-sb-try-handling-v2
…em and back Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…overy-mgmt-sb-try-handling-v2
…est' into bboozzoo/uc20-recovery-mgmt-sb-try-handling-v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments, I have not finished reviewing the tests yet
// do we even have ubuntu-data? | ||
haveData := m.degradedState.partition("ubuntu-data").FindState == partitionFound | ||
|
||
if m.noFallback && (!haveData || assumeEncrypted) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is a bit confusing to me what is the purpose of the || assumeEncrypted)
?
Also I think this may warrant at least an explanation in the README about the behavior if not an update to the state machine diagram (though I don't think there are any new transitions, just an additional exit path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we have a && part here at all? which scenario are we considering? I'm not understanding the comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are 2 possible scenarios where we want to exit early:
- the system is encrypted (
assumeEncrypted == true
), the call to using fallback key is unwarranted for whatever reason (unlocking the data or save with run key must have failed, or data is not mounted) - we could not find ubuntu-data (
!haveData == true
), thus we don't know if the system is encrypted or not and we try using save fallback key, but we don't want that either
I have left out the case for system being unencrypted and data failing to mount, as I assume this is part of the health check, which basically tells whether the relevant data is there (in this case it's state.json). Though to make it consistent in all cases, the check could verify that data is actually mounted and block this path early too. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in a quick chat, I'll try to prepare a little cleanup of the state functions around this part so that it is clear whether we are following the encrypted or unencrypted device code paths.
} | ||
|
||
if outcome == boot.TryRecoverySystemOutcomeSuccess { | ||
if err := healthCheck(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this err is not assigning to the one in the function as intended, should this be
if err := healthCheck(); err != nil { | |
if err = healthCheck(); err != nil { |
that way it can be included in the first error message in the defer'd function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the original intention here, but this change in particular doesn't help surfacing the error because we will still end with "return nil" and that nil will be assigned to err before we process the defers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a log here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a pass, to be honest right now I'm a bit confused by some of the code here
} | ||
|
||
if outcome == boot.TryRecoverySystemOutcomeSuccess { | ||
if err := healthCheck(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the original intention here, but this change in particular doesn't help surfacing the error because we will still end with "return nil" and that nil will be assigned to err before we process the defers
if finalizeErr != nil { | ||
err = finalizeErr | ||
} | ||
panic(fmt.Errorf("%v tried recovery system %q: %v", status, mst.recoverySystem, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
successful/failed tried reads a bit oddly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried to make it too fancy. Dropped this now. Also made sure that relevant errors are actually logged, so that at least have some trace of what failed.
// do we even have ubuntu-data? | ||
haveData := m.degradedState.partition("ubuntu-data").FindState == partitionFound | ||
|
||
if m.noFallback && (!haveData || assumeEncrypted) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we have a && part here at all? which scenario are we considering? I'm not understanding the comment below
// return back to run mode | ||
finalizeErr := finalizeTryRecoverySystemAndReboot(boot.TryRecoverySystemOutcomeInconsistent) | ||
// not reached, unless in tests | ||
panic(fmt.Errorf("inconsistent tried recovery system bootenv: %v", finalizeErr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/tried/try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2cts is that it would be nicer if finalizeTryRecoverySystemAndReboot would not return an error but either do its job or panic directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair point, i've refactored the code to end in finalizeTry..
…overy-mgmt-sb-try-handling-v2
…errors, extend tests Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…oo/uc20-recovery-mgmt-sb-try-handling-v2-wip
…overy-mgmt-sb-try-handling-v2-wip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the changes, couple of comments
if err == nil && !machine.degraded() { | ||
outcome = boot.TryRecoverySystemOutcomeSuccess | ||
} | ||
if outcome == boot.TryRecoverySystemOutcomeFailure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm missing something but wouldn't an else be clearer than setting outcome and then checking it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -948,6 +956,11 @@ func (m *recoverModeStateMachine) openUnencryptedSave() (stateFunc, error) { | |||
func (m *recoverModeStateMachine) unlockEncryptedSaveFallbackKey() (stateFunc, error) { | |||
// try to unlock save with the fallback key on ubuntu-seed, which must have | |||
// been mounted at this point | |||
|
|||
if m.noFallback { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also short cut unlockMaybeEncryptedAloneSaveFallbackKey? it still works as is because over that path we mark things as degraded but is a bit of pointless work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd leave the check here. It's probably a bit cleaner this way. Also, we identify the system as failed if we either get an error or are in the degraded mode. In case of disabled fallback, we'd get an error if we reach the fallback path.
…overy-mgmt-sb-try-handling-v2
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, lgtm some test nitpicks
|
||
tryingCurrentSystem, err := boot.InitramfsIsTryingRecoverySystem(mst.recoverySystem) | ||
if err != nil { | ||
if boot.IsInconsystemRecoverySystemState(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it just now occurred to me that this is a typo, I assume that the original name intention of the function is:
if boot.IsInconsystemRecoverySystemState(err) { | |
if boot.IsInconsistentSystemRecoverySystemState(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, good catch.
logger.Noticef("try recovery system state is inconsistent: %v", err) | ||
finalizeTryRecoverySystemAndReboot(boot.TryRecoverySystemOutcomeInconsistent) | ||
} | ||
// this could be an inconsistency in the state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you clarify what kind of situations could lead here, using the phrase inconsistency
is confusing, since above the function we check for is IsInconsistent...
, so just saying that this could be an inconsistency in the state is a bit weird, since one would assume inconsistencies would have been true for the above function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this comment is no longer correct and should be dropped
@@ -0,0 +1,76 @@ | |||
summary: verify early boot handling of a try recovery system on UC20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice spread test!
@@ -4949,3 +4949,496 @@ func (s *initramfsMountsSuite) TestInitramfsMountsInstallModeUnsetMeasure(c *C) | |||
func (s *initramfsMountsSuite) TestInitramfsMountsRecoverModeMeasure(c *C) { | |||
s.testInitramfsMountsInstallRecoverModeMeasure(c, "recover") | |||
} | |||
|
|||
func (s *initramfsMountsSuite) runInitramfsMountsUnencryptedTryRecovery(c *C, triedSystem bool) (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might seriously need to think about not adding new tests to this file and instead creating new files here, it's a bit cough cough large
defer bootloader.Force(nil) | ||
|
||
hostUbuntuData := filepath.Join(boot.InitramfsRunMntDir, "host/ubuntu-data/") | ||
mockedState := filepath.Join(hostUbuntuData, "system-data/var/lib/snapd/state.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmmh it would be nice if we actually wrote this file only after the systemd-mount for ubuntu-data was called/performed in the callback, but this is probably okay as-is
// system | ||
"recovery_system_status": "try", | ||
"try_recovery_system": "1234", | ||
// system is set up to go into run more if rebooted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// system is set up to go into run more if rebooted | |
// system is set up to go into run mode if rebooted |
switch unlockVolumeWithSealedKeyCalls { | ||
|
||
case 1: | ||
// ubuntu data can't be unlocked with run key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ubuntu data can't be unlocked with run key |
c.Assert(name, Equals, "ubuntu-data") | ||
c.Assert(sealedEncryptionKeyFile, Equals, filepath.Join(s.tmpDir, "run/mnt/ubuntu-boot/device/fde/ubuntu-data.sealed-key")) | ||
if unlockDataFails { | ||
return foundEncrypted("ubuntu-data"), fmt.Errorf("failed to unlock ubuntu-data with run object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return foundEncrypted("ubuntu-data"), fmt.Errorf("failed to unlock ubuntu-data with run object") | |
// ubuntu-data can't be unlocked with the run key | |
return foundEncrypted("ubuntu-data"), fmt.Errorf("failed to unlock ubuntu-data with run object") |
hostUbuntuData := filepath.Join(boot.InitramfsRunMntDir, "host/ubuntu-data/") | ||
mockedState := filepath.Join(hostUbuntuData, "system-data/var/lib/snapd/state.json") | ||
c.Assert(os.MkdirAll(filepath.Dir(mockedState), 0750), IsNil) | ||
c.Assert(ioutil.WriteFile(mockedState, []byte(mockStateContent), 0640), IsNil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why create these files if we are mocking a failure for the recovery health check anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit artificial scenario where setting up try recovery has been successful so far (thus copying of state files from ubuntu-data was successful too), and only the health check fails. I've added a comment in the code to cover that.
…overy-mgmt-sb-try-handling-v2
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
The relevant test is green and the failures are unrelated. @pedronis I think we can land it. |
Supersedes #9928