Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boot: cmd/snap-bootstrap: handle a candidate recovery system v2 #9940

Merged

Conversation

bboozzoo
Copy link
Collaborator

Supersedes #9928

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…om initramfs

Add a helpers for checking whether the current recovery system is being tried
and marking its state.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…rdingly

When we 'try' a new recovery system, make sure it boots up to a point where
snap-bootstrap has set up the system for further execution in after-pivot world.
This ensures that the try system is functional until that point, but does not
execute any services from it in case those would alter the system state.

We also verify that the ubuntu-data is mounted and accessible, and as such the
new recovery system should be functional if fully booted.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@bboozzoo bboozzoo added the Run nested The PR also runs tests inluded in nested suite label Feb 18, 2021
@github-actions github-actions bot removed the Run nested The PR also runs tests inluded in nested suite label Feb 18, 2021
@bboozzoo bboozzoo added the Run nested The PR also runs tests inluded in nested suite label Feb 18, 2021
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@github-actions github-actions bot removed the Run nested The PR also runs tests inluded in nested suite label Feb 18, 2021
@bboozzoo bboozzoo added the Run nested The PR also runs tests inluded in nested suite label Feb 18, 2021
bboozzoo added a commit to bboozzoo/snapd that referenced this pull request Feb 18, 2021
…being removed

Auto labeler seems to be removing the labels unexpectedly, see:

- snapcore#9940 (comment) label was added
  manually when opening a PR and later removed by bot

- snapcore#9943 (comment) modified files
  match the label, but labeler removed 'Run nested' label

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…t' into bboozzoo/uc20-recovery-mgmt-sb-try-handling-v2-wip
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…t' into bboozzoo/uc20-recovery-mgmt-sb-try-handling-v2
…em and back

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…est' into bboozzoo/uc20-recovery-mgmt-sb-try-handling-v2
Copy link
Member

@anonymouse64 anonymouse64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments, I have not finished reviewing the tests yet

// do we even have ubuntu-data?
haveData := m.degradedState.partition("ubuntu-data").FindState == partitionFound

if m.noFallback && (!haveData || assumeEncrypted) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is a bit confusing to me what is the purpose of the || assumeEncrypted) ?

Also I think this may warrant at least an explanation in the README about the behavior if not an update to the state machine diagram (though I don't think there are any new transitions, just an additional exit path)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have a && part here at all? which scenario are we considering? I'm not understanding the comment below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are 2 possible scenarios where we want to exit early:

  • the system is encrypted (assumeEncrypted == true), the call to using fallback key is unwarranted for whatever reason (unlocking the data or save with run key must have failed, or data is not mounted)
  • we could not find ubuntu-data (!haveData == true), thus we don't know if the system is encrypted or not and we try using save fallback key, but we don't want that either

I have left out the case for system being unencrypted and data failing to mount, as I assume this is part of the health check, which basically tells whether the relevant data is there (in this case it's state.json). Though to make it consistent in all cases, the check could verify that data is actually mounted and block this path early too. Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in a quick chat, I'll try to prepare a little cleanup of the state functions around this part so that it is clear whether we are following the encrypted or unencrypted device code paths.

}

if outcome == boot.TryRecoverySystemOutcomeSuccess {
if err := healthCheck(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this err is not assigning to the one in the function as intended, should this be

Suggested change
if err := healthCheck(); err != nil {
if err = healthCheck(); err != nil {

that way it can be included in the first error message in the defer'd function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the original intention here, but this change in particular doesn't help surfacing the error because we will still end with "return nil" and that nil will be assigned to err before we process the defers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a log here instead.

@pedronis pedronis added the UC20 label Mar 4, 2021
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a pass, to be honest right now I'm a bit confused by some of the code here

}

if outcome == boot.TryRecoverySystemOutcomeSuccess {
if err := healthCheck(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the original intention here, but this change in particular doesn't help surfacing the error because we will still end with "return nil" and that nil will be assigned to err before we process the defers

if finalizeErr != nil {
err = finalizeErr
}
panic(fmt.Errorf("%v tried recovery system %q: %v", status, mst.recoverySystem, err))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

successful/failed tried reads a bit oddly ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried to make it too fancy. Dropped this now. Also made sure that relevant errors are actually logged, so that at least have some trace of what failed.

// do we even have ubuntu-data?
haveData := m.degradedState.partition("ubuntu-data").FindState == partitionFound

if m.noFallback && (!haveData || assumeEncrypted) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have a && part here at all? which scenario are we considering? I'm not understanding the comment below

// return back to run mode
finalizeErr := finalizeTryRecoverySystemAndReboot(boot.TryRecoverySystemOutcomeInconsistent)
// not reached, unless in tests
panic(fmt.Errorf("inconsistent tried recovery system bootenv: %v", finalizeErr))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/tried/try?

Copy link
Collaborator

@pedronis pedronis Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2cts is that it would be nicer if finalizeTryRecoverySystemAndReboot would not return an error but either do its job or panic directly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point, i've refactored the code to end in finalizeTry..

…errors, extend tests

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
…oo/uc20-recovery-mgmt-sb-try-handling-v2-wip
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes, couple of comments

if err == nil && !machine.degraded() {
outcome = boot.TryRecoverySystemOutcomeSuccess
}
if outcome == boot.TryRecoverySystemOutcomeFailure {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm missing something but wouldn't an else be clearer than setting outcome and then checking it again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -948,6 +956,11 @@ func (m *recoverModeStateMachine) openUnencryptedSave() (stateFunc, error) {
func (m *recoverModeStateMachine) unlockEncryptedSaveFallbackKey() (stateFunc, error) {
// try to unlock save with the fallback key on ubuntu-seed, which must have
// been mounted at this point

if m.noFallback {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also short cut unlockMaybeEncryptedAloneSaveFallbackKey? it still works as is because over that path we mark things as degraded but is a bit of pointless work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd leave the check here. It's probably a bit cleaner this way. Also, we identify the system as failed if we either get an error or are in the degraded mode. In case of disabled fallback, we'd get an error if we reach the fallback path.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Copy link
Member

@anonymouse64 anonymouse64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, lgtm some test nitpicks


tryingCurrentSystem, err := boot.InitramfsIsTryingRecoverySystem(mst.recoverySystem)
if err != nil {
if boot.IsInconsystemRecoverySystemState(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just now occurred to me that this is a typo, I assume that the original name intention of the function is:

Suggested change
if boot.IsInconsystemRecoverySystemState(err) {
if boot.IsInconsistentSystemRecoverySystemState(err) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, good catch.

logger.Noticef("try recovery system state is inconsistent: %v", err)
finalizeTryRecoverySystemAndReboot(boot.TryRecoverySystemOutcomeInconsistent)
}
// this could be an inconsistency in the state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify what kind of situations could lead here, using the phrase inconsistency is confusing, since above the function we check for is IsInconsistent..., so just saying that this could be an inconsistency in the state is a bit weird, since one would assume inconsistencies would have been true for the above function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this comment is no longer correct and should be dropped

@@ -0,0 +1,76 @@
summary: verify early boot handling of a try recovery system on UC20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice spread test!

@@ -4949,3 +4949,496 @@ func (s *initramfsMountsSuite) TestInitramfsMountsInstallModeUnsetMeasure(c *C)
func (s *initramfsMountsSuite) TestInitramfsMountsRecoverModeMeasure(c *C) {
s.testInitramfsMountsInstallRecoverModeMeasure(c, "recover")
}

func (s *initramfsMountsSuite) runInitramfsMountsUnencryptedTryRecovery(c *C, triedSystem bool) (err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might seriously need to think about not adding new tests to this file and instead creating new files here, it's a bit cough cough large

defer bootloader.Force(nil)

hostUbuntuData := filepath.Join(boot.InitramfsRunMntDir, "host/ubuntu-data/")
mockedState := filepath.Join(hostUbuntuData, "system-data/var/lib/snapd/state.json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh it would be nice if we actually wrote this file only after the systemd-mount for ubuntu-data was called/performed in the callback, but this is probably okay as-is

// system
"recovery_system_status": "try",
"try_recovery_system": "1234",
// system is set up to go into run more if rebooted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// system is set up to go into run more if rebooted
// system is set up to go into run mode if rebooted

switch unlockVolumeWithSealedKeyCalls {

case 1:
// ubuntu data can't be unlocked with run key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ubuntu data can't be unlocked with run key

c.Assert(name, Equals, "ubuntu-data")
c.Assert(sealedEncryptionKeyFile, Equals, filepath.Join(s.tmpDir, "run/mnt/ubuntu-boot/device/fde/ubuntu-data.sealed-key"))
if unlockDataFails {
return foundEncrypted("ubuntu-data"), fmt.Errorf("failed to unlock ubuntu-data with run object")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return foundEncrypted("ubuntu-data"), fmt.Errorf("failed to unlock ubuntu-data with run object")
// ubuntu-data can't be unlocked with the run key
return foundEncrypted("ubuntu-data"), fmt.Errorf("failed to unlock ubuntu-data with run object")

Comment on lines +5416 to +5419
hostUbuntuData := filepath.Join(boot.InitramfsRunMntDir, "host/ubuntu-data/")
mockedState := filepath.Join(hostUbuntuData, "system-data/var/lib/snapd/state.json")
c.Assert(os.MkdirAll(filepath.Dir(mockedState), 0750), IsNil)
c.Assert(ioutil.WriteFile(mockedState, []byte(mockStateContent), 0640), IsNil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why create these files if we are mocking a failure for the recovery health check anyways?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit artificial scenario where setting up try recovery has been successful so far (thus copying of state files from ubuntu-data was successful too), and only the health check fails. I've added a comment in the code to cover that.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
@bboozzoo
Copy link
Collaborator Author

The relevant test is green and the failures are unrelated. @pedronis I think we can land it.

@pedronis pedronis merged commit f525ab4 into snapcore:master Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Run nested The PR also runs tests inluded in nested suite
Projects
None yet
3 participants