Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] overlord: show what manager throws a state ensure error #8687

Closed
wants to merge 1 commit into from

Conversation

mvo5
Copy link
Contributor

@mvo5 mvo5 commented May 18, 2020

When we install UC20 we get a lot of state.ErrNoState errors. In
general the state engine error reporting can be unhelpful if the
errors generic. To improve the situation a bit this commit adds
information about what manager is throwing the error. Eventually
we should ensure we don't return "naked" ErrNoState or similar
generic errors all the way up without decorating them with more
information.

I tested this on a UC20 install and the output is (slightly) more useful
this way. More to come.

When we install UC20 we get a lot of state.ErrNoState errors. In
general the state engine error reporting can be unhelpful if the
errors generic. To improve the situation a bit this commit adds
information about what manager is throwing the error. Eventually
we should ensure we don't return "naked" ErrNoState or similar
generic errors all the way up without decorating them with more
information.
@mvo5 mvo5 added the Needs Samuele review Needs a review from Samuele before it can land label May 18, 2020
@mvo5 mvo5 added this to the 2.45 milestone May 18, 2020
@pedronis pedronis self-requested a review May 18, 2020 11:01
@@ -147,7 +147,7 @@ func (se *StateEngine) Ensure() error {
for _, m := range se.managers {
err := m.Ensure()
if err != nil {
logger.Noticef("state ensure error: %v", err)
logger.Noticef("state ensure error from %T: %v", m, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

devicemgr has it's own ensureError which adds devicemgr: .. prefix to the error string. Maybe we could do something like this:

diff --git a/overlord/stateengine.go b/overlord/stateengine.go
index 80b2755c4d..f4da008bdc 100644
--- a/overlord/stateengine.go
+++ b/overlord/stateengine.go
@@ -121,6 +121,14 @@ func (se *StateEngine) StartUp() error {
 type ensureError struct {
        errs []error
 }
+type managerEnsureError struct {
+       manager string
+       err     error
+}
+
+func (m *managerEnsureError) Error() string {
+       return fmt.Sprintf("%s: %v", m.manager, m.err)
+}
 
 func (e *ensureError) Error() string {
        return fmt.Sprintf("state ensure errors: %v", e.errs)
@@ -147,8 +155,12 @@ func (se *StateEngine) Ensure() error {
        for _, m := range se.managers {
                err := m.Ensure()
                if err != nil {
-                       logger.Noticef("state ensure error: %v", err)
-                       errs = append(errs, err)
+                       merr := &managerEnsureError{
+                               manager: fmt.Sprintf("%T", m),
+                               err:     err,
+                       }
+                       logger.Noticef("state ensure error: %v", merr)
+                       errs = append(errs, merr)
                }
        }
        if len(errs) != 0 {

Note that the error strings are not too nice:

state ensure errors: [*overlord_test.fakeManager: boom1 *overlord_test.fakeManager: boom2]

@pedronis
Copy link
Collaborator

the ongoing assumption was that ensure errors would rare or not very interesting that's why we never spent a lot of time on their logging, it seems the assumption is wrong.

@mvo5
Copy link
Contributor Author

mvo5 commented May 18, 2020

I think they are kind of rare, it's not totally wrong. After PR #8690 got merged it even rarer. But for the cases when it does happen every bit of context helps

@zyga
Copy link
Collaborator

zyga commented May 19, 2020

@mvo5 spread failures look related to the changes here.

@mvo5
Copy link
Contributor Author

mvo5 commented May 19, 2020

Closing for now, it's not important enough if it requires this level of discussion.

@mvo5 mvo5 closed this May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Samuele review Needs a review from Samuele before it can land
Projects
None yet
4 participants