Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machined.run function blocks forever if c.V1Alpha2().Run fails without ever running controllers. #8263

Open
DmitriyMV opened this issue Feb 5, 2024 · 4 comments
Assignees
Milestone

Comments

@DmitriyMV
Copy link
Member

The main culprit is here:

go func() {
if e := c.V1Alpha2().Run(ctx, drainer); e != nil {
errCh <- fmt.Errorf("fatal controller runtime error: %s", e)
}
log.Printf("controller runtime finished")
}()

If it exits too soon, the

if err = c.Run(ctx, runtime.SequenceInitialize, nil); err != nil {
if errors.Is(err, context.Canceled) {
initializeCanceled = true
} else {
return err
}
}

will block forever, because EnforceKSPPRequirements (specifically runtime.KernelParamsSetCondition) depends on runtime running. But even if we somehow exit from this function we will forever block here

// Schedule service shutdown on any return.
defer system.Services(c.Runtime()).Shutdown(ctx)

Whats is interesting here, is that passed ctx will essentially act as context.Background() because deferred cancelation will happen AFTER we finish with system.Singleton.Shutdown(...). And those, if runtime is not running, Shutdown will block forever too.

I try to fix this in #8256, but I don't think we should pass canceled context to the Shutdown(...).

Instead we create a new context with timeout of 60 seconds (stopServices try to wait for about 30 seconds, so we make the overall context twice as big). But I'm unsure if it's proper fix or not.

@DmitriyMV DmitriyMV self-assigned this Feb 5, 2024
@DmitriyMV DmitriyMV added this to the v1.7 milestone Feb 5, 2024
@smira
Copy link
Member

smira commented Feb 5, 2024

This code is ugly, as it combines older sequence-based initialization and newer controller-based approach.

But if we want to fix it (which I believe we don't need to for real), we just need to wait with some timeout and fail the sequencer. This works well.

If we need this for debugging purposes, we can add a log line.

@DmitriyMV
Copy link
Member Author

The thing is, under current implementation you will never see fatal controller runtime error because receiving from the channel happens after running from the sequencer, which, as it turns out, never happens. We can add the log info about controller runtime failing, but the machine will be stuck anyway.

we just need to wait with some timeout and fail the sequencer. This works well.

We would also need to add timeout for Shutdown for the reasons I outlined above.

which I believe we don't need to for real

c.V1Alpha2().Run doesn't do much - it only registers controllers and runs them. But if we choose to ignore this, we should at least document that if controller registration fails, the machine will be stuck (unlike reboot which is what most users is expecting probably?) forever.

DmitriyMV added a commit to DmitriyMV/talos that referenced this issue Feb 5, 2024
While we decide what to do with siderolabs#8263 and siderolabs#8256 this quickfix at least allows us to
see what went wrong

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
@DmitriyMV
Copy link
Member Author

Created #8264 as a quick fix.

@smira
Copy link
Member

smira commented Feb 5, 2024

c.V1Alpha2().Run doesn't do much - it only registers controllers and runs them. But if we choose to ignore this, we should at least document that if controller registration fails, the machine will be stuck (unlike reboot which is what most users is expecting probably?) forever.

that's my point - it doesn't happen unless it's a developer error, so this is a non-issue for our users, but rather improvement for developers. this is not bad to be fixed, but I would rather try to find the least intrusive way to do so, and in the end this code will most probably go away.

DmitriyMV added a commit to DmitriyMV/talos that referenced this issue Feb 5, 2024
While we decide what to do with siderolabs#8263 and siderolabs#8256 this quickfix at least allows us to
see what went wrong

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
@smira smira modified the milestones: v1.7, v1.8 Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants