Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support hardware watchdog timers #8313

Merged
merged 1 commit into from
Mar 25, 2024
Merged

Conversation

dsseng
Copy link
Member

@dsseng dsseng commented Feb 13, 2024

Pull Request

What? (description)

Add support for Linux hardware watchdog API and corresponding configuration values. Watchdog timer is armed and fed by a controller (is it reliable to be sure a ticker will tick fine as long as no hang occurs?).

Watchdog timer can help automatically restoring a bare metal/supported VM node after a hang caused by a software or hardware error.

Why? (reasoning)

Fixes #8284, approved by developers in the comments

Acceptance

Please use the following checklist:

  • you linked an issue (if applicable)
  • you included tests (not applicable due to small footprint and mainly being ioctl-centered)
  • you ran conformance (make conformance - not done due to certainly being unrelated)
  • you formatted your code (make fmt)
  • you linted your code (make lint)
  • you generated documentation (make docs)
  • you ran unit-tests (make unit-tests)

See make help for a description of the available targets.

}

logger.Info("Opened hardware watchdog", zap.String("filename", dev))
// TODO: support reboot/shutdown watchdogs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently disable watchdog when quitting. systemd however allows for shutdown, reboot and kexec watchdogs to be set up to ensure power operations don't hang the system in inaccessible state. I could also do so if we clarify requirements for this and how are controllers being shut down during a restart or poweroff to be sure what timeout would be sane

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controllers are terminated right before the reboot/shutdown (from the controller point of view, context will be cancelled).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then I might add a prop for that after the first review. However shutdown/reboot watchdog won't be turned on by default as I'm not sure what are the safe values and they depend on target config.

@dsseng
Copy link
Member Author

dsseng commented Feb 21, 2024

Why are the CI checks failing? Does conform/commit/gpg-identity require me to add a GPG key and re-sign commits using it?

@frezbo
Copy link
Member

frezbo commented Feb 21, 2024

Why are the CI checks failing? Does conform/commit/gpg-identity require me to add a GPG key and re-sign commits using it?

You can ignore the gpg check, it needs to be signed by one of our devs

@dsseng
Copy link
Member Author

dsseng commented Feb 21, 2024

Ok, would be happy to hear and address reviews then

@smira
Copy link
Member

smira commented Feb 21, 2024

@dsseng your PR looks good to me, I would like to refactor some machine configuration stuff and probably see if we can do an integration test. I will find some time to look into that, but it might not be quick.

@dsseng
Copy link
Member Author

dsseng commented Feb 22, 2024

No problem, I'll do needed improvements as soon as you request something. Thanks for reviewing!

@dsseng
Copy link
Member Author

dsseng commented Feb 23, 2024

Note: kernel changes are required to load watchdog drivers, I haven't committed those yet

@dsseng dsseng force-pushed the wdt branch 2 times, most recently from d135c0e to 88ca123 Compare March 7, 2024 19:27
smira pushed a commit to dsseng/talos-pkgs that referenced this pull request Mar 13, 2024
See siderolabs/talos#8313

Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
}

// 3 pings per timeout should suffice in any case
ticker = time.NewTicker(time.Duration(timeout/3) * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't stop old ticker here... do we even need a ticker?

we could do something like https://github.com/cosi-project/runtime/blob/main/pkg/controller/runtime/internal/qruntime/internal/timer/resettable.go, but for Tickers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't a ticker stop when GC'ed? Actually we can make use of resettable timer and reset it at each trigger, or develop an in-tree wrapper (or one contributed to the Cosi runtime) for ticker yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ticker.Stop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to have a resettable Ticker here copying the approach from the above (to some extent).

The reason is that we don't want to have a ticker if there's no watchdog.

Or it could be implemented easier (there are plenty of examples in the codebase) by using a channel variable and setting it to nil when ticket is inactive.

@dsseng
Copy link
Member Author

dsseng commented Mar 19, 2024

Will update the kernel once more and rebase/squash stuff

@smira smira mentioned this pull request Mar 21, 2024
@smira smira force-pushed the wdt branch 2 times, most recently from fb19f4a to 60f47ab Compare March 21, 2024 19:08
@smira smira added this to the v1.7 milestone Mar 21, 2024
@smira
Copy link
Member

smira commented Mar 21, 2024

@dsseng refactored controller a bit, other changes are config-related:

  • moved configuration to a separate machine config document (we're slowly moving towards multi-doc)
  • validation for the config, use proper time.Duration type
  • introduced intermediate resource/controller to simplify the Watchdog controller

I will ask you to do integration test as a separate PR.

@smira
Copy link
Member

smira commented Mar 21, 2024

/ok-to-test

@smira
Copy link
Member

smira commented Mar 21, 2024

@dsseng added WatchdogTimerStatus, we can even try to read this:

All watchdog drivers are required return more information about the system,
some do temperature, fan and power level monitoring, some can tell you
the reason for the last reboot of the system.  The GETSUPPORT ioctl is
available to ask what the device can do:

	struct watchdog_info ident;
	ioctl(fd, WDIOC_GETSUPPORT, &ident);

the fields returned in the ident struct are:

        identity		a string identifying the watchdog driver
	firmware_version	the firmware version of the card if available
	options			a flags describing what the device supports

"""

[notes.watchdog]
title = "Hardware Watchdog Timers"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess we also mention this is only available for amd64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is available on aarch64. We actually had all drivers built-in previously, so I didn't have to modify aarch64 config except for sysfs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, were they built in? I guess we need to keep them also as modules if not, just to be in parity with amd64 🤔 . I'm not so sure, @smira wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/siderolabs/pkgs/blob/6364d991db023e5f614cbbc450b4afd471e16203/kernel/build/config-arm64#L4441

Maybe some of those mustn't be modules due to the way they need to be probed, so that should be checked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it supports PCI id database https://cateee.net/lkddb/web-lkddb/WATCHDOG.html, but we might hit other issue like we try to read before udev picks them up, I was more talking about these:

❯ grep IPMI_WA
kernel/build/config-arm64
3586:# CONFIG_IPMI_WATCHDOG is not set

kernel/build/config-amd64
3157:CONFIG_IPMI_WATCHDOG=m

❯ grep MFD_CORE
kernel/build/config-arm64
4514:CONFIG_MFD_CORE=y

kernel/build/config-amd64
3630:CONFIG_MFD_CORE=m

❯ grep SP5100
kernel/build/config-amd64
3581:CONFIG_SP5100_TCO=m

which are not consistent across amd64 and arm64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can fix it up, but it shouldn't be related to to this PR specifically.

I think arm64 in general needs a module cleanup the way I did for amd64 - try to make everything which can be a module to be a module.

arm64 kernel is not compressed, so moving to initramfs.xz is a good thing imho

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, arm64 needs some cleanup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SP5100 is AMD WDT, MFD is for Intel one, but IPMI might work for ARM. Regarding PCI ID DB: aarch64 WDTs are typically part of the SoC

@@ -93,6 +93,11 @@ func (s *EventSinkV1Alpha1) KmsgLogURLs() []*url.URL {
return nil
}

// WatchdogTimer implements config.RuntimeConfig interface.
func (s *EventSinkV1Alpha1) WatchdogTimer() config.WatchdogTimerConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why does eventsink need watchdogtimer method implemented?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the way I implemented Runtime() config interface - all documents implement all methods, but return nil if it's not their thing. it might be good or bad idea, idk, but it is this way right now ;)

@dsseng
Copy link
Member Author

dsseng commented Mar 22, 2024

@dsseng added WatchdogTimerStatus, we can even try to read this:

All watchdog drivers are required return more information about the system,
some do temperature, fan and power level monitoring, some can tell you
the reason for the last reboot of the system.  The GETSUPPORT ioctl is
available to ask what the device can do:

	struct watchdog_info ident;
	ioctl(fd, WDIOC_GETSUPPORT, &ident);

the fields returned in the ident struct are:

        identity		a string identifying the watchdog driver
	firmware_version	the firmware version of the card if available
	options			a flags describing what the device supports

We should probably report that using sysfs to also know timeout and timeleft

@smira
Copy link
Member

smira commented Mar 22, 2024

We should probably report that using sysfs to also know timeout and timeleft

Timeout we already report (as we set it), while timeleft makes sense only in the moment, and the controller wakes up on feedInterval, so basically from controller point of view it's always timeleft == timeout, as we ping the watchdog.

Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
Signed-off-by: Dmitry Sharshakov <d3dx12.xx@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
@dsseng
Copy link
Member Author

dsseng commented Mar 25, 2024

/m

@talos-bot talos-bot merged commit 9456489 into siderolabs:main Mar 25, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watchdog timer (WDT) support
4 participants