-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add path configuration for hardware watchdog device #7352
Conversation
could you please document the new option ? |
Hmpf, I am not sure I like this approach. Device indexes are essentially non-deterministic, they depend on the order of driver initialization, and in a world of USB and whatnot that order is very much undefined. There has been a long-term TODO list item to rework the watchdog logic to consult udev for watchdog devices, add support for pinging multiple devices in parallel, all controlled by looking for devices tagged with some special udev tag. Of course this would mean that we wouldn't ping watchdog devices before udev probed the devices, but that's OK I think. Altogether I think doing the udev thing would make this a lot more powerful, and people could choose explicitly which watchdogs they want systemd to ping and which ones not, simply by writing the appropriate udev rule that attaches the tag to right set of devices. By default we'd place the tag on all devices however. Of course, doing such a rework is more work, but if we touch this code, we might really want to do this properly right-away instead of adding new configuration interfaces now that are likely going away again if we do the proper rework. Any chance you could rework this based on udev device subscriptions? |
I don't think so. In a perfect world you could rely on udev, but in perfect world you wouldn't need any watchdog in the first place. This particular device is responsible for last-resort dealing with "any remaining bugs". Consider scenarios:
Watchdog is one of the very first things to be handled for better reliability. Putting a dependency on anything here is bad. As for the changing device numbers - this in turn can be handled by system administrator well enough. Consider:
Anyway, watchdogs do not usually fall into "USB world" class with asynchronous initialization. Moreover, even switching watchdogs after udev settle might be hard/impossible (nowayout? magic close?) so for this case there should be an exception allowing to bypass udev (module might be loaded manually or from modules.d, if not built-in); this device is much closer to the bare metal and deserves to be treated differently (just like boot loaders, no udev available yet). And BTW, is there any way to query systemd about current watchdog status? As wdctl cannot open the device, there should be another interface provided. |
Hmm, OK, I figure early watchdog is relevant for stability testing... Hmm, I figure it would be OK to say that — as soon as the udev-based watchdog pinging is implemented – the set of watchdogs to ping is the combination of whatever is specified on the kernel cmdline and what carries the right udev tag. Hence, I figure a patch your like yours can go in, however, semantically I think we should not base any of this on "device indexes", but at least device names. i.e. make the full path of the device node configurable, not the index only. That way the specific drivers can avoid the "unstable probing order problem", by picking more descriptive device names. i.e. the soft watchdog could use "/dev/softdog" or something like that, and things wouldn't be as undeterministic. Could you please rework the patch accordingly? |
src/shared/watchdog.c
Outdated
if (asprintf(&watchdog_file, "/dev/watchdog%d", wd_number) < 0) | ||
return -errno; | ||
|
||
watchdog_fd = open(watchdog_file, O_WRONLY|O_CLOEXEC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, we currently maintain the watchdog_fd as a static variable. I figure the watchdog path should be maintained the same way, and be settable by a new call watchdog_set_path() or so, which is fed from the kernel cmdline and the config file.
also, this needs a rebase, currently does not apply to git |
Thanks for the suggestions, I have rebased and updated my commits. Much cleaner now. |
src/core/main.c
Outdated
@@ -461,6 +462,15 @@ static int parse_proc_cmdline_item(const char *key, const char *value, void *dat | |||
if (arg_default_timeout_start_usec <= 0) | |||
arg_default_timeout_start_usec = USEC_INFINITY; | |||
|
|||
} else if (proc_cmdline_key_streq(key, "systemd.watchdog_dev_path")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no unnecessary abbreviations. I think "systemd.watchdog_device" is a better name for this
src/core/main.c
Outdated
if (proc_cmdline_value_missing(key, value)) | ||
return 0; | ||
|
||
r = parse_path(value, &arg_watchdog_device_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_path() does not exist?
Please test your stuff before pushing things. This can't possibly have compiled for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake here. I did test this, and fixed this in latest commits. My problem is I write/test with systemd 2.23 so I can test on hardware. I then have to rebase on master, I think I got the wrong branch with this bad code.
src/core/main.c
Outdated
@@ -751,6 +761,7 @@ static int parse_config_file(void) { | |||
{ "Manager", "JoinControllers", config_parse_join_controllers, 0, &arg_join_controllers }, | |||
{ "Manager", "RuntimeWatchdogSec", config_parse_sec, 0, &arg_runtime_watchdog }, | |||
{ "Manager", "ShutdownWatchdogSec", config_parse_sec, 0, &arg_shutdown_watchdog }, | |||
{ "Manager", "WatchdogDevicePath", config_parse_path, 0, &arg_watchdog_device_path }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just "WatchdogDevice" should be fine
src/core/main.c
Outdated
@@ -1519,6 +1530,11 @@ static int become_shutdown( | |||
/* Tell the binary how often to ping, ignore failure */ | |||
if (asprintf(&e, "WATCHDOG_USEC="USEC_FMT, arg_shutdown_watchdog) > 0) | |||
(void) strv_push(&env_block, e); | |||
|
|||
if (arg_watchdog_device_path) { | |||
if (asprintf(&e, "WATCHDOG_DEV_PATH=%s", arg_watchdog_device_path) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WATCHDOG_DEVICE
src/core/main.c
Outdated
@@ -120,6 +120,7 @@ static usec_t arg_default_start_limit_interval = DEFAULT_START_LIMIT_INTERVAL; | |||
static unsigned arg_default_start_limit_burst = DEFAULT_START_LIMIT_BURST; | |||
static usec_t arg_runtime_watchdog = 0; | |||
static usec_t arg_shutdown_watchdog = 10 * USEC_PER_MINUTE; | |||
static const char *arg_watchdog_device_path = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory allocation is weird here... strings we allocate/free cannot be "const".
Please make this a normal "char*", and make sure we always allocate strings properly when assigning this variables. Also, please free this when shutting down. Not that freeing would strictly be necessary, but it's very helpful for valgrinding stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, latest commits I changed it to char *. I will add a call to free, thanks.
src/shared/watchdog.c
Outdated
@@ -31,6 +31,7 @@ | |||
#include "watchdog.h" | |||
|
|||
static int watchdog_fd = -1; | |||
static const char *watchdog_dev = "/dev/watchdog"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, i really don't like the memory management for this. This should either always point to static memory, or always to dynamic memory. But right now it initially points to static memory, except when the user overrides it, in which case it points to externally owned dynamic memory, which might go away any time.
Please let's clean up memory management here, let's stick one kind of memory: dynamic. Specifically, change "const char*" to just "char*", and initialize this to NULL. When open_watchdog() sees this set to NULL it should use "/dev/watchdog" instead, i.e. the fallbcak should happen at the moment we read the variable, not already at initialization.
Then, make sure that watchdog_set_device() allocates a new string, and make sure there's watchdog_free_device() or so, that frees the string, and that is called at PID 1 exit, so that we are valgrind clean. Yes, this means we'll keep the string in memory twice (once in main.c's arg_watchdog_device, and once in this copy), but that's fine. I very much prefer clean memory owenership rules about minimizing copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will change, thanks.
src/shared/watchdog.c
Outdated
@@ -96,6 +97,10 @@ static int open_watchdog(void) { | |||
return update_timeout(); | |||
} | |||
|
|||
void watchdog_set_path(const char *path) { | |||
watchdog_dev = path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed above, this should do strdup(). And hence also be able to return an error...
And I figure this should be called watchdog_set_device() or so
src/core/main.c
Outdated
@@ -120,6 +120,7 @@ static usec_t arg_default_start_limit_interval = DEFAULT_START_LIMIT_INTERVAL; | |||
static unsigned arg_default_start_limit_burst = DEFAULT_START_LIMIT_BURST; | |||
static usec_t arg_runtime_watchdog = 0; | |||
static usec_t arg_shutdown_watchdog = 10 * USEC_PER_MINUTE; | |||
static char *arg_watchdog_device_path = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call this arg_watchdog_device
so that the internal variable name and the externally visible config/kernel cmdline options match
src/core/main.c
Outdated
|
||
r = parse_path_argument_and_warn(value, false, &arg_watchdog_device_path); | ||
if (r < 0) | ||
log_warning_errno(r, "Failed to parse watchdog device path: %s, ignoring.", value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_path_argument_and_warn() logs on its own (that's what the _and_warn
suffix is supposed to say), no need to log here a second time, that'd just be confusing for the user
@@ -382,6 +382,15 @@ | |||
</listitem> | |||
</varlistentry> | |||
|
|||
<varlistentry> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by is a kernel thing btw, we don't use that in systemd, please drop
src/shared/watchdog.c
Outdated
@@ -84,7 +85,8 @@ static int open_watchdog(void) { | |||
if (watchdog_fd >= 0) | |||
return 0; | |||
|
|||
watchdog_fd = open("/dev/watchdog", O_WRONLY|O_CLOEXEC); | |||
watchdog_fd = open(watchdog_dev ? watchdog_dev : "/dev/watchdog", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use watchdog_dev ?: "/dev/watchdog"
.
src/shared/watchdog.c
Outdated
@@ -96,6 +98,16 @@ static int open_watchdog(void) { | |||
return update_timeout(); | |||
} | |||
|
|||
int watchdog_set_device(char *path) { | |||
watchdog_dev = strdup(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return free_and_strdup(&watchdog_dev, path);
.
src/shared/watchdog.c
Outdated
} | ||
|
||
void watchdog_free_device(void) { | ||
free(watchdog_dev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
watchdog_dev = mfree(watchdog_dev)
.
Or even better, just remove this function and use watchdog_set_device(NULL)
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make watchdog_free_device() an inline static wrapper in the header file, that internally just calls watchdog_set_device(NULL)
src/core/main.c
Outdated
|
||
if (arg_watchdog_device) { | ||
if (asprintf(&e, "WATCHDOG_DEVICE=%s", arg_watchdog_device) > 0) | ||
(void) strv_push(&env_block, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was already buggy, so not really your fault, but please fix it if you're touching it. This should be
if (asprintf(&e, ...) > 0)
(void) strv_consume(&env_block, e);
if (arg_watchdog_device &&
asprintf(&e, ...) > 0)
(void) strv_consume(&env_block, e);
src/core/main.c
Outdated
@@ -1944,6 +1958,11 @@ int main(int argc, char *argv[]) { | |||
test_usr(); | |||
} | |||
|
|||
if (arg_system && arg_watchdog_device) { | |||
if (watchdog_set_device(arg_watchdog_device) < 0) | |||
log_error("Failed to set watchdog device %s.", arg_watchdog_device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"set watchdog device to %s"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use:
if (…) {
r = watchdog_set_device(…);
if (r < 0)
log_warning_errno(r, "Failed to set watchdog device to %s, ignoring: %m", …);
}
src/core/shutdown.c
Outdated
watchdog_device = getenv("WATCHDOG_DEVICE"); | ||
if (watchdog_device) { | ||
if (watchdog_set_device(watchdog_device) < 0) | ||
log_error("Failed to set watchdog device %s.", watchdog_device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to"?
man/kernel-command-line.xml
Outdated
<term><varname>systemd.watchdog_device=</varname></term> | ||
|
||
<listitem> | ||
<para>Overwrites the watchdog device path <varname>WatchdogDevice=</varname> at boot. For details, see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop "at boot" here. We have various options which apply only at boot, and this one applies even after boot, and in particular during shutdown.
man/systemd-system.conf.xml
Outdated
@@ -161,8 +161,9 @@ | |||
<literal>d</literal>, <literal>w</literal>). If | |||
<varname>RuntimeWatchdogSec=</varname> is set to a non-zero | |||
value, the watchdog hardware | |||
(<filename>/dev/watchdog</filename>) will be programmed to | |||
automatically reboot the system if it is not contacted within | |||
(<filename>/dev/watchdog</filename> or the path to which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"<filename>/dev/watchdog</filename> or the path specified with <varname>WatchdogDevice=</varname> or the kernel option <varname>systemd.watchdog-device=</varname>"
src/shared/watchdog.c
Outdated
@@ -31,6 +31,7 @@ | |||
#include "watchdog.h" | |||
|
|||
static int watchdog_fd = -1; | |||
static char *watchdog_dev = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd not abbreviate this either, just call it watchdog_device
, so that the same name is used everywhere: externally, internally, in functions, in global variables, and so on.
src/shared/watchdog.c
Outdated
} | ||
|
||
void watchdog_free_device(void) { | ||
free(watchdog_dev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make watchdog_free_device() an inline static wrapper in the header file, that internally just calls watchdog_set_device(NULL)
src/core/main.c
Outdated
@@ -1944,6 +1958,11 @@ int main(int argc, char *argv[]) { | |||
test_usr(); | |||
} | |||
|
|||
if (arg_system && arg_watchdog_device) { | |||
if (watchdog_set_device(arg_watchdog_device) < 0) | |||
log_error("Failed to set watchdog device %s.", arg_watchdog_device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use:
if (…) {
r = watchdog_set_device(…);
if (r < 0)
log_warning_errno(r, "Failed to set watchdog device to %s, ignoring: %m", …);
}
Any more updates requested on this one? |
ah, we don't get notified in some cases if you push a new version. Please always write a comment as well, otherwise we might not notice, like in this case. |
patch looks excellent now, but needs to be rebased, main.c has been rearranged quite a bit... |
Currently systemd hardcodes the use of /dev/watchdog. This is a legacy chardev that points to watchdog0 in the system. Modify the watchdog API to allow a different device path to be passed and stored. Opening the watchdog defaults to /dev/watchdog, maintaining existing behavior.
This option allows a device path to be specified for the systemd watchdog (both runtime and shutdown). If a system requires a watchdog other than /dev/watchdog (pointing to /dev/watchdog0) to be used to reboot the system, this setting should be changed to the relevant watchdog device path (e.g. /dev/watchdog1).
Document the command line parameter and the system configuration file setting.
Great, thanks, I just rebased. |
lgtm |
upstream: systemd/systemd#7352 Change-Id: I62df65eeec786890c0a7b923fd0455f7869f38b9 Signed-off-by: Edward A. James <eajames@us.ibm.com>
upstream: systemd/systemd#7352 Change-Id: I62df65eeec786890c0a7b923fd0455f7869f38b9 Signed-off-by: Edward A. James <eajames@us.ibm.com>
The hardware watchdog is currently hard-coded to /dev/watchdog0. This can create problems for systems with multiple watchdogs tied to different system behaviours.
This patch set modifies the hardware watchdog code to allow an index to be specified when opening, pinging, and setting the timeout of a hardware watchdog device. This index for the device number is configurable with the "WatchdogDeviceNum" systemd configuration option. The default device number is 0, so there should be no change of behaviour unless that option is specified. This takes effect for the runtime and shutdown watchdogs.
I tested both runtime and shutdown watchdogs on an AST2500 (small ARM core with three watchdogs, used as a baseboard management controller).