-
Notifications
You must be signed in to change notification settings - Fork 9
RSDK-11248 RSDK-11266 RSDK-11900 RSDK-11901 Add restart checking logic #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RSDK-11248 RSDK-11266 RSDK-11900 RSDK-11901 Add restart checking logic #153
Conversation
…hen viamserver does not
…edsRestart, fix checkRestartProperty, and drive-by logging fixes
|
||
connMu sync.RWMutex | ||
conn rpc.ClientConn | ||
client pb.AgentDeviceServiceClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this was a pointless field to store on the manager; creating a gRPC client on top of conn
above through which to call DeviceAgentConfig
requires no actual, blocking work. It was confusing to store this variable on the struct and check its existence to see if we had dial
ed already.
|
||
if needRestart || needRestartConfigChange || m.viamServerNeedsRestart || m.viamAgentNeedsRestart { | ||
if m.viamServer.(viamserver.RestartCheck).SafeToRestart(ctx) { | ||
if m.viamServer.Property(ctx, viamserver.RestartPropertyRestartAllowed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the RestartCheck
interface, and instead added a Property
method to the Subsystem
interface. I think it's a bit easier to read. I also moved logging about the result of querying restart_allowed
to this file (line below this).
// The minimal (and default) interval for checking for config updates via DeviceAgentConfig. | ||
minimalDeviceAgentConfigCheckInterval = time.Second * 5 | ||
// The minimal (and default) interval for checking whether agent needs to be restarted. | ||
minimalNeedsRestartCheckInterval = time.Second * 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have two background check goroutines: one checks for a new config every 5s (existing), and another checks for a restart every 1s (new). Each check can receive a new, different interval from the app call, so they need to be running at different cadences in different goroutines. You'll also notice that I renamed some interval
variable names in this file to be more specific as to which "interval" they were associated with.
slowWatcher, slowWatcherCancel := goutils.SlowGoroutineWatcher( | ||
stopAllTimeout, "Agent is taking a while to shut down,", m.logger) | ||
stopAllTimeout, | ||
fmt.Sprintf("Viam agent subsystems and/or background workers failed to shut down within %v", stopAllTimeout), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[drive-by] This log was getting output after agent shutdown timed out, so the message was slightly inaccurate.
} | ||
// As with the device agent config check interval, randomly fuzz the interval by | ||
// +/- 5%. | ||
timer.Reset(utils.FuzzTime(needsRestartCheckInterval, 0.05)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were doing this for the config check interval, too... I'm not sure why; anyone know?
HealthCheck(ctx context.Context) error | ||
|
||
// Property gets an arbitrary property about the running subystem. | ||
Property(ctx context.Context, property string) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the new method I added. I realize the interface here is meant to reveal a limited API from manager -> subsystems, but I found that the manager truly needed to know a couple "properties" of the running viamserver subsystem (whether restart was currently allowed and whether viamserver was already handling restart checking logic), so I thought this was worth adding despite it opening a pretty generic API to subsystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved most the code that attempted to find whether restart was allowed here, and I extended it to also be able to check whether viamserver handled restart checking logic. Some of the concurrent code could use an extra pair of eyes since it's slightly different than before; my manual testing seems to imply it works fine.
RestartAllowed bool `json:"restart_allowed"` | ||
// DoesNotHandleNeedsRestart represents whether this instance of the viamserver does | ||
// not check for the need to restart against app itself and, thus, needs agent to do so. | ||
// Newer versions of viamserver (>= v0.9x.0) will report true for this value, while |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you're curious, I don't think we can check the semantic version string of viamserver
to know this information. That string is not always a semantic version (could be stable
, customURL...
, etc.), so I felt that the best way to know was to directly query something on the running viam-server
.
goodBytes = bytes.Equal(shasum, verData.UnpackedSHA) | ||
} else { | ||
c.logger.Warn(err) | ||
} else if verData.UnpackedPath != "" { // custom file:// URLs with have an empty unpacked path; no need to warn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[drive-by] Logging fix related to comment.
} | ||
// Only log errors from Wait() or the exit code of the process state if subsystem | ||
// exited unexpectly (was not stopped by agent and is therfore still marked as | ||
// shouldRun). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were getting unnecessary error logs here when restarting. A non-zero exit code when we sent SIGQUIT to viam-server
is expected.
fallthrough | ||
case syscall.SIGTERM: | ||
globalLogger.Info("exiting") | ||
globalLogger.Infof("Signal %s was received. %s will now exit to be restarted by service manager", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change, and the addition and usage of reason
for the Exit
method in manager.go
are for RSDK-11266. Hopefully these messages will be useful for debugging.
// set before starting viam-server to indicate that agent is a new enough version to | ||
// have its own background loop that runs NeedsRestart against app.viam.com to determine | ||
// if the system needs a restart. MUST be kept in line with the equivalent value in the | ||
// rdk repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now continue to handle checking NeedsRestart
in a background goroutine in the server if VIAM_AGENT_HANDLES_NEEDS_RESTART_CHECKING
is unset (newer versions of agent will set it unconditionally when launching viam-server
). It was determined offline that we do want to support old-agent/new-server setups for a while, and eventually completely remove needs restart checking functionality from the rdk
with RSDK-12057.
Thanks to @jmatth for the env var idea 🎉 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some manual testing to ensure I didn't break "restart allowed" logic (turns out it wasn't even working on main
for Windows) and that Windows behaved as expected for the feature (across old/new agent/server versions).
I'm also going to add some unit tests to this PR.
s.checkURL = matches[1] | ||
s.checkURLAlt = strings.Replace(matches[2], "0.0.0.0", "localhost", 1) | ||
s.logger.Infof("viam-server restart allowed check URLs: %s %s", s.checkURL, s.checkURLAlt) | ||
s.checkURLAlt = strings.Replace(matches[2], "0.0.0.0", "127.0.0.1", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to 127.0.0.1
so checks to checkURLAlt
would work on Windows (localhost
does not exist on Windows) as well as Linux. Restart allowed checks could already work with the checkURL
(this was something like https://vader-main.ydbecmp2sz.local.viam.cloud:8080
).
switch property { | ||
case RestartPropertyRestartAllowed: | ||
if !s.running { | ||
// Assume agent can restart viamserver if the subsystem is not running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to also return true
unconditionally on Windows, meaning a maintenance sensor returning false
or an active reconfiguration would not stop viam-agent
from restarting viam-server
for an update on Windows. That's no longer the case, and both of those things will prohibit viam-agent
with these changes from restarting viam-server
(the WARN
log viam-server has NOT allowed a restart; will NOT restart
will be output).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending tests
RSDK-11248:
Adds similar restart checking logic to the logic disabled in viamrobotics/rdk#5324; the logic now restarts
viam-agent
and all its other subsystems along withviam-server
. Refactors hits ofrestart_status
to be more generic (can query multiple restart "properties"). Adds aProperty
method to theSubsystem
interface.RSDK-11266:
Logs the reason that viam-agent is exiting (either a received signal or a requested restart from app).
RSDK-11900:
Stops logging a stack trace from viam-server on part restart. A stack trace will only be logged from viam-server when shutdown is hanging.
RSDK-11901:
Stops logging any ERROR-level logs for restarts (so no error logs appear in machine settings control card).
Manual testing
Note that I'm leveraging a
does_not_handle_needs_restart
JSON property exposed on therestart_status
endpoint introduced in the associated RDK PR along with aVIAM_AGENT_HANDLES_NEEDS_RESTART
environment variable set by agent. The expected behavior right now is:I've tested the above matrix on both Linux and Windows, and the behavior is as-expected. I've also ensured that I haven't broken the existing "restart allowed" logic (and gotten it to work on Windows where it wasn't before).