Skip to content

Conversation

benjirewis
Copy link
Member

@benjirewis benjirewis commented Sep 26, 2025

RSDK-11248:
Adds similar restart checking logic to the logic disabled in viamrobotics/rdk#5324; the logic now restarts viam-agent and all its other subsystems along with viam-server. Refactors hits of restart_status to be more generic (can query multiple restart "properties"). Adds a Property method to the Subsystem interface.

RSDK-11266:
Logs the reason that viam-agent is exiting (either a received signal or a requested restart from app).

RSDK-11900:
Stops logging a stack trace from viam-server on part restart. A stack trace will only be logged from viam-server when shutdown is hanging.

RSDK-11901:
Stops logging any ERROR-level logs for restarts (so no error logs appear in machine settings control card).

Manual testing

Note that I'm leveraging a does_not_handle_needs_restart JSON property exposed on the restart_status endpoint introduced in the associated RDK PR along with a VIAM_AGENT_HANDLES_NEEDS_RESTART environment variable set by agent. The expected behavior right now is:

Old viam-agent New viam-agent
Old viam-server viam-server should handle restart checking viam-server should handle restart checking
New viam-server viam-server should handle restart checking viam-agent should handle restart checking

I've tested the above matrix on both Linux and Windows, and the behavior is as-expected. I've also ensured that I haven't broken the existing "restart allowed" logic (and gotten it to work on Windows where it wasn't before).


connMu sync.RWMutex
conn rpc.ClientConn
client pb.AgentDeviceServiceClient
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this was a pointless field to store on the manager; creating a gRPC client on top of conn above through which to call DeviceAgentConfig requires no actual, blocking work. It was confusing to store this variable on the struct and check its existence to see if we had dialed already.


if needRestart || needRestartConfigChange || m.viamServerNeedsRestart || m.viamAgentNeedsRestart {
if m.viamServer.(viamserver.RestartCheck).SafeToRestart(ctx) {
if m.viamServer.Property(ctx, viamserver.RestartPropertyRestartAllowed) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the RestartCheck interface, and instead added a Property method to the Subsystem interface. I think it's a bit easier to read. I also moved logging about the result of querying restart_allowed to this file (line below this).

// The minimal (and default) interval for checking for config updates via DeviceAgentConfig.
minimalDeviceAgentConfigCheckInterval = time.Second * 5
// The minimal (and default) interval for checking whether agent needs to be restarted.
minimalNeedsRestartCheckInterval = time.Second * 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have two background check goroutines: one checks for a new config every 5s (existing), and another checks for a restart every 1s (new). Each check can receive a new, different interval from the app call, so they need to be running at different cadences in different goroutines. You'll also notice that I renamed some interval variable names in this file to be more specific as to which "interval" they were associated with.

slowWatcher, slowWatcherCancel := goutils.SlowGoroutineWatcher(
stopAllTimeout, "Agent is taking a while to shut down,", m.logger)
stopAllTimeout,
fmt.Sprintf("Viam agent subsystems and/or background workers failed to shut down within %v", stopAllTimeout),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[drive-by] This log was getting output after agent shutdown timed out, so the message was slightly inaccurate.

}
// As with the device agent config check interval, randomly fuzz the interval by
// +/- 5%.
timer.Reset(utils.FuzzTime(needsRestartCheckInterval, 0.05))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were doing this for the config check interval, too... I'm not sure why; anyone know?

HealthCheck(ctx context.Context) error

// Property gets an arbitrary property about the running subystem.
Property(ctx context.Context, property string) bool
Copy link
Member Author

@benjirewis benjirewis Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the new method I added. I realize the interface here is meant to reveal a limited API from manager -> subsystems, but I found that the manager truly needed to know a couple "properties" of the running viamserver subsystem (whether restart was currently allowed and whether viamserver was already handling restart checking logic), so I thought this was worth adding despite it opening a pretty generic API to subsystems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved most the code that attempted to find whether restart was allowed here, and I extended it to also be able to check whether viamserver handled restart checking logic. Some of the concurrent code could use an extra pair of eyes since it's slightly different than before; my manual testing seems to imply it works fine.

RestartAllowed bool `json:"restart_allowed"`
// DoesNotHandleNeedsRestart represents whether this instance of the viamserver does
// not check for the need to restart against app itself and, thus, needs agent to do so.
// Newer versions of viamserver (>= v0.9x.0) will report true for this value, while
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you're curious, I don't think we can check the semantic version string of viamserver to know this information. That string is not always a semantic version (could be stable, customURL..., etc.), so I felt that the best way to know was to directly query something on the running viam-server.

goodBytes = bytes.Equal(shasum, verData.UnpackedSHA)
} else {
c.logger.Warn(err)
} else if verData.UnpackedPath != "" { // custom file:// URLs with have an empty unpacked path; no need to warn
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[drive-by] Logging fix related to comment.

}
// Only log errors from Wait() or the exit code of the process state if subsystem
// exited unexpectly (was not stopped by agent and is therfore still marked as
// shouldRun).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were getting unnecessary error logs here when restarting. A non-zero exit code when we sent SIGQUIT to viam-server is expected.

fallthrough
case syscall.SIGTERM:
globalLogger.Info("exiting")
globalLogger.Infof("Signal %s was received. %s will now exit to be restarted by service manager",
Copy link
Member Author

@benjirewis benjirewis Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change, and the addition and usage of reason for the Exit method in manager.go are for RSDK-11266. Hopefully these messages will be useful for debugging.

@benjirewis benjirewis changed the title RSDK-11248 Add restart checking logic RSDK-11248 RSDK-11266 Add restart checking logic Sep 29, 2025
@benjirewis benjirewis requested review from jmatth and cheukt September 29, 2025 15:59
@benjirewis benjirewis marked this pull request as ready for review September 29, 2025 15:59
// set before starting viam-server to indicate that agent is a new enough version to
// have its own background loop that runs NeedsRestart against app.viam.com to determine
// if the system needs a restart. MUST be kept in line with the equivalent value in the
// rdk repo.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now continue to handle checking NeedsRestart in a background goroutine in the server if VIAM_AGENT_HANDLES_NEEDS_RESTART_CHECKING is unset (newer versions of agent will set it unconditionally when launching viam-server). It was determined offline that we do want to support old-agent/new-server setups for a while, and eventually completely remove needs restart checking functionality from the rdk with RSDK-12057.

Thanks to @jmatth for the env var idea 🎉 .

Copy link
Member Author

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some manual testing to ensure I didn't break "restart allowed" logic (turns out it wasn't even working on main for Windows) and that Windows behaved as expected for the feature (across old/new agent/server versions).

I'm also going to add some unit tests to this PR.

s.checkURL = matches[1]
s.checkURLAlt = strings.Replace(matches[2], "0.0.0.0", "localhost", 1)
s.logger.Infof("viam-server restart allowed check URLs: %s %s", s.checkURL, s.checkURLAlt)
s.checkURLAlt = strings.Replace(matches[2], "0.0.0.0", "127.0.0.1", 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to 127.0.0.1 so checks to checkURLAlt would work on Windows (localhost does not exist on Windows) as well as Linux. Restart allowed checks could already work with the checkURL (this was something like https://vader-main.ydbecmp2sz.local.viam.cloud:8080).

switch property {
case RestartPropertyRestartAllowed:
if !s.running {
// Assume agent can restart viamserver if the subsystem is not running.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to also return true unconditionally on Windows, meaning a maintenance sensor returning false or an active reconfiguration would not stop viam-agent from restarting viam-server for an update on Windows. That's no longer the case, and both of those things will prohibit viam-agent with these changes from restarting viam-server (the WARN log viam-server has NOT allowed a restart; will NOT restart will be output).

@benjirewis benjirewis changed the title RSDK-11248 RSDK-11266 Add restart checking logic RSDK-11248 RSDK-11266 RSDK-11901 Add restart checking logic Oct 2, 2025
@benjirewis benjirewis changed the title RSDK-11248 RSDK-11266 RSDK-11901 Add restart checking logic RSDK-11248 RSDK-11266 RSDK-11900 RSDK-11901 Add restart checking logic Oct 2, 2025
Copy link
Member

@cheukt cheukt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants