Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes should be able to self-report as "degraded" #357

Closed
mhasself opened this issue Oct 5, 2023 · 2 comments
Closed

Processes should be able to self-report as "degraded" #357

mhasself opened this issue Oct 5, 2023 · 2 comments

Comments

@mhasself
Copy link
Member

mhasself commented Oct 5, 2023

Currently we monitor agent instance health through each process' "OpCode", which mostly boils down to whether the process is running or not. (See enum.) This value is included in agent heartbeat; is monitored and streamed to a feed by the registry.

In the model where processes are long-lived, and they deal with hardware problems by simply attempting reconnection, the process running/not-running is not sufficiently informative. Such a process could record the status in session.data somehow ... but for generic propagation of that information we should standardize how that is done and how to get that information out.

Continuing from the discussion on dev call yesterday, a proposal is:

  • Add "DEGRADED" to the OpCode enum.
  • Standardize on session.data["degraded_at"] = unix_timestamp to mark running sessions as degraded.
  • In OpSession, add function "set_degraded(degraded [bool])" to mark / clear the degraded state (which just updates session.data).
  • In OpSession.op_code property, if status == "running" but data['degraded'] > 0 then return value "DEGRADED".

Individual agents and processes will need to manually implement the use of degraded, if it is applicable to how they deal with errors. Alarms configured for such agents will need to be updated to map the "degraded" state to be as bad as "not running".

@mhasself
Copy link
Member Author

mhasself commented Oct 5, 2023

While we're in there, might be nice to have processes automagically transition out of "starting" state when the process code is run, rather than relying on agent code to set_status('running') manually.

@mhasself
Copy link
Member Author

Addressed in #371.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant