-
Notifications
You must be signed in to change notification settings - Fork 6
Operation Monitoring #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operation Monitoring #188
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, I'm sure this functionality will see a lot of use. A couple of small suggestions inline, mostly to suggest documentation. The screenshot of the panel might even be good to include.
What'd you mean by "... instead of having to monitor docker updates with telegraf." Are you referring to the logs?
ocs/ocs_agent.py
Outdated
SESSION_STATUS_CODES = [None, 'starting', 'running', 'stopping', 'done'] | ||
|
||
|
||
class OpCode(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class will be autodoc'd and show up in the API section of the docs. It'd be good to link to that from the Registry page, as well as to update the registry documentation to describe the new op state tracking functionality. That'll make it easy for people to create a value mapping in Grafana.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
This approval is contingent on Brian being happy with the documentation.
Alright, I added a docs section and implemented all suggestions. I also added field name validation to the registry fields before adding them to the data block because I realized long or improper agent instance-id's would result in improper field names in the Let me know if there's anything else I should add before merging! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point about the field naming potentially becoming invalid. I commented on this inline below. I think moving this check closer to the heartbeat could provide the Agent user more feedback.
Fixed Brian's last comments. Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, thanks for those! Maybe longer term that method marked as hidden on the Provider should be moved somewhere more general or made not hidden, but this works for now.
This PR adds the
agent_operations
feed for the registry agent which allows for the monitoring of individual agent operations, and makes it possible to set alerts for failed processes in grafana.Description
The agent heartbeat now publishes a dict of operation codes for the agent which can be used to determine the state of each operation that the agent can run. The operation codes are the following:
NONE
: If an operation has never been runSTARTING
: If an operation currently has the 'starting' statusRUNNING
: If the operation has the 'running' statusSTOPPING
: If the operation has the 'stopping' statusSUCCEEDED
: If the operation has finished withsuccess=True
FAILED
: If the operation has finished withsuccess=False
EXPIRED
: If the agent who owns the operation is no longer running (set by the registry).Motivation and Context
This lets you make cool grafana panels that track the status of all operations:

It also makes it possible to setup grafana alerts for individual operations instead of having to monitor docker updates with telegraf.
How Has This Been Tested?
This has been tested by running agents locally.
Types of changes
Checklist:
develop
branch.