Skip to content

Conversation

jlashner
Copy link
Contributor

This PR adds the agent_operations feed for the registry agent which allows for the monitoring of individual agent operations, and makes it possible to set alerts for failed processes in grafana.

Description

The agent heartbeat now publishes a dict of operation codes for the agent which can be used to determine the state of each operation that the agent can run. The operation codes are the following:

  • NONE: If an operation has never been run
  • STARTING: If an operation currently has the 'starting' status
  • RUNNING: If the operation has the 'running' status
  • STOPPING: If the operation has the 'stopping' status
  • SUCCEEDED: If the operation has finished with success=True
  • FAILED: If the operation has finished with success=False
  • EXPIRED: If the agent who owns the operation is no longer running (set by the registry).

Motivation and Context

This lets you make cool grafana panels that track the status of all operations:
Screen Shot 2020-11-09 at 8 39 32 PM

It also makes it possible to setup grafana alerts for individual operations instead of having to monitor docker updates with telegraf.

How Has This Been Tested?

This has been tested by running agents locally.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • Unless I am preparing a release, I have opened this PR onto the develop branch.

@jlashner jlashner requested review from mhasself and BrianJKoopman and removed request for mhasself November 10, 2020 04:51
Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, I'm sure this functionality will see a lot of use. A couple of small suggestions inline, mostly to suggest documentation. The screenshot of the panel might even be good to include.

What'd you mean by "... instead of having to monitor docker updates with telegraf." Are you referring to the logs?

ocs/ocs_agent.py Outdated
SESSION_STATUS_CODES = [None, 'starting', 'running', 'stopping', 'done']


class OpCode(Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class will be autodoc'd and show up in the API section of the docs. It'd be good to link to that from the Registry page, as well as to update the registry documentation to describe the new op state tracking functionality. That'll make it easy for people to create a value mapping in Grafana.

Copy link
Member

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

This approval is contingent on Brian being happy with the documentation.

@jlashner
Copy link
Contributor Author

Alright, I added a docs section and implemented all suggestions. I also added field name validation to the registry fields before adding them to the data block because I realized long or improper agent instance-id's would result in improper field names in the agent_operations feed.

Let me know if there's anything else I should add before merging!

Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point about the field naming potentially becoming invalid. I commented on this inline below. I think moving this check closer to the heartbeat could provide the Agent user more feedback.

@jlashner
Copy link
Contributor Author

Fixed Brian's last comments. Added Provider._enforce_feed_name_rules as a final step to make sure fields are monitored instead of dropped.

@BrianJKoopman BrianJKoopman self-requested a review November 19, 2020 19:46
Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, thanks for those! Maybe longer term that method marked as hidden on the Provider should be moved somewhere more general or made not hidden, but this works for now.

@jlashner jlashner merged commit 700b889 into develop Nov 19, 2020
@BrianJKoopman BrianJKoopman mentioned this pull request May 12, 2021
@BrianJKoopman BrianJKoopman deleted the registry_ops branch October 15, 2021 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants