New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Health Checks #1

Open
rhuss opened this Issue Jan 17, 2015 · 12 comments

Comments

Projects
None yet
4 participants
@rhuss
Copy link
Owner

rhuss commented Jan 17, 2015

Health Checks should be generic in so far that they can be declared externally in some form. I.e. the following open points should be addressed:

  • How should health check specification look like ? Should it be done via JSON or should a more expressive DSL based e.g. on Groovy should be used ?
  • How are the health check store on the agent side ?
    • Looking them up in the filesystem (from a configurable path with a sane default like ~/.jolokia_healthchecks)
    • Baking it into the agent jar
    • Uploading it via an MBean operation (and then storing them in the filesystem as well)
  • What kind of meta data should be provided so that consoles like hawt.io can dynamically create their health check views ?

The design will be summarized briefly in this wiki page.

@rhuss

This comment has been minimized.

Copy link
Owner

rhuss commented Jan 17, 2015

Some links for discussions:

@markus-meisterernst

This comment has been minimized.

Copy link

markus-meisterernst commented Jan 19, 2015

Hi,
I used jolokia in a JBoss based JEE project with various Sub Applications. We used it there as
a replacement for the obsolete jmx-console since JBoss 7 (executing lifecycle operations and other admin stuff) and as a means to grep actual monitoring metrics.
The Apps are executed in multiple instances on various hosts.
The thing about the healthcheck is that it only resembles the state of the current jvm instance at sample time.
If a single instance let's say has a peak for the bad than that basically means nothing if the others at the same time aren't equally affected.
So to that point one doesn't necessarily want to wave a warning or an alarm.
And that brings up the issue from my perspective that you actually want to know the status of all instances at a) regular intervals and b) build heuristics (some graphs to learn from) and c) judge on mean values for each jvm instance or the whole group over d) a history (a certain moving timeframe to build mean values for to prevent oscillating alarms caused by monitoring deficiencies).
Apart from that there is still the problem of actual invocation of the heartbeat mechanism.
From receiving feedback from various IT managers I would summarize that they like the idea of a passive monitoring agent that is triggered from outside.
So ... I would prefer to have a groovy DSL that is actually executed outside of the actual monitored jvm that allows to define what to monitor in let's call it a BulkSampleDefintion (= what MBean values to obtain and operations to invoke) that is cached from the Jolokia Agent.
After I provided and sanity checked the BulkSampleDefintion on the Agent (which means that at sample time Query constructs are evaluated to the Target Env) the Agent would only receive a GET operation on the cached SampleDefintion which actually does the Sample.
The DSL and scripts (possibly written in any language the jvm supports) as part of the jolokia client plugin system would than assist me in writing those healthchecks with a richer context (e.g. a history on the last n-calls on the same Sample) and a means to actually store the collected metrics (also pluggable).

My 5 cents ... :-)

Markus

P.s.: A BulkSampleDefinition would be made up of individual SampleDefintions which some of them are reusable (MW, OS checking ones) and some are Application specific (some namespace support might be good then ...). A SampleJobDefinition would therefore define what to sample for whom on what hosts.
A SampleJob would be the currently executing instance of that definition ...
A healthcheck would than be like some post processing to the SampleJob (if you don't need the history, otherwise there might be some independent means to judge on the health, that builds on a datastore to provide for historic data)

@jstrachan

This comment has been minimized.

Copy link

jstrachan commented Jan 20, 2015

@markus-meisterernst interesting stuff.

Monitoring and management of software is a large and complex space with lots of use cases and possible software components; so I don't think the health checks have to perform all possible checks and alerts. TBH the main use case I have is to check that things are working at all (not so much going a bit too slow relative to other machines or expected performance over statistical measurements). e.g. did the container startup and did the camel route start or did the JMS consumer actually get some messages etc. If not kill it, something random and bad happened (could be temporary network glitch with stale DNS or something odd, could be a software bug - whatever - restart!).

Also I think the idea should be that the health check agent should be pretty small with minimal overhead ideally.

Having said all that; I guess some other global view system could figure out average usage windows and then update the health checks in individual agents (e.g. periodically updating them or something). e.g. if there was a JMX operation we could use to add/update checks at runtime; then some global monitoring system could decide what were good ranges of certain metrics and then update them via JMX/REST - while keeping the health check agent code nice and simple.

BTW my personal biggest itch to scratch is to get builds of containers to generate meaningful health checks to verify that things actually startup correctly. e.g. that N camel routes should start; that messages should be consumed some of these specific queues etc. As without health checks; just looking at metric sizes and changes over time, you kinda miss the fact that a camel route couldn't connect to a database so never even started; or that an ActiveMQ application has never properly connected to the message broker yet; but everything else looks fine etc. Its kinda easy to set alerts if, a queue gets too big or something; but its kinda hard to add alerts to metrics which are not there ;)

@markus-meisterernst

This comment has been minimized.

Copy link

markus-meisterernst commented Jan 20, 2015

@jstrachan ok, I finally got the idea of healthchecks and some of your itches. thanks for sharing.

Today I had a look at the healthcheck code (@rhuss there is actually a bug in the GREATER_THAN_EQUAL definition of the Comparison enum).
The code looks clean and efficient and it actually resembles your proposal, as far as I can judge. I was supprised to see that the result of a healthcheck is actually NO JSON for some reason.

@rhuss mentioned a DSL and the capability to get external definitions being catched up somehow.
let alone the DSL the current mechanism involves extra effort for the packaging of the App specific plugins.
In some environments I don't see the benefits compared to App specific MBeans (e.g. exposed super easy by Spring) that can be used in a similar fashion, even though it is always good to adhere to some protocol (here the MBeanPlugin).

I personally would prefer the suggestion made by @nevenr even though json should not replace a DSL. But as jolokia already uses json, we got accustomed to it and one quality criteria I catched up is to prevent bloat, I think JSON could actually be used to serialize these external healthcheck definitions.

I found out, James, that you are actually a Jexl committer. As Jexl 2 has only a dependency on the bsf-api and commons logging afaik which sum up for ~ 300 kbyte, it looks to me as a sound means for describing and executing assertions. At least compared to a full blown scripting language in terms of the number of dependencies and size.

If you give each MBean you query/invoke an alias, then you can put results on the Jexl context using the alias as a variable that can be referred to in the Jexl Script.
I guess I don't have to explain you details ... :-)

So, I like the idea to combine JSON and Jexl. Building on @nevenr's example a JSON could look like:

{"dataCollection": {
    "collectionId": "com.mycompany.CheckJvmAndEnv",
    "meta": {
        "version": "1.0.0",
        "description": "a mixture of JVM and OS checks. Use at your own risk ...",
        "author": "It could be you",
        "supportLink": "http://my.support.wiki/kb/healthcheck/com.mycompany.CheckJvmAndEnv"        
    },
    "vars": {
        "checkSwapSpaceThreshold": 0.5,
        "checkSystemLoadThreshold": 1.0
    }, 
    "collectionItems": [
        {
            "var": "heap",
            "cmd": "read",
            "mbean": "java.lang:type=Memory",
            "attribute": "HeapMemoryUsage"
        },
        {
            "var": "os",
            "cmd": "read",
            "mbean": "java.lang:type=OperatingSystem",
            "attributes": [
                "FreeSwapSpaceSize",
                "TotalSwapSpaceSize",
                "SystemCpuLoad",
                "AvailableProcessors"
            ]
        },
        {
            "var": "someOperation",
            "cmd": "exec",
            "mbean": "com.mycompany:type=Something",
            "operation": "someOperation"
        }
    ],
    "assertions": [
        {
            "id": "checkSwapSpace",
            "check": "1 - (os.FreeSwapSpaceSize / os.TotalSwapSpaceSize) > checkSwapSpaceThreshold ?: 0.5",
            "desc": "collectionId + ' assertion ' + currentAssertion.id + ' failed as check: ' + currentAssertion.check + ' resolves to: ' + currentAssertion.result + ' (meaning that more than half of the swap space has been consumed already). See ' + meta.supportLink + '#checkSwapSpace for further infos or contact: ' + meta.author"
        },
        {
            "id": "checkSystemLoad",
            "check": "os.SystemCpuLoad / os.AvailableProcessors > checkSystemLoadThreshold ?: 1.0",
            "desc": "collectionId + ' assertion ' + currentAssertion.id + ' failed as check: ' + currentAssertion.check + ' resolves to: ' + currentAssertion.result + ' (meaning that the load on each processor is higher than 1.5). See ' + meta.supportLink + '#checkSystemLoad for further infos or contact: ' + meta.author"
        },
        {
            "id": "heapMemoryUsage",
            "check": "heap.used / heap.max > heapMemoryUsageThreshold ?: 0.85",
            "desc": "collectionId + ' assertion ' + currentAssertion.id + ' failed as check: ' + currentAssertion.check + ' resolves to: ' + currentAssertion.result + ' (meaning that more than 85% of the heap is used). See ' + meta.supportLink + '#heapMemoryUsage for further infos or contact: ' + meta.author"
        }
    ]
}}

Again the JSON is only a serialization means to provide dataCollection Definitions (~ SampleDefinitions from my last post) to the Agent.
I would go with a new jolokia-healthcheck REST interface that stands next to the classic jolokia API (either as a seperate WAR or within the existing agent delivery).
The new REST interface would depend on a jolokia-healthcheck-core lib that exposes a HealthChecker MBean.
The job of the new jolokia-healthcheck REST interface would be to administrate above mentioned dataCollections in the HealthChecker maintained "Script"-Repository.
So, in the end the jolokia classic API is actually used to query the HealthCheckerMBean as such:

/jolokia/exec/<mbeanName for healthchecker>/exec/dataCollection1/dataCollection2/.../dataCollection3

it would be just another MBean operation and as such the coupling of the existing api and the healthchecker would be reduced not to speak about the flexibility gained through the external definitions.

The setup could be summarized as such:

thoughts_on_jolokia_healthchecks

cheers,

Markus

P.s.: Maybe it is also viable to have both, the Plugin concept for some kind of closed environments and the more flexible REST based API for the other environments.

@rhuss

This comment has been minimized.

Copy link
Owner

rhuss commented Jan 22, 2015

@markus-meisterernst, thanks a lot for sharing your ideas.

The current code is really only a Proof-of-Concept and is meant to be thrown away ;-). And indeed, a final solution should emit JSON as a response (for both, the good and the bad case).

I'm also fine with a JSON based DSL for describing the health checks so that wee keep the size of the agents small.

A plugin also gets access to the agent configuration (and has of also access to system properties and environment variables), so there should be an easy way for defining the storage of HC declarations.

For uploading the HC declaration I don't think we need a new REST API, since we could easily provide an HcUploadMBean which takes the check definitions as an argument to a custom MBean operation. So everything (managing and executing HCs) can be covered by the existing protocol.

Said all this, I like your idea to keep the HCs completely out of the agent (instead of integrating it as plugins). To summarize, I see these two implementation strategies:

Health Checks as Jolokia Plugins

The plugins would register custom HC MBeans during startup of the agent. Advantage of this solution is, that it is easy to install (only one Jolokia artifact). Disadvantage is an extra packaging (jolokia-extra), which needs to be updated also for any new agent version.

Healt Checks attached externally

This is the solution you suggest: A dedicated, separate application contains all the HC stuff and registers some MBeans. As I said above, one wouldn't even need to expose a REST API which avoids extra complexity (not so much for the WAR agent, but for the JVM and OSGi agents).

An indeed, there is already an example for this, however a bit hidden: The Jolokia integration tests uses also to custom MBeans for performing the test. These MBean are either registered during a Servlet's init() (see jolokia-war), as an OSGi Bundle (which registers the MBean during startup of the bundle) or as a JVM agent (also registering the MBeans during startup). This startup code could easily reused for health checks.

The beauty of this solution is, that it is completely independent of the Jolokia agent (and hence can have completely different releasy cycles). Disadvantage for the user is, that he have to install two artifacts, the Jolokia agent itself and the health check agent.


So, what's your favorite approach ? Personally, I would prefer the second one because of its independence and easier maintenance. Markus, thanks a lot for the heads up. Sometimes one doesn't see the wood for the trees.

@nevenr

This comment has been minimized.

Copy link

nevenr commented Jan 22, 2015

Hi,

I already suggested something like this in jolokia issue #162. We should pay attention to proper scaling number of monitoring components.

jolokiahealthchecker_one_per_proc
One per proc (component).

jolokiahealthchecker_one_per_host
One per host.

jolokiahealthchecker_one_per_env
One per environment.

@rhuss

This comment has been minimized.

Copy link
Owner

rhuss commented Jan 22, 2015

IMO we should take care about the focus of health checks as discussed here. A health check should be a companion to an agent, 1:1. And also important, a health check should not actively and periodically query the system state but is nothing more than a passive component. So my understanding is, that health check fit to your first suggestion, but IMO diagram 2 and 3 go over the top for this particular use case. This does not mean, that this might be not useful, but I see the responsibility for scaling more at the monitoring application itself (like Nagios with mod_gearman, or Shinken, or whatever) .

To summarize, I think the HC should query only local MBeans and provide a consolidated view of their attribute values.

@rhuss

This comment has been minimized.

Copy link
Owner

rhuss commented Jan 22, 2015

What I forgot to add to the comparison between plugin and external setup is, that the plugin has access to all MBeanServers as detected by the MBeanServerDetector and which are very specific for different platforms. That would be not the case for an external solution except if we would add the server detection facility to this external jar as well.

@markus-meisterernst

This comment has been minimized.

Copy link

markus-meisterernst commented Jan 23, 2015

@nevenr, @jstrachan and @rhuss
yes, the focus should be on option 1 of the picture above (one Health Checker per JVM).
As a Collector (HealthChecker Client on the picture) one might consider collectd or nagios etc. as Roland suggested. James argumented also in this direction as the multitude of choices and use cases diverge from here on (i.e. from the point when a jolokia client receiving the raw data from the agent) ...

@rhuss considering the various clients to support (jvm, osghi,war) you are just right to abondon the idea of another REST interface. I was too concerned about my own needs in this respect ;-)

The multiple MBeanServer support would be a good opportunity to slice the core into proper artifacts to allow the project to prosper in the future. One could think of a Discovery module that is used by the "core" as a first step to take.

Given that we would opt for external definitons for the HC, let me bring some issues back to the table (which were my inital motivation to jump onto this discussion):

  1. Definition of an HC beyond the assumed JSON format

From my perspective a HC is actually a multiple step operation (I don't know the actual strategy the jolokia agent currently follows, so regard this as a possibly more general top-down approach):

S1) a describing model is provisioned (through the HcUploadMBean)
S2) a describing model (JSON, AST instance) is mapped onto the Target Environment (All queries are resolved to find respective MBean candidates) => actual query model
S3) The actual MBeans are queried for the actual query model and results are gathered
S4) Assertions are applied and results are gathered
S5) Results are serialized (JSON)

So far so good.

My point here is that HC is actually some special case of harvesting MBean metrics (S1-3 and S5).
The only difference is that step 4 is inserted and S5 serializes some other content (event though the serialization code and means would be the same machinery I presume).
Therefore I would suggest to stop to think (not stop thinking) and take a broader look at it and call these HC defining templates Jolokia Scripts (or any better name you may come up with).
These Jolokia Scripts would allow you to do both, collect harvested metrics and do the Assertions as part of the HC. So the HC is a special case of a gerneral purpose "harvest script".

With this in mind you find a better and coherent solution, that's my opinion.

  1. Provisioning and Caching of the HC

We would have to make up our minds about the provisioning and caching. They relate if we assume that if a JVM starts up the Jolokia Script Cache would be empty.
In this case a Collector / HealthChecker (i.e. Client of the HC Agent) would need to implement some sort of a protocol (a execute -> receive an "Script not provisioned"-error -> provision script -> re-execute kind of chain to be executed as part of the protocol).
If the Cache would be persistent, it would be more of a deployment action (either manually or by a build job) when things change (Scripts updated and / or new Software deployed).

I personally would go with the volatile cache as it simplyfies things for the jolokia HC implementation.
People could then implement whatever strategy appeals best for their use case and if this feature is needed in the end ... it could be added anytime later on ...

No matter what strategy to implement for the Jolokia Script Cache, I think it should cache only the Java presentation of the describing model. That would mean the transformation to the target model on each invocation.
But that is weighing robustness and flexibility against performance, so this could be also configurable ...

ok, I just wannted to share these thoughts with you guys.

Markus

@markus-meisterernst

This comment has been minimized.

Copy link

markus-meisterernst commented Feb 14, 2015

@rhuss, @jstrachan, @nevenr
I had a look at the source of the jolokia-core to get a clue on how to tackle the integration topics.

If we are to choose the "Healt Checks attached externally"-path we would need a dependency on the core so we can make use of the JmxRequest descendent classes as the target models.
We would use those at the end of the day to actually effectuate the MBean queries for the HC.

(I'm currently not sure if its a good idea to slice the core into different artifacts. One has to weight reusability against convinience... facing users with the question of which jars to be included for their use case)

I guess the HC code should utilize the RequestDispatcher (LocalRequestDispatcher) through the BackendManager so to still apply the same facilities (e.g. history) and restrictions (through the Restrictor) while querying the MBeanServers.

BackendManager.java

public JSONObject handleRequest(JmxRequest pJmxReq) throws ...

If the BackendManager would be an MBean it could be looked up by the HC for calling handleRequest as part of the Jolokia Script execution. Making use of the BackendManager would solve the problem with multiple MBeanServer Support as of your last post, Roland.

The HC would then make use of a model that wraps around the JmxRequest Subclasses, so to define the operations / queries.
They are needed so to provide for a naming scheme to be used for variable assignment and resolution as part of the HC expressions to be evaluated (possibly through JEXL).

Would that be a path to follow or do you think in a different direction ?

P.s.:
As both, the Agents and the HC depend on the core its arguable if the HC shouldn't be integrated as part of the packaging for the target platform (WAR, OSGi ...). Clearly logically the HC should be considered a different module with its own lifecycle.

P.P.s:
One could think of combining MW specific HC Scripts (JBoss, WLS, OSGi, JVM etc.) bundled with jolokia with Application specific ones (for Caches, Routes etc.) typically provided at Runtime.

@rhuss

This comment has been minimized.

Copy link
Owner

rhuss commented Mar 6, 2015

First of all, thanks a lot for your input, @markus-meisterernst And apologies for the very long delay.

From my point, we shouldn't go for a to tight integration into the Jolokia backend. BackendManager and also JmxRequest are not really prepared for an API based access from the outside. This is much better for Jolokia 2.0. More on that later.

For Jolokia 1.x I really would like it to as much decoupled as possible, so my suggestion is still one of the two options:

  • Use it as an MBeanPlugin and package it with the jolokia agent. This plugin would register an MBean inorder to manage and query the health checks. It uses the MBeanPluginContext in order to query all MBeans locally according to some (to be defined) declaration (DSL, XML, JSON, whatever...). Advantage is full and easy access to all MBeanServers and, more important, an easy upgrade path to Jolokia 2.0
  • Use a complete external approach where some app register an the HC MBean which internally queries the PlatformMBeanServer. Advantage is an even more lose coupling, disadvantage that there is no easy access to all MBeanServers as detected by Jolokia detectors. Also, it needs to deployed in an extra step and we need different deployment variants (WAR, JVM, OSGi) for different scenarios (this is probably the biggest drawback).

I think the first approach is preferable, because of the ease of use and the upgrade path to Jolokia 2.0 is quite clear. Nevertheless, I think that even both approaches could be achieved simultaneously if we use a slight abstraction above MBeanPluginContext. But let's start with Option 1.

What we now need in the next step is how we defined Healt-Checks. Especially the syntax, which could be JSON based (there already had been some suggestions). And then next how an HcMBean could look like with methods for deploying (open question: where to store ?), underlying, getting the HealtChecks and then one for executing a specific health check (with an specific id) or all health checks stored (I suppose, that the HcMBean is able to manage multiple different health checks).

Some final words to Jolokia 2.0: It is a complete refactoring, which separates clearly all services. It's a bit modeled after OSGi and has a JolokiaContext as central entry point into the core. The JolokiaContext itself holds a reference to the MBeanServerAccess which is quite similar to the MBeanPluginContext. So the upgrade steps to 2.0 would be:

  • Convert the MBeanPlugin to a Jolokia Service
  • Access to MBeanPluginContext is switched to use MBeanServerAccess from JolokiaContext
  • Package the service with the jolokia agents (there will be a tool, which allows a pick-and-chose packaging of agents)

For the time frame: As you probably noticed, I'm currently quite busy with other stuff, so I can't talks about any concrete timeframe. Hopefully (and very likely ;-) in May I can spent much more time on Open Source stuff ....

@markus-meisterernst

This comment has been minimized.

Copy link

markus-meisterernst commented Mar 15, 2015

Hi,

I understand, @rhuss, if you want to pursue the MBeanPluginContext strategy,
as it gives an official API and better control on how the project evolves in the future.

Personally, though I find the approach a bit low level, but this is obviously a matter of taste.

I have to admit that I enjoyed myself to code against the internal BackendManager and JmxRequest API,
which for me gave the right level of abstraction to adopt to.
The level of abstraction just felt right as it naturally resembles the Client API and the JSON protocol.

Health Check Scripting - POC

@rhuss, @nevenr, @jstrachan

I was not quite satisfied with my last post, as I felt that by just talking about things we make no real progress in shaping the nature of Health Check Scripting.

Therefore I put some effort in preparing a Proof of Concept, a short Tutorial and some Reference Guide about the how a Scripting Implementation and the Scripts might look like.

Feel free to have a look at the code base and follow along the tutorial:

https://github.com/markus-meisterernst/jolokia-extra/tree/hcs_feature#health-check-scripts---proof-of-concept

Even though, I didn't follow your proposal, @rhuss, we might built upon the syntax and some of the insights the POC revealed.

:-)

Cheers,

Markus

P.s.: I was thinking about how to integrate plugins into the Jolokia runtime:

The HealthChecks, being either the Java coded plugins or the dynamic scripting feature, would ideally use the whiteboard pattern (http://www.osgi.org/wiki/uploads/Links/whiteboard.pdf) to register themselves with the Jolokia Core Runtime.

You could then consider using JMX Notifications as a vehicle to actually implement the plugin registration.

But where (i.e. which MBeanServer instance) does the Jolokia Runtime exists ?

Ideally, Jolokia would spawn its own MBeanServer that proxies to all other MBeanServers detected within the JVM.
As such the MBeanPluginContext would be bound to this Jolokia MBeanServer, inherently solving the issue with multiple MBeanServers.

The Jolokia MBeanServer would be registered with a known agentId (e.g. jolokia), so that all Plugins know where they should actually register to (i.e. to which MBeanServer to send their JMX Notifications).
I'm not 100% sure if this can be done, however, there is at least an MBeanServerFactory.findMBeanServer(String agentId) method.
See also: http://marxsoftware.blogspot.de/2008/01/determining-your-mbeanservers.html about the agentId.

The only other immediately apparent problem that comes to my mind is then deployment order.
One has to make sure that Jolokia is deployed before any plugins, so that there is actually a consumer for the whiteboard registration notifications.

This could be solved at the container- or application level (package a surveillance EAR, where jolokia and the plugins are bundled together. the application.xml would allow for explicit order).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment