Health Checks should be generic in so far that they can be declared externally in some form. I.e. the following open points should be addressed:
The design will be summarized briefly in this wiki page.
The text was updated successfully, but these errors were encountered:
My 5 cents ... :-)
P.s.: A BulkSampleDefinition would be made up of individual SampleDefintions which some of them are reusable (MW, OS checking ones) and some are Application specific (some namespace support might be good then ...). A SampleJobDefinition would therefore define what to sample for whom on what hosts.
@markus-meisterernst interesting stuff.
Monitoring and management of software is a large and complex space with lots of use cases and possible software components; so I don't think the health checks have to perform all possible checks and alerts. TBH the main use case I have is to check that things are working at all (not so much going a bit too slow relative to other machines or expected performance over statistical measurements). e.g. did the container startup and did the camel route start or did the JMS consumer actually get some messages etc. If not kill it, something random and bad happened (could be temporary network glitch with stale DNS or something odd, could be a software bug - whatever - restart!).
Also I think the idea should be that the health check agent should be pretty small with minimal overhead ideally.
Having said all that; I guess some other global view system could figure out average usage windows and then update the health checks in individual agents (e.g. periodically updating them or something). e.g. if there was a JMX operation we could use to add/update checks at runtime; then some global monitoring system could decide what were good ranges of certain metrics and then update them via JMX/REST - while keeping the health check agent code nice and simple.
BTW my personal biggest itch to scratch is to get builds of containers to generate meaningful health checks to verify that things actually startup correctly. e.g. that N camel routes should start; that messages should be consumed some of these specific queues etc. As without health checks; just looking at metric sizes and changes over time, you kinda miss the fact that a camel route couldn't connect to a database so never even started; or that an ActiveMQ application has never properly connected to the message broker yet; but everything else looks fine etc. Its kinda easy to set alerts if, a queue gets too big or something; but its kinda hard to add alerts to metrics which are not there ;)
@jstrachan ok, I finally got the idea of healthchecks and some of your itches. thanks for sharing.
Today I had a look at the healthcheck code (@rhuss there is actually a bug in the GREATER_THAN_EQUAL definition of the Comparison enum).
@rhuss mentioned a DSL and the capability to get external definitions being catched up somehow.
I personally would prefer the suggestion made by @nevenr even though json should not replace a DSL. But as jolokia already uses json, we got accustomed to it and one quality criteria I catched up is to prevent bloat, I think JSON could actually be used to serialize these external healthcheck definitions.
I found out, James, that you are actually a Jexl committer. As Jexl 2 has only a dependency on the bsf-api and commons logging afaik which sum up for ~ 300 kbyte, it looks to me as a sound means for describing and executing assertions. At least compared to a full blown scripting language in terms of the number of dependencies and size.
If you give each MBean you query/invoke an alias, then you can put results on the Jexl context using the alias as a variable that can be referred to in the Jexl Script.
So, I like the idea to combine JSON and Jexl. Building on @nevenr's example a JSON could look like:
Again the JSON is only a serialization means to provide dataCollection Definitions (~ SampleDefinitions from my last post) to the Agent.
/jolokia/exec/<mbeanName for healthchecker>/exec/dataCollection1/dataCollection2/.../dataCollection3
it would be just another MBean operation and as such the coupling of the existing api and the healthchecker would be reduced not to speak about the flexibility gained through the external definitions.
The setup could be summarized as such:
P.s.: Maybe it is also viable to have both, the Plugin concept for some kind of closed environments and the more flexible REST based API for the other environments.
@markus-meisterernst, thanks a lot for sharing your ideas.
The current code is really only a Proof-of-Concept and is meant to be thrown away ;-). And indeed, a final solution should emit JSON as a response (for both, the good and the bad case).
I'm also fine with a JSON based DSL for describing the health checks so that wee keep the size of the agents small.
A plugin also gets access to the agent configuration (and has of also access to system properties and environment variables), so there should be an easy way for defining the storage of HC declarations.
For uploading the HC declaration I don't think we need a new REST API, since we could easily provide an HcUploadMBean which takes the check definitions as an argument to a custom MBean operation. So everything (managing and executing HCs) can be covered by the existing protocol.
Said all this, I like your idea to keep the HCs completely out of the agent (instead of integrating it as plugins). To summarize, I see these two implementation strategies:
Health Checks as Jolokia Plugins
The plugins would register custom HC MBeans during startup of the agent. Advantage of this solution is, that it is easy to install (only one Jolokia artifact). Disadvantage is an extra packaging (jolokia-extra), which needs to be updated also for any new agent version.
Healt Checks attached externally
This is the solution you suggest: A dedicated, separate application contains all the HC stuff and registers some MBeans. As I said above, one wouldn't even need to expose a REST API which avoids extra complexity (not so much for the WAR agent, but for the JVM and OSGi agents).
An indeed, there is already an example for this, however a bit hidden: The Jolokia integration tests uses also to custom MBeans for performing the test. These MBean are either registered during a Servlet's
The beauty of this solution is, that it is completely independent of the Jolokia agent (and hence can have completely different releasy cycles). Disadvantage for the user is, that he have to install two artifacts, the Jolokia agent itself and the health check agent.
So, what's your favorite approach ? Personally, I would prefer the second one because of its independence and easier maintenance. Markus, thanks a lot for the heads up. Sometimes one doesn't see the wood for the trees.
IMO we should take care about the focus of health checks as discussed here. A health check should be a companion to an agent, 1:1. And also important, a health check should not actively and periodically query the system state but is nothing more than a passive component. So my understanding is, that health check fit to your first suggestion, but IMO diagram 2 and 3 go over the top for this particular use case. This does not mean, that this might be not useful, but I see the responsibility for scaling more at the monitoring application itself (like Nagios with mod_gearman, or Shinken, or whatever) .
To summarize, I think the HC should query only local MBeans and provide a consolidated view of their attribute values.
What I forgot to add to the comparison between plugin and external setup is, that the plugin has access to all MBeanServers as detected by the MBeanServerDetector and which are very specific for different platforms. That would be not the case for an external solution except if we would add the server detection facility to this external jar as well.
@nevenr, @jstrachan and @rhuss
@rhuss considering the various clients to support (jvm, osghi,war) you are just right to abondon the idea of another REST interface. I was too concerned about my own needs in this respect ;-)
The multiple MBeanServer support would be a good opportunity to slice the core into proper artifacts to allow the project to prosper in the future. One could think of a Discovery module that is used by the "core" as a first step to take.
Given that we would opt for external definitons for the HC, let me bring some issues back to the table (which were my inital motivation to jump onto this discussion):
From my perspective a HC is actually a multiple step operation (I don't know the actual strategy the jolokia agent currently follows, so regard this as a possibly more general top-down approach):
S1) a describing model is provisioned (through the HcUploadMBean)
So far so good.
My point here is that HC is actually some special case of harvesting MBean metrics (S1-3 and S5).
With this in mind you find a better and coherent solution, that's my opinion.
We would have to make up our minds about the provisioning and caching. They relate if we assume that if a JVM starts up the Jolokia Script Cache would be empty.
I personally would go with the volatile cache as it simplyfies things for the jolokia HC implementation.
No matter what strategy to implement for the Jolokia Script Cache, I think it should cache only the Java presentation of the describing model. That would mean the transformation to the target model on each invocation.
ok, I just wannted to share these thoughts with you guys.
If we are to choose the "Healt Checks attached externally"-path we would need a dependency on the core so we can make use of the JmxRequest descendent classes as the target models.
(I'm currently not sure if its a good idea to slice the core into different artifacts. One has to weight reusability against convinience... facing users with the question of which jars to be included for their use case)
I guess the HC code should utilize the RequestDispatcher (LocalRequestDispatcher) through the BackendManager so to still apply the same facilities (e.g. history) and restrictions (through the Restrictor) while querying the MBeanServers.
If the BackendManager would be an MBean it could be looked up by the HC for calling handleRequest as part of the Jolokia Script execution. Making use of the BackendManager would solve the problem with multiple MBeanServer Support as of your last post, Roland.
The HC would then make use of a model that wraps around the JmxRequest Subclasses, so to define the operations / queries.
Would that be a path to follow or do you think in a different direction ?
First of all, thanks a lot for your input, @markus-meisterernst And apologies for the very long delay.
From my point, we shouldn't go for a to tight integration into the Jolokia backend.
For Jolokia 1.x I really would like it to as much decoupled as possible, so my suggestion is still one of the two options:
I think the first approach is preferable, because of the ease of use and the upgrade path to Jolokia 2.0 is quite clear. Nevertheless, I think that even both approaches could be achieved simultaneously if we use a slight abstraction above
What we now need in the next step is how we defined Healt-Checks. Especially the syntax, which could be JSON based (there already had been some suggestions). And then next how an
Some final words to Jolokia 2.0: It is a complete refactoring, which separates clearly all services. It's a bit modeled after OSGi and has a JolokiaContext as central entry point into the core. The
For the time frame: As you probably noticed, I'm currently quite busy with other stuff, so I can't talks about any concrete timeframe. Hopefully (and very likely ;-) in May I can spent much more time on Open Source stuff ....
I understand, @rhuss, if you want to pursue the MBeanPluginContext strategy,
Personally, though I find the approach a bit low level, but this is obviously a matter of taste.
I have to admit that I enjoyed myself to code against the internal BackendManager and JmxRequest API,
Health Check Scripting - POC
I was not quite satisfied with my last post, as I felt that by just talking about things we make no real progress in shaping the nature of Health Check Scripting.
Therefore I put some effort in preparing a Proof of Concept, a short Tutorial and some Reference Guide about the how a Scripting Implementation and the Scripts might look like.
Feel free to have a look at the code base and follow along the tutorial:
Even though, I didn't follow your proposal, @rhuss, we might built upon the syntax and some of the insights the POC revealed.
P.s.: I was thinking about how to integrate plugins into the Jolokia runtime:
The HealthChecks, being either the Java coded plugins or the dynamic scripting feature, would ideally use the whiteboard pattern (http://www.osgi.org/wiki/uploads/Links/whiteboard.pdf) to register themselves with the Jolokia Core Runtime.
You could then consider using JMX Notifications as a vehicle to actually implement the plugin registration.
But where (i.e. which MBeanServer instance) does the Jolokia Runtime exists ?
Ideally, Jolokia would spawn its own MBeanServer that proxies to all other MBeanServers detected within the JVM.
The Jolokia MBeanServer would be registered with a known agentId (e.g. jolokia), so that all Plugins know where they should actually register to (i.e. to which MBeanServer to send their JMX Notifications).
The only other immediately apparent problem that comes to my mind is then deployment order.
This could be solved at the container- or application level (package a surveillance EAR, where jolokia and the plugins are bundled together. the application.xml would allow for explicit order).