Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strimzi Monitoring Refresh #877

Merged
merged 5 commits into from Sep 21, 2018

Conversation

Projects
None yet
5 participants
@seglo
Copy link
Contributor

commented Sep 19, 2018

Type of change

  • Refactoring
  • Documentation

Description

This PR refreshes the Monitoring appendix section in the documentation. The new bundled dashboards for Kafka and ZooKeeper should work out of the box to detect running Strimzi clusters and monitor key metrics for both services. Variables can be used to drill down the view by Namespace, Strimzi Cluster, Broker/Node, Topic (Kafka), Partition (Kafka). This should serve as a good starting point for people who want to build their own monitoring infrastructure for Strimzi clusters.

Important: Before the documentation for deploying Grafana can be used the Grafana docker image in ./metrics/examples/grafana/grafana-openshift must be pushed to the strimzilabs docker user on docker.io. The tag is expected to be strimzilabs/grafana-openshift:latest.

The following tasks were completed.

  • Update Prometheus server to latest available version (2.4.0).
  • Update Prometheus server config to scrape service endpoints, pods, kube-metrics, cAdvisor, etc.
  • Update Grafana server to latest available version (5.2.4).
  • Add Grafana source docker image to the project.
  • Update Kafka examples and oc templates to include a full set of Prometheus JMX Exporter rules for effective monitoring.
  • Replace existing sample Grafana dashboard with one new dashboard each for monitoring Kafka and ZooKeeper
  • Refresh documentation.

Example dashboards:

strimzi-kafka.json

image

strimzi-zookeeper.json

image

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2018

TODO: Revert examples and templates Kafka resources to not include additional Prometheus JMX exporter rules. Create a new example in the metrics/examples/ directory. Update docs to reference this example for exporter rules. /cc @scholzj

@tombentley
Copy link
Member

left a comment

This is a real improvement, thanks! Just a couple of minor points.


==== Grafana
To deploy all these resources you can run the following. Note that this file creates a `ClusterRoleBinding` in the `myproject` namespace. If you're not using this namespace then download the resource file locally and update it.

This comment has been minimized.

Copy link
@tombentley

tombentley Sep 20, 2018

Member

If this file will be included in what the user downloads why are we telling them to apply it from a URL (thus assuming the user has internet access)?

This comment has been minimized.

Copy link
@scholzj

scholzj Sep 20, 2018

Member

TBH, I do not think we should provide the Grafana and Prometheus YAMLs in releases. From my perspective these to be used for testing etc. Users should they own production grade setup. Or are these deployments production ready @seglo?

But we should maybe do it for the JSONs with Grafana dashboards.

This comment has been minimized.

Copy link
@seglo

seglo Sep 20, 2018

Author Contributor

@tombentley @scholzj At the moment the Strimzi download doesn't bundle the ./metrics/examples so the docs have the user downloading resource artifacts directly from the internet.

I think it might make sense to include it in the release bundle, again as an example. Since we serve other example resources in that bundle I think it would fit in. Then we could reference the resources there instead of the internet, and apply the same namespace sed instructions that are done in the cluster provisioning sections of the docs. Would you lke me to update the release management to include it?

This monitoring setup isn't everything you need in production, but I would say that the dashboard gets you about 75% of the way. The missing pieces IMO are setting up alerts and any custom tweaks required by the user to accommodate their particular infrastructure. Both these activities are subjective to the user's setup.

Another dimension this setup doesn't satisfy is application monitoring. The dashboards are a good starting point for infrastructure monitoring of the cluster itself, but monitoring Kafka application attributes such as consumer lag is a major component of a complete streaming monitoring setup. I have some ideas here which I'll follow up on in another thread.

So while this setup isn't everything you need in production, it's a good starting point. There's a lot of components and configuration required to make metrics available to the dashboards. I think it's important to include working Prometheus JMX Exporter configurations and a Prometheus server configuration so that users can get up and running out of the box.

This comment has been minimized.

Copy link
@tombentley

tombentley Sep 20, 2018

Member

The reason we don't include metrics currently is because it don't want users to mistakenly think we're able to support their prometheus/grafana deployment. We're not in a position to do that, so we don't distribute those things and merely link to the them in the docs.

We could stick to that approach, but I'm not sure how helpful it really is to the end user. An alternative would be to rearrange the current examples directory to better reflect things. Right now it contains install resources as well as more exemplar custom resources. If we has install (containing the current examples/install), examples and contrib top level directories, I think that might better reflect the status of these different things.

This comment has been minimized.

Copy link
@scholzj

scholzj Sep 20, 2018

Member

@seglo I do not think the dashboards are something what would not be production ready (although its obviously not everything). But for example the Prometheus deployment YAML is using emptyDir volume which is probably not the right way for your production Prometheus deployment. So IMHO we need to make everyone aware that these are only if you want to try the monitoring and that this is not in any way something you should take to production just like that.

This comment has been minimized.

Copy link
@seglo

seglo Sep 20, 2018

Author Contributor

@scholzj That's true. The opening paragraph for the Metrics appendix section mention that the metrics setup isn't "recommended configuration", but I'll update this to include some disclaimers about the things like the Prometheus config, etc. The main parts we should highlight as required for the Grafana dashboards to work is the Prometheus service discovery and relabeling config and the JMX exporter rules.

@tombentley I think your proposed structure is a good idea, but I'll keep it as-is for this PR. Do you agree?

This comment has been minimized.

Copy link
@tombentley

tombentley Sep 20, 2018

Member

@seglo fine with me. I'm interested in @scholzj's opinion of that idea.

This comment has been minimized.

Copy link
@scholzj

scholzj Sep 20, 2018

Member

Yeah, I think the new structure is good. But keeping it for separate PR makes sense.

The Prometheus data source, and the above dashboard, can be set up in Grafana by following these steps.
As an example, and in order to visualize the exported metrics in Grafana, two sample dashboards are provided https://github.com/strimzi/strimzi-kafka-operator/blob/{ProductVersion}/metrics/examples/grafana/strimzi-kafka.json[`strimzi-kafka.json`] and https://github.com/strimzi/strimzi-kafka-operator/blob/{ProductVersion}/metrics/examples/grafana/strimzi-zookeeper.json[`strimzi-zookeeper.json`].
These dashboards represent a good starting point for key metrics to monitor Kafka and ZooKeeper clusters.
Please note that they are not representative of all the metrics available.

This comment has been minimized.

Copy link
@tombentley

tombentley Sep 20, 2018

Member

Can we link to a reference for all the metrics which are available?

This comment has been minimized.

Copy link
@seglo

seglo Sep 20, 2018

Author Contributor

Ok. I'll add a few references.

Add metrics reference section in docs.
Updated disclaimer to make it clear this is an example only.
Removed exporter rules from examples.
Added metrics Kafka example with exporter rules.
@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2018

Updates based on PR feedback:

  • Added metrics reference links section in docs.
  • Updated disclaimer to make it clear this is an example only.
  • Removed exporter rules from existing examples.
  • Added new metrics Kafka resource example with exporter rules.

@scholzj scholzj added this to the 0.8.0 milestone Sep 20, 2018

@scholzj scholzj self-assigned this Sep 20, 2018

@scholzj

This comment has been minimized.

Copy link
Member

commented Sep 20, 2018

@seglo I wanted to build and push the image before I merge this. But there is nothing in the ./metrics/examples/grafana/grafana-openshift directory. Did you forget to add the Dockerfile? Or did I get it wrong?

@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2018

@scholzj My mistake. I added the directory.

Show resolved Hide resolved documentation/book/appendix_metrics.adoc Outdated
Show resolved Hide resolved metrics/examples/grafana/kubernetes.yaml Outdated
Fix strimzilab dockerhub user id typo.
Rephrased intro sentence in docs.
@scholzj

This comment has been minimized.

Copy link
Member

commented Sep 20, 2018

@seglo I played with it in my OpenShift. The dashboards are super dope. @ppatierno wanted to review this as well ... and since he did the original work I will give him some more time to do it ... hopefully we can merge it tomorrow. Thanks a lot for this great contribution.

@scholzj scholzj assigned ppatierno and unassigned scholzj Sep 20, 2018

@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2018

Thanks @scholzj ! Sure, there's no rush.

@ppatierno
Copy link
Member

left a comment

Thanks @seglo ! I pushed the Grafana image and tested it ... it's really a great addition on monitoring example. Thanks again! Going to merge it ;)

@ppatierno ppatierno merged commit 3e0f271 into strimzi:master Sep 21, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 21, 2018

Awesome. Thanks and you're welcome :)

@Tombar

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

I know I'm late to the game and that the PR is already merged, but I tried it out and have some comments I think would be useful overall.

  1. the zookeeper dashboard only works when running zookeper with 2+ replicas, if just running 1 replica zookeepr jmx export fails (due to the replica rules) + doesn't export the InMemoryDataTree or zookeeper_quorum metrics needed for the dashboard

  2. when trying to use the dashboards with the Prometheus helm chart (instead of the prometheus.yml) it didn't work out of the box, I needed to manually diff rules between the prometheus.yml in the example and the default one in the chart + update some of the dashboards to use pod or pod_name instead of kubernetes_pod_name

  3. last but not least :) great work! I will soon finish porting the kafka connect dashboard from the strimzi-training/lab-5 into a similar one.

@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 25, 2018

Thanks for the review @Tombar ! I'll follow up tomorrow.

@seglo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 25, 2018

Hey @Tombar . My responses below:

the zookeeper dashboard only works when running zookeper with 2+ replicas, if just running 1 replica zookeepr jmx export fails (due to the replica rules) + doesn't export the InMemoryDataTree or zookeeper_quorum metrics needed for the dashboard

Interesting. I'll take a look at this. As you may be aware, some metrics aren't available when the ZK ensemble is only one node. I'll see if there's a way to support single node and cluster configurations without requiring separate dashboards.

when trying to use the dashboards with the Prometheus helm chart (instead of the prometheus.yml) it didn't work out of the box, I needed to manually diff rules between the prometheus.yml in the example and the default one in the chart + update some of the dashboards to use pod or pod_name instead of kubernetes_pod_name

Yes. This is an unfortunate problem in the metrics space in general. Inconsistent metric names, versioning, relabeling, exporter rules, etc. There are even core Prometheus exporters that are inconsistent with each other. Here's a prometheus server example config from the prometheus project
itself that uses the kubernetes_pod_name relabeling config and probably where we (Lightbend) got the configuration from in the first place. This is why it's important to provide the full pipeline in the Metrics example so that users can at least get it running in isolation. Users should be encouraged to make these examples their own and then apply whatever metrics conventions make sense for their use cases. Consistency is the best advice.

last but not least :) great work! I will soon finish porting the kafka connect dashboard from the strimzi-training/lab-5 into a similar one.

Excellent! I don't have much experience with connect, but I'm interested in learning more. I'm looking forward to seeing the dashboard you provide.

@seglo seglo referenced this pull request Sep 25, 2018

Merged

Support standalone ZooKeeper in Grafana dashboard #895

0 of 7 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.