Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Provide metrics to monitor certificates expiration #3761

Closed
OuesFa opened this issue Oct 7, 2020 · 13 comments
Closed

[Enhancement] Provide metrics to monitor certificates expiration #3761

OuesFa opened this issue Oct 7, 2020 · 13 comments

Comments

@OuesFa
Copy link
Contributor

OuesFa commented Oct 7, 2020

Is your feature request related to a problem? Please describe.
Lack of visibility regarding the validity period of certificates created by the cluster & user operators.

Describe the solution you'd like
Expose metrics to monitor the expiration of the certificates through visualisations and alerts.

@scholzj
Copy link
Member

scholzj commented Apr 12, 2022

Triaged on 12.4.2022: This makes sense for the CAs:

  • Metric can be provided as days until expiration
  • Should be included in the Grafana dashboards and sample alerts

User certificates are not that easy because the User Operator doesn't know whether the certificate is actually used by the client or not. This would be better solved at the client side.

@OuesFa
Copy link
Contributor Author

OuesFa commented Apr 12, 2022

Thanks for the update.
How "difficult" you think this contribution would be, and how long it would take, compared to a contribution like this one for example #5413
I can work on this if you think it is reasonable that someone who is not that familiar with the operator's code can do.

@scholzj
Copy link
Member

scholzj commented Apr 12, 2022

I think this is harder than #5413. There are two parts to this:

  1. Is getting the actual information about the days till expiration. This might IMHO not be that hard.

  2. Exposing the metrics. I think this will be the hard part. Because here you need:

    • Have some shared metric which will agregate this for multiple Kafka clusters (since the operator might manage more of them, each with their own CAs and metrics)
    • Make sure to add the metrics for new and existing clusters
    • Make sure to remove the metrics for deleted clusters

    And I think this might not be completely easy.

Of course if you wanna look into it, we will try or best to help you.

@steffen-karlsson
Copy link
Contributor

@scholzj @maciej-tatarski and I would like to work on this, we have a solution in our company, not directly using Strimzi but build around and have some suggestions for dashboards as well.

@maciej-tatarski
Copy link
Contributor

One suggestion would be to expose the actual epoch of the expiration date instead of days to expiration as it is easier and more flexible to work with.

@scholzj
Copy link
Member

scholzj commented Mar 14, 2024

Can you elaborate a bit more on it?

  • Why do you think the epoch of the actual expiration is more useful? It seems to me that the number of days makes it super easy to evaluate it and read. Although I guess at the end you can usually convert them quite easily if needed.
  • Also, how do you monitor the expirations if not directly using Strimzi? In the past, I thought that the best way to implement this might be through a separate tool called something like strimzi-state-metrics that would provide these additional metrics as some of them are hard to integrate directly into the operators and add a lot of complexity that way.

@steffen-karlsson
Copy link
Contributor

steffen-karlsson commented Mar 14, 2024

@maciej-tatarski will elaborate on the epoch :)

@scholzj Regarding our existent solution, we don't plan to use that as part of this implementation, rather decom it when this is done.

What we have done, is to implement dashboards and alerts on certs based on a K8s CronJob to monitor our external secret store and emit metrics that way, because we were missing this.

What we want to do in this solution is to emit the metric when a cluster is created, or secret is updated and remove it when a cluster is deleted, i.e. in the operator-common and cluster-operator.

@maciej-tatarski
Copy link
Contributor

I think epoch is better because in grafana you can easily visualize it as a date or days to expiry, because it fits default grafana time format.. Additionally it gives you more precise data, as it is in seconds.

@scholzj
Copy link
Member

scholzj commented Mar 14, 2024

I think epoch is better because in grafana you can easily visualize it as a date or days to expiry, because it fits default grafana time format.. Additionally it gives you more precise data, as it is in seconds.

Ok, I guess that makes sense. We would still need to figure out what would be the best way to expose these metrics. One of the main issues is how to cleanly remove them when the cluster is deleted.

@steffen-karlsson
Copy link
Contributor

I see, there are no callbacks or anything on deletion in the cluster-operator currently that we can hook into?

@scholzj
Copy link
Member

scholzj commented Mar 14, 2024

To be honest, I do not remember the details exactly. But in general, the deletion is done by Kubernetes and its garbage collection. It is not always simple to remove the metrics for the deleted resources. But if you don't do it, they usually stay set until the operator restarts.

@steffen-karlsson
Copy link
Contributor

Makes sense, @maciej-tatarski and I would gladly give it a go and see if we can come up with anything meaningful :)

@scholzj
Copy link
Member

scholzj commented Mar 14, 2024

Ok, great. That sounds like a plan then.

scholzj added a commit that referenced this issue Apr 3, 2024
Signed-off-by: Steffen Karlsson <steffen.karlsson@maersk.com>
Signed-off-by: Steffen Wirenfeldt Karlsson <steffen.karlsson@maersk.com>
Signed-off-by: maciej-tatarski <maciej.tatarski@maersk.com>
Co-authored-by: Jakub Scholz <www@scholzj.com>
Co-authored-by: maciej-tatarski <maciej.tatarski@maersk.com>
@scholzj scholzj closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants