Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3761 Provide metrics to monitor certificates expiration #9861

Merged

Conversation

steffen-karlsson
Copy link
Contributor

@steffen-karlsson steffen-karlsson commented Mar 21, 2024

Type of change

Description

Implement metric emitter for certificates expiration and dashboard to monitor expiration for certificates.

Checklist

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards (added in comment)

@steffen-karlsson steffen-karlsson force-pushed the 3761-metric-certificates-expiration branch from 53cd989 to 5a2ab21 Compare March 21, 2024 14:15
@maciej-tatarski
Copy link
Contributor

maciej-tatarski commented Mar 21, 2024

Tested wit grafana 7.3.7 and 10.3.4. Visualised as a table, which allows to show multiple expire date based on cluster name:

image

For multiple kafka clusters:
image

On cert renewal, value of the metric is also changing:
image

@steffen-karlsson steffen-karlsson force-pushed the 3761-metric-certificates-expiration branch 8 times, most recently from 26843ea to efe5161 Compare March 22, 2024 20:50
@steffen-karlsson steffen-karlsson marked this pull request as ready for review March 22, 2024 21:50
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was at KubeCon so I got to the Pr only now. I left some comments.

@steffen-karlsson
Copy link
Contributor Author

Sorry, I was at KubeCon so I got to the Pr only now. I left some comments.

Sure, no worries at all :) Hope you had a good time! Hopefully will join next year - was at Kafka Summit this year.

@steffen-karlsson steffen-karlsson force-pushed the 3761-metric-certificates-expiration branch 4 times, most recently from 2112e7b to c48cf39 Compare March 24, 2024 18:12
@ppatierno
Copy link
Member

Struggling to understand what the latest graph represents.
The text says "On cert renewal, value of the metric is also changing:". What did exactly happen? A certificate was expiring and renewed to expire after 5 mins?

@maciej-tatarski
Copy link
Contributor

@ppatierno Cert was renewed after 5 mins and it was just to show that it is reflected in the metric. This graph is not included in any dashboard.

@steffen-karlsson steffen-karlsson force-pushed the 3761-metric-certificates-expiration branch from 64c8ec3 to 2634b28 Compare March 25, 2024 09:54
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is definitely useful. Thanks.

I tried changes on Grafana 7.3.7, including having multiple Kafka clusters, and deleting a cluster.


Issue: Expiry time set to epoch start

To reproduce the issue, create a Kafka cluster with metrics, wait for the new metric to appear in operator dashboard, then delete the whole cluster, finally recreate the cluster with the same name.

Certificate content:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            5f:25:69:2e:67:74:21:35:4e:e8:16:7a:1b:ad:43:01:7f:29:ba:7f
        Signature Algorithm: sha512WithRSAEncryption
        Issuer: O=io.strimzi, CN=cluster-ca v0
        Validity
            Not Before: Mar 25 17:51:58 2024 GMT
            Not After : Mar 25 17:51:58 2025 GMT
        Subject: O=io.strimzi, CN=my-cluster-kafka

Raw metric:

# HELP strimzi_certificate_expiration_ms Time in milliseconds when the certificate expires
# TYPE strimzi_certificate_expiration_ms gauge
strimzi_certificate_expiration_ms{cluster_name="my-cluster",kind="Kafka",namespace="test",selector="",} 0.0

Grafana dashboard:

Screenshot from 2024-03-25 17-44-37

@steffen-karlsson
Copy link
Contributor Author

steffen-karlsson commented Mar 25, 2024

Hello @fvaleri

Thanks for your comments will access them tomorrow.
Regarding your bug, @maciej-tatarski and I are aware of it, we have a solution for now, very ugly - thus not committed it, but will fix, when its decided by @scholzj, @ppatierno and yourselves probably, where KafkaEventHandler fits the best as its depending.

In short essence the problem is that MetricsHolder and its subsets are not singleton, so the new Map (reconciliationsTimerMap) is cleared pr instance, which is not shared across the controllers.
So StrimziPodSetController which resets it has one instance of MetricsHolder and KafkaReconciler who uses and populates it, has another instance. Temporary proven as a fix to make reconciliationsTimerMap static, but definitely not a final solution.

@steffen-karlsson steffen-karlsson force-pushed the 3761-metric-certificates-expiration branch from 84c13ff to fd34a04 Compare March 25, 2024 20:27
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screenshot lists only a Cluster CA. Is the Clients CA planned for a follow-up PR?

@scholzj
Copy link
Member

scholzj commented Mar 25, 2024

when its decided by @scholzj, @ppatierno and yourselves probably, where KafkaEventHandler fits the best as its depending.

@steffen-karlsson I replied to the main open points. If you want to have a Zoom call or something to go through it together, feel free to ping me on Slack. Sometimes it's easier than through the GitHub messages.

@scholzj scholzj added this to the 0.41.0 milestone Mar 25, 2024
@steffen-karlsson steffen-karlsson force-pushed the 3761-metric-certificates-expiration branch from 37c54d6 to 42700ee Compare March 26, 2024 08:40
@maciej-tatarski maciej-tatarski force-pushed the 3761-metric-certificates-expiration branch from 42700ee to f3f5f39 Compare March 26, 2024 08:53
@scholzj
Copy link
Member

scholzj commented Apr 2, 2024

@steffen-karlsson ... I see 122 commits including stuff from other people/PRs, why? Can you fix this? Otherwise I see an history which doesn't make sense for this PR in github, or?

Well, we will squash it when merging it. So not sure it matters that much. But that is likely one of the things confusing the DCO signoff.

@ppatierno
Copy link
Member

Well, we will squash it when merging it. So not sure it matters that much. But that is likely one of the things confusing the DCO signoff.

Well couldn't be better to have them squashed now and providing a meaningful single commit message instead of using the squash and merge from GitHub which will produce a very long commit message including even all the others commit messages not related to this PR.
And yes, I agree that was the problem of the DCO imho.

@steffen-karlsson
Copy link
Contributor Author

steffen-karlsson commented Apr 2, 2024

Well, we will squash it when merging it. So not sure it matters that much. But that is likely one of the things confusing the DCO signoff.

Well couldn't be better to have them squashed now and providing a meaningful single commit message instead of using the squash and merge from GitHub which will produce a very long commit message including even all the others commit messages not related to this PR.
And yes, I agree that was the problem of the DCO imho.

I don't know what happened, the commits are actually all mine minus a few, had to alter the last half amount of mine as they were not including the Sign off message, therefore the high amount.

While altering the messages, it added the other ones, don't know why, have tried to fix it, but cannot seems to do it, branch is fully up to date with main branch and number of files changes makes sense in terms of the PR.

If you @ppatierno or @scholzj also does not have any idea, the only solution I see is to create a new branch and diff all changes and create a new PR.

I used the suggestions by the DCO action.

@scholzj
Copy link
Member

scholzj commented Apr 2, 2024

I don't know what happened ...

I guess Git happened ... I think that all of us had this kind of issue in the past.

If you @ppatierno or @scholzj also does not have any idea, the only solution I see is to create a new branch and diff all changes and create a new PR.

I think creating a new PR from another branch might be an option. I think that would be completely fine. Alternatively, I think if you simply squash all of the commits and fix the commit message, it might work as well (but make sure to have a backup copy before you do that :-o)

@ppatierno
Copy link
Member

I am fine with whatever is the best solution for you to pick from ... 1. new PR or 2. squashing and leaving one single commit message which makes sense for the overall PR

…board as well as alert example.

Signed-off-by: Steffen Karlsson <steffen.karlsson@maersk.com>
@steffen-karlsson steffen-karlsson force-pushed the 3761-metric-certificates-expiration branch from ccb444b to fa4b46b Compare April 2, 2024 17:32
Signed-off-by: Steffen Karlsson <steffen.karlsson@maersk.com>
@steffen-karlsson
Copy link
Contributor Author

steffen-karlsson commented Apr 2, 2024

Squashed it all into one commit @ppatierno, DCO is green and tests will be soon :)

Only missing last comment from @scholzj regarding KafkaAssemblyOperator.removeMetrics test, other comments are fixed and handled.

Signed-off-by: Steffen Karlsson <steffen.karlsson@maersk.com>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, two more things ... one is my fault and one isn't ...

steffen-karlsson and others added 2 commits April 3, 2024 06:56
…rator/assembly/KafkaAssemblyOperator.java

Co-authored-by: Jakub Scholz <www@scholzj.com>
Signed-off-by: Steffen Wirenfeldt Karlsson <steffen.karlsson@maersk.com>
Signed-off-by: Steffen Karlsson <steffen.karlsson@maersk.com>
steffen-karlsson and others added 4 commits April 3, 2024 09:33
Signed-off-by: Steffen Karlsson <steffen.karlsson@maersk.com>
Signed-off-by: Steffen Karlsson <steffen.karlsson@maersk.com>
Signed-off-by: maciej-tatarski <maciej.tatarski@maersk.com>
Signed-off-by: maciej-tatarski <maciej.tatarski@maersk.com>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@scholzj
Copy link
Member

scholzj commented Apr 3, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@fvaleri
Copy link
Contributor

fvaleri commented Apr 3, 2024

Not sure if it is me, but on Grafana 7.3.7 the dashboard now looks like this. Data is correct, but "type" is hidden, and "prometheus" data source is shown.

Screenshot from 2024-04-03 12-33-28

Signed-off-by: maciej-tatarski <maciej.tatarski@maersk.com>
@maciej-tatarski
Copy link
Contributor

maciej-tatarski commented Apr 3, 2024

Not sure if it is me, but on Grafana 7.3.7 the dashboard now looks like this. Data is correct, but "type" is hidden, and "prometheus" data source is shown.

Screenshot from 2024-04-03 12-33-28

You are right, I removed aggregation and excluded some labels by transformation. I reverted it now as we don't really know which labels people will have so it is better to just aggregate by labels that are always there. @fvaleri please check now, thanks!

Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR!
I was waiting for the commits storm to finish :-)
I didn't run it but trust Fede and Jakub for this, but I had another pass on the code, and it makes sense to me.

@scholzj scholzj merged commit ddb588d into strimzi:main Apr 3, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants