Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS Metrics: Ensure that recorded expiries advance on credentials reload #17233

Merged
merged 2 commits into from
Mar 28, 2024

Conversation

oleiman
Copy link
Member

@oleiman oleiman commented Mar 21, 2024

What it says on the box.

Fixes #16840

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fixes a bug with TLS metrics where expiration timestamps would not advance on certificate reload

@oleiman oleiman requested a review from BenPope March 21, 2024 07:01
@oleiman oleiman self-assigned this Mar 21, 2024
Comment on lines 407 to 408
_cert_expiry_time = clock_type::time_point::max();
_ca_expiry_time = clock_type::time_point::max();
Copy link
Member

@BenPope BenPope Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reset() really be setting these to max when !certs_info.has_value() || !ts_info.has_value()?

Should that check also ensure they are not empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably they should be cleared down to something nominally invalid in either case, right? Since we're monitoring a file, I assumed whatever state we previously had in the probe should be 100% junk, irrespective of whether the new creds are valid, empty, etc.

That said, it's quite possible/likely that I misunderstand the common case for administering TLS certificates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that min would be a safer number than max; I'd probably have a check that the cert doesn't expire in the next week or so, and max would satisfy that. I'm assuming here that when the callback is called we are expecting valid certificates. But maybe it means that TLS certificates have been turned off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a check that the cert doesn't expire in the next week or so, and max would satisfy that

Not sure I follow. I was thinking about something like

  • always reset probe states to time_point::min (effectively "invalid")
  • use a couple of temporary optionals to track the min expiry in the creds
  • set probe state to the new value at the very end

Does that cover our bases? Is there an edge case I'm missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took sort of a hard left turn - trying to make it harder to represent invalid states. Very much open to suggestions here.

Previously, on certificate load, expiry times would be overwritten
iff the expiry of the new cert was sooner than the old stored expiry.
This is a bug. Each time credentials are reloaded, the recorded expiry
should always reflect the soonest expiring cert in the _current_
credential set, irrespective of whatever the probe stored previously.

To that end, this commit refactors tls_certificate_probe such that:
- The default state is "no cert"/"no CA"
- The default expiry time for a cert/CA is time_point::min
- Certificate state is cleared immediately and unconditionally
  at load time.

With these changes, probe state when `!_cert_loaded` can be considered
"junk" but should produce correct semantics for all metrics.
@oleiman
Copy link
Member Author

oleiman commented Mar 21, 2024

force push heavier handed refactor in response to CR comments.

@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/46589#018e6333-1d8e-48f3-a40e-d30571821ee7:

"rptest.tests.control_character_flag_test.ControlCharacterPermittedAfterUpgrade.test_upgrade_from_pre_v23_2.initial_version=.23.1.1"

@oleiman
Copy link
Member Author

oleiman commented Mar 22, 2024

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@oleiman oleiman requested a review from BenPope March 26, 2024 17:11
@michael-redpanda michael-redpanda merged commit 0221db6 into redpanda-data:dev Mar 28, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

net/probes: Certificate expiry cannot advance on reload
4 participants