-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics for transform logging #16566
Conversation
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44956#018d8ff9-26a8-437a-b7fd-c030d809c4b5 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45064#018daf99-bc28-43d6-a0f4-359f521e5ff9 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45256#018dd2c1-1634-4257-b095-998374e6710d ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45310#018dd791-0a37-47f6-a260-6fa6cbb5a7fc ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45660#018e0c2f-6d7d-41a7-9e1d-91cbce916bbb |
new failures in https://buildkite.com/redpanda/redpanda/builds/44956#018d9005-98cf-4418-ad7f-2e328d86ee50:
new failures in https://buildkite.com/redpanda/redpanda/builds/45064#018daf88-57ac-4ba6-8598-bce02b094d94:
new failures in https://buildkite.com/redpanda/redpanda/builds/45302#018dd6af-b3df-49a3-bf18-a1d86ccef4d9:
|
3d22de7
to
182f8f7
Compare
182f8f7
to
22dc987
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics look good. I am a little worried about the brittleness of the tests
) | ||
|
||
self.logger.debug( | ||
"Produce enough data to make a noticeable dent but not so much as to trip the buffer LWM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still get LWM with retries in transforms right? Let's be extra careful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I hadn't really taken that into account. Will do a bit of rework. Incidentally, I've run these on the order of 100s of times w/o issue. I wonder whether there's a way to encourage or trigger retries in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run the stress utility while ducktape is running? CI is not as stable as an environment as locally, but yeah I have had the same concerns. I think the perf team was looking into what CPUs our tests run on to try and track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment/log still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of. We still perform the test, but the log is not very informative.
auto group_name = prometheus_sanitize::metrics_name( | ||
"data_transforms_logger"); | ||
|
||
if (!config::shard_local_cfg().disable_metrics()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need all this song and dance? If so we really should clean this up and force consistency..(nothing you need to do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mean this is the "standard" pattern I think. Very tedious
df02347
to
7e76110
Compare
force push contents:
|
/ci-repeat 5 |
1 similar comment
/ci-repeat 5 |
/dt |
/ci-repeat 1 |
src/v/transform/logging/probes.cc
Outdated
|
||
if (!config::shard_local_cfg().disable_public_metrics()) { | ||
const auto aggregate_labels | ||
= config::shard_local_cfg().aggregate_metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public metrics don't do aggregation in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, we still want public metrics aggregating on ss::metrics::shard_label
in the general case (i.e. irrespective of cluster config), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right yes that's correct
namespace sm = ss::metrics; | ||
|
||
const auto name_label = sm::label("transform_name"); | ||
const std::vector<sm::label_instance> labels = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm some previous discussion, this is fairly limited right? Max 16 per cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of a limit. Which previous discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was discussion with @rockwotj about what the max cardinality we can expect here as we don't want to have unbounded cardinality in metrics (or anything that scales with large N).
Though you are aggregating the transform name label away anyway so it's not much of an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default number of transforms you can have is 10, I expect the upper number of transforms in a cluster to be low hundreds in the extreme case.
7e76110
to
e9977ba
Compare
force push contents:
|
/ci-repeat 5 |
e9977ba
to
44b5afd
Compare
force push contents:
|
44b5afd
to
8a3e59d
Compare
force push contents: fix broken unit test ( |
/cdt |
logger_probe for tracking metrics specific to individual transform loggers: - data_transforms_logger_events_total - Total # of log events emitted by some transform. - data_transforms_logger_events_dropped_total - Total # of some transform's log events that were dropped due to buffer capacity constraint. - exported to BOTH /metrics and /public_metrics manager_probe for tracking metrics generic to the logging::manager: - data_transforms_log_manager_buffer_usage_ratio - Current occupancy of the logging::manager's queues as a fraction of total capacity. [0.0..1.0] - data_transforms_log_manager_write_errors_total - Total number of failures to produce log events to the transform logs topic. - exported ONLY to /metrics Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
One manager_probe per manager instance, initialized on manager::start One logger_probe per log source, initialized on first manager::enqueue_log Also moves some init/deinit logic to log_manager::{start,stop} to avoid duplicate metric registrations Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
8a3e59d
to
6cd34d9
Compare
force push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - a couple of comments on the tests to double check that they are robust.
) | ||
|
||
self.logger.debug( | ||
"Produce enough data to make a noticeable dent but not so much as to trip the buffer LWM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment/log still valid?
assert any( | ||
bu > 0.0 for bu in all_nodes_usage | ||
), f"Expected some non-zero buffer usage, got {all_nodes_usage}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this racy in that all the logs could be flushed before we query the metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. It is timing dependent, but the flush interval is configured to 1h at the top of the test. Or are you concerned that we're racing against the config binding watcher?
- Presence of metrics on both endpoints - Metrics values in contrived scenarios Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
6cd34d9
to
b3a6637
Compare
force push import cleanup and produce even less for |
/backport v23.3.x |
Failed to create a backport PR to v23.3.x branch. I tried:
|
This PR introduces some metrics for transform logging:
logger_probe for tracking metrics specific to individual transform
loggers:
to buffer capacity constraint.
manager_probe for tracking metrics generic to the logging::manager:
of total capacity. [0.0..1.0]
logs topic.
Closes https://github.com/redpanda-data/core-internal/issues/1059
Backports Required
Release Notes
Features