Metrics for transform logging #16566

oleiman · 2024-02-09T20:32:31Z

This PR introduces some metrics for transform logging:

logger_probe for tracking metrics specific to individual transform
loggers:

data_transforms_logger_events_total
- Total # of log events emitted by some transform.
data_transforms_logger_events_dropped_total
- Total # of some transform's log events that were dropped due
  to buffer capacity constraint.
- exported to BOTH /metrics and /public_metrics

manager_probe for tracking metrics generic to the logging::manager:

data_transforms_log_manager_buffer_usage_ratio
- Current occupancy of the logging::manager's queues as a fraction
  of total capacity. [0.0..1.0]
data_transforms_log_manager_write_errors_total
- Total number of failures to produce log events to the transform
  logs topic.
- exported ONLY to /metrics

Closes https://github.com/redpanda-data/core-internal/issues/1059

Backports Required

Release Notes

Features

Add Prometheus metrics for data transforms logging

vbotbuildovich · 2024-02-09T23:47:38Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44956#018d8ff9-26a8-437a-b7fd-c030d809c4b5

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45064#018daf99-bc28-43d6-a0f4-359f521e5ff9

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45256#018dd2c1-1634-4257-b095-998374e6710d

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45310#018dd791-0a37-47f6-a260-6fa6cbb5a7fc

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45660#018e0c2f-6d7d-41a7-9e1d-91cbce916bbb

vbotbuildovich · 2024-02-09T23:56:49Z

new failures in https://buildkite.com/redpanda/redpanda/builds/44956#018d9005-98cf-4418-ad7f-2e328d86ee50:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/45064#018daf88-57ac-4ba6-8598-bce02b094d94:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile_with_override"

new failures in https://buildkite.com/redpanda/redpanda/builds/45302#018dd6af-b3df-49a3-bf18-a1d86ccef4d9:

"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_values"

oleiman · 2024-02-10T00:11:39Z

CI Failures:

[v23.3.x] CI Failure (BadLogLines: unknown exception thrown: seastar::gate_closed_exception) in SIPartitionMovementTest. test_cross_shard #16540

rockwotj

Metrics look good. I am a little worried about the brittleness of the tests

src/v/transform/logging/log_manager.cc

src/v/transform/logging/log_manager.h

tests/rptest/tests/data_transforms_test.py

rockwotj · 2024-02-10T01:47:17Z

tests/rptest/tests/data_transforms_test.py

+        )
+
+        self.logger.debug(
+            "Produce enough data to make a noticeable dent but not so much as to trip the buffer LWM"


We can still get LWM with retries in transforms right? Let's be extra careful.

Yeah, I hadn't really taken that into account. Will do a bit of rework. Incidentally, I've run these on the order of 100s of times w/o issue. I wonder whether there's a way to encourage or trigger retries in the test?

run the stress utility while ducktape is running? CI is not as stable as an environment as locally, but yeah I have had the same concerns. I think the perf team was looking into what CPUs our tests run on to try and track this.

Is this comment/log still valid?

Sort of. We still perform the test, but the log is not very informative.

tests/rptest/tests/data_transforms_test.py

rockwotj · 2024-02-10T01:51:44Z

src/v/transform/logging/probes.cc

+    auto group_name = prometheus_sanitize::metrics_name(
+      "data_transforms_logger");
+
+    if (!config::shard_local_cfg().disable_metrics()) {


Do we really need all this song and dance? If so we really should clean this up and force consistency..(nothing you need to do)

Yeah, I mean this is the "standard" pattern I think. Very tedious

oleiman · 2024-02-11T00:31:35Z

force push contents:

remove unnecessary check on btree_map::emplace result at probe construction
try to account for the possibility of transform retries in tests
typos, etc.

oleiman · 2024-02-11T00:33:01Z

/ci-repeat 5
skip-unit
dt-repeat=50
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman · 2024-02-11T00:37:34Z

/ci-repeat 5
skip-unit
dt-repeat=50
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman · 2024-02-11T00:48:11Z

/dt

oleiman · 2024-02-16T00:34:30Z

/ci-repeat 1

src/v/transform/logging/probes.h

src/v/transform/logging/probes.cc

StephanDollberg · 2024-02-22T11:35:03Z

src/v/transform/logging/probes.cc

+
+    if (!config::shard_local_cfg().disable_public_metrics()) {
+        const auto aggregate_labels
+          = config::shard_local_cfg().aggregate_metrics()


public metrics don't do aggregation in general.

To be clear, we still want public metrics aggregating on ss::metrics::shard_label in the general case (i.e. irrespective of cluster config), right?

Right yes that's correct

src/v/transform/logging/probes.cc

StephanDollberg · 2024-02-22T11:37:11Z

src/v/transform/logging/probes.cc

+    namespace sm = ss::metrics;
+
+    const auto name_label = sm::label("transform_name");
+    const std::vector<sm::label_instance> labels = {


Just to confirm some previous discussion, this is fairly limited right? Max 16 per cluster?

I'm not aware of a limit. Which previous discussion?

Was discussion with @rockwotj about what the max cardinality we can expect here as we don't want to have unbounded cardinality in metrics (or anything that scales with large N).

Though you are aggregating the transform name label away anyway so it's not much of an issue.

The default number of transforms you can have is 10, I expect the upper number of transforms in a cluster to be low hundreds in the extreme case.

oleiman · 2024-02-22T20:04:48Z

force push contents:

change 'transform_name' label to 'function_name' (to match other transform metrics as in transform/probe.cc
Use convenient add_group interface for internal metrics and remove aggregation config checks
redundant/dead code

oleiman · 2024-02-23T15:55:50Z

/ci-repeat 5
skip-unit
skip-redpanda-build
dt-repeat=50
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman · 2024-02-23T17:25:59Z

force push contents:

Move probe deinit to log_manager::stop. As a consequence, move some other member init/deinit to start/stop. @BenPope - good call out in standup.

oleiman · 2024-02-23T19:13:33Z

force push contents: fix broken unit test (log_manager::stop should be idempotent)

oleiman · 2024-02-26T18:59:56Z

/cdt
num_nodes=5
dt-repeat=10
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

src/v/transform/logging/probes.h

logger_probe for tracking metrics specific to individual transform loggers: - data_transforms_logger_events_total - Total # of log events emitted by some transform. - data_transforms_logger_events_dropped_total - Total # of some transform's log events that were dropped due to buffer capacity constraint. - exported to BOTH /metrics and /public_metrics manager_probe for tracking metrics generic to the logging::manager: - data_transforms_log_manager_buffer_usage_ratio - Current occupancy of the logging::manager's queues as a fraction of total capacity. [0.0..1.0] - data_transforms_log_manager_write_errors_total - Total number of failures to produce log events to the transform logs topic. - exported ONLY to /metrics Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

One manager_probe per manager instance, initialized on manager::start One logger_probe per log source, initialized on first manager::enqueue_log Also moves some init/deinit logic to log_manager::{start,stop} to avoid duplicate metric registrations Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

oleiman · 2024-02-28T05:04:42Z

force push uint32_t -> uint64_t for counters

rockwotj

Looks good - a couple of comments on the tests to double check that they are robust.

tests/rptest/tests/data_transforms_test.py

rockwotj · 2024-03-04T16:08:08Z

tests/rptest/tests/data_transforms_test.py

+        )
+
+        self.logger.debug(
+            "Produce enough data to make a noticeable dent but not so much as to trip the buffer LWM"


Is this comment/log still valid?

rockwotj · 2024-03-04T16:09:18Z

tests/rptest/tests/data_transforms_test.py

+        assert any(
+            bu > 0.0 for bu in all_nodes_usage
+        ), f"Expected some non-zero buffer usage, got {all_nodes_usage}"


Is this racy in that all the logs could be flushed before we query the metrics?

I don't think so. It is timing dependent, but the flush interval is configured to 1h at the top of the test. Or are you concerned that we're racing against the config binding watcher?

- Presence of metrics on both endpoints - Metrics values in contrived scenarios Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

oleiman · 2024-03-05T00:01:25Z

force push import cleanup and produce even less for _values test to be safe.

vbotbuildovich · 2024-03-05T05:52:44Z

/backport v23.3.x

vbotbuildovich · 2024-03-05T05:53:36Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16566-v23.3.x-647 remotes/upstream/v23.3.x
git cherry-pick -x c31ea7f3c9adcc395e7b025dbd3c71d0c916a491 6d59f927053f8117bc14bddf273ef4b15773c3c1 b3a663798f0ce6c14d20805251ca3aa20316e808

Workflow run logs.

oleiman self-assigned this Feb 9, 2024

github-actions bot added area/redpanda area/wasm WASM Data Transforms labels Feb 9, 2024

oleiman marked this pull request as ready for review February 9, 2024 21:33

oleiman requested a review from a team as a code owner February 9, 2024 21:33

oleiman requested review from rpdevmp and removed request for a team February 9, 2024 21:33

oleiman requested review from dotnwat, rockwotj and michael-redpanda and removed request for dotnwat, rockwotj and michael-redpanda February 10, 2024 00:12

oleiman force-pushed the xfm-logging/probe branch 2 times, most recently from 3d22de7 to 182f8f7 Compare February 10, 2024 01:07

oleiman marked this pull request as draft February 10, 2024 01:19

oleiman force-pushed the xfm-logging/probe branch from 182f8f7 to 22dc987 Compare February 10, 2024 01:20

rockwotj reviewed Feb 10, 2024

View reviewed changes

oleiman force-pushed the xfm-logging/probe branch 2 times, most recently from df02347 to 7e76110 Compare February 11, 2024 00:29

oleiman marked this pull request as ready for review February 11, 2024 00:57

oleiman marked this pull request as draft February 11, 2024 01:04

oleiman requested a review from graphcareful February 20, 2024 16:05

ivotron removed the request for review from rpdevmp February 20, 2024 18:17

StephanDollberg reviewed Feb 22, 2024

View reviewed changes

oleiman force-pushed the xfm-logging/probe branch from 7e76110 to e9977ba Compare February 22, 2024 20:02

StephanDollberg previously approved these changes Feb 23, 2024

View reviewed changes

oleiman dismissed StephanDollberg’s stale review via 44b5afd February 23, 2024 17:24

oleiman force-pushed the xfm-logging/probe branch from e9977ba to 44b5afd Compare February 23, 2024 17:24

oleiman force-pushed the xfm-logging/probe branch from 44b5afd to 8a3e59d Compare February 23, 2024 19:11

dotnwat reviewed Feb 28, 2024

View reviewed changes

src/v/transform/logging/probes.h Outdated Show resolved Hide resolved

src/v/transform/logging/probes.h Outdated Show resolved Hide resolved

oleiman added 2 commits February 27, 2024 21:03

oleiman force-pushed the xfm-logging/probe branch from 8a3e59d to 6cd34d9 Compare February 28, 2024 05:04

rockwotj previously approved these changes Mar 4, 2024

View reviewed changes

dt: Integration tests for transform::logging probes

b3a6637

- Presence of metrics on both endpoints - Metrics values in contrived scenarios Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

oleiman dismissed rockwotj’s stale review via b3a6637 March 5, 2024 00:00

oleiman force-pushed the xfm-logging/probe branch from 6cd34d9 to b3a6637 Compare March 5, 2024 00:00

oleiman requested a review from rockwotj March 5, 2024 00:01

rockwotj approved these changes Mar 5, 2024

View reviewed changes

oleiman merged commit c282bd1 into redpanda-data:dev Mar 5, 2024
17 checks passed

vbotbuildovich mentioned this pull request Mar 5, 2024

[v23.3.x] Metrics for transform logging #16895

Closed

oleiman mentioned this pull request Mar 6, 2024

[v23.3.x] Metrics for transform logging #16913

Merged

Metrics for transform logging #16566

Metrics for transform logging #16566

Conversation

oleiman commented Feb 9, 2024 • edited

Backports Required

Release Notes

Features

vbotbuildovich commented Feb 9, 2024 • edited

vbotbuildovich commented Feb 9, 2024 • edited

oleiman commented Feb 10, 2024

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Feb 11, 2024

oleiman commented Feb 11, 2024

oleiman commented Feb 11, 2024

oleiman commented Feb 11, 2024

oleiman commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanDollberg Feb 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Feb 22, 2024

oleiman commented Feb 23, 2024

oleiman commented Feb 23, 2024

oleiman commented Feb 23, 2024

oleiman commented Feb 26, 2024

oleiman commented Feb 28, 2024

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Mar 5, 2024

vbotbuildovich commented Mar 5, 2024

vbotbuildovich commented Mar 5, 2024

oleiman commented Feb 9, 2024 •

edited

vbotbuildovich commented Feb 9, 2024 •

edited

vbotbuildovich commented Feb 9, 2024 •

edited

StephanDollberg Feb 23, 2024 •

edited