Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for transform logging #16566

Merged
merged 3 commits into from
Mar 5, 2024
Merged

Conversation

oleiman
Copy link
Member

@oleiman oleiman commented Feb 9, 2024

This PR introduces some metrics for transform logging:

logger_probe for tracking metrics specific to individual transform
loggers:

  • data_transforms_logger_events_total
    • Total # of log events emitted by some transform.
  • data_transforms_logger_events_dropped_total
    • Total # of some transform's log events that were dropped due
      to buffer capacity constraint.
    • exported to BOTH /metrics and /public_metrics

manager_probe for tracking metrics generic to the logging::manager:

  • data_transforms_log_manager_buffer_usage_ratio
    • Current occupancy of the logging::manager's queues as a fraction
      of total capacity. [0.0..1.0]
  • data_transforms_log_manager_write_errors_total
    • Total number of failures to produce log events to the transform
      logs topic.
    • exported ONLY to /metrics

Closes https://github.com/redpanda-data/core-internal/issues/1059

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Features

  • Add Prometheus metrics for data transforms logging

@oleiman oleiman self-assigned this Feb 9, 2024
@github-actions github-actions bot added area/redpanda area/wasm WASM Data Transforms labels Feb 9, 2024
@oleiman oleiman marked this pull request as ready for review February 9, 2024 21:33
@oleiman oleiman requested a review from a team as a code owner February 9, 2024 21:33
@oleiman oleiman requested review from rpdevmp and removed request for a team February 9, 2024 21:33
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 9, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/44956#018d9005-98cf-4418-ad7f-2e328d86ee50:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/45064#018daf88-57ac-4ba6-8598-bce02b094d94:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile_with_override"

new failures in https://buildkite.com/redpanda/redpanda/builds/45302#018dd6af-b3df-49a3-bf18-a1d86ccef4d9:

"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_values"

@oleiman oleiman requested review from dotnwat, rockwotj and michael-redpanda and removed request for dotnwat, rockwotj and michael-redpanda February 10, 2024 00:12
@oleiman oleiman force-pushed the xfm-logging/probe branch 2 times, most recently from 3d22de7 to 182f8f7 Compare February 10, 2024 01:07
@oleiman oleiman marked this pull request as draft February 10, 2024 01:19
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics look good. I am a little worried about the brittleness of the tests

src/v/transform/logging/log_manager.cc Outdated Show resolved Hide resolved
src/v/transform/logging/log_manager.h Outdated Show resolved Hide resolved
tests/rptest/tests/data_transforms_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/data_transforms_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/data_transforms_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/data_transforms_test.py Outdated Show resolved Hide resolved
)

self.logger.debug(
"Produce enough data to make a noticeable dent but not so much as to trip the buffer LWM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still get LWM with retries in transforms right? Let's be extra careful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I hadn't really taken that into account. Will do a bit of rework. Incidentally, I've run these on the order of 100s of times w/o issue. I wonder whether there's a way to encourage or trigger retries in the test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run the stress utility while ducktape is running? CI is not as stable as an environment as locally, but yeah I have had the same concerns. I think the perf team was looking into what CPUs our tests run on to try and track this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment/log still valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. We still perform the test, but the log is not very informative.

tests/rptest/tests/data_transforms_test.py Outdated Show resolved Hide resolved
auto group_name = prometheus_sanitize::metrics_name(
"data_transforms_logger");

if (!config::shard_local_cfg().disable_metrics()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need all this song and dance? If so we really should clean this up and force consistency..(nothing you need to do)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean this is the "standard" pattern I think. Very tedious

@oleiman oleiman force-pushed the xfm-logging/probe branch 2 times, most recently from df02347 to 7e76110 Compare February 11, 2024 00:29
@oleiman
Copy link
Member Author

oleiman commented Feb 11, 2024

force push contents:

  • remove unnecessary check on btree_map::emplace result at probe construction
  • try to account for the possibility of transform retries in tests
  • typos, etc.

@oleiman
Copy link
Member Author

oleiman commented Feb 11, 2024

/ci-repeat 5
skip-unit
dt-repeat=50
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

1 similar comment
@oleiman
Copy link
Member Author

oleiman commented Feb 11, 2024

/ci-repeat 5
skip-unit
dt-repeat=50
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

@oleiman
Copy link
Member Author

oleiman commented Feb 11, 2024

/dt

@oleiman oleiman marked this pull request as ready for review February 11, 2024 00:57
@oleiman oleiman marked this pull request as draft February 11, 2024 01:04
@oleiman
Copy link
Member Author

oleiman commented Feb 16, 2024

/ci-repeat 1

@ivotron ivotron removed the request for review from rpdevmp February 20, 2024 18:17
src/v/transform/logging/probes.h Outdated Show resolved Hide resolved
src/v/transform/logging/probes.cc Outdated Show resolved Hide resolved
src/v/transform/logging/probes.cc Outdated Show resolved Hide resolved

if (!config::shard_local_cfg().disable_public_metrics()) {
const auto aggregate_labels
= config::shard_local_cfg().aggregate_metrics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public metrics don't do aggregation in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we still want public metrics aggregating on ss::metrics::shard_label in the general case (i.e. irrespective of cluster config), right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yes that's correct

src/v/transform/logging/probes.cc Outdated Show resolved Hide resolved
namespace sm = ss::metrics;

const auto name_label = sm::label("transform_name");
const std::vector<sm::label_instance> labels = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm some previous discussion, this is fairly limited right? Max 16 per cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of a limit. Which previous discussion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was discussion with @rockwotj about what the max cardinality we can expect here as we don't want to have unbounded cardinality in metrics (or anything that scales with large N).

Though you are aggregating the transform name label away anyway so it's not much of an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default number of transforms you can have is 10, I expect the upper number of transforms in a cluster to be low hundreds in the extreme case.

@oleiman
Copy link
Member Author

oleiman commented Feb 22, 2024

force push contents:

  • change 'transform_name' label to 'function_name' (to match other transform metrics as in transform/probe.cc
  • Use convenient add_group interface for internal metrics and remove aggregation config checks
  • redundant/dead code

@oleiman
Copy link
Member Author

oleiman commented Feb 23, 2024

/ci-repeat 5
skip-unit
skip-redpanda-build
dt-repeat=50
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

@oleiman
Copy link
Member Author

oleiman commented Feb 23, 2024

force push contents:

  • Move probe deinit to log_manager::stop. As a consequence, move some other member init/deinit to start/stop. @BenPope - good call out in standup.

@oleiman
Copy link
Member Author

oleiman commented Feb 23, 2024

force push contents: fix broken unit test (log_manager::stop should be idempotent)

@oleiman
Copy link
Member Author

oleiman commented Feb 26, 2024

/cdt
num_nodes=5
dt-repeat=10
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

src/v/transform/logging/probes.h Outdated Show resolved Hide resolved
src/v/transform/logging/probes.h Outdated Show resolved Hide resolved
logger_probe for tracking metrics specific to individual transform
loggers:
  - data_transforms_logger_events_total
    - Total # of log events emitted by some transform.
  - data_transforms_logger_events_dropped_total
    - Total # of some transform's log events that were dropped due
      to buffer capacity constraint.
  - exported to BOTH /metrics and /public_metrics

manager_probe for tracking metrics generic to the logging::manager:
  - data_transforms_log_manager_buffer_usage_ratio
    - Current occupancy of the logging::manager's queues as a fraction
      of total capacity. [0.0..1.0]
  - data_transforms_log_manager_write_errors_total
    - Total number of failures to produce log events to the transform
      logs topic.
  - exported ONLY to /metrics

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
One manager_probe per manager instance, initialized on manager::start

One logger_probe per log source, initialized on first manager::enqueue_log

Also moves some init/deinit logic to log_manager::{start,stop} to avoid
duplicate metric registrations

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman
Copy link
Member Author

oleiman commented Feb 28, 2024

force push uint32_t -> uint64_t for counters

rockwotj
rockwotj previously approved these changes Mar 4, 2024
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - a couple of comments on the tests to double check that they are robust.

tests/rptest/tests/data_transforms_test.py Outdated Show resolved Hide resolved
)

self.logger.debug(
"Produce enough data to make a noticeable dent but not so much as to trip the buffer LWM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment/log still valid?

Comment on lines +893 to +891
assert any(
bu > 0.0 for bu in all_nodes_usage
), f"Expected some non-zero buffer usage, got {all_nodes_usage}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this racy in that all the logs could be flushed before we query the metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. It is timing dependent, but the flush interval is configured to 1h at the top of the test. Or are you concerned that we're racing against the config binding watcher?

- Presence of metrics on both endpoints
- Metrics values in contrived scenarios

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman
Copy link
Member Author

oleiman commented Mar 5, 2024

force push import cleanup and produce even less for _values test to be safe.

@oleiman oleiman requested a review from rockwotj March 5, 2024 00:01
@oleiman oleiman merged commit c282bd1 into redpanda-data:dev Mar 5, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16566-v23.3.x-647 remotes/upstream/v23.3.x
git cherry-pick -x c31ea7f3c9adcc395e7b025dbd3c71d0c916a491 6d59f927053f8117bc14bddf273ef4b15773c3c1 b3a663798f0ce6c14d20805251ca3aa20316e808

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda area/wasm WASM Data Transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants