Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: fix kafka_max_offset on read replicas #16263

Merged
merged 2 commits into from
Jan 24, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jan 24, 2024

Read replicas were previously translating their own local log offsets to
return `redpanda_kafka_max_offset` to the metrics endpoint. This is
different from how read replicas calculate the HWM when returning to the
Kafka endpoint, which just goes directly to cloud storage.

This adds the same read replica check that we have in the Kafka layer.

Also fixes the metric_sum filtering in ducktape, as this was required for the included test.

Fixes #16259

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Bug Fixes

  • Fixes a bug that would previously cause read replicas to report the wrong value for the redpand_kafka_max_offset metric.

@andrwng andrwng self-assigned this Jan 24, 2024
@piyushredpanda piyushredpanda added this to the v23.2.24 milestone Jan 24, 2024
@andrwng andrwng force-pushed the cloud-storage-rrr-metric-hwm branch from 6fb81bb to 7a09ddd Compare January 24, 2024 01:01
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 24, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/44191#018d392c-7c97-49bd-8552-274b3f1dc481:

"rptest.tests.follower_fetching_test.FollowerFetchingTest.test_follower_fetching_with_maintenance_mode"

new failures in https://buildkite.com/redpanda/redpanda/builds/44191#018d392c-7c9e-4bae-83be-9987280a72a2:

"rptest.tests.follower_fetching_test.FollowerFetchingTest.test_basic_follower_fetching.read_from_object_store=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/44191#018d392c-7c94-4fa9-b4b1-50954e766635:

"rptest.tests.follower_fetching_test.FollowerFetchingTest.test_basic_follower_fetching.read_from_object_store=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44191#018d393d-6bea-43fc-91fa-3ddbced39eb6:

"rptest.tests.follower_fetching_test.FollowerFetchingTest.test_follower_fetching_with_maintenance_mode"

new failures in https://buildkite.com/redpanda/redpanda/builds/44191#018d393d-6bf5-4df2-a9ab-97276f8bd230:

"rptest.tests.follower_fetching_test.FollowerFetchingTest.test_basic_follower_fetching.read_from_object_store=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/44191#018d393d-6bf2-4a15-83df-4575771b5098:

"rptest.tests.follower_fetching_test.FollowerFetchingTest.test_basic_follower_fetching.read_from_object_store=False"

@piyushredpanda
Copy link
Contributor

@andrwng Needs some test updates, it looks like.

Updates the filter used when collecting metrics samples.

In some cases, the samples take the form:

Sample(name='redpanda_kafka_max_offset', labels={'redpanda_namespace': 'kafka', 'redpanda_partition': '5', 'redpanda_topic': 'panda-topic'}, value=0.0, ...
Read replicas were previously translating their own local log offsets to
return `redpanda_kafka_max_offset` to the metrics endpoint. This is
different from how read replicas calculate the HWM when returning to the
Kafka endpoint, which just goes directly to cloud storage.

This adds the same read replica check that we have in the Kafka layer.
@andrwng andrwng force-pushed the cloud-storage-rrr-metric-hwm branch from 7a09ddd to 7e33cc7 Compare January 24, 2024 05:50
Comment on lines +1069 to +1087
labels = sample.labels
if ns:
if "redpanda_namespace" in labels:
if labels["redpanda_namespace"] != ns:
continue
elif "namespace" in labels:
if labels["namespace"] != ns:
continue
else:
assert False, f"Missing namespace label: {sample}"
if topic:
if "redpanda_topic" in labels:
if labels["redpanda_topic"] != topic:
continue
elif "topic" in labels:
if labels["topic"] != topic:
continue
else:
assert False, f"Missing topic label: {sample}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these checks are a little hard to read, it looks like they could be simplified a bit and collapsed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's a bit ugly. I spent some time earlier trying to clean it up, but couldn't come to anything simpler. Open to suggestions on how to improve it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe I'll do this in a follow up, if it works:

if ns:
    assert "kafka_namespace" in labels or "namespace" in labels, f"Missing namespace"
    if labels.get("kafka_namespace", labels.get("namespace")) != ns
        continue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here #16277

@andrwng andrwng merged commit 86c0d1d into redpanda-data:dev Jan 24, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16263-v23.1.x-69 remotes/upstream/v23.1.x
git cherry-pick -x e567d6387e1c6b37674328ef355d029c30511434 7e33cc70ff4d0bfd59ed16849177480b79968c4a

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16263-v23.2.x-82 remotes/upstream/v23.2.x
git cherry-pick -x e567d6387e1c6b37674328ef355d029c30511434 7e33cc70ff4d0bfd59ed16849177480b79968c4a

Workflow run logs.

andrwng added a commit to andrwng/redpanda that referenced this pull request Jan 24, 2024
Follow-up to e567d63 (redpanda-data#16263) to simplify the metrics_sum filtering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read replica partitions report incorrect kafka max offset
5 participants