Skip to content

Conversation

@WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented May 16, 2025

To improve observability of adjacent segment compaction. This counter represents the total number of segments that have been compacted away via adjacent segment compaction.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Improvements

  • Adds the storage_log_adjacent_segments_compacted metric for better observability into adjacent segment compaction.

segment_to_remove, "compact_adjacent_segments");
}

_probe->segments_adjacent_merged(segments.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if this method of counting is the most intuitive. for example, if we have 4 segments reduced to 1 then we'd have 2 + 2 + 2 = 6? Which I suppose is literally correct, but it also feels like it over counts? not really sure...

Copy link
Contributor Author

@WillemKauf WillemKauf May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have a strong opinion on this when adding the metric- you can either halve the metric to know that 3 segments have been removed by adjacent compaction, or acknowledge that 6 segments have, in total, partaken in adjacent compaction. If we extend this way of counting to the implementation for adjacent segment compaction beyond two segments, the multiplication factor will always be < 2.

If you have a strong preference here, I'm happy to adjust the counting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents is I think a metric for the number of segments compacted away (half this metric) is more intuitive and communicates the same information-- how many adjacent segment compaction operations have happened. Given the current implementation of adjacent segment compaction any of the following immediately tells you the others: number of adjacent segment compactions, number of segments that have participated in adjacent segment compaction, number of segments compacted away by adjacent segment compaction, number of segments produced by adjacent segment compaction. All four of those are interesting and wouldn't be directly related to each other if we generalize adjacent segment compaction to be m:n.

Also, is it worth having a bytes measurement too? That way we have compaction metrics reflecting "fixed costs" like making new files and shuffling metadata, and metrics reflecting "variable costs" like how many bytes we have to write, how many bytes we compacted away, how much reindexing has to happen, and how much savings there is to readers who read compacted data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Altered.

@WillemKauf WillemKauf requested a review from dotnwat May 16, 2025 18:49
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 16, 2025

CI test results

test results on build#66111
test_class test_method test_arguments test_kind job_url test_status passed reason
AvailabilityTests test_recovery_after_catastrophic_failure ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196da24-5411-433f-969c-6134bfc87984 FLAKY 19/21 upstream reliability is '99.10313901345292'. current run reliability is '90.47619047619048'. drift is 8.62695 and the allowed drift is set to 50. The test should PASS
ClusterHealthOverviewTest cluster_health_overview_baseline_test ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196da24-5413-41e8-8423-9e98d212018b FLAKY 20/21 upstream reliability is '95.89442815249268'. current run reliability is '95.23809523809523'. drift is 0.65633 and the allowed drift is set to 50. The test should PASS
ConsumerOffsetsRecoveryTest test_consumer_offsets_partition_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196da24-5411-433f-969c-6134bfc87984 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DeleteRecordsKafkaTest test_delete_records_non_empty_topic {"truncate_point": "start_offset"} ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196da24-5412-470d-843b-f9eeede6941e FLAKY 20/21 upstream reliability is '98.7551867219917'. current run reliability is '95.23809523809523'. drift is 3.51709 and the allowed drift is set to 50. The test should PASS
PartitionBalancerTest test_unavailable_nodes ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196d9b5-ceb6-42d9-a319-970643fa90ea FLAKY 20/21 upstream reliability is '99.32659932659934'. current run reliability is '95.23809523809523'. drift is 4.0885 and the allowed drift is set to 50. The test should PASS
RaftAvailabilityTest test_one_node_down ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196d9b5-ceb6-42d9-a319-970643fa90ea FLAKY 19/21 upstream reliability is '98.66071428571429'. current run reliability is '90.47619047619048'. drift is 8.18452 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": true, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196d9b5-ceb7-44e1-ab1d-7807ad589eee FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": true, "mixed_versions": true, "with_chunked_compaction": false, "with_iceberg": false, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196d9b5-ceb5-4bff-9d2c-0ce34464a268 FLAKY 20/21 upstream reliability is '98.96907216494846'. current run reliability is '95.23809523809523'. drift is 3.73098 and the allowed drift is set to 50. The test should PASS
VerifyConsumerOffsetsThruUpgrades test_consumer_group_offsets {"versions_to_upgrade": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/66111#0196da24-5412-470d-843b-f9eeede6941e FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
test results on build#66205
test_class test_method test_arguments test_kind job_url test_status passed reason
AuditLogTestOauth test_kafka_oauth {"authz_match": "acl"} ducktape https://buildkite.com/redpanda/redpanda/builds/66205#0196edf9-d3d0-4763-9999-fdb9d3f1c484 FLAKY 20/21 upstream reliability is '97.7715877437326'. current run reliability is '95.23809523809523'. drift is 2.53349 and the allowed drift is set to 50. The test should PASS
NodesDecommissioningTest test_decommissioning_working_node {"delete_topic": false, "tick_interval": 3600000} ducktape https://buildkite.com/redpanda/redpanda/builds/66205#0196edf9-d3d0-4763-9999-fdb9d3f1c484 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": true, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66205#0196edfb-a89f-4442-b991-601d4f2a3cfd FAIL 0/1
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66205#0196edfb-a89f-4442-b991-601d4f2a3cfd FAIL 0/1
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": true, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66205#0196edfb-a89f-4442-b991-601d4f2a3cfd FAIL 0/1

@WillemKauf WillemKauf force-pushed the compaction_metrics branch from 0ec8e07 to 2b2d6ba Compare May 19, 2025 21:39
@WillemKauf WillemKauf changed the title storage: add _num_adjacent_segments_merged to probe storage: add _num_adjacent_segments_compacted to probe May 19, 2025
@WillemKauf WillemKauf requested a review from wdberkeley May 19, 2025 21:40
To improve observability of adjacent segment compaction.
@WillemKauf WillemKauf force-pushed the compaction_metrics branch from 2b2d6ba to 05f5ef7 Compare May 20, 2025 12:12
@WillemKauf WillemKauf enabled auto-merge May 20, 2025 16:47
@WillemKauf
Copy link
Contributor Author

/ci-repeat 1
skip-redpanda-build
skip-units
tests/rptest/tests/consumer_group_test.py::ConsumerGroupOffsetResetTest.test_stress_consumer_group_commits

@WillemKauf WillemKauf merged commit a4d0b3a into redpanda-data:dev May 20, 2025
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26180-v24.3.x-431 remotes/upstream/v24.3.x
git cherry-pick -x 05f5ef7257

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants