-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archival: Start housekeeping jobs after STM sync #14599
Conversation
9577fed
to
4d0b3de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good change, but isn't this problem more general? A housekeeping job may try to replicate a command at any time. It will sync with the log before replicating, but the decisions which led to the replication of the command may have been made on stale data.
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b3e-bd67-42ad-8405-8289aaf8f68b: "rptest.tests.cloud_storage_usage_test.CloudStorageUsageTest.test_cloud_storage_usage_reporting_with_partition_moves" |
@VladLazar the adjacent segment merger uses locking to avoid this issue. There is a mutual exclusion between the ASM and |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b3e-bd6e-48c7-9ffd-d6c3f6cc9561: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=node_add" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b3e-bd72-40ea-abe7-7ebe3a45d7ba: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b3e-bd6b-409e-92d8-262658de307e: "rptest.tests.cluster_features_test.FeaturesSingleNodeUpgradeTest.test_upgrade" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b5d-cc35-4e6c-b731-d17d257657ad: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=node_add" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b5d-cc38-477d-b21b-e39d710ce410: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b5d-cc32-4789-bd7e-f4129ae759c9: "rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=compact.delete" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40238#018b8b5d-cc2f-4ca9-af1a-3a63618e703e: "rptest.tests.cloud_storage_usage_test.CloudStorageUsageTest.test_cloud_storage_usage_reporting_with_partition_moves" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b905f-4dc4-435d-a32c-ed2e82705cb8: "rptest.tests.cloud_storage_usage_test.CloudStorageUsageTest.test_cloud_storage_usage_reporting_with_partition_moves" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b905f-4dca-4b38-b77e-a779e8a0d24b: "rptest.tests.topic_creation_test.CreateSITopicsTest.topic_alter_config_test" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b905f-4dcd-4b7c-8577-4dca9261d475: "rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTest.test_reset_spillover.cloud_storage_type=CloudStorageType.ABS" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b905f-4dc7-4a6e-98e1-b1f4b21de4da: "rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b906e-41db-45ab-9370-4c7a9a156e51: "rptest.tests.topic_creation_test.CreateSITopicsTest.topic_alter_config_test" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b906e-41dd-47a2-a0ec-1c8338c4ad28: "rptest.tests.cloud_storage_usage_test.CloudStorageUsageTest.test_cloud_storage_usage_reporting" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b906e-41d8-4eab-ac5c-ebc985787234: "rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=compact.delete" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40312#018b906e-41d5-4b62-8396-7e6d4a0fb437: "rptest.tests.cloud_storage_usage_test.CloudStorageUsageTest.test_cloud_storage_usage_reporting_with_partition_moves" |
When |
Not as as far as I'm aware. |
@VladLazar another possible race source then? As it starts the uploader/merger loop there. |
The jobs are enabled after the call to |
@VladLazar can you point to the file:lineno where |
|
// can only see up to date manifest. | ||
auto sync_timeout = config::shard_local_cfg() | ||
.cloud_storage_metadata_sync_timeout_ms.value(); | ||
co_await _parent.archival_meta_stm()->sync(sync_timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VladLazar assuming upload loop is fine, notify_leadership also re-enabled the merger and it doesn't have a call to sync (this PR adds one only at archival start). Struggle to reason about these code paths a bit. |
4d0b3de
to
9a378e6
Compare
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e10-6018-4346-8a7b-1e750869e9f3:
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e08-e9d5-44bc-86ae-1439494777fc:
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e08-e9d9-4edd-8313-60370b594c2d:
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e08-e9df-4ac5-9e13-0188c9faf589:
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e08-e9dc-4785-979e-fbdd9b55cd80:
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e10-6022-4ae3-b53d-fc4385e231ad:
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e10-601f-4a6e-82ad-b824c0185aa2:
new failures in https://buildkite.com/redpanda/redpanda/builds/43254#018c8e10-601c-4558-99a2-70234aaa26bb:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda55-fc59-4bc6-baec-008a32b913d2:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda55-fc55-4d8b-850d-6f4d30402fc2:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda55-fc5c-42c5-bdff-c0afaa8e9230:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda55-fc5f-4c7f-b9cc-7c615cc0e3af:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda66-e74d-4e37-bf56-b49551177956:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda66-e750-4028-bfb7-74302c15324d:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda66-e749-42bc-ac08-1bb38688c567:
new failures in https://buildkite.com/redpanda/redpanda/builds/43484#018cda66-e753-42f1-9b32-f722f16b167c:
|
9a378e6
to
8c8551a
Compare
Previously, the housekeeping jobs were started in the c-tor of the ntp_archiver. This allowed them to 'see' potentially stale manifest. To avoid this this commit forces STM sync in the 'ntp_archiver::start' and enables housekeeping jobs after that.
8c8551a
to
f4cbe6c
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43771#018d0d9f-6ae6-499e-8ef6-484dbf9aa574 |
/backport v23.3.x |
Previously, the housekeeping jobs were started in the c-tor of the ntp_archiver. This allowed them to 'see' potentially stale manifest. To avoid this this commit forces STM sync in the 'ntp_archiver::start' and enables housekeeping jobs after that.
Fixes #14222
Backports Required
Release Notes