Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PESDL-942 concurrent pods restart #17232

Merged
merged 1 commit into from
Mar 27, 2024
Merged

Conversation

rpdevmp
Copy link
Contributor

@rpdevmp rpdevmp commented Mar 21, 2024

PESDL-942 Concurrent pods restart (hard restart)

Backports Required

  • [x ] none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

Improvements

  • Added ability to restart all of the pods at the same time, without waiting for each pod to come up (hard restart)
  • Moved some of the code into separate functions, so they could be reused and avoid code duplication
  • Updated rolling restart function to also avoid code duplication
  • Removed old function of nodes restart as we needed to move to pods, and created a separate function to verify basic producer_consumer that could be reused, moved it from HTT to redpanda service

NOTE: Need to perform more tests before merging

@@ -795,6 +795,7 @@ def stage_stop_wait_start(self, forced_stop: bool, downtime: int):
timeout_sec=restart_timeout,
backoff_sec=1)

@skip_if_cloud_type_is('CLOUD_TYPE_FMC')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this one, I will remove it

@@ -808,11 +809,6 @@ def test_disrupt_cloud_storage(self):
cloudv2 API to create/update buckets and its
properties/policies
"""
if self.redpanda.cloud_type == CLOUD_TYPE_FMC:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this one

@rpdevmp
Copy link
Contributor Author

rpdevmp commented Mar 21, 2024

I need to update this commit/PR and remove a couple of things that got added by accident. Those were already merged.

@rpdevmp rpdevmp force-pushed the PESDLC-942-cloud-pods-restart branch from f2a3c7f to 3bac8f7 Compare March 21, 2024 13:47
@travisdowns
Copy link
Member

@rpdevmp about the release notes it looks like maybe you filled out the "improvements" section but also put * none. The template might be a bit unclear but the improvements section is for the release notes, this will get copied into our public release nodes.

For these test fixes, I think that bullet list would go great at the top of a cover letter as a summary of what was changed.

@@ -1345,7 +1340,8 @@ def test_consume(self):
timeout=self.msg_timeout)

self.stage_lots_of_failed_consumers()
self.stage_hard_restart(producer)
RedpandaServiceCloud.concurrent_restart_pods(producer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a class method, why is it being called as a class method? Actually it's not clear how this works.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 21, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/46559#018e6188-7861-47bd-934a-666ccb3f30b6:

"rptest.tests.read_replica_e2e_test.TestReadReplicaService.test_identical_lwms_after_delete_records.partition_count=5.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/46630#018e67b1-85a4-4b87-a670-b1985c9ce136:

"rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure"

new failures in https://buildkite.com/redpanda/redpanda/builds/46630#018e67c2-d40d-4264-ba76-c4da984d30a3:

"rptest.tests.read_replica_e2e_test.TestReadReplicaService.test_identical_lwms_after_delete_records.partition_count=5.cloud_storage_type=CloudStorageType.S3"

Copy link
Contributor

@savex savex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 comments on business logic and 1 request for fix

]).decode()
return int(0 if not ret else ret)

def verify_basic_produce_consume(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as line 1987, test elements better be held in test module.

tests/rptest/services/redpanda.py Show resolved Hide resolved
tests/rptest/redpanda_cloud_tests/high_throughput_test.py Outdated Show resolved Hide resolved
@rpdevmp rpdevmp force-pushed the PESDLC-942-cloud-pods-restart branch from 852ea2e to b8c2deb Compare March 26, 2024 03:19
@rpdevmp rpdevmp force-pushed the PESDLC-942-cloud-pods-restart branch 2 times, most recently from 1657b38 to f5e8ed1 Compare March 26, 2024 03:36
@rpdevmp rpdevmp force-pushed the PESDLC-942-cloud-pods-restart branch from f5e8ed1 to f0529e3 Compare March 26, 2024 03:45
@rpdevmp rpdevmp force-pushed the PESDLC-942-cloud-pods-restart branch from f0529e3 to bd48aa2 Compare March 26, 2024 03:49
@rpdevmp
Copy link
Contributor Author

rpdevmp commented Mar 26, 2024

  1. To avoid code duplication, moved some of the functions from HTT to RedpandaServiceCloud so they could be reused
  2. Added a concurrent restarts using threading. 1 Get a list of pods, use kubectl delete pod concurrently
  3. Verify replicas, verify basic produce and consume functions
  4. Tested multiple times: ducktape ... tests/rptest/redpanda_cloud_tests/high_throughput_test.py::HighThroughputTest.test_consume

Copy link
Contributor

@savex savex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm according to discussion on the sync meeting

@rpdevmp rpdevmp requested review from savex and removed request for andrewhsu March 27, 2024 19:14
Copy link
Contributor

@savex savex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@piyushredpanda
Copy link
Contributor

Only known failures, merging.

@piyushredpanda piyushredpanda merged commit b94628d into dev Mar 27, 2024
13 of 16 checks passed
@piyushredpanda piyushredpanda deleted the PESDLC-942-cloud-pods-restart branch March 27, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants