Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put nodes into maintenance mode in preStop hook #378

Merged
merged 2 commits into from
Oct 2, 2020
Merged

Conversation

coro
Copy link
Contributor

@coro coro commented Oct 1, 2020

This commit causes nodes to go into maintenance mode before the
controller runtime terminates the rabbitmq-server process on a pod stop.
This allows for a more controlled shutdown of pods in a cluster during
rolling restart, or upgrade.

Using https://github.com/Vanlightly/RabbitTestTool, I was able to show that with
these changes, no data loss, or serious periods of cluster
unavailability, were caused during a rolling restart. However, this is
the case without these changes; I attempted to find a demonstrable case
where this change leads to an improvement in rolling restart
reliability, however was not able to improve this beyond its current
reliability.

I still put this PR forward for review, as the core team recommend that
nodes are put into maintenance mode before being stopped, to allow more
predictable transfer of responsibility to other nodes. The
rabbitmq-upgrade drain command is also able to be expanded in the
future if there are further improvements that can be made in the future.

This closes #320.

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

Additional Context

All tests pass. Also ran some manual testing with kubectl rollout restart statefulset <>

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

This commit causes nodes to go into maintenance mode before the
controller runtime terminates the rabbitmq-server process on a pod stop.
This allows for a more controlled shutdown of pods in a cluster during
rolling restart, or upgrade.

Using github.com/Vanlightly/RabbitTestTool, I was able to show that with
these changes, no data loss, or serious periods of cluster
unavailability, were caused during a rolling restart. However, this is
the case without these changes; I attempted to find a demonstrable case
where this change leads to an improvement in rolling restart
reliability, however was not able to improve this beyond its current
reliability.

I still put this PR forward for review, as the core team recommend that
nodes are put into maintenance mode before being stopped, to allow more
predictable transfer of responsibility to other nodes. The
`rabbitmq-upgrade drain` command is also able to be expanded in the
future if there are further improvements that can be made in the future.

This closes #320.
Maintenance mode is not available in previous versions of RabbitMQ
Copy link
Collaborator

@mkuratczyk mkuratczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. My only comment, and I'm not saying we need to do anything about it, is that we started with a week-long termination timeout but now we pass that value to three different commands so we basically have a 3 week timeout now. I don't think that matters really though - a week basically means infinity and 3*infinity is still infinity ;)

Copy link
Collaborator

@Zerpet Zerpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good and tests are passing in my machine ✅

@coro coro merged commit e12a397 into main Oct 2, 2020
@coro coro deleted the 320-drain-on-shutdown branch October 2, 2020 09:49
@Zerpet
Copy link
Collaborator

Zerpet commented Oct 2, 2020

Great stuff. My only comment, and I'm not saying we need to do anything about it, is that we started with a week-long termination timeout but now we pass that value to three different commands so we basically have a 3 week timeout now. I don't think that matters really though - a week basically means infinity and 3*infinity is still infinity ;)

The container will receive a SIGKILL if the grace period expires, even if it's executing the preStop hook; the longest we will wait will be one week (infinity).

mkuratczyk pushed a commit that referenced this pull request Oct 5, 2020
* Put nodes into maintenance mode in preStop hook

This commit causes nodes to go into maintenance mode before the
controller runtime terminates the rabbitmq-server process on a pod stop.
This allows for a more controlled shutdown of pods in a cluster during
rolling restart, or upgrade.

Using github.com/Vanlightly/RabbitTestTool, I was able to show that with
these changes, no data loss, or serious periods of cluster
unavailability, were caused during a rolling restart. However, this is
the case without these changes; I attempted to find a demonstrable case
where this change leads to an improvement in rolling restart
reliability, however was not able to improve this beyond its current
reliability.

I still put this PR forward for review, as the core team recommend that
nodes are put into maintenance mode before being stopped, to allow more
predictable transfer of responsibility to other nodes. The
`rabbitmq-upgrade drain` command is also able to be expanded in the
future if there are further improvements that can be made in the future.

This closes #320.

* Update minimum RMQ version to 3.8.8

Maintenance mode is not available in previous versions of RabbitMQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use maintenance mode for termination
4 participants