Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup Restore rabbit cluster managed by operator #1491

Closed
brandtwinchell opened this issue Nov 20, 2023 · 8 comments
Closed

Backup Restore rabbit cluster managed by operator #1491

brandtwinchell opened this issue Nov 20, 2023 · 8 comments
Labels
closed-stale Issue or PR closed due to long period of inactivity stale Issue or PR with long period of inactivity

Comments

@brandtwinchell
Copy link

Is your feature request related to a problem? Please describe.
Existing issue is the ability to quiesce a cluster for the ability to perform backups. The second part is performing a scale-down operation to replace the cluster during a restore.

Backup issue: (use Kasten v6.0.11 and custom Kanister blueprint)

  • Have a 4 x node rabbit cluster managed by operator
    • the statefulset created by the operator has a 'readiness' check that looks for a TCPsocket of the AMQP port
    • the operator does not allow scale-down features

Backup Idea:
This idea will cause the entire cluster to stop responding clients requests. That is acceptable at this point (as it is really the responsibility of the client to retry anyway).
Also want to accomplish this without the Operator trying to re-deploy the entire cluster for changes (it would make the backup procedure too long).

  • Put all nodes of the cluster into maintenance mode (this is all controlled by Kanister blueprint)
    • On each node run rabittmq-upgrade --timeout 10 drain
      • All nodes will eventually will be pseudo up (pod will show as not Ready but will not trigger the 'readiness' to reboot the pod)
    • At this point, we have all nodes in a "quiesced" state (no client connections, no listeners working, messages are static/stable).
      • In theory, we can now snapshot the underlying storage with all Rabbit configs and messages

Idea Issue:

  • Kasten will not backup a workload (statefulset, pod, container) that in not in a ready state
    • Because we put the Rabbit nodes into maintenance mode, the pods and statefulset are not in a ready state
    • So figured I would be smart and temporarily modify the 'readiness' config on the statefulset.
      • This does not work as the Operator kicks in and reverts that setting. Even if I could override the 'readiness' via the operator, this would require the cluster to be redeployed (we do not want that)
      • Cannot modify the 'readiness' on the pod as K8s does not allow that, as the pod is part of a statefulset

Restore Issue:
As mentioned, I currently use Kasten to backup my K8s workloads. Inherently, when Kasten is performing a restore to an existing workload with a PV attached, it will scaledown the workload to remove/replace the PV with the backed up data.

  • Operator does not support the scaledown feature
    • So Kasten cannot restore the Rabbit cluster, as it cannot remove the existing PVs (Kasten just loops and eventually times out/fails)
      • I can bypass this by essentially creating a Kanister execution hook (blueprint) that will "delete" the entire existing Rabbit cluster. Now Kasten can replace cluster as the objects do not currently exist

Ideas??
So any ideas how this logic to quiesce a Rabbit cluster to backup might be accomplished

@Zerpet
Copy link
Collaborator

Zerpet commented Nov 21, 2023

I had a brief look at Kanister, bear in mind that I'm no expert on that technology 🐻 I have a long-shot idea that may solve the backup issue. You can use virtual host limits and set the maximum number of connections to 0 in all vhosts. This effectively rejects any connection to RabbitMQ. I don't recall if applying this limit also closes current connections, but there's a rabbitmqctl command to close all connections; this command could be used in one of the backup steps, if needed.

You are right that the Operator does not allow scaling down, mainly because it's not safe to do so and data loss is highly likely if you scale down rabbit. However, that concern does not apply in the restore scenario, because, well, you are restoring, you don't care about current data. One idea to stop the Operator from meddling in your restore sequence is to pause the reconcilliation, and let Kanister scale down the StatefulSet directly. Luckily, the name of the STS is derived from the RabbitmqCluster name + -server, so it should be easy (?) to point Kanister to the STS.

Do you have any specific ask in terms of functionality that you would like to change in the Operator as part of this issue?

@brandtwinchell
Copy link
Author

Thanks for the insight.
Logically, your proposal sounds reasonable and I will do some testing today. Will report back either way.
If this scenario/sequence works, then I would request that functionality be added into the operator to essentially "stun" the cluster.
Currently, until something more substantial is built into the functionality of Rabbit itself, I think this is the best way to backup a Rabbit cluster statefully.

@brandtwinchell
Copy link
Author

So was able to bet the pre-hook configs working and tested.
Was not able to put too much error checking as command line abilities are somewhat limited.
Now have to get the restore logic working.

@brandtwinchell
Copy link
Author

So finally got something working.
It is by no means quick as the steps using rabbitmqctl are not fast to respond. So if you have a lot of vHosts, this process can stun your cluster for an overly lengthy time
Essentially there are different phases for different scenarios.

TL;DR
stop the entire RabbitMQ cluster from receiving any messages
set vhost limits to '0'
reactive the pods

quiesceRabbitCluster:

  • rabbitClusterStatus
    • get if any Rabbit nodes are in maintenance mode
  • drainRabbitNodes
    • perform a 'drain' like you are about to perform an upgrade
  • quiesceWaitRabbitNodesStop
    • wait for the pods to fully stop
  • blockVhostConnections
    • set all vhost 'max_connections' = 0
  • quiesceReviveRabbitNodes
    • bring Rabbit nodes out of maintenance mode
      Now perform backup as per normal

activateRabbitCluster:

  • releaseVhostConnections
    • set all vhost 'max_connections' = -1 (remove any restrictions. Future improvement is to set back to original setting)

revertRabbitCluster: (in-case of error)

  • revertVhostConnections
    • set all vhost 'max_connections' = -1
  • revertReviveRabbitNodes
    • bring Rabbit nodes out of maintenance mode

Even with pausing the Operator reconciliation, Kasten was not able to successfully remove all the components. I did find this note that if you delete the rabbitcluster object, the Operator will not interfere (by design). So in my case, I need to destroy the entire cluster before I can restore it.

@Zerpet
Copy link
Collaborator

Zerpet commented Dec 5, 2023

Thank you for reporting back and for your effort figuring out the right sequence of steps. We can automate the steps to quiesce the rabbitmq cluster from the Operator, and the steps to "activate" the rabbitmq cluster.

I have one question in the quiesce procedure. The step quiesceWaitRabbitNodesStop: wait for the pods to fully stop, what does it mean/entail wait for pods to fully stop?

My understanding was that the sequence would be something like:

  1. Put nodes in maintenance
  2. Set all vhosts max connections to 0
  3. Put nodes out of maintenance
  4. Perform backup

Did I misunderstood something?

@brandtwinchell
Copy link
Author

@Zerpet

quiesceWaitRabbitNodesStop: wait for the pods to fully stop:

  • when I submit the command to drain the nodes, there is a period where the nodes are still in the process of of flushing queues/messages. I wait for that process to stop before setting the vhost max_connections.
    • I could not find any definitive docs that described if/when the vhost max_connections was set (in the middle of a drain); would that sever current connections trying to finish their actions (when the drain command was issued).
      • So decided to best just wait and ensure the drain was completely finished before proceeding

Your summary of the steps is correct.

If something like that was built into the operator would be great.

Copy link

github-actions bot commented Feb 4, 2024

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Feb 4, 2024
Copy link

github-actions bot commented Mar 6, 2024

Closing stale issue due to further inactivity.

@github-actions github-actions bot added the closed-stale Issue or PR closed due to long period of inactivity label Mar 6, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-stale Issue or PR closed due to long period of inactivity stale Issue or PR with long period of inactivity
Projects
None yet
Development

No branches or pull requests

2 participants