Upscaling and downscaling rqlite #1247

AdamJSoftware · 2023-05-05T03:39:49Z

I was playing around with the kubernetes operator for rqlite. I got it to scale from 1-3, and then from 3-1 after removing nodes 1 and 2 (node 0 being the leader node). However, I had issues bringing it back up to 3. It seems that the nodes do not want to reconnect and I am not sure why. Looking at the nodes 1 and 2, I see that they have an error that says read protobuf length: EOF and in their logs it shows

 [WARN]  raft: heartbeat timeout reached, not part of a stable configuration or a non-voter, not triggering a leader election

Do I need to somehow renable theses nodes?

The text was updated successfully, but these errors were encountered:

otoolep · 2023-05-05T04:02:41Z

Just to be clear, what do you mean by the Kubernetes operator? Are you talking specifically about an Operator someone wrote? If so, can you link to it?

otoolep · 2023-05-05T13:48:37Z

Also, if you remove a node you must explicitly tell is to rejoin the cluster. Simply restarting it is not enough. What does the Operator do?

otoolep · 2023-05-05T14:32:49Z

Your scenario triggered some improvements in the code, which will improve the situation when the node restarts, but won't entirely solve it. Once you remove a node from the cluster, it must join again via the -join operation -- simply restarting it is not enough. Does that make sense? What does the operator do?

AdamJSoftware · 2023-05-05T14:50:56Z

Sorry for the late reply @otoolep . I was referring to this https://rqlite.io/docs/guides/kubernetes/. I simply scaled by modifying the amount of replicas.
I was actually wrong regarding the operator. It's controlled via statefulsets not an operator, must have been thinking of something else while writing the issue. Maybe writing an operator could solve the issue?

otoolep · 2023-05-05T15:21:25Z

OK, thanks. I should update those docs to say that growing works well, but growing after shrinking requires extra work.

I do not have a huge amount of experience with Kubernetes, I would be very interested in someone would submitting an Operator for me. I would add it to https://github.com/rqlite/kubernetes-configuration so other folks could benefit from it.

otoolep · 2023-05-05T15:25:46Z

@sgalsaleh -- you helped me refine the Kubernetes stateful set YAML. Did you and the Replicated team hit this? I know your product scales rqlite clusters up. Do you ever scale them back down in the field? If so, how do you handle it?

sgalsaleh · 2023-05-05T15:50:51Z

Do you ever scale them back down in the field?

No, the part of our product that uses rqlite doesn't yet support moving from HA to non-HA configuration, which is most probably why we haven't hit this issue yet.

sgalsaleh · 2023-05-05T15:57:44Z

I would be very interested in someone would submitting an Operator for me. I would add it to https://github.com/rqlite/kubernetes-configuration so other folks could benefit from it.

At Replicated, one of the reasons we decided to use rqlite is that it doesn't require a Kubernetes operator to provide HA capabilities. It would be great if that remains the case. Introducing an operator would make rqlite more heavyweight, and in many cases, it would require higher privileges in order to deploy a Kubernetes operator.

AdamJSoftware · 2023-05-05T16:15:35Z

I would be very interested in someone would submitting an Operator for me. I would add it to https://github.com/rqlite/kubernetes-configuration so other folks could benefit from it.

At Replicated, one of the reasons we decided to use rqlite is that it doesn't require a Kubernetes operator to provide HA capabilities. It would be great if that remains the case. Introducing an operator would make rqlite more heavyweight, and in many cases, it would require higher privileges in order to deploy a Kubernetes operator.

You could always have both. The kubernetes operator would be relatively simple anyways. When it comes to scaling down, it would just run a few simple http requests to ensure the nodes get removed from the leader. When it comes to scaling back up, still confused on how to do this properly but I don't see how it would be too complex. At the end of the day, I don't see the operator taking more than 10-20mb (I usually write mine in Rust). Obviously the higher privileges might be an issue for some, but at that point you can stick to the statefulsets.

sgalsaleh · 2023-05-05T17:07:51Z

The kubernetes operator would be relatively simple anyways.

It might start that way, but I'm sure its functionality would grow over time. It would also be a maintenance burden to implement the same functionality twice just to cover use cases where a deploying an operator isn't possible.

At the end of the day, I don't see the operator taking more than 10-20mb (I usually write mine in Rust).

10-20mb of what? Memory? Disk space? I was referring to resources in general, e.g. CPU, Memory, Disk space, etc...

sgalsaleh · 2023-05-05T17:10:54Z

@otoolep is there a technical limitation preventing the nodes from joining the cluster back automatically if for example the -bootstrap-expect field is set/updated accordingly along with the number of replicas? That's what we do today when scaling rqlite up initially, but I'm wondering if that would also work when scaling up after scaling down.

sgalsaleh · 2023-05-05T17:26:54Z

I just realized that even scaling down isn't automatic when reducing the number of replicas and the -bootstrap-expect field. Which is probably the expected behavior in that situation.

AdamJSoftware · 2023-05-05T18:03:22Z

I was referring to memory usage. CPU and disk should be minimal. All an operator does is listen for changes through the kubernetes API and then applies changes through the kubernetes API. At the end of the day, there is no reason why we can't keep using statefulsets. The operator can just be something that adds additional features. I'm assuming the operator will leverage statefulsets anyways, it just automates some tasks that are not feasible through statefulsets.

otoolep · 2023-05-08T12:10:49Z

@otoolep is there a technical limitation preventing the nodes from joining the cluster back automatically if for example the -bootstrap-expect field is set/updated accordingly along with the number of replicas?

@sgalsaleh -- I don't follow you 100%. Can you give me a step-by-step listing of the sequence you want to be addressed automatically? It may be possible.

sgalsaleh · 2023-05-08T16:29:32Z

@otoolep you mentioned that if you remove a node you must explicitly tell is to rejoin the cluster. From the description of the main issue here, it seems that 2 nodes were removed (manually?), and that simply scaling the number of replicas back up isn't enough to have the 2 nodes join the cluster again.

Now, I'm wondering in that scenario, why wouldn't scaling the replicas back up be sufficient to have the 2 nodes join the cluster back? If it should be sufficient, is it possible to implement or is there a technical limitation that prevents that?

otoolep · 2023-05-09T13:23:10Z

"simply scaling the number of replicas back up" implies the following happened:

replica count scaled down to 1
k8s shuts down the nodes gracefully (SIGTERM)
remaining leader still looking for missing nodes, not sure what the state of things are now.

Couple of things:

when you scale down, doesn't k8s destroy the nodes, and persistent disks? So they can't "rejoin" -- they will be new nodes coming back up. It might work though, I'm not sure, would need to test it.
I actually just made a new change to master, which may help here, see: Support removing self from cluster on shutdown #1253. With this change in place a node that is gracefully shutdown (SIGTERM), it will first remove itself from the cluster (you have to pass -raft-cluster-remove-shutdown=true to the node). I think this might work well in this situation, since you wouldn't need to do the manual remove operations. It would mean that scaling down would mean you'd end up with a fully functional 1-node cluster. Scaling up may then work nicely again.

@jmordica requested -raft-cluster-remove-shutdown and it's working well for him. I think it looks like what we want for k8s setups generally? @sgalsaleh @AdamJSoftware ?

jmordica · 2023-05-09T14:34:14Z

Using this cmd flag is only recommended when using local disk and NOT PVC’s.

You want the new node coming online to not have an persisted data when joining the cluster when using this flag.

otoolep · 2023-05-09T14:59:50Z

OK, I have a question for you folks. Say I change the replica count in this file from 3->1->3, one step after another. Does k8s guarantee I get the same nodes and states again when I return to 3? Or are the PVCs deprovisioned automatically by k8s when replica count goes to 1?

I guess I could test this myself, but want to know what people think should happen.

jmordica · 2023-05-09T15:15:19Z

When you scale down a StatefulSet, Kubernetes does not automatically remove the associated PVCs. The PVCs will still persist even after you've removed some replicas. This is by design to ensure that data is not accidentally lost when scaling down. If you want to remove the PVCs, you'll need to delete them manually.

AdamJSoftware · 2023-05-09T15:53:47Z

I think yet again this is where an operator becomes useful. The operator would automatically delete these PVC's. One way around it is to have the replica automatically delete all of it's data on a graceful shutdown, but I feel like that's dangerous.

AdamJSoftware · 2023-05-09T15:54:28Z

Or maybe have some flag that will reset the statefulset on startup. Not entirely sure what the best approach is.

otoolep · 2023-05-09T16:01:34Z

I'll do more testing later, but I think using the new flag could be useful. It may just come down to "do you want the leader node to continue trying to contact pods that have been gracefully shutdown, or not? Do you want to have to explicitly remove nodes from the cluster after they are shutdown?" The answer may vary from person to person.

As for adding functionality to rqlite itself to clear up Kubernetes-level resources such as deleting all its data on graceful shutdown (if that what's you're suggesting), no way I'm adding that. :-) Things like that should only be done by the operator explicitly.

otoolep · 2023-05-09T16:03:40Z

For example -- and I'm not sure if anyone is actually suggesting this -- rqlite will never reach out to the Kubernetes API and delete PVCs on shutdown. That would be crossing a line, and is something that should only ever be done by a separate system or process.

AdamJSoftware · 2023-05-09T16:04:07Z

@otoolep yeah I agree with you, I think that is the right choice. If I have time I might experiment with an operator and I'll link it to this issue if any good progress is made.

otoolep · 2023-05-09T16:05:45Z

@AdamJSoftware -- if you create an Operator that is useful and you believe it would help others, I would always be happy to add it to https://github.com/rqlite/kubernetes-configuration for reference by others.

otoolep · 2023-05-11T02:13:21Z

BTW I'm not yet convinced there is an issue here. If you could list out step-by-step what you doing I can see if I can repro it. Show us the commands you issue at each stage, what you expect should happen, and what you actually see. That would help.

AdamJSoftware · 2023-05-11T15:20:09Z

Just to rephrase. I follow the steps listed on your documentation page (https://rqlite.io/docs/guides/kubernetes/). I then shell into the first container and tell it to remove nodes 2 and 3 (.remove ). I scale it down to one node by updating the statefulset to 1 replica. Test and make sure everything works (which it does). And then try to scale it back up to 3 nodes by updating the statefulset. It seems that node 2 and 3 are unable to properly communicate with the leader (node 1).

It's a bit of a weird use case. But at a larger scale it makes sense. There are some points where more replicas are required to handle a larger load, and some points where less are required to handle a smaller load and save money. It's not a critical bug, but it might definitely be a pain point when someone eventually tries to scale down and then back up.

otoolep · 2023-05-11T15:51:05Z

OK, I think what you are trying to do should work, but let me test it more closely and get back to you.

There are some points where more replicas are required to handle a larger load

You should be clear that rqlite does not horizontally scale, in the same way that Consul and etcd don't horizontally scale either.

Adding more nodes to rqlite will generally not increase its performance. It can increase read performance, but your clients must then use a read-consistency of none to avoid hitting the Leader. Adding more nodes gives you more copies of your data, so you have more redundancy, but you do not have more performance.

otoolep · 2023-05-11T15:52:06Z

https://rqlite.io/docs/faq/#rqlite-is-distributed-does-that-mean-it-can-increase-sqlite-performance

AdamJSoftware · 2023-05-15T12:47:26Z

Yeah that makes sense, I knew write performances wouldn't increase but I assumed read would (guess it's only in certain cases). We're you able to replicate the issue though, as I still believe that we should be able to scale up and down (which makes sense if you want increased read performance).

otoolep · 2023-05-19T03:55:33Z

Haven't tried it yet, was focused on getting 7.18.0 out -- which I just did. Should have time to test now.

otoolep · 2023-05-20T18:11:55Z

So this is what I see. I am testing with minikube, the configs here, and rqlite 7.18.1.

A 1=node replica cluster comes up fine.
Scale to 3 nodes via kubectl scale statefulsets rqlite --replicas=3 works fine.
Scale back down to 1 node via kubectl scale statefulsets rqlite --replicas=1 shuts down the nodes that came up in step 2. The first node loses leadership, but can't get re-elected either (since the nodes are still in the cluster config).
Scale back up via kubectl scale statefulsets rqlite --replicas=3 works fine, and you've got the 3-node cluster back again.

Then I did steps again, except this time I added "-raft-cluster-remove-shutdown=true" as a launch option for rqlited. Everything goes fine until I try to scale up again, and the restarted nodes get:

2023-05-20T18:01:24.586Z [WARN] raft: heartbeat timeout reached, not part of a stable configuration or a non-voter, not triggering a leader election

which is what @AdamJSoftware saw.

So this is what I think:

In the first case the system is behaving exactly as one would expect. The nodes are shut down, but are not removed from the cluster. They are restarted when the cluster is scaled to 3 a second time, and everything works fine. From rqlite's point of view, this is like the nodes failed, the cluster lost quorum, and the single remaining node sat there stuck until a quorum came back.
The second case, where the cluster is scaled back down and the nodes and then removed from the cluster. So it's always been the case that the reason to remove a node was because you were deprovisioning it, and it would never rejoin again. In other words the reason you remove nodes is because they have failed and are never coming back. What we really want "scale to 3 a second time" to actually do is create 2 brand new nodes.

What happens with the two nodes that were removed, and the reason they log that message, is that the cluster config sitting in their data directory contains no reference to themselves. Does that make sense? That is because they were gracefully removed from the cluster (and updated the cluster config in their data data directory) before they were terminated. So they come up (due to the cluster being scaled to 3 nodes a second time) read the config underneath themselves, and find it doesn't refer to them. So they are not part of a "stable configuration".

I need to make this clearer in the docs. But scaling down requires a) the scaled down nodes to be removed from the cluster, AND b) the data directory that sat under those nodes to be deleted.

Further feedback welcome.

otoolep · 2023-05-20T18:34:36Z

Well, that's interesting -- I just tried to repro the second set of steps (with -raft-cluster-remove-shutdown=true) outside of Kubernetes and it seems to work fine! So clearly there is something about the bootstrapping logic in Kubernetes that is different (I was using explicit joins with my non-Kubernetes test) So it seems like it should work after all.

I'll need to dig into the code to figure out why the behaviour on my machine, just running the nodes, is different than what I see on minikube. My original hypothesis may have been incorrect.

otoolep · 2023-05-20T19:19:34Z

This may also simply be a timing issue -- self-removal of a node gets to complete some activities (or not) when running on my local machine, but timing is slightly differently in minikube. I'll continue to look into this.

otoolep · 2023-05-20T19:50:01Z

OK, I think I see it the issue -- and it's specific to DNS-based clustering, which Kubernetes uses. I need to think about this a bit more.

It's only a problem if you want to scale up using nodes that you previously explicitly removed from the cluster.

May be solution to #1247.

otoolep · 2023-05-22T13:56:22Z

I believe I have fixed this issue -- the logic that runs bootstrapping of a node was too conservative, and refusing to reach out to the peers returned by DNS if the node had already been part of a cluster. This was preventing scaling up after being previously scaled down. But there is no harm reaching out to those nodes again, so I just removed the logic which prevented it.

I tested a 3-node cluster on minikube and could scale it up and down, and up again, and it all worked fine.

otoolep · 2023-05-22T14:46:35Z

Scaling up and down, and up again, should work well now on k8s.

otoolep · 2023-05-22T14:46:54Z

Fix will be in 7.18.2.

otoolep · 2023-05-22T14:54:05Z

Updated the docs: https://rqlite.io/docs/guides/kubernetes/#scaling-the-cluster

otoolep mentioned this issue May 5, 2023

Retry certain cluster-client operations #1249

Merged

otoolep added a commit that referenced this issue May 22, 2023

Proceed with DNS even if pre-existing state

f3faf00

May be solution to #1247.

otoolep mentioned this issue May 22, 2023

Proceed with DNS even if pre-existing state exists #1274

Merged

otoolep closed this as completed May 22, 2023

Upscaling and downscaling rqlite #1247

Upscaling and downscaling rqlite #1247

Comments

AdamJSoftware commented May 5, 2023

otoolep commented May 5, 2023

otoolep commented May 5, 2023

otoolep commented May 5, 2023

AdamJSoftware commented May 5, 2023

otoolep commented May 5, 2023

otoolep commented May 5, 2023

sgalsaleh commented May 5, 2023 • edited

sgalsaleh commented May 5, 2023

AdamJSoftware commented May 5, 2023

sgalsaleh commented May 5, 2023

sgalsaleh commented May 5, 2023

sgalsaleh commented May 5, 2023

AdamJSoftware commented May 5, 2023

otoolep commented May 8, 2023

sgalsaleh commented May 8, 2023

otoolep commented May 9, 2023

jmordica commented May 9, 2023

otoolep commented May 9, 2023

jmordica commented May 9, 2023

AdamJSoftware commented May 9, 2023

AdamJSoftware commented May 9, 2023

otoolep commented May 9, 2023 • edited

otoolep commented May 9, 2023 • edited

AdamJSoftware commented May 9, 2023

otoolep commented May 9, 2023

otoolep commented May 11, 2023

AdamJSoftware commented May 11, 2023 • edited

otoolep commented May 11, 2023

otoolep commented May 11, 2023

AdamJSoftware commented May 15, 2023

otoolep commented May 19, 2023

otoolep commented May 20, 2023

otoolep commented May 20, 2023

otoolep commented May 20, 2023

otoolep commented May 20, 2023

otoolep commented May 22, 2023

otoolep commented May 22, 2023

otoolep commented May 22, 2023

otoolep commented May 22, 2023

sgalsaleh commented May 5, 2023 •

edited

otoolep commented May 9, 2023 •

edited

otoolep commented May 9, 2023 •

edited

AdamJSoftware commented May 11, 2023 •

edited