-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upscaling and downscaling rqlite #1247
Comments
Just to be clear, what do you mean by the Kubernetes operator? Are you talking specifically about an Operator someone wrote? If so, can you link to it? |
Also, if you remove a node you must explicitly tell is to rejoin the cluster. Simply restarting it is not enough. What does the Operator do? |
Your scenario triggered some improvements in the code, which will improve the situation when the node restarts, but won't entirely solve it. Once you remove a node from the cluster, it must join again via the - |
Sorry for the late reply @otoolep . I was referring to this https://rqlite.io/docs/guides/kubernetes/. I simply scaled by modifying the amount of replicas. |
OK, thanks. I should update those docs to say that growing works well, but growing after shrinking requires extra work. I do not have a huge amount of experience with Kubernetes, I would be very interested in someone would submitting an Operator for me. I would add it to https://github.com/rqlite/kubernetes-configuration so other folks could benefit from it. |
@sgalsaleh -- you helped me refine the Kubernetes stateful set YAML. Did you and the Replicated team hit this? I know your product scales rqlite clusters up. Do you ever scale them back down in the field? If so, how do you handle it? |
No, the part of our product that uses rqlite doesn't yet support moving from HA to non-HA configuration, which is most probably why we haven't hit this issue yet. |
At Replicated, one of the reasons we decided to use rqlite is that it doesn't require a Kubernetes operator to provide HA capabilities. It would be great if that remains the case. Introducing an operator would make rqlite more heavyweight, and in many cases, it would require higher privileges in order to deploy a Kubernetes operator. |
You could always have both. The kubernetes operator would be relatively simple anyways. When it comes to scaling down, it would just run a few simple http requests to ensure the nodes get removed from the leader. When it comes to scaling back up, still confused on how to do this properly but I don't see how it would be too complex. At the end of the day, I don't see the operator taking more than 10-20mb (I usually write mine in Rust). Obviously the higher privileges might be an issue for some, but at that point you can stick to the statefulsets. |
It might start that way, but I'm sure its functionality would grow over time. It would also be a maintenance burden to implement the same functionality twice just to cover use cases where a deploying an operator isn't possible.
10-20mb of what? Memory? Disk space? I was referring to resources in general, e.g. CPU, Memory, Disk space, etc... |
@otoolep is there a technical limitation preventing the nodes from joining the cluster back automatically if for example the |
I just realized that even scaling down isn't automatic when reducing the number of replicas and the |
I was referring to memory usage. CPU and disk should be minimal. All an operator does is listen for changes through the kubernetes API and then applies changes through the kubernetes API. At the end of the day, there is no reason why we can't keep using statefulsets. The operator can just be something that adds additional features. I'm assuming the operator will leverage statefulsets anyways, it just automates some tasks that are not feasible through statefulsets. |
@sgalsaleh -- I don't follow you 100%. Can you give me a step-by-step listing of the sequence you want to be addressed automatically? It may be possible. |
@otoolep you mentioned that if you remove a node you must explicitly tell is to rejoin the cluster. From the description of the main issue here, it seems that 2 nodes were removed (manually?), and that simply scaling the number of replicas back up isn't enough to have the 2 nodes join the cluster again. Now, I'm wondering in that scenario, why wouldn't scaling the replicas back up be sufficient to have the 2 nodes join the cluster back? If it should be sufficient, is it possible to implement or is there a technical limitation that prevents that? |
"simply scaling the number of replicas back up" implies the following happened:
Couple of things:
@jmordica requested |
Using this cmd flag is only recommended when using local disk and NOT PVC’s. You want the new node coming online to not have an persisted data when joining the cluster when using this flag. |
OK, I have a question for you folks. Say I change the replica count in this file from 3->1->3, one step after another. Does k8s guarantee I get the same nodes and states again when I return to 3? Or are the PVCs deprovisioned automatically by k8s when replica count goes to 1? I guess I could test this myself, but want to know what people think should happen. |
When you scale down a StatefulSet, Kubernetes does not automatically remove the associated PVCs. The PVCs will still persist even after you've removed some replicas. This is by design to ensure that data is not accidentally lost when scaling down. If you want to remove the PVCs, you'll need to delete them manually. |
I think yet again this is where an operator becomes useful. The operator would automatically delete these PVC's. One way around it is to have the replica automatically delete all of it's data on a graceful shutdown, but I feel like that's dangerous. |
Or maybe have some flag that will reset the statefulset on startup. Not entirely sure what the best approach is. |
I'll do more testing later, but I think using the new flag could be useful. It may just come down to "do you want the leader node to continue trying to contact pods that have been gracefully shutdown, or not? Do you want to have to explicitly remove nodes from the cluster after they are shutdown?" The answer may vary from person to person. As for adding functionality to rqlite itself to clear up Kubernetes-level resources such as deleting all its data on graceful shutdown (if that what's you're suggesting), no way I'm adding that. :-) Things like that should only be done by the operator explicitly. |
For example -- and I'm not sure if anyone is actually suggesting this -- rqlite will never reach out to the Kubernetes API and delete PVCs on shutdown. That would be crossing a line, and is something that should only ever be done by a separate system or process. |
@otoolep yeah I agree with you, I think that is the right choice. If I have time I might experiment with an operator and I'll link it to this issue if any good progress is made. |
@AdamJSoftware -- if you create an Operator that is useful and you believe it would help others, I would always be happy to add it to https://github.com/rqlite/kubernetes-configuration for reference by others. |
BTW I'm not yet convinced there is an issue here. If you could list out step-by-step what you doing I can see if I can repro it. Show us the commands you issue at each stage, what you expect should happen, and what you actually see. That would help. |
Just to rephrase. I follow the steps listed on your documentation page (https://rqlite.io/docs/guides/kubernetes/). I then shell into the first container and tell it to remove nodes 2 and 3 (.remove ). I scale it down to one node by updating the statefulset to 1 replica. Test and make sure everything works (which it does). And then try to scale it back up to 3 nodes by updating the statefulset. It seems that node 2 and 3 are unable to properly communicate with the leader (node 1). It's a bit of a weird use case. But at a larger scale it makes sense. There are some points where more replicas are required to handle a larger load, and some points where less are required to handle a smaller load and save money. It's not a critical bug, but it might definitely be a pain point when someone eventually tries to scale down and then back up. |
OK, I think what you are trying to do should work, but let me test it more closely and get back to you.
You should be clear that rqlite does not horizontally scale, in the same way that Consul and etcd don't horizontally scale either. Adding more nodes to rqlite will generally not increase its performance. It can increase read performance, but your clients must then use a read-consistency of |
Yeah that makes sense, I knew write performances wouldn't increase but I assumed read would (guess it's only in certain cases). We're you able to replicate the issue though, as I still believe that we should be able to scale up and down (which makes sense if you want increased read performance). |
Haven't tried it yet, was focused on getting 7.18.0 out -- which I just did. Should have time to test now. |
So this is what I see. I am testing with minikube, the configs here, and rqlite 7.18.1.
Then I did steps again, except this time I added
which is what @AdamJSoftware saw. So this is what I think:
What happens with the two nodes that were removed, and the reason they log that message, is that the cluster config sitting in their data directory contains no reference to themselves. Does that make sense? That is because they were gracefully removed from the cluster (and updated the cluster config in their data data directory) before they were terminated. So they come up (due to the cluster being scaled to 3 nodes a second time) read the config underneath themselves, and find it doesn't refer to them. So they are not part of a "stable configuration". I need to make this clearer in the docs. But scaling down requires a) the scaled down nodes to be removed from the cluster, AND b) the data directory that sat under those nodes to be deleted. Further feedback welcome. |
Well, that's interesting -- I just tried to repro the second set of steps (with I'll need to dig into the code to figure out why the behaviour on my machine, just running the nodes, is different than what I see on minikube. My original hypothesis may have been incorrect. |
This may also simply be a timing issue -- self-removal of a node gets to complete some activities (or not) when running on my local machine, but timing is slightly differently in minikube. I'll continue to look into this. |
OK, I think I see it the issue -- and it's specific to DNS-based clustering, which Kubernetes uses. I need to think about this a bit more. It's only a problem if you want to scale up using nodes that you previously explicitly removed from the cluster. |
I believe I have fixed this issue -- the logic that runs bootstrapping of a node was too conservative, and refusing to reach out to the peers returned by DNS if the node had already been part of a cluster. This was preventing scaling up after being previously scaled down. But there is no harm reaching out to those nodes again, so I just removed the logic which prevented it. I tested a 3-node cluster on minikube and could scale it up and down, and up again, and it all worked fine. |
Scaling up and down, and up again, should work well now on k8s. |
Fix will be in 7.18.2. |
Updated the docs: https://rqlite.io/docs/guides/kubernetes/#scaling-the-cluster |
I was playing around with the kubernetes operator for rqlite. I got it to scale from 1-3, and then from 3-1 after removing nodes 1 and 2 (node 0 being the leader node). However, I had issues bringing it back up to 3. It seems that the nodes do not want to reconnect and I am not sure why. Looking at the nodes 1 and 2, I see that they have an error that says read protobuf length: EOF and in their logs it shows
Do I need to somehow renable theses nodes?
The text was updated successfully, but these errors were encountered: