Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling Kubernetes on Jetstream with Cluster Autoscaler #15

Closed
zonca opened this issue Apr 20, 2019 · 34 comments
Closed

Autoscaling Kubernetes on Jetstream with Cluster Autoscaler #15

zonca opened this issue Apr 20, 2019 · 34 comments

Comments

@zonca
Copy link
Owner

zonca commented Apr 20, 2019

Cluster autoscaler is the official infrastructure to provide autoscaling on AWS and Google Cloud.
Openstack support was being developed.

@zonca
Copy link
Owner Author

zonca commented Apr 20, 2019

Openstack support was merged in March, kubernetes/autoscaler#1690
I will test and report back here.

@zonca
Copy link
Owner Author

zonca commented Apr 20, 2019

it is based on Magnum, so we should abandon kubespray anyway. Still I think it is worth a try, as long as it doesn't require too much effort.

@zonca
Copy link
Owner Author

zonca commented Jun 14, 2019

Deployment with magnum works, see #16, next I'll work on this

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

@rsignell-usgs @julienchastang @ktyle

deployed the autoscaler on top of the magnum deployment, it authenticates fine and when there are many nodes pending, it requests a new node to the Openstack API.
The only problem is that it times out at 10 minutes, but Jetstream takes about 15 minutes to provision 1 new node and configure it for Kubernetes.

Asked XSEDE if there is anything we can do to speed it up, because if we have users waiting for a Jupyter Notebook ideally we would like them to wait ~5min. Otherwise, I'll have to recompile the container, as the wait time is a constant in the Go codebase.

@julienchastang
Copy link
Contributor

One thing I have noticed is really long dpkg processes upon VM startup when doing "manual" scaling. Perhaps if we have an updated image, the dpkg process would not take as long. Ping @jlf599.

@jlf599
Copy link

jlf599 commented Sep 5, 2019

On the IU cloud, all new deployments get a forced update. If your image is old and has lots of pending updates, that might be the cause.

You can override it with cloud-init by doing something like this:

#cloud-config
packages: []

package_update: false
package_upgrade: false
package_reboot_if_required: false

final_message: "Boot completed in $UPTIME seconds"


in a script. This is noted here -- http://wiki.jetstream-cloud.org/Using+cloud-init+scripts+with+the+Jetstream+API

From the CLI it's invoked with the --file switch at launch. With terraform or other tools, I'm less sure how to include it but hopefully it's possible.

Looks like it might be:
https://www.terraform.io/docs/providers/openstack/r/compute_instance_v2.html

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

thanks! it looks a good idea, I need to understand how to modify the HEAT templates in Magnum to provide a cloud-config,

possibly https://github.com/openstack/heat-templates/blob/master/hot/software-config/example-templates/example-cloud-init.yaml

I'll try and report back here.

@julienchastang
Copy link
Contributor

Also simply working with a more up-to-date image that will not have so many out of date debian packages could be another solution and maybe a more secure one too.

@jlf599
Copy link

jlf599 commented Sep 5, 2019

That is the best solution. However, if it's the fedora image that Magnum depends on, we've been trying to update to use CoreOS (as that's what Magnum is going to) and it's just not working correctly. If you're using an Ubuntu image, creating an up to date snap on a regular basis is not a bad idea.

@jlf599
Copy link

jlf599 commented Sep 5, 2019

Also, @zonca is this issue directly related to the ticket you opened today?

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

@jlf599 yes, exactly

@jlf599
Copy link

jlf599 commented Sep 5, 2019

Can you try an experiment of updating the image you're using with all of the latest updates and trying and see if you get a fast boot/cluster growth?

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

I'm using Fedora-AtomicHost-28-20180625 that was the only one working with Magnum, @jlf599 should I update that?

@jlf599
Copy link

jlf599 commented Sep 5, 2019

If Julian is seeing lots of dpkg updates, that's an ubuntu/debian based image, though.

You could yum update the Fedora image, snapshot it, and try using it. It should work but we all know how should messes with things. :)

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

@julienchastang is referring to the kubespray deployment, with Magnum I have fedora, but it could be the same problem. Anyway it doesn't hurt to have a update image ;)

@jlf599
Copy link

jlf599 commented Sep 5, 2019

That image for kubespray probably could stand to be updated, too, then. Can you let me know how things go so I can handle the ticket accordingly?

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

See the log of an instance attached, I think there is a big delay well before updating packages console.txt

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

@jlf599 even with an updated image I still see ~15 min provisioning of 1 node

@zonca
Copy link
Owner Author

zonca commented Sep 5, 2019

@jlf599 actually I realized I didn't actually update the image, I am retesting it again now.

@zonca
Copy link
Owner Author

zonca commented Sep 6, 2019

that was it, with my updated image Fedora-AtomicHost-28-20180625-updated-20190905, it completes in 4 minutes. thanks! now I'll get back to test the autoscaler

@zonca
Copy link
Owner Author

zonca commented Sep 6, 2019

my impression is that the older kernel was hanging on something, while the new one works fine.
I attach the boot log as reference, but no need to dig into this.
console_updated_image.txt

@ktyle
Copy link

ktyle commented Sep 6, 2019

Nice ... I just tried it (looks like the image name is now Fedora-AtomicHost-28-Updated-9-6-2019 ) and it completed in a little less than 8 minutes ... way faster than with the last image.

@jlf599
Copy link

jlf599 commented Sep 6, 2019 via email

@jlf599
Copy link

jlf599 commented Sep 6, 2019

Okay. We've deactivated the old Atomic image and put this one on both clouds:

035d9554-086e-40f9-8da2-db023ea4b941 | Fedora-AtomicHost-28-Updated-9-6-2019

We'll be updating that every month or so with our featured images.

Thanks for pointing out the issue!

@zonca
Copy link
Owner Author

zonca commented Sep 12, 2019

@jlf599 unfortunately it looks like the new image, both your version and mine, gets to "CREATE_COMPLETE", but the Kubernetes cluster is broken.

For example in my current deployment, the master node, even if it seems to be running, it is not recognized by kubernetes:

kubectl get nodes
NAME                        STATUS   ROLES    AGE   VERSION
k8s-7hkcfyagetqm-minion-0   Ready    <none>   5m    v1.11.1

It looks like there are some specific instructions on how to update the images for Magnum: https://docs.openstack.org/magnum/mitaka/dev/build-atomic-image.html

Can you please remind me which version of openstack is on Jetstream at IU? Can you please recover the old image and I'll try to update it using those instructions?

@jlf599
Copy link

jlf599 commented Sep 12, 2019

We're on Rocky release presently. Planning to go to Stein by year's end.

We didn't delete the old image -- just made it inactive, so we can reenable it and see about getting it updated.

@jlf599
Copy link

jlf599 commented Sep 12, 2019

I reactivated

5f2f28a4-6e7c-4515-86c7-f7cbfaa19a30 | Fedora-AtomicHost-28-20180625

and deactivated

035d9554-086e-40f9-8da2-db023ea4b941 | Fedora-AtomicHost-28-Updated-9-6-2019

I'm trying to find Rocky instructions like those above for Mitaka but haven't found them yet. Granted, I haven't spent much time on it yet.

@zonca
Copy link
Owner Author

zonca commented Sep 12, 2019

thanks @jlf599 for the prompt response!
I cannot find that image:

Could not find resource 5f2f28a4-6e7c-4515-86c7-f7cbfaa19a30

@zonca
Copy link
Owner Author

zonca commented Sep 12, 2019

I had the wrong hash, however that is still deactivated:

openstack image list | grep Atomic
| 4043bfa2-289b-4112-961a-64e2aeaedc3f | Fedora-Atomic-27-20180419                             | active      |
| 358cc5d0-e780-4ef5-acf6-9e37de560efa | Fedora-AtomicHost-28-20180625-updated-20190905        | active      |
| 035d9554-086e-40f9-8da2-db023ea4b941 | Fedora-AtomicHost-28-Updated-9-6-2019                 | deactivated |

@jlf599
Copy link

jlf599 commented Sep 12, 2019

Hrm:

5f2f28a4-6e7c-4515-86c7-f7cbfaa19a30 | Fedora-AtomicHost-28-20180625 | active

I did deactivate the one that wasn't working. Do you want that one back on?

@jlf599
Copy link

jlf599 commented Sep 12, 2019

(openstack) [IU] [Entropy] jeremy ~-->os image set --activate 035d9554-086e-40f9-8da2-db023ea4b941

Back on

@zonca
Copy link
Owner Author

zonca commented Sep 12, 2019

ok, I tried to test with the older Fedora Atomic 27, and that worked fine! it doesn't have that slow boot as Fedora Atomic 28 and everything now works.

@zonca
Copy link
Owner Author

zonca commented Sep 12, 2019

@julienchastang @jlf599 @ktyle @rsignell-usgs

See the tutorial: https://zonca.github.io/2019/09/kubernetes-jetstream-autoscaler.html

I'll do more testing in the next weeks and improve the tutorial, but everything seems to be working fine.

@zonca
Copy link
Owner Author

zonca commented Sep 24, 2019

next simulate load, see #23

@zonca zonca closed this as completed Nov 1, 2019
zonca added a commit that referenced this issue May 6, 2020
Implement data volume via block store
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants