Autoscaling Kubernetes on Jetstream with Cluster Autoscaler #15

zonca · 2019-04-20T17:14:10Z

Cluster autoscaler is the official infrastructure to provide autoscaling on AWS and Google Cloud.
Openstack support was being developed.

zonca · 2019-04-20T17:14:41Z

Openstack support was merged in March, kubernetes/autoscaler#1690
I will test and report back here.

zonca · 2019-04-20T17:57:21Z

it is based on Magnum, so we should abandon kubespray anyway. Still I think it is worth a try, as long as it doesn't require too much effort.

zonca · 2019-06-14T15:25:46Z

Deployment with magnum works, see #16, next I'll work on this

zonca · 2019-09-05T06:44:23Z

@rsignell-usgs @julienchastang @ktyle

deployed the autoscaler on top of the magnum deployment, it authenticates fine and when there are many nodes pending, it requests a new node to the Openstack API.
The only problem is that it times out at 10 minutes, but Jetstream takes about 15 minutes to provision 1 new node and configure it for Kubernetes.

Asked XSEDE if there is anything we can do to speed it up, because if we have users waiting for a Jupyter Notebook ideally we would like them to wait ~5min. Otherwise, I'll have to recompile the container, as the wait time is a constant in the Go codebase.

julienchastang · 2019-09-05T17:23:16Z

One thing I have noticed is really long dpkg processes upon VM startup when doing "manual" scaling. Perhaps if we have an updated image, the dpkg process would not take as long. Ping @jlf599.

jlf599 · 2019-09-05T17:27:22Z

On the IU cloud, all new deployments get a forced update. If your image is old and has lots of pending updates, that might be the cause.

You can override it with cloud-init by doing something like this:

#cloud-config
packages: []

package_update: false
package_upgrade: false
package_reboot_if_required: false

final_message: "Boot completed in $UPTIME seconds"

in a script. This is noted here -- http://wiki.jetstream-cloud.org/Using+cloud-init+scripts+with+the+Jetstream+API

From the CLI it's invoked with the --file switch at launch. With terraform or other tools, I'm less sure how to include it but hopefully it's possible.

Looks like it might be:
https://www.terraform.io/docs/providers/openstack/r/compute_instance_v2.html

zonca · 2019-09-05T17:44:53Z

thanks! it looks a good idea, I need to understand how to modify the HEAT templates in Magnum to provide a cloud-config,

possibly https://github.com/openstack/heat-templates/blob/master/hot/software-config/example-templates/example-cloud-init.yaml

I'll try and report back here.

julienchastang · 2019-09-05T17:47:58Z

Also simply working with a more up-to-date image that will not have so many out of date debian packages could be another solution and maybe a more secure one too.

jlf599 · 2019-09-05T17:51:58Z

That is the best solution. However, if it's the fedora image that Magnum depends on, we've been trying to update to use CoreOS (as that's what Magnum is going to) and it's just not working correctly. If you're using an Ubuntu image, creating an up to date snap on a regular basis is not a bad idea.

jlf599 · 2019-09-05T17:52:22Z

Also, @zonca is this issue directly related to the ticket you opened today?

zonca · 2019-09-05T17:52:45Z

@jlf599 yes, exactly

jlf599 · 2019-09-05T17:54:04Z

Can you try an experiment of updating the image you're using with all of the latest updates and trying and see if you get a fast boot/cluster growth?

zonca · 2019-09-05T17:54:36Z

I'm using Fedora-AtomicHost-28-20180625 that was the only one working with Magnum, @jlf599 should I update that?

jlf599 · 2019-09-05T17:56:14Z

If Julian is seeing lots of dpkg updates, that's an ubuntu/debian based image, though.

You could yum update the Fedora image, snapshot it, and try using it. It should work but we all know how should messes with things. :)

zonca · 2019-09-05T17:57:21Z

@julienchastang is referring to the kubespray deployment, with Magnum I have fedora, but it could be the same problem. Anyway it doesn't hurt to have a update image ;)

jlf599 · 2019-09-05T19:51:24Z

That image for kubespray probably could stand to be updated, too, then. Can you let me know how things go so I can handle the ticket accordingly?

zonca · 2019-09-05T21:00:52Z

See the log of an instance attached, I think there is a big delay well before updating packages console.txt

zonca · 2019-09-05T21:02:16Z

@jlf599 even with an updated image I still see ~15 min provisioning of 1 node

zonca · 2019-09-05T21:51:56Z

@jlf599 actually I realized I didn't actually update the image, I am retesting it again now.

zonca · 2019-09-06T05:46:37Z

that was it, with my updated image Fedora-AtomicHost-28-20180625-updated-20190905, it completes in 4 minutes. thanks! now I'll get back to test the autoscaler

zonca · 2019-09-06T05:49:24Z

my impression is that the older kernel was hanging on something, while the new one works fine.
I attach the boot log as reference, but no need to dig into this.
console_updated_image.txt

ktyle · 2019-09-06T18:49:43Z

Nice ... I just tried it (looks like the image name is now Fedora-AtomicHost-28-Updated-9-6-2019 ) and it completed in a little less than 8 minutes ... way faster than with the last image.

jlf599 · 2019-09-06T18:51:11Z

We are in the midst of creating an updated one -- you found it. :) J

jlf599 · 2019-09-06T19:43:17Z

Okay. We've deactivated the old Atomic image and put this one on both clouds:

035d9554-086e-40f9-8da2-db023ea4b941 | Fedora-AtomicHost-28-Updated-9-6-2019

We'll be updating that every month or so with our featured images.

Thanks for pointing out the issue!

zonca · 2019-09-12T07:41:28Z

@jlf599 unfortunately it looks like the new image, both your version and mine, gets to "CREATE_COMPLETE", but the Kubernetes cluster is broken.

For example in my current deployment, the master node, even if it seems to be running, it is not recognized by kubernetes:

kubectl get nodes
NAME                        STATUS   ROLES    AGE   VERSION
k8s-7hkcfyagetqm-minion-0   Ready    <none>   5m    v1.11.1

It looks like there are some specific instructions on how to update the images for Magnum: https://docs.openstack.org/magnum/mitaka/dev/build-atomic-image.html

Can you please remind me which version of openstack is on Jetstream at IU? Can you please recover the old image and I'll try to update it using those instructions?

jlf599 · 2019-09-12T13:41:53Z

We're on Rocky release presently. Planning to go to Stein by year's end.

We didn't delete the old image -- just made it inactive, so we can reenable it and see about getting it updated.

jlf599 · 2019-09-12T14:41:27Z

I reactivated

5f2f28a4-6e7c-4515-86c7-f7cbfaa19a30 | Fedora-AtomicHost-28-20180625

and deactivated

035d9554-086e-40f9-8da2-db023ea4b941 | Fedora-AtomicHost-28-Updated-9-6-2019

I'm trying to find Rocky instructions like those above for Mitaka but haven't found them yet. Granted, I haven't spent much time on it yet.

zonca · 2019-09-12T15:29:57Z

thanks @jlf599 for the prompt response!
I cannot find that image:

Could not find resource 5f2f28a4-6e7c-4515-86c7-f7cbfaa19a30

zonca · 2019-09-12T15:51:21Z

I had the wrong hash, however that is still deactivated:

openstack image list | grep Atomic
| 4043bfa2-289b-4112-961a-64e2aeaedc3f | Fedora-Atomic-27-20180419                             | active      |
| 358cc5d0-e780-4ef5-acf6-9e37de560efa | Fedora-AtomicHost-28-20180625-updated-20190905        | active      |
| 035d9554-086e-40f9-8da2-db023ea4b941 | Fedora-AtomicHost-28-Updated-9-6-2019                 | deactivated |

jlf599 · 2019-09-12T15:53:05Z

Hrm:

5f2f28a4-6e7c-4515-86c7-f7cbfaa19a30 | Fedora-AtomicHost-28-20180625 | active

I did deactivate the one that wasn't working. Do you want that one back on?

jlf599 · 2019-09-12T15:55:42Z

(openstack) [IU] [Entropy] jeremy ~-->os image set --activate 035d9554-086e-40f9-8da2-db023ea4b941

Back on

zonca · 2019-09-12T18:46:09Z

ok, I tried to test with the older Fedora Atomic 27, and that worked fine! it doesn't have that slow boot as Fedora Atomic 28 and everything now works.

zonca · 2019-09-12T18:47:28Z

@julienchastang @jlf599 @ktyle @rsignell-usgs

See the tutorial: https://zonca.github.io/2019/09/kubernetes-jetstream-autoscaler.html

I'll do more testing in the next weeks and improve the tutorial, but everything seems to be working fine.

zonca · 2019-09-24T21:05:09Z

next simulate load, see #23

Implement data volume via block store

zonca mentioned this issue Apr 20, 2019

Autoscaling Kubernetes on Jetstream #13

Closed

zonca mentioned this issue Apr 20, 2019

Evaluate again deployment with magnum #16

Closed

zonca mentioned this issue May 14, 2019

Jetstream Deployment pangeo-data/pangeo#72

Closed

zonca closed this as completed Nov 1, 2019

zonca added a commit that referenced this issue May 6, 2020

Merge pull request #15 from det-lab/data_volume

7f648fa

Implement data volume via block store

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling Kubernetes on Jetstream with Cluster Autoscaler #15

Autoscaling Kubernetes on Jetstream with Cluster Autoscaler #15

zonca commented Apr 20, 2019

zonca commented Apr 20, 2019

zonca commented Apr 20, 2019

zonca commented Jun 14, 2019

zonca commented Sep 5, 2019 •

edited

Loading

julienchastang commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

julienchastang commented Sep 5, 2019

jlf599 commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

zonca commented Sep 5, 2019

zonca commented Sep 5, 2019

zonca commented Sep 6, 2019

zonca commented Sep 6, 2019

ktyle commented Sep 6, 2019

jlf599 commented Sep 6, 2019 via email •

edited

Loading

jlf599 commented Sep 6, 2019 •

edited

Loading

zonca commented Sep 12, 2019

jlf599 commented Sep 12, 2019

jlf599 commented Sep 12, 2019

zonca commented Sep 12, 2019 •

edited

Loading

zonca commented Sep 12, 2019

jlf599 commented Sep 12, 2019

jlf599 commented Sep 12, 2019

zonca commented Sep 12, 2019

zonca commented Sep 12, 2019 •

edited

Loading

zonca commented Sep 24, 2019

Autoscaling Kubernetes on Jetstream with Cluster Autoscaler #15

Autoscaling Kubernetes on Jetstream with Cluster Autoscaler #15

Comments

zonca commented Apr 20, 2019

zonca commented Apr 20, 2019

zonca commented Apr 20, 2019

zonca commented Jun 14, 2019

zonca commented Sep 5, 2019 • edited Loading

julienchastang commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

julienchastang commented Sep 5, 2019

jlf599 commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

jlf599 commented Sep 5, 2019

zonca commented Sep 5, 2019

zonca commented Sep 5, 2019

zonca commented Sep 5, 2019

zonca commented Sep 6, 2019

zonca commented Sep 6, 2019

ktyle commented Sep 6, 2019

jlf599 commented Sep 6, 2019 via email • edited Loading

jlf599 commented Sep 6, 2019 • edited Loading

zonca commented Sep 12, 2019

jlf599 commented Sep 12, 2019

jlf599 commented Sep 12, 2019

zonca commented Sep 12, 2019 • edited Loading

zonca commented Sep 12, 2019

jlf599 commented Sep 12, 2019

jlf599 commented Sep 12, 2019

zonca commented Sep 12, 2019

zonca commented Sep 12, 2019 • edited Loading

zonca commented Sep 24, 2019

zonca commented Sep 5, 2019 •

edited

Loading

jlf599 commented Sep 6, 2019 via email •

edited

Loading

jlf599 commented Sep 6, 2019 •

edited

Loading

zonca commented Sep 12, 2019 •

edited

Loading

zonca commented Sep 12, 2019 •

edited

Loading