Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salt-cloud being strangely slow #34776

Closed
alexykot opened this issue Jul 19, 2016 · 14 comments

Comments

Projects
None yet
7 participants
@alexykot
Copy link

commented Jul 19, 2016

Description of the problem

I'm comparing salt-cloud and terraform as tools to manage our infrastructure at GCE. We use salt stack to manage VM instance configurations, so I would naturally prefer to use salt-cloud as an integral part of the stack and phase out terraform as a legacy thing.

However my use case is critical on VM deployment time because we offer PaaS solution with VMs deployed on customer request, so need to deliver ready VMs on a click of a button within seconds.

And what puzzles me is why salt-cloud takes so long to deploy basic machines.

I have created neck-to-neck simple test with deploying three VMs based on default GCE CentOS7 image using both terraform and salt-cloud (both in parallel mode). And the time difference is stunning - where terraform needs around 30 seconds to deploy requested machines (which is similar to time needed to deploy same instances through GCE GUI), salt-cloud takes around 220 seconds to deploy exactly same machines under same account in the same zone. Especially strange is that first 130 seconds salt-cloud does not start deploying and does seemingly nothing at all, and only after around 130 seconds pass it shows message deploying VMs and those VMs appear in GUI as in deployment.

Is there something obvious that I'm missing about salt-cloud that makes it so slow? Can it be sped up somehow? I would prefer to user full salt stack, but with this issue it is not possible in my use case.

Setup

I'm using salt-cloud v2016.3.1 versus terraform v0.6.16, both most recent versions available at a time of test. The master instance that executes this provisioning sits in the same GCE DC, but in a different subnet on a different GCE account. Same instance is used to execute salt-cloud and terraform tests.

Notes

I've measured time from start till end of hosts bootstrap. Salt-cloud tells you in console when initial machine bootstrap is over and it moves on to provisioning, i.e. minion installation, master-minion connection and certificates, further provision according to salt states etc. So that is excluded from the timeline I've explained above.

@alexykot alexykot changed the title salt-cloud being ridiculously slow salt-cloud being strangely slow Jul 19, 2016

@techhat

This comment has been minimized.

Copy link
Member

commented Jul 19, 2016

Hi @alexykot. With most providers, it really does only take a few seconds to request a machine. The bulk of the time spent waiting is for machines to become responsive so that Salt Cloud can log in, upload files, and run the deploy script.

You should be able to see all of this activity by adding -l debug to your salt-cloud command. If you see specific areas that can be optimized, we would love to do so.

@alexykot

This comment has been minimized.

Copy link
Author

commented Jul 19, 2016

I would expect so and it would be totally fine to wait until machines come up online. And fine to wait little longer to provision salt-minion and do keys exchange. But it seems to take much longer than needed to do that. I will retry it with -l debug and report.

@alexykot

This comment has been minimized.

Copy link
Author

commented Jul 19, 2016

I ran another deployment with -l debug, and it output massive log (as expected). However the log doesn't have timestamps attached.
Is it possible to add timestamps to log entries to see what is happening when and recognise when the delay is happening?

@s0undt3ch

This comment has been minimized.

@s0undt3ch s0undt3ch closed this Jul 19, 2016

@s0undt3ch s0undt3ch reopened this Jul 19, 2016

@s0undt3ch

This comment has been minimized.

Copy link
Member

commented Jul 19, 2016

Sorry... Dam phone...

More particularly, %(asctime)s,%(msecs)03.0f

@cachedout

This comment has been minimized.

Copy link
Collaborator

commented Jul 19, 2016

@s0undt3ch Should we make timestamps the default?

@techhat

This comment has been minimized.

Copy link
Member

commented Jul 20, 2016

@s0undt3ch

This comment has been minimized.

Copy link
Member

commented Jul 20, 2016

Its the default for log files, and I honestly also set that for the console on my setups.... Yes?

@alexykot

This comment has been minimized.

Copy link
Author

commented Jul 21, 2016

I think I know where the speed problem lies. I did something that totally changed the way it works, and here is what is was.

Initially I was doing the test using the same salt stack instance I use for managing main infrastructure we have. I have the provider for production GCE cloud account configured in there, and I've added another provider for this performance testing, so I had two providers configured at once.
And I've had a test instance profile configured that was using test provider.
And with this configuration provisioning speed was not great as I've described above.

So what I did was commenting out the production provider. I've just renamed it's .conf file into .conf.bak, checked that salt-cloud doesn't see it anymore, and rerun the same test on the test provider. And the speed has increased twofold. It is still taking some time, but much less than before.

And I can point out the exact place in the log where the time difference happened.
I've rerun test provisioning with and without production provider configured and all the difference is between these two lines:

with prod provider:

10:45:13,558 [salt.utils.lazy                          ][DEBUG   ] LazyLoaded rackspace.list_locations
10:49:39,140 [salt.loaded.int.clouds.gce               ][INFO    ] Creating GCE instance saltcloud-test-3 in europe-west1-b

without prod provider:

10:35:20,585 [salt.utils.lazy                          ][DEBUG   ] LazyLoaded rackspace.list_locations
10:36:46,114 [salt.loaded.int.clouds.gce               ][INFO    ] Creating GCE instance saltcloud-test-2 in europe-west1-b

That's 86 seconds versus 266 seconds.

I have the rest of the log, but I don't see much of other interesting things in there.

An important background note - our production provider has around 300 live instances at the moment.

@alexykot

This comment has been minimized.

Copy link
Author

commented Aug 8, 2016

Hey, I see the status here now, saying "Fixed Pending Verification". Shall I retest it now with latest develop branch?

@cachedout

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2016

@alexykot I don't see any specific commit referenced, so I don't think a change has been made in regard to this PR. I'll remove the label since I think it was added in error.

@rosscdh

This comment has been minimized.

Copy link

commented Nov 29, 2016

+1 this becomes a real problem at around 120 machines

@s-schweer

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2016

+1 yeah, with up to 100 machines it's quite fast but beyond that salt-cloud takes about 15 minutes before it even starts to provision a machine (at least with the nova driver), i think it's querying the existing machines that causes that...

@stale

This comment has been minimized.

Copy link

commented Aug 1, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Aug 1, 2018

@stale stale bot closed this Aug 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.