Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help improving performance of our Salt environment - it is unacceptably slow #35746

Closed
mwtzzz-zz opened this issue Aug 24, 2016 · 4 comments
Labels
info-needed waiting for more info
Milestone

Comments

@mwtzzz-zz
Copy link

mwtzzz-zz commented Aug 24, 2016

Problem:

  • salt masters running with 50% cpu wait on io (as reported by vmstat)
  • salt minions take 30+ minutes to run a highstate
  • lots of "resource timeout and unavailable" message from strace on salt-minion process

Effect:

  • this is causing us trouble in maintaining our environment
  • sometimes to do a simple package update across our environment takes hours

Environment:

  • five salt masters in amazon aws (amazon linux)
  • 1500 salt minions in amazon aws (amazon linux)
  • salt minions configured for random_master

Versions:

  • salt masters: 2016.3.2, ZMQ = 4.0.5, PyZMQ = 14.5.0
  • salt minions: 2016.3.2, ZMQ = 4.0.5, PyZMQ = 14.5.0

Details:
salt master vmstat:

[root@ec2- salt-master101 ~]$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  6      0 3875856 747976 1593464    0    0     1   276   56   47 20  3 45 32  0
 1  7      0 3869740 747976 1593472    0    0     0   572 2568 2446  8  1 13 79  0
 0  6      0 3876328 747976 1593484    0    0     0   512 2079 1973  2  0 27 71  0
 0  8      0 3876188 747976 1593472    0    0     0  1736 1964 1786  1  1 44 54  0
 0  8      0 3876172 747976 1593472    0    0     0   464 1192 1180  0  0  0 99  0

STRACE on minion during salt-call state.highstate:

poll([{fd=7, events=POLLIN}, {fd=14, events=POLLIN}], 2, 0) = 1 ([{fd=7, revents=POLLIN}])
poll([{fd=14, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=14, events=POLLIN}], 1, 0)    = 0 (Timeout)
write(12, "\1\0\0\0\0\0\0\0", 8)        = 8
fstat(7, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
lseek(7, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
read(7, "x", 8192)                      = 1
read(7, 0x3858f15, 8191)                = -1 EAGAIN (Resource temporarily unavailable)
fstat(7, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
lseek(7, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
read(7, 0x3858f54, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
gettimeofday({1472066441, 123603}, NULL) = 0
gettimeofday({1472066441, 123640}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {10019313, 260326271}) = 0
poll([{fd=7, events=POLLIN}, {fd=14, events=POLLIN}], 2, 0) = 0 (Timeout)
poll([{fd=14, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=7, events=POLLIN}, {fd=14, events=POLLIN}], 2, 59998^CProcess 14365 detached
 <detached ...>
@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 25, 2016

@mwtzzz okay can you please provide/answer the following:

  1. sanitized version of both your master and minions
  2. Is this new behavior for your environment? And if so when did you start seeing this behavior appear?Maybe an upgrade? any other information as to anything new added into the environment
  3. Can you ensure there are no defunct processes. When running service salt-master stop please check to make sure all processes were killed. Please do this for one of teh minions that this behavior is showing up on as well.
  4. Are there any network issues in your environment, such as slowness?
  5. I'm curious to see if random_master is causing the issue. If you remove that do you still see the issue.
  6. Anything relevant in the debug logs on both the master and minion?

@Ch3LL Ch3LL added the info-needed waiting for more info label Aug 25, 2016
@Ch3LL Ch3LL added this to the Blocked milestone Aug 25, 2016
@mwtzzz-zz
Copy link
Author

Hi Megan,

Thanks for your reply. It looks like the issue might have been "hardware".
Specifically, two things:

a) a couple salt masters were pegged at 100%util on iostat, and throwing
weird errors about full disk even though df was showing plenty of free space

b) some of the salt masters were using root disk instead of the faster EBS
volume. In other words, salt masters were not configured identically.

With preliminary testing, slowness has gone away by eliminating (a) and by
configuring the remaining servers to serve Salt from the EBS disk. I'll
post an update in a couple days after letting the changes bake in.

On Thu, Aug 25, 2016 at 10:55 AM, Megan Wilhite notifications@github.com
wrote:

@mwtzzz https://github.com/mwtzzz okay can you please provide/answer
the following:

  1. sanitized version of both your master and minions
  2. Is this new behavior for your environment? And if so when did you
    start seeing this behavior appear?Maybe an upgrade? any other information
    as to anything new added into the environment
  3. Can you ensure there are no defunct processes. When running service
    salt-master stop please check to make sure all processes were killed.
    Please do this for one of teh minions that this behavior is showing up on
    as well.
  4. Are there any network issues in your environment, such as slowness?
  5. I'm curious to see if random_master is causing the issue. If you
    remove that do you still see the issue.
  6. Anything relevant in the debug logs on both the master and minion?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35746 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGfAIUk5y0wwvR-HL_hdce4zBkp0n1MOks5qjdcZgaJpZM4JsW9W
.


Michael Martinez
http://www.michael--martinez.com

@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 15, 2016

@mwtzzz any updates?

@mwtzzz-zz
Copy link
Author

Yes, it appears it has been consistently faster since I added the faster hard drives. Feel free to close this ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info-needed waiting for more info
Projects
None yet
Development

No branches or pull requests

3 participants