New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with multi-master and local job cache: An inconsistency occurred #20197

Closed
jhenry82 opened this Issue Jan 29, 2015 · 22 comments

Comments

@jhenry82

jhenry82 commented Jan 29, 2015

Since updating from 2014.1.7 to 2014.7.1, we are seeing a large number (like, tens of thousands per day) of messages in the master log like the following:

2015-01-28 14:08:35,600 [salt.loaded.int.returner.local_cache ][ERROR ] An inconsistency occurred, a job was received with a job id that is not present in the local cache: 20150128140503219478

We run multi-master with 4 masters, all listed in every minion's config.

In issue #18322, @basepi suggested this was probably due to some changes in multi-master in the 2014.7 timeframe and should be fixed. I just wanted to formally open an issue so it's not lost.

CentOS 6.2

           Salt: 2014.7.1
         Python: 2.6.6 (r266:84292, Dec  7 2011, 20:48:22)
         Jinja2: 2.2.1
       M2Crypto: 0.20.2
 msgpack-python: 0.1.9.final
   msgpack-pure: Not Installed
       pycrypto: 2.6
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.0.2
           RAET: Not Installed
            ZMQ: 3.2.3
           Mako: Not Installed
@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Jan 30, 2015

Contributor

Thanks for filing this @jhenry82. We shall take a look here!

ping @cachedout

Contributor

rallytime commented Jan 30, 2015

Thanks for filing this @jhenry82. We shall take a look here!

ping @cachedout

@jasee

This comment has been minimized.

Show comment
Hide comment
@jasee

jasee Feb 3, 2015

I found something interesting.
I have two masters, called A and B. After all minion restarts, both A and B can execute test.ping without any problem. But if I run state.highstate on master A, then A can't get minion's response any more and at this moment B can run state.highstate and test.ping normally. It seems that the master who first start a highstate will losts minion's response。

Centos 6.3
           Salt: 2014.7.0
         Python: 2.6.6 (r266:84292, Jun 18 2012, 14:18:47)
         Jinja2: 2.2.1
       M2Crypto: 0.20.2
 msgpack-python: 0.1.13
   msgpack-pure: Not Installed
       pycrypto: 2.0.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.3.1
           RAET: Not Installed
            ZMQ: 3.2.4
           Mako: Not Installed

jasee commented Feb 3, 2015

I found something interesting.
I have two masters, called A and B. After all minion restarts, both A and B can execute test.ping without any problem. But if I run state.highstate on master A, then A can't get minion's response any more and at this moment B can run state.highstate and test.ping normally. It seems that the master who first start a highstate will losts minion's response。

Centos 6.3
           Salt: 2014.7.0
         Python: 2.6.6 (r266:84292, Jun 18 2012, 14:18:47)
         Jinja2: 2.2.1
       M2Crypto: 0.20.2
 msgpack-python: 0.1.13
   msgpack-pure: Not Installed
       pycrypto: 2.0.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.3.1
           RAET: Not Installed
            ZMQ: 3.2.4
           Mako: Not Installed
@DaveQB

This comment has been minimized.

Show comment
Hide comment
@DaveQB

DaveQB Mar 4, 2015

Contributor

I think am suffering something like this.
Got a thread on google groups

https://groups.google.com/forum/#!topic/salt-users/bP49NTWyLyo

Contributor

DaveQB commented Mar 4, 2015

I think am suffering something like this.
Got a thread on google groups

https://groups.google.com/forum/#!topic/salt-users/bP49NTWyLyo

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Mar 4, 2015

Contributor

@jhenry82 or @DaveQB Are either of you in a position to give #19790 a try to see if that helps alleviate this issue? We had a race condition in event handler for multi-master and I am wondering if that is the underlying bug you are seeing. That fix was backported to the 2015.2 branch (in #20964), so that fix will be available in the 2015.2.0 release.

Contributor

rallytime commented Mar 4, 2015

@jhenry82 or @DaveQB Are either of you in a position to give #19790 a try to see if that helps alleviate this issue? We had a race condition in event handler for multi-master and I am wondering if that is the underlying bug you are seeing. That fix was backported to the 2015.2 branch (in #20964), so that fix will be available in the 2015.2.0 release.

@DaveQB

This comment has been minimized.

Show comment
Hide comment
@DaveQB

DaveQB Mar 6, 2015

Contributor

@rallytime I will give this a go today. It's been a hellish week as a result of our upgrade form 2014.1.3 > 2014.7.1. It started off looking safe but has caused 18 hour work days etc.

I hope this is the fix. Thanks.

Contributor

DaveQB commented Mar 6, 2015

@rallytime I will give this a go today. It's been a hellish week as a result of our upgrade form 2014.1.3 > 2014.7.1. It started off looking safe but has caused 18 hour work days etc.

I hope this is the fix. Thanks.

@DaveQB

This comment has been minimized.

Show comment
Hide comment
@DaveQB

DaveQB Mar 6, 2015

Contributor

@rallytime You mean just this diff? https://github.com/saltstack/salt/pull/19790/files
Looking at the whole file verse the minion.py I have for 2014.7.1, there's a lot of other changes.

Contributor

DaveQB commented Mar 6, 2015

@rallytime You mean just this diff? https://github.com/saltstack/salt/pull/19790/files
Looking at the whole file verse the minion.py I have for 2014.7.1, there's a lot of other changes.

@DaveQB

This comment has been minimized.

Show comment
Hide comment
@DaveQB

DaveQB Mar 6, 2015

Contributor

@rallytime I can't apply the patch as none of the lines it is replacing exist in 2014.7.1
I am a bit concerned about applying the whole file, being so different.

Contributor

DaveQB commented Mar 6, 2015

@rallytime I can't apply the patch as none of the lines it is replacing exist in 2014.7.1
I am a bit concerned about applying the whole file, being so different.

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Mar 6, 2015

Contributor

@DaveQB Ah, yes, my apologies. That was why this fix was backported to the 2015.2 branch instead of the 2014.7 branch. The bug was fixed originally on develop, which contains many refactors affecting this area of the code. Since 2015.2 hasn't diverged from develop as much as 2014.7 has, and contains many of those refactors, it backported just fine. As you've seen, however, that is not the case for the 2014.7 branch.

I suppose what I really meant in my earlier comment, but articulated incorrectly, was to ask if you were in a position to give either the HEAD of 2015.2 branch a try in a testing environment, or even the first release candidate on 2015.2 (2015.2.0rc1) to see if the issue is resolved for you.

Contributor

rallytime commented Mar 6, 2015

@DaveQB Ah, yes, my apologies. That was why this fix was backported to the 2015.2 branch instead of the 2014.7 branch. The bug was fixed originally on develop, which contains many refactors affecting this area of the code. Since 2015.2 hasn't diverged from develop as much as 2014.7 has, and contains many of those refactors, it backported just fine. As you've seen, however, that is not the case for the 2014.7 branch.

I suppose what I really meant in my earlier comment, but articulated incorrectly, was to ask if you were in a position to give either the HEAD of 2015.2 branch a try in a testing environment, or even the first release candidate on 2015.2 (2015.2.0rc1) to see if the issue is resolved for you.

@DaveQB

This comment has been minimized.

Show comment
Hide comment
@DaveQB

DaveQB Mar 6, 2015

Contributor

@rallytime I see. This is really disappointing that multi-master support has gone backwards with the latest release. I'll look into the 2015 RC

@jhenry82 Are you seeing minions dropping off from your masters? That is, a large number of minions showing up from 'salt-run manage.down'? And then once you get them back up, they fall off again?

Contributor

DaveQB commented Mar 6, 2015

@rallytime I see. This is really disappointing that multi-master support has gone backwards with the latest release. I'll look into the 2015 RC

@jhenry82 Are you seeing minions dropping off from your masters? That is, a large number of minions showing up from 'salt-run manage.down'? And then once you get them back up, they fall off again?

@jhenry82

This comment has been minimized.

Show comment
Hide comment
@jhenry82

jhenry82 Mar 6, 2015

@rallytime I'll try to get HEAD 2015.0 in our test environment next week, but like @DaveQB work has been crazy lately and I'm not sure I'll have time in the near future.

And yes. Basically what we saw was huge swathes of minions dropping off all but one master. So like masterA could only test.ping one small subset of the total number of minions. masterB could only test.ping a different subset. And so forth. Restarting the minions and/or the masters never fully resolved the issue. As you said, it was so completely broken that I didn't spend a ton of time playing with it before simply rolling back to the 2014.1 series.

jhenry82 commented Mar 6, 2015

@rallytime I'll try to get HEAD 2015.0 in our test environment next week, but like @DaveQB work has been crazy lately and I'm not sure I'll have time in the near future.

And yes. Basically what we saw was huge swathes of minions dropping off all but one master. So like masterA could only test.ping one small subset of the total number of minions. masterB could only test.ping a different subset. And so forth. Restarting the minions and/or the masters never fully resolved the issue. As you said, it was so completely broken that I didn't spend a ton of time playing with it before simply rolling back to the 2014.1 series.

@DaveQB

This comment has been minimized.

Show comment
Hide comment
@DaveQB

DaveQB Mar 7, 2015

Contributor
Contributor

DaveQB commented Mar 7, 2015

@cachedout

This comment has been minimized.

Show comment
Hide comment
@cachedout

cachedout Mar 19, 2015

Contributor

@rallytime Which issue is the blocker here? If it's the highstate causing minions to drop issue, that is a duplicate of #19932 which is resolved by #21795.

If it is the issue about inconsistent returns, I can confirm that in the 2014.7.2 tag as @jhenry82 originally pointed out but not in the HEAD of 2014.7.

Contributor

cachedout commented Mar 19, 2015

@rallytime Which issue is the blocker here? If it's the highstate causing minions to drop issue, that is a duplicate of #19932 which is resolved by #21795.

If it is the issue about inconsistent returns, I can confirm that in the 2014.7.2 tag as @jhenry82 originally pointed out but not in the HEAD of 2014.7.

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Mar 19, 2015

Contributor

@cachedout You're right I think that I didn't realize these were duplicate issues. I'll test your fix on 2015.2 since the PR was submitted there and see if I can backport the fixes to 2014.7, where both of the referenced issues were filed. If they can't be backported then we'll need to look at a different fix here. I'll keep you posted.

Contributor

rallytime commented Mar 19, 2015

@cachedout You're right I think that I didn't realize these were duplicate issues. I'll test your fix on 2015.2 since the PR was submitted there and see if I can backport the fixes to 2014.7, where both of the referenced issues were filed. If they can't be backported then we'll need to look at a different fix here. I'll keep you posted.

@cachedout

This comment has been minimized.

Show comment
Hide comment
@cachedout

cachedout Mar 19, 2015

Contributor

@rallytime Sounds good thanks

Contributor

cachedout commented Mar 19, 2015

@rallytime Sounds good thanks

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Mar 19, 2015

Contributor

I patched both of my masters and both of my test minions with the fix in #21795 and I am still seeing events return to the wrong master after running a state.highstate.

All masters and minions in this set up are running on a recent install of the HEAD of 2014.7:

# salt --versions
           Salt: 2014.7.2-372-g493a97c
         Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.0.1
           RAET: Not Installed
            ZMQ: 4.0.4
           Mako: Not Installed
Contributor

rallytime commented Mar 19, 2015

I patched both of my masters and both of my test minions with the fix in #21795 and I am still seeing events return to the wrong master after running a state.highstate.

All masters and minions in this set up are running on a recent install of the HEAD of 2014.7:

# salt --versions
           Salt: 2014.7.2-372-g493a97c
         Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.0.1
           RAET: Not Installed
            ZMQ: 4.0.4
           Mako: Not Installed

This was referenced Mar 20, 2015

@cachedout cachedout removed the Blocker label Mar 20, 2015

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Mar 20, 2015

Contributor

@jhenry82 and @DaveQB we've been working to make significant improvements to multi-master over the last couple of days. I believe we have addressed all of the concerns here, but if you're able to test (from the HEAD of 2014.7) this and confirm that improvements have been made, that would certainly be appreciated.

Contributor

rallytime commented Mar 20, 2015

@jhenry82 and @DaveQB we've been working to make significant improvements to multi-master over the last couple of days. I believe we have addressed all of the concerns here, but if you're able to test (from the HEAD of 2014.7) this and confirm that improvements have been made, that would certainly be appreciated.

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Mar 21, 2015

Contributor

There is still one bug that we are working out that I recently discovered when running a highstate command in a particular manner. I'll post here with an update when we have it resolved.

Contributor

rallytime commented Mar 21, 2015

There is still one bug that we are working out that I recently discovered when running a highstate command in a particular manner. I'll post here with an update when we have it resolved.

@DaveQB

This comment has been minimized.

Show comment
Hide comment
@DaveQB

DaveQB Mar 23, 2015

Contributor

Thank you very much for your efforts @rallytime @cachedout

Looking forward to it.

Contributor

DaveQB commented Mar 23, 2015

Thank you very much for your efforts @rallytime @cachedout

Looking forward to it.

@rvora

This comment has been minimized.

Show comment
Hide comment
@rvora

rvora Mar 24, 2015

@rallytime I am seeing these logs but I don't have a multi-master environment. I have a master-of-masters and a bunch of ephemeral masters who run syndic pointing to master-of-masters. By ephemeral masters, I mean that these masters are spawned by CI/CD pipeline and are eventually destroyed by the pipeline. All minions point to only one master.

A couple of things about the setup. Master-of-masters is running 2014.7.1 whereas the syndics are running 2014.7.0. They are all in open-mode.

I get a lot of these in the master-of-master logs:
2015-03-24 12:55:53,1000 [salt.loaded.int.returner.local_cache ][WARNING ] Could not write job invocation cache file: [Errno 2] No such file or directory: '/var/cache/salt/master/jobs/10/f6620f0de42f7b15db44ca73bebb01/.load.p'
2015-03-24 12:55:54,003 [salt.loaded.int.returner.local_cache ][ERROR ] An inconsistency occurred, a job was received with a job id that is not present in the local cache: 20150324125535753347

rvora commented Mar 24, 2015

@rallytime I am seeing these logs but I don't have a multi-master environment. I have a master-of-masters and a bunch of ephemeral masters who run syndic pointing to master-of-masters. By ephemeral masters, I mean that these masters are spawned by CI/CD pipeline and are eventually destroyed by the pipeline. All minions point to only one master.

A couple of things about the setup. Master-of-masters is running 2014.7.1 whereas the syndics are running 2014.7.0. They are all in open-mode.

I get a lot of these in the master-of-master logs:
2015-03-24 12:55:53,1000 [salt.loaded.int.returner.local_cache ][WARNING ] Could not write job invocation cache file: [Errno 2] No such file or directory: '/var/cache/salt/master/jobs/10/f6620f0de42f7b15db44ca73bebb01/.load.p'
2015-03-24 12:55:54,003 [salt.loaded.int.returner.local_cache ][ERROR ] An inconsistency occurred, a job was received with a job id that is not present in the local cache: 20150324125535753347

@basepi

This comment has been minimized.

Show comment
Hide comment
@basepi

basepi Mar 24, 2015

Collaborator

@rvora First thing to check is that there is only one syndic process running on each master. Additionally, we have some recent syndic fixes which have gone in which may fix this issue, though I haven't actually seen this particular issue with a syndic setup.

Collaborator

basepi commented Mar 24, 2015

@rvora First thing to check is that there is only one syndic process running on each master. Additionally, we have some recent syndic fixes which have gone in which may fix this issue, though I haven't actually seen this particular issue with a syndic setup.

@jfindlay jfindlay added the P1 label Apr 8, 2015

@jfindlay jfindlay added Core and removed Core labels May 26, 2015

@jhenry82

This comment has been minimized.

Show comment
Hide comment
@jhenry82

jhenry82 May 30, 2015

I updated to 2015.5.1 this week and am no longer seeing this behavior. As far as I am concerned, it is fixed and the issue can be closed. Thanks for the awesome release!

I do NOT have any syndics in my environment so I can't comment on @rvora's problem.

jhenry82 commented May 30, 2015

I updated to 2015.5.1 this week and am no longer seeing this behavior. As far as I am concerned, it is fixed and the issue can be closed. Thanks for the awesome release!

I do NOT have any syndics in my environment so I can't comment on @rvora's problem.

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime May 31, 2015

Contributor

@jhenry82 That is excellent news to hear. We made some significant fixes and improvements to Multi Master and I am so glad that 2015.5.1 is working better for you. I know you had some significant issues with MM in the 2014.7.x releases. I am going to go ahead and close this one.

@rvora If you continue to have a problem with your syndic setup after trying the suggestions from @basepi, please open a new issue with as much useful information you can provide and we can address those concerns in the new issue.

Contributor

rallytime commented May 31, 2015

@jhenry82 That is excellent news to hear. We made some significant fixes and improvements to Multi Master and I am so glad that 2015.5.1 is working better for you. I know you had some significant issues with MM in the 2014.7.x releases. I am going to go ahead and close this one.

@rvora If you continue to have a problem with your syndic setup after trying the suggestions from @basepi, please open a new issue with as much useful information you can provide and we can address those concerns in the new issue.

@rallytime rallytime closed this May 31, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment