Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in manage runner #44958

Merged
merged 2 commits into from Dec 13, 2017

Conversation

terminalmage
Copy link
Contributor

@terminalmage terminalmage commented Dec 13, 2017

The runner was not connecting the client event listener to the event bus. This meant that all events between the client.run_job() and when client.get_cli_event_returns() began polling for events would be lost. Before get_cli_event_returns (via LocalClient's get_iter_returns) gets around to polling for events, it first checks to see if the job cache has a record of the jid it's querying. When using a custom returner for the job cache, one which has even a little bit of latency, return events from minions may sneak past before the jid lookup is complete, meaning that get_iter_returns will not return them and the manage runner will assume the minion did not respond.

Connecting to the event bus before we run the test.ping ensures that we do not miss any of these events.

Resolves #44820

NOTE you can test this fix by adding a time.sleep(5) after the client.run_job(). This will introduce enough latency between the run_job and get_cli_iter_returns to trigger the behavior. With the connect_pub line commented out, the minion will fail, even though with debug logging turned on you can clearly see the return come in. With the fix in place, the sleep does not prevent the return event from being processed by get_cli_iter_returns, and the return from salt-run manage.status is as expected.

@terminalmage
Copy link
Contributor Author

@msteed the fix here (as well as the addition of the sleep to aid in confirmation) should be simple enough for you to try on the master you're using for testing purposes.

The runner was not connecting the client event listener to the event
bus. This meant that all events between the `client.run_job()` and when
`client.get_cli_event_returns()` began polling for events would be lost.
Before `get_cli_event_returns` (via LocalClient's `get_iter_returns`)
gets around to polling for events, it first checks to see if the job
cache has a record of the jid it's querying. When using a custom
returner for the job_cache, one which has even a little bit of latency,
return events from minions may sneak past before the jid lookup is
complete, meaning that `get_iter_returns` will not return them and the
manage runner will assume the minion did not respond.

Connecting to the event bus before we run the test.ping ensures that we
do not miss any of these events.
@terminalmage
Copy link
Contributor Author

Actually I noticed that the run_job func has a listen arg that does the same thing I was doing manually in my initial commit. I just pushed a cleaner fix.

@cachedout cachedout merged commit dad2d72 into saltstack:2017.7 Dec 13, 2017
@msteed
Copy link
Contributor

msteed commented Dec 13, 2017

@terminalmage : Thanks, this works perfectly

@terminalmage
Copy link
Contributor Author

👍

rallytime pushed a commit that referenced this pull request Dec 14, 2017
@terminalmage terminalmage deleted the issue44820 branch January 5, 2018 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants