Fix a race condition in manage runner #44958
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The runner was not connecting the client event listener to the event bus. This meant that all events between the
client.run_job()
and whenclient.get_cli_event_returns()
began polling for events would be lost. Beforeget_cli_event_returns
(via LocalClient'sget_iter_returns
) gets around to polling for events, it first checks to see if the job cache has a record of the jid it's querying. When using a custom returner for the job cache, one which has even a little bit of latency, return events from minions may sneak past before the jid lookup is complete, meaning thatget_iter_returns
will not return them and the manage runner will assume the minion did not respond.Connecting to the event bus before we run the test.ping ensures that we do not miss any of these events.
Resolves #44820
NOTE you can test this fix by adding a time.sleep(5) after the
client.run_job()
. This will introduce enough latency between therun_job
andget_cli_iter_returns
to trigger the behavior. With theconnect_pub
line commented out, the minion will fail, even though with debug logging turned on you can clearly see the return come in. With the fix in place, the sleep does not prevent the return event from being processed byget_cli_iter_returns
, and the return fromsalt-run manage.status
is as expected.