Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't subscribe to events if not sure it would read them. #36024

Merged

Conversation

DmitryKuzmenko
Copy link
Contributor

@DmitryKuzmenko DmitryKuzmenko commented Sep 2, 2016

What does this PR do?

Reactor uses LocalClient to send message but do not read events that
produces memory leak in EventPublisher: sent messages are collected
there in the IOStream class.

If there are any issues with no-receiving answer because the client is not subscribed to the bus we should change the logic there to subscribe to it implicitly or explicitly, but we shouldn't do this if not sure we'll read them.

What issues does this PR fix or reference?

#31454

Tests written?

No

Reactor uses LocalClient to send message but do not read events that
produces memory leak in EventPublisher: sent messages are collected
there in the IOStream class.
@DmitryKuzmenko DmitryKuzmenko changed the base branch from develop to 2016.3 September 2, 2016 14:26
@DmitryKuzmenko
Copy link
Contributor Author

@skizunov could you please to take a look?

# Ensure that the event subscriber is connected.
# If not, we won't get a response, so error out
if not self.event.connect_pub(timeout=timeout):
raise SaltReqTimeoutError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the intent of this code is to make sure the salt-master is running before invoking channel.send. In theory, it seems that channel.send should itself timeout if the salt-master is not running, so it looks like this change should be harmless.

However, a deeper look tells me there would be a slight difference in behavior. The intent of the code may be to give the master a little bit of time to come up (due to the pausing inherent in self.event.connect_pub) so that channel.send will work in cases where things are slightly out-of-sync coming up.

If the master is not up at the exact time channel.send is called, channel.send would always fail with a timeout even if the master came up slightly afterwards (and still within the timeout period) because the sent packet is lost (at least in a connection-less transport like ZMQ). However, with the original code intact, as long as the master came up within the timeout period, no errors would occur.

Removing this may make such test cases slightly more unreliable.

So I would say try to remove this and see what happens. If you see more occasional test case failures, then this would be the explanation for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good analysis. I suspect that the test suite will be tolerant of this change but this is something to keep in mind. I'd like to get this in and see how things go.

Thanks @DmitryKuzmenko and @skizunov as always.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skizunov thank you for review!

@cachedout cachedout merged commit cd60ec5 into saltstack:2016.3 Sep 2, 2016
@DmitryKuzmenko DmitryKuzmenko deleted the bugs/31454_local_client_memleak branch September 5, 2016 09:01
@ahammond
Copy link
Contributor

Does this need to be patched on just the master or on minions or on both master and minions?

@DmitryKuzmenko
Copy link
Contributor Author

@ahammond master side is enough.

@ahammond
Copy link
Contributor

Thanks!

On Sep 21, 2016 3:52 AM, "Dmitry Kuzmenko" notifications@github.com wrote:

@ahammond https://github.com/ahammond master side is enough.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#36024 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAbNRORqVZNf-m4Eo5q6MGZoSjvA-v6-ks5qsQxbgaJpZM4Jzwtm
.

skizunov added a commit to skizunov/salt that referenced this pull request Sep 30, 2016
It has been observed that when running this command:

```
salt "*" test.ping
```

sometimes the command would return `Minion did not return. [No response]`
for some of the minions even though the minions did indeed respond
(reproduced running Windows salt-master on Python 3 using the TCP
transport).

After investigating this further, it seems that there is a race condition
where if the response via event happens before events are being listened
for, the response is lost. For instance, in
`salt.client.LocalClient.cmd_cli` which is what is invoked in the command
above, it won't start listening for events until `get_cli_event_returns`
which invokes `get_iter_returns` which invokes `get_returns_no_block`
which invokes `self.event.get_event` which will connect to the event bus
if it hasn't connected yet (which is the case the first time it hits
this code). But events may be fired anytime after `self.pub()` is
executed which occurs before this code.

We need to ensure that events are being listened for before it is
possible they return. We also want to avoid issue saltstack#31454 which is what
PR saltstack#36024 fixed but in turn caused this issue. This is the approach I
have taken to try to tackle this issue:

It doesn't seem possible to generically discern if events can be
returned by a given function that invokes `run_job` and contains an
event searching function such as `get_cli_event_returns`. So for all
such functions that could possibly need to search the event bus, we
do the following:
- Record if the event bus is currently being listened to.
- When invoking `run_job`, ensure that `listen=True` so that `self.pub()`
will ensure that the event bus is listed to before sending the payload.
- When all possible event bus activities are concluded, if the event
bus was not originally being listened to, stop listening to it. This is
designed so that issue saltstack#31454 does not reappear. We do this via a
try/finally block in all instances of such code.

Signed-off-by: Sergey Kizunov <sergey.kizunov@ni.com>
@skizunov
Copy link
Contributor

@DmitryKuzmenko : It looks like I missed something when I previously evaluated the code change in this PR. I describe the issue and offer a solution to it in PR #36720. Any comments on my solution are welcome.

DmitryKuzmenko pushed a commit to DSRCorporation/salt that referenced this pull request Oct 20, 2016
It has been observed that when running this command:

```
salt "*" test.ping
```

sometimes the command would return `Minion did not return. [No response]`
for some of the minions even though the minions did indeed respond
(reproduced running Windows salt-master on Python 3 using the TCP
transport).

After investigating this further, it seems that there is a race condition
where if the response via event happens before events are being listened
for, the response is lost. For instance, in
`salt.client.LocalClient.cmd_cli` which is what is invoked in the command
above, it won't start listening for events until `get_cli_event_returns`
which invokes `get_iter_returns` which invokes `get_returns_no_block`
which invokes `self.event.get_event` which will connect to the event bus
if it hasn't connected yet (which is the case the first time it hits
this code). But events may be fired anytime after `self.pub()` is
executed which occurs before this code.

We need to ensure that events are being listened for before it is
possible they return. We also want to avoid issue saltstack#31454 which is what
PR saltstack#36024 fixed but in turn caused this issue. This is the approach I
have taken to try to tackle this issue:

It doesn't seem possible to generically discern if events can be
returned by a given function that invokes `run_job` and contains an
event searching function such as `get_cli_event_returns`. So for all
such functions that could possibly need to search the event bus, we
do the following:
- Record if the event bus is currently being listened to.
- When invoking `run_job`, ensure that `listen=True` so that `self.pub()`
will ensure that the event bus is listed to before sending the payload.
- When all possible event bus activities are concluded, if the event
bus was not originally being listened to, stop listening to it. This is
designed so that issue saltstack#31454 does not reappear. We do this via a
try/finally block in all instances of such code.

Signed-off-by: Sergey Kizunov <sergey.kizunov@ni.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants