Don't subscribe to events if not sure it would read them. #36024

DmitryKuzmenko · 2016-09-02T14:26:07Z

What does this PR do?

Reactor uses LocalClient to send message but do not read events that
produces memory leak in EventPublisher: sent messages are collected
there in the IOStream class.

If there are any issues with no-receiving answer because the client is not subscribed to the bus we should change the logic there to subscribe to it implicitly or explicitly, but we shouldn't do this if not sure we'll read them.

What issues does this PR fix or reference?

#31454

Tests written?

No

Reactor uses LocalClient to send message but do not read events that produces memory leak in EventPublisher: sent messages are collected there in the IOStream class.

DmitryKuzmenko · 2016-09-02T14:27:29Z

@skizunov could you please to take a look?

skizunov · 2016-09-02T15:27:40Z

salt/client/__init__.py

-            # Ensure that the event subscriber is connected.
-            # If not, we won't get a response, so error out
-            if not self.event.connect_pub(timeout=timeout):
-                raise SaltReqTimeoutError()


It seems like the intent of this code is to make sure the salt-master is running before invoking channel.send. In theory, it seems that channel.send should itself timeout if the salt-master is not running, so it looks like this change should be harmless.

However, a deeper look tells me there would be a slight difference in behavior. The intent of the code may be to give the master a little bit of time to come up (due to the pausing inherent in self.event.connect_pub) so that channel.send will work in cases where things are slightly out-of-sync coming up.

If the master is not up at the exact time channel.send is called, channel.send would always fail with a timeout even if the master came up slightly afterwards (and still within the timeout period) because the sent packet is lost (at least in a connection-less transport like ZMQ). However, with the original code intact, as long as the master came up within the timeout period, no errors would occur.

Removing this may make such test cases slightly more unreliable.

So I would say try to remove this and see what happens. If you see more occasional test case failures, then this would be the explanation for that.

Good analysis. I suspect that the test suite will be tolerant of this change but this is something to keep in mind. I'd like to get this in and see how things go.

Thanks @DmitryKuzmenko and @skizunov as always.

@skizunov thank you for review!

ahammond · 2016-09-20T18:38:46Z

Does this need to be patched on just the master or on minions or on both master and minions?

DmitryKuzmenko · 2016-09-21T10:51:54Z

@ahammond master side is enough.

ahammond · 2016-09-21T17:29:44Z

Thanks!

On Sep 21, 2016 3:52 AM, "Dmitry Kuzmenko" notifications@github.com wrote:

@ahammond https://github.com/ahammond master side is enough.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#36024 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAbNRORqVZNf-m4Eo5q6MGZoSjvA-v6-ks5qsQxbgaJpZM4Jzwtm
.

It has been observed that when running this command: ``` salt "*" test.ping ``` sometimes the command would return `Minion did not return. [No response]` for some of the minions even though the minions did indeed respond (reproduced running Windows salt-master on Python 3 using the TCP transport). After investigating this further, it seems that there is a race condition where if the response via event happens before events are being listened for, the response is lost. For instance, in `salt.client.LocalClient.cmd_cli` which is what is invoked in the command above, it won't start listening for events until `get_cli_event_returns` which invokes `get_iter_returns` which invokes `get_returns_no_block` which invokes `self.event.get_event` which will connect to the event bus if it hasn't connected yet (which is the case the first time it hits this code). But events may be fired anytime after `self.pub()` is executed which occurs before this code. We need to ensure that events are being listened for before it is possible they return. We also want to avoid issue saltstack#31454 which is what PR saltstack#36024 fixed but in turn caused this issue. This is the approach I have taken to try to tackle this issue: It doesn't seem possible to generically discern if events can be returned by a given function that invokes `run_job` and contains an event searching function such as `get_cli_event_returns`. So for all such functions that could possibly need to search the event bus, we do the following: - Record if the event bus is currently being listened to. - When invoking `run_job`, ensure that `listen=True` so that `self.pub()` will ensure that the event bus is listed to before sending the payload. - When all possible event bus activities are concluded, if the event bus was not originally being listened to, stop listening to it. This is designed so that issue saltstack#31454 does not reappear. We do this via a try/finally block in all instances of such code. Signed-off-by: Sergey Kizunov <sergey.kizunov@ni.com>

skizunov · 2016-09-30T19:02:46Z

@DmitryKuzmenko : It looks like I missed something when I previously evaluated the code change in this PR. I describe the issue and offer a solution to it in PR #36720. Any comments on my solution are welcome.

It has been observed that when running this command: ``` salt "*" test.ping ``` sometimes the command would return `Minion did not return. [No response]` for some of the minions even though the minions did indeed respond (reproduced running Windows salt-master on Python 3 using the TCP transport). After investigating this further, it seems that there is a race condition where if the response via event happens before events are being listened for, the response is lost. For instance, in `salt.client.LocalClient.cmd_cli` which is what is invoked in the command above, it won't start listening for events until `get_cli_event_returns` which invokes `get_iter_returns` which invokes `get_returns_no_block` which invokes `self.event.get_event` which will connect to the event bus if it hasn't connected yet (which is the case the first time it hits this code). But events may be fired anytime after `self.pub()` is executed which occurs before this code. We need to ensure that events are being listened for before it is possible they return. We also want to avoid issue saltstack#31454 which is what PR saltstack#36024 fixed but in turn caused this issue. This is the approach I have taken to try to tackle this issue: It doesn't seem possible to generically discern if events can be returned by a given function that invokes `run_job` and contains an event searching function such as `get_cli_event_returns`. So for all such functions that could possibly need to search the event bus, we do the following: - Record if the event bus is currently being listened to. - When invoking `run_job`, ensure that `listen=True` so that `self.pub()` will ensure that the event bus is listed to before sending the payload. - When all possible event bus activities are concluded, if the event bus was not originally being listened to, stop listening to it. This is designed so that issue saltstack#31454 does not reappear. We do this via a try/finally block in all instances of such code. Signed-off-by: Sergey Kizunov <sergey.kizunov@ni.com>

Don't subscribe to events if not sure it would read them.

01911c5

Reactor uses LocalClient to send message but do not read events that produces memory leak in EventPublisher: sent messages are collected there in the IOStream class.

DmitryKuzmenko changed the base branch from develop to 2016.3 September 2, 2016 14:26

skizunov reviewed Sep 2, 2016
View reviewed changes

cachedout merged commit cd60ec5 into saltstack:2016.3 Sep 2, 2016

DmitryKuzmenko deleted the bugs/31454_local_client_memleak branch September 5, 2016 09:01

DmitryKuzmenko mentioned this pull request Sep 20, 2016

salt-master EventPublisher crashes every few day #36285

Closed

This was referenced Sep 20, 2016

Disappearing pillars in batch mode (aka pillar.get() Considered Harmful) #36449

Closed

salt loses minions from List while executing #36448

Closed

Ch3LL mentioned this pull request Sep 21, 2016

EventPublisher consumes 100% CPU #35503

Closed

skizunov mentioned this pull request Sep 30, 2016

Fix race condition when returning events from commands #36720

Merged

DmitryKuzmenko mentioned this pull request Jul 19, 2018

EventPublisher consuming 4.6 GB memory #48498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't subscribe to events if not sure it would read them. #36024

Don't subscribe to events if not sure it would read them. #36024

DmitryKuzmenko commented Sep 2, 2016 •

edited by meggiebot

Loading

DmitryKuzmenko commented Sep 2, 2016

skizunov Sep 2, 2016

cachedout Sep 2, 2016

DmitryKuzmenko Sep 5, 2016

ahammond commented Sep 20, 2016

DmitryKuzmenko commented Sep 21, 2016

ahammond commented Sep 21, 2016

skizunov commented Sep 30, 2016

Don't subscribe to events if not sure it would read them. #36024

Don't subscribe to events if not sure it would read them. #36024

Conversation

DmitryKuzmenko commented Sep 2, 2016 • edited by meggiebot Loading

What does this PR do?

What issues does this PR fix or reference?

Tests written?

DmitryKuzmenko commented Sep 2, 2016

skizunov Sep 2, 2016

Choose a reason for hiding this comment

cachedout Sep 2, 2016

Choose a reason for hiding this comment

DmitryKuzmenko Sep 5, 2016

Choose a reason for hiding this comment

ahammond commented Sep 20, 2016

DmitryKuzmenko commented Sep 21, 2016

ahammond commented Sep 21, 2016

skizunov commented Sep 30, 2016

DmitryKuzmenko commented Sep 2, 2016 •

edited by meggiebot

Loading