Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get events from python #37018

Closed
tsaridas opened this issue Oct 14, 2016 · 28 comments
Closed

get events from python #37018

tsaridas opened this issue Oct 14, 2016 · 28 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix
Milestone

Comments

@tsaridas
Copy link
Contributor

tsaridas commented Oct 14, 2016

Description of Issue/Question

Trying to follow the guide at to get events from a python scripts using the salt.utils.event class,
I noticed that when master is restarted the events do not keep coming any more.
Saltstack Documentation

I noticed @DmitryKuzmenko did some changes for the minions and syndics in order to connect after disconnection. Maybe he could suggest something similar for the specific case ?

I also tried using iter_events with the same results.

Also get_event_block in a while loop throughs an exception but when I try to reconnect It fails.

Steps to Reproduce Issue

Follow the steps at the documentation and restart master.

Versions Report

latest 2016.3.3

@Ch3LL
Copy link
Contributor

Ch3LL commented Oct 14, 2016

yes I believe @DmitryKuzmenko would definitely be the one with the definitive answer.

But I would think the reason monitoring events with the master stops after restarting the master is because you would need to re-inialize the connection to the master. The same thing happens if I were to run salt-run state.event pretty=True if I restart the master I have to re-run that command to view events but i think dmitry will have more knowledge around this and might have a workaround/suggestion. Thanks

@Ch3LL Ch3LL added the Question The issue is more of a question rather than a bug or a feature request label Oct 14, 2016
@Ch3LL Ch3LL added this to the Approved milestone Oct 14, 2016
@DmitryKuzmenko
Copy link
Contributor

@Ch3LL is right. When master restarts it recreates the unix domain socket PUB socket. So after master restart client continues to listen the removed socket file while master publishes to the new socket file. But I'll take a closer look to provide a way to handle this.

@tsaridas
Copy link
Contributor Author

thanks @DmitryKuzmenko,

I understand that is the same socket file and it would be nice if I had a method checking the connection or maybe automatically connect back.
connect_pub() returns True after restart but it still doesn't show any events after reconnecting.

I tried to see where the exactly does this get stuck but I wasn't able to.

ps: I already added your patch for syndic reconnection thinking that it might help but no luck.

@DmitryKuzmenko
Copy link
Contributor

@tsaridas
After thinking slightly more I've got that actually when publisher closes the socket all the clients get StreamClosedError. And the mentioned Syndic fix introduces this error handling. After looking up the code I've got that we have (now) the way to handle the error in async mode like syndic does but we have no way to detect disconnection in sync client mode, that is described in the docs. It's because for the user disconnect looks similar to timeout because of this:
http://github.com/saltstack/salt/blob/develop/salt/utils/event.py#L532-L533

So client code gets None in both cases: timeout and error. We have to fix this.

@DmitryKuzmenko DmitryKuzmenko added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Confirmed Salt engineer has confirmed bug/feature - often including a MCVE and removed Question The issue is more of a question rather than a bug or a feature request labels Oct 17, 2016
@tsaridas
Copy link
Contributor Author

tsaridas commented Oct 17, 2016

@DmitryKuzmenko thanks for checking.

Would you be able to provide a quick fix so that the client connects automatically ?

For some reason even when I try to get the exception from get_event_noblock() and initialise the var = salt.utils.event.get_event() I cannot reconnect anymore.

var.get_event_noblock() just hangs.

I tried settings all the variables to None but I still get the same.

Thanks

@DmitryKuzmenko
Copy link
Contributor

@tsaridas ah, sure, there are more special get_event_noblock() and get_event_block() that don't hide the exception. I've tried to do things you want and everything looks working:

#!/bin/env python 
import salt.config
import salt.utils.event
import sys
import time
import tornado

opts = salt.config.client_config('/etc/salt/master')

event = salt.utils.event.get_event(
        'master',
        sock_dir=opts['sock_dir'],
        transport=opts['transport'],
        opts=opts)

while True:
    try:
        while True:
            data = event.get_event_block()
            print(data)
    except tornado.iostream.StreamClosedError as ex:
        print(ex)
        event.close_pub()
        tries = 0
        while not event.cpub:
            if tries > 0:
                time.sleep(1)
            tries += 1
            print('Reconnecting #{0}'.format(tries))
            event.connect_pub()
    except:
        print('Some exception: {0}'.format(sys.exc_info()[0]))
        break

print('Done')

@DmitryKuzmenko DmitryKuzmenko added Question The issue is more of a question rather than a bug or a feature request and removed Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Core relates to code central or existential to Salt labels Oct 18, 2016
@tsaridas
Copy link
Contributor Author

tsaridas commented Oct 18, 2016

@DmitryKuzmenko ,

that doesn't work for me. It gets stuck, as I expected, at event.connect_pub() and never times out.

I also tried setting event.connect_pub(timeout=10) but without success.

I see that you are working with the develop branch hence the issue might be fixed there.

I'll test with the develop branch later on this evening and post the results.

ps : Did you try to restart master when running the script :) ? just checking ..

Thanks

@tsaridas
Copy link
Contributor Author

tsaridas commented Oct 18, 2016

Tested and it works in develop branch. Not sure which ticket solved this though.

I guess it was my mistake I didn't test in develop. Sorry for the inconvenience and thanks for the help @DmitryKuzmenko

@szjur
Copy link

szjur commented Oct 18, 2016

@DmitryKuzmenko, any idea how to make it work in 2016.3.3? There is no close_pub() there, I tried to do what it does based on the latest code but it always hangs on connect_pub() even if you wait a minute and the master is surely up again at that time.

@DmitryKuzmenko
Copy link
Contributor

@szjur you can either re-create the SaltEvent or use the following code instead of missing close_pub():

        event.subscriber.close()
        event.subscriber = None
        event.pending_events = []
        event.cpub = False

@tsaridas
Copy link
Contributor Author

@Ch3LL @DmitryKuzmenko thank you for your help.
Issue can be closed.

@szjur
Copy link

szjur commented Oct 19, 2016

@DmitryKuzmenko you do realise that what you suggested is exactly what I said I tried? That doesn't help, at least not with 2016.3.3 on RHEL6. @Ch3LL, I don't know why @tsaridas closed it - most likely he found the code change that fixes this bug but just closing it light that doesn't really help the stable branches to get better and may be a bit confusing to anyone stumbling on this bug and reading this thread.

@DmitryKuzmenko
Copy link
Contributor

@szjur ah! I see, sorry. I'll take a closer look.

@tsaridas
Copy link
Contributor Author

I could reopen the ticket if you think it should correspond to the stable version being buggy.

fyi- the pr that fixed the issue in 2016.3.3 was #32329

Not sure why though since the specific one has to do with serializer ... I only patched ipc.py and transport/frame.py together with your syndic patch.

I will take another look at the changes one more time and try to see what was the exact line that fixed it.

Thnx

@DmitryKuzmenko DmitryKuzmenko added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt and removed Question The issue is more of a question rather than a bug or a feature request labels Oct 20, 2016
@DmitryKuzmenko
Copy link
Contributor

I agree with @szjur. Also I've checked the latest 2016.3 branch and the bug is there: SaltEvent hangs on reconnecting to the event bus.
The issue is fixed in carbon branch by this PR: #36720
I think it's needed to be backported into 2016.3 branch.

@DmitryKuzmenko
Copy link
Contributor

I've backported the fix to 2016.3. @cachedout could you please take a look?

@DmitryKuzmenko
Copy link
Contributor

@szjur, @tsaridas thank you for help!

@szjur
Copy link

szjur commented Oct 20, 2016

Excellent, thanks a lot @DmitryKuzmenko :-) We'll also test it thoroughly tomorrow.

@szjur
Copy link

szjur commented Oct 21, 2016

@DmitryKuzmenko, I tested 2016.3.3 with just the second part of the backport (d7e3209) + manually doing what the missing close_pub() does and that does the trick in the scenario you proposed (using get_event_block() and catching tornado.iostream.StreamClosedError). I gave the other part of the backport (82e2763) a miss for now, because it didn't apply cleanly over stock 2016.3.3 and I didn't feel like getting too hung up on that.
By the way iter_events still blocks forever after master restart. I'm fine using get_event_block() but maybe at some point it could also be addressed as that problem limits the usability of iter_events().

@DmitryKuzmenko
Copy link
Contributor

I'd like to address this question to @cachedout and @thatch45.
In simple words: salt.utils.event has different error handling in function allowing to get events.
get_event_block, get_event_noblock: return None if there's no data, raises StreamClosedError if remote peer is disconnected.
get_event(): returns None if there's no data or error occurred, we have no way to detect errors.
iter_events(): continues waiting for events if got None from get_event(), i.e. it hangs if connection is broken.
Shouldn't we change get_event() and iter_events() behavior?

@thatch45
Copy link
Contributor

This is a good question, I think the optimal situation would be to allow the user to pass the option into get_event() to expose errors raised

@szjur
Copy link

szjur commented Oct 24, 2016

In any case iter_events() hanging indefinitely after master restart with no way whatsoever to catch it and handle is beyond any user's expectation. It should at least return None.

@jeanpralo
Copy link
Contributor

I am just wondering why should we be passing an option to raise an exception ? I mean if it does not raise anything, this is sort of broken by default no ?

Anyway all in favor of fixing it for get_event and so for iter_events as well.

@cachedout cachedout added the fixed-pls-verify fix is linked, bug author to confirm fix label Nov 4, 2016
@tsaridas
Copy link
Contributor Author

tsaridas commented Nov 6, 2016

with a quick search I see that iter_events is being used in reactor and cherrypy so yes it would make sense to raise an exception and try to catch it in all places being used or all the above would have issues when master is restarted.

@thatch45
Copy link
Contributor

thatch45 commented Nov 7, 2016

My concern is to avoid changing behavior without explicitly adding the option. Many people use the event system api and I don't like pulling the rug out from under people unless we have to. Hence why I would rather make it an option.

@jeanpralo
Copy link
Contributor

True, makes sense.

Well #37438 is fixing so I guess we can close #37335 ?

@tsaridas
Copy link
Contributor Author

tsaridas commented Nov 7, 2016

verified fix for 2016.3.4 and get_event_noblock

@DmitryKuzmenko
Copy link
Contributor

It looks I've done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix
Projects
None yet
Development

No branches or pull requests

8 participants