Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client clash with Tornado IOLoop #33697

Closed
asloboda-cisco opened this issue Jun 2, 2016 · 25 comments
Closed

Client clash with Tornado IOLoop #33697

asloboda-cisco opened this issue Jun 2, 2016 · 25 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Needs-Testcase PR needs test cases written, or the issue is about a bug/feature that needs test cases P2 Priority 2 severity-critical top severity, seen by most users, serious issues severity-high 2nd top severity, seen by most users, causes major problems ZRELEASED - Boron ZRELEASED - 2016.3.2
Milestone

Comments

@asloboda-cisco
Copy link

After upgrade to 2016.3 (from 2015.8) my application started crashing on Salt client. The application is using Tornado 4 as well and client crashes with IOLoop is already running

I tracked down the relevant piece of code to SMinion constructor io_loop.run_sync and by making change to this io_loop initialization I made it go away:

        #io_loop = LOOP_CLASS.current()
        io_loop = zmq.eventloop.ioloop.ZMQIOLoop()

I am not familiar with nuances of io_loop constructors or why it has changed between Salt versions but it doesn't seem to break anything.

@cachedout
Copy link
Contributor

cachedout commented Jun 2, 2016

@asloboda-cisco I really don't consider this a bug in Salt. We use a Tornado ioloop on salt's main thread in the running daemons. If your application is running in the same process, you need to either use Salt's IOLoop or re-architect it such that your loop is running on its own thread.

@Ch3LL Ch3LL added the expected-behavior intended functionality label Jun 2, 2016
@Ch3LL Ch3LL added this to the Blocked milestone Jun 2, 2016
@Ch3LL Ch3LL added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Jun 2, 2016
@asloboda-cisco
Copy link
Author

@cachedout I get that this way a separation is a requirement, but I got a bit lost in your answer. Is the Salt daemon a Salt master server? If so then how does it affect what's going on in client and why was the change that breaks things required? AFAIK it did use IOLoop before that too.

I can't find anything about IOthread in Salt, can you point me to it?

@ypoison
Copy link

ypoison commented Jun 3, 2016

@asloboda-cisco This may be associated with ZMQ version, I by updating the python-zmq and zeromq to solve the problem.

@asloboda-cisco
Copy link
Author

asloboda-cisco commented Jun 3, 2016

@ypoison I read that somewhere but apparently pyzmq==14.0.1 (& ZMQ: 4.0.4) should be fine. Which version do you suggest?

@cachedout
Copy link
Contributor

@asloboda-cisco Sorry, I don't know why I wrote IOThread. I meant IOLoop. (I must be tired. Apologies.)

Could you clarify how you're using the salt client and give more details about how your application is interacting with Salt?

@asloboda-cisco
Copy link
Author

@cachedout it is a WSGI app using Tornado web server to serve pages and when it starts HTTPServer it also starts IOLoop. I did not write the code but looking at the Tornado documentation it seems to be the thing to do.

It is working just fine until some request will result in an action which tries to initialize Salt client.

@asloboda-cisco
Copy link
Author

@cachedout I managed to get together an example which works with previous Salt version:

$ links -dump http://localhost:8888
   3 no error

but does this now:

$ links -dump http://localhost:8888
   4 error Traceback (most recent call last): File "tornado4.py", line 12, in
   get salt.client.Caller() File
   "/usr/local/lib/python2.7/dist-packages/salt/client/__init__.py", line
   1717, in __init__ self.sminion = salt.minion.SMinion(self.opts) File
   "/usr/local/lib/python2.7/dist-packages/salt/minion.py", line 592, in
   __init__ lambda: self.eval_master(self.opts, failed=True) File
   "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 448, in
   run_sync self.start() File
   "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 748, in
   start raise RuntimeError("IOLoop is already running") RuntimeError: IOLoop
   is already running

Here's the code:

import tornado.web
import tornado.httpserver
import tornado.ioloop
import salt
import salt.client

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("%s" % tornado.process.task_id())

        try:
            salt.client.Caller()
            self.write(" no error")
        except:
            import traceback
            self.write(" error %s" % traceback.format_exc())

app = tornado.web.Application([
    (r"/", MainHandler),
])

if __name__ == "__main__":
    server = tornado.httpserver.HTTPServer(app)
    server.bind(8888)
    server.start(0)
    tornado.ioloop.IOLoop.instance().start()

@danlsgiga
Copy link
Contributor

I'm having the very same issue after upgrading from 2015.8.10 to 2016.3.0.
Every time I tried to call my runners (any) from salt-api, the mentioned error comes up in the master.log.
I've also made the same change suggested by @asloboda-cisco and everything magically started working after restarting the master. (I didn't restart the minion, so, I'm supposing this is a master issue).

Can you guys please take a deeper look on this issue?

Thanks in advance

@cachedout
Copy link
Contributor

@danlsgiga Which salt-api backend? CherryPy or Tornado?

@danlsgiga
Copy link
Contributor

@cachedout rest_cherrypi

@cachedout
Copy link
Contributor

@danlsgiga Oh, wow. OK. I had assumed Tornado. Interesting. We'll take a look.

@cachedout cachedout added Bug broken, incorrect, or confusing behavior severity-critical top severity, seen by most users, serious issues ZRELEASED - Boron and removed expected-behavior intended functionality Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged labels Jun 6, 2016
@Ch3LL Ch3LL added P2 Priority 2 severity-high 2nd top severity, seen by most users, causes major problems Core relates to code central or existential to Salt labels Jun 27, 2016
@Ch3LL Ch3LL modified the milestones: Approved, Blocked Jun 27, 2016
@fracklen
Copy link
Contributor

@cachedout It seems the problem isn't in salt-api...

Here's a relatively easy way to reproduce:

Setup

The error_runner.py looks like this:

import salt.client

def test_func():
    caller = salt.client.Caller()
    return caller.cmd('test.ping')

Steps to Reproduce Issue

$ sudo salt-run error_runner.test_func
[ERROR   ] Exception in callback <functools.partial object at 0x7fccda2dd8e8>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 343, in wrapped
    raise_exc_info(exc)
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 314, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/crypt.py", line 462, in _authenticate
    io_loop=self.io_loop)
  File "/usr/lib/python2.7/dist-packages/salt/transport/client.py", line 105, in factory
    return salt.transport.zeromq.AsyncZeroMQReqChannel(opts, **kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 84, in __new__
    new_obj.__singleton_init__(opts, **kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 155, in __singleton_init__
    io_loop=self._io_loop,
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 820, in __init__
    self._init_socket()
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 869, in _init_socket
    self.stream = zmq.eventloop.zmqstream.ZMQStream(self.socket, io_loop=self.io_loop)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 107, in __init__
    self._init_io_state()
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 528, in _init_io_state
    self.io_loop.add_handler(self.socket, self._handle_events, self._state)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 704, in add_handler
    self._impl.register(fd, events | self.ERROR)
TypeError: argument must be an int, or have a fileno() method.

@Ch3LL has bisected it to a37a270

This has been blocking us from upgrading to 2016.3.x
It was working in 2015.8.10.

@danlsgiga
Copy link
Contributor

hey @fracklen, thanks for this.
That makes total sense since I use lots of salt client calls inside my runner that is being called by salt-api.

Good catch guys!! Thanks!!

@fracklen
Copy link
Contributor

@cachedout @Ch3LL would it be possible to create an integration test for this, based on the relatively simple runner with the salt.client.Caller()?

@meggiebot meggiebot modified the milestones: C 7, Approved Jun 30, 2016
@cachedout
Copy link
Contributor

I'm looking at this today but also wanted to bring in @skizunov for additional insight.

@skizunov
Copy link
Contributor

skizunov commented Jul 1, 2016

@cachedout : The code seems to be using a Tornado IO Loop tornado.ioloop.IOLoop. It needs to be using a ZMQ IO Loop zmq.eventloop.ioloop.IOLoop instead. ZMQ IO Loops are required only when using ZeroMQ (which is the case in this call stack) Also, ZMQ needs to have been initialized with this before using your first ZMQ IO Loop:

zmq.eventloop.ioloop.install()

In theory, the code should correctly select which type of IO Loop it uses (based on whether or not ZMQ is installed). Perhaps there is a bug in that logic or there is a missed use case.

@cachedout
Copy link
Contributor

@skizunov I've been troubleshooting the runner issue outlined by @fracklen specifically. There's somewhere inside of Salt where that seems to happen when a client is initialized inside of a runner. I'm still chasing that one down...

@thatch45
Copy link
Contributor

thatch45 commented Jul 5, 2016

@asloboda-cisco @danlsgiga @fracklen can you please test my pr:
#34456
and let me know if it solves the issue?

@danlsgiga
Copy link
Contributor

@thatch45 Your fix works fine!!! Just applied to my dev environment and my custom runners that uses salt client internally are running fine now.

Thanks a lot!!

Any idea of which release this fix will be applied to?

@thatch45
Copy link
Contributor

thatch45 commented Jul 5, 2016

good to hear! We are trying to finalize 2016.3.2 right now, if this fix is good we are down to just one blocker bug I think.

@fracklen
Copy link
Contributor

fracklen commented Jul 5, 2016

@thatch45 works like a dream. Awesome!

Sounds great if it could hit 2016.3.2.

@thatch45
Copy link
Contributor

thatch45 commented Jul 5, 2016

it is merged into the branch, so this fix will be in 2016.3.2, glad I could help :)

@thatch45
Copy link
Contributor

thatch45 commented Jul 5, 2016

with multiple validations on the fix and the pr being merged I am going to close this out as fixed

@thatch45 thatch45 closed this as completed Jul 5, 2016
@fracklen
Copy link
Contributor

fracklen commented Jul 5, 2016

@thatch45 wrt the tests, can this be an issue for other transports as well?

Would be nice to have a test that verifies that a runner can connect to a running IOLoop using the client - using each transport.

@thatch45
Copy link
Contributor

thatch45 commented Jul 5, 2016

Yes, good call,I did validate this fix on the other transports and a test that checks for this will be run on our multiple transport test platforms.
But yes, we do need to get a test in for this

@thatch45 thatch45 added the Needs-Testcase PR needs test cases written, or the issue is about a bug/feature that needs test cases label Jul 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Needs-Testcase PR needs test cases written, or the issue is about a bug/feature that needs test cases P2 Priority 2 severity-critical top severity, seen by most users, serious issues severity-high 2nd top severity, seen by most users, causes major problems ZRELEASED - Boron ZRELEASED - 2016.3.2
Projects
None yet
Development

No branches or pull requests

9 participants