Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launching nanogui Screen in a separate thread #35

Closed
mrzv opened this issue Dec 16, 2015 · 38 comments
Closed

Launching nanogui Screen in a separate thread #35

mrzv opened this issue Dec 16, 2015 · 38 comments

Comments

@mrzv
Copy link
Contributor

mrzv commented Dec 16, 2015

I would like to launch a nanogui Screen in a thread of its own. This works as intended on Linux, but crashes and burns on a Mac. Any suggestions and pointers would be very welcome.

#include <iostream>
#include <thread>
#include <nanogui/screen.h>

namespace ng = nanogui;

int main(int argc, char *argv[])
{
    std::thread t([]()
    {
        try
        {
            nanogui::init();

            ng::Screen* app = new ng::Screen(ng::Vector2i { 800, 600 }, "Test");

            app->drawAll();
            app->setVisible(true);

            nanogui::mainloop();

            delete app;

            nanogui::shutdown();
        } catch (const std::runtime_error &e)
        {
            std::string error_msg = std::string("Caught a fatal error: ") + std::string(e.what());
            #if defined(WIN32)
                MessageBoxA(nullptr, error_msg.c_str(), NULL, MB_ICONERROR | MB_OK);
            #else
                std::cerr << error_msg << '\n';
            #endif
            return -1;
        }
    });
    t.join();

    return 0;
}
@mrzv
Copy link
Contributor Author

mrzv commented Dec 16, 2015

The errors I get on a Mac OS X (El Capitan):

2015-12-16 11:06:21.563 test[38913:3591970] *** Assertion failure in +[NSUndoManager _endTopLevelGroupings], /Library/Caches/com.apple.xbs/Sources/Foundation/Foundation-1255.1/Misc.subproj/NSUndoManager.m:359
2015-12-16 11:06:21.564 test[38913:3591970] +[NSUndoManager(NSInternal) _endTopLevelGroupings] is only safe to invoke on the main thread.
2015-12-16 11:06:21.565 test[38913:3591970] (
    0   CoreFoundation                      0x00007fff9f89ce32 __exceptionPreprocess + 178
    1   libobjc.A.dylib                     0x00007fff9686cdd4 objc_exception_throw + 48
    2   CoreFoundation                      0x00007fff9f8a1b0a +[NSException raise:format:arguments:] + 106
    3   Foundation                          0x00007fff89d57e16 -[NSAssertionHandler handleFailureInMethod:object:file:lineNumber:description:] + 198
    4   Foundation                          0x00007fff89cdcf31 +[NSUndoManager(NSPrivate) _endTopLevelGroupings] + 170
    5   AppKit                              0x00007fff9a65bdca -[NSApplication run] + 844
    6   libnanogui.dylib                    0x000000010df8520b _ZN7nanogui8ComboBoxD2Ev + 145275
    7   libnanogui.dylib                    0x000000010df814c5 _ZN7nanogui8ComboBoxD2Ev + 129589
    8   libnanogui.dylib                    0x000000010df55f5d _ZNK7nanogui18AdvancedGridLayout6anchorEPKNS_6WidgetE + 2013
    9   test                                0x000000010df1656f _ZZ4mainENK3$_0clEv + 319
    10  test                                0x000000010df161e9 _ZNSt3__114__thread_proxyINS_5tupleIJZ4mainE3$_0EEEEEPvS4_ + 393
    11  libsystem_pthread.dylib             0x00007fff999ae9b1 _pthread_body + 131
    12  libsystem_pthread.dylib             0x00007fff999ae92e _pthread_body + 0
    13  libsystem_pthread.dylib             0x00007fff999ac385 thread_start + 13
)
2015-12-16 11:06:21.566 test[38913:3591970] *** Assertion failure in +[NSUndoManager _endTopLevelGroupings], /Library/Caches/com.apple.xbs/Sources/Foundation/Foundation-1255.1/Misc.subproj/NSUndoManager.m:359
2015-12-16 11:06:21.566 test[38913:3591970] An uncaught exception was raised
2015-12-16 11:06:21.567 test[38913:3591970] +[NSUndoManager(NSInternal) _endTopLevelGroupings] is only safe to invoke on the main thread.
2015-12-16 11:06:21.567 test[38913:3591970] (
    0   CoreFoundation                      0x00007fff9f89ce32 __exceptionPreprocess + 178
    1   libobjc.A.dylib                     0x00007fff9686cdd4 objc_exception_throw + 48
    2   CoreFoundation                      0x00007fff9f8a1b0a +[NSException raise:format:arguments:] + 106
    3   Foundation                          0x00007fff89d57e16 -[NSAssertionHandler handleFailureInMethod:object:file:lineNumber:description:] + 198
    4   Foundation                          0x00007fff89cdcf31 +[NSUndoManager(NSPrivate) _endTopLevelGroupings] + 170
    5   AppKit                              0x00007fff9a65be66 -[NSApplication run] + 1000
    6   libnanogui.dylib                    0x000000010df8520b _ZN7nanogui8ComboBoxD2Ev + 145275
    7   libnanogui.dylib                    0x000000010df814c5 _ZN7nanogui8ComboBoxD2Ev + 129589
    8   libnanogui.dylib                    0x000000010df55f5d _ZNK7nanogui18AdvancedGridLayout6anchorEPKNS_6WidgetE + 2013
    9   test                                0x000000010df1656f _ZZ4mainENK3$_0clEv + 319
    10  test                                0x000000010df161e9 _ZNSt3__114__thread_proxyINS_5tupleIJZ4mainE3$_0EEEEEPvS4_ + 393
    11  libsystem_pthread.dylib             0x00007fff999ae9b1 _pthread_body + 131
    12  libsystem_pthread.dylib             0x00007fff999ae92e _pthread_body + 0
    13  libsystem_pthread.dylib             0x00007fff999ac385 thread_start + 13
)
2015-12-16 11:06:21.568 test[38913:3591970] *** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: '+[NSUndoManager(NSInternal) _endTopLevelGroupings] is only safe to invoke on the main thread.'
*** First throw call stack:
(
    0   CoreFoundation                      0x00007fff9f89ce32 __exceptionPreprocess + 178
    1   libobjc.A.dylib                     0x00007fff9686cdd4 objc_exception_throw + 48
    2   CoreFoundation                      0x00007fff9f8a1b0a +[NSException raise:format:arguments:] + 106
    3   Foundation                          0x00007fff89d57e16 -[NSAssertionHandler handleFailureInMethod:object:file:lineNumber:description:] + 198
    4   Foundation                          0x00007fff89cdcf31 +[NSUndoManager(NSPrivate) _endTopLevelGroupings] + 170
    5   AppKit                              0x00007fff9a65be66 -[NSApplication run] + 1000
    6   libnanogui.dylib                    0x000000010df8520b _ZN7nanogui8ComboBoxD2Ev + 145275
    7   libnanogui.dylib                    0x000000010df814c5 _ZN7nanogui8ComboBoxD2Ev + 129589
    8   libnanogui.dylib                    0x000000010df55f5d _ZNK7nanogui18AdvancedGridLayout6anchorEPKNS_6WidgetE + 2013
    9   test                                0x000000010df1656f _ZZ4mainENK3$_0clEv + 319
    10  test                                0x000000010df161e9 _ZNSt3__114__thread_proxyINS_5tupleIJZ4mainE3$_0EEEEEPvS4_ + 393
    11  libsystem_pthread.dylib             0x00007fff999ae9b1 _pthread_body + 131
    12  libsystem_pthread.dylib             0x00007fff999ae92e _pthread_body + 0
    13  libsystem_pthread.dylib             0x00007fff999ac385 thread_start + 13
)
libc++abi.dylib: terminating with uncaught exception of type NSException
[1]    38913 abort      DYLD_LIBRARY_PATH=~/Build/nanogui ./test

@wjakob
Copy link
Owner

wjakob commented Dec 16, 2015

Hi Dmitry,

this is a serious issue on Macs, where the GLFW event loop is only allowed to run on the main thread (probably due to assumptions in Cocoa) -- see the GLFW web page for details.

It is possible that you could do something horribly hacky to launch the event loop on the main thread and return to Python on another thread using longjmp() or so -- shudder! :)

If you do that, I'd be really curious to know if it worked.
Wenzel

@wjakob
Copy link
Owner

wjakob commented Dec 16, 2015

(let me know in case my description doesn't make sense)

@mrzv
Copy link
Contributor Author

mrzv commented Dec 16, 2015

No, it makes perfect sense. Thanks for the pointer and the idea (I use coroutines in a different project, so it doesn't make me shudder). I'll dig into it. Thanks again.

@wjakob
Copy link
Owner

wjakob commented Dec 16, 2015

If it works, it might even make sense to make this an official function of the Python bindings of nanogui (e.g. nanogui.main_loop(detach=True) or so)

@mrzv
Copy link
Contributor Author

mrzv commented Dec 17, 2015

Ok, it seems to be working (using libcoro for context switching — contrary to what that page implies, it's using setjmp/longjmp). Take a look, let me know what you think: https://github.com/mrzv/nanogui-thread (I've only tested it on my Mac for now; I'll check Linux tomorrow.)

It's a bit annoying that you have to call finalize explicitly, but for now it's the best I can do.

There is a test.py in there, but it's much nicer to just manually launch it from an interactive shell and actually interact with Python while the GUI is running. Somewhat amusingly, IPython, unlike Python, insists on quiting from the original thread that launched it (otherwise SQLite that it uses under the hood crashes and burns), but it's definitely the cleaner way to quit anyway.

@wjakob
Copy link
Owner

wjakob commented Dec 17, 2015

Hi Dmitry,

that's awesome! -- the bit with finalize is not too surprising I guess.

I see that libcoro is mostly a portability layer for Windows, which doesn't have longjmp().

One thing I'm wondering is whether the coroutine approach is actually needed on Windows? If it just works on that platform, then my preferred approach to integrate it would be to just use longjmp() on OSX and Linux and do a no-op on Windows.

Wenzel

@wjakob
Copy link
Owner

wjakob commented Dec 17, 2015

(or maybe even just on OSX)

@mrzv
Copy link
Contributor Author

mrzv commented Dec 17, 2015

Well, the bit with finalize is annoying because I'd rather have it in a destructor, and assign an Environment as a module variable -- basically, use RAII pattern. That doesn't seem to work; it's too late to call finalize from the destructor.

I wouldn't characterize libcoro as a portability layer for Windows -- it has a lot of good stuff in there; if anything, it's a workaround for Mac/Linux not having fibers. But that's semantics. I agree with your larger point that since we only need this workaround on a Mac, it makes sense to strip it down to only what's necessary (basically, the CORO_SJLJ subset) and enable all this yoga only on OS X.

I'll look into how to strip it down -- I haven't done this before, but I should learn the details anyway.

@wjakob
Copy link
Owner

wjakob commented Dec 17, 2015

Out of curiosity: CORO has multiple different SJLJ "drivers", one of them being setjmp()/longjmp() which works out of the box on OSX. What kind of extra yoga is needed in this case? Just calling these two is not enough?

@mrzv
Copy link
Contributor Author

mrzv commented Dec 17, 2015

Well, at the very least you have to create an extra stack.

@wjakob
Copy link
Owner

wjakob commented Dec 17, 2015

Ouch, I forgot about that -- good point. Regarding RAII: could atexit perhaps be used? (https://docs.python.org/2/library/atexit.html). This could be invoked via pybind to install a handler that switches back to the main thread. But maybe that is also already too late in the shutdown phase.

@wjakob
Copy link
Owner

wjakob commented Dec 17, 2015

Thinking a bit more about the stacks -- are you sure that it's needed? The goal is really just to switch stacks -- the newly created thread begins executing Python code, using the stack that was previously associated with the main thread. The main thread enters the event loop, using the stack of the newly created thread.

So perhaps you'll really just need tje two longjmp()/setjmp(), plus some mutexes like you're doing now. What do you think?

@mrzv
Copy link
Contributor Author

mrzv commented Dec 17, 2015

Yeah, I was just thinking about this -- the extra thread does have its own stack. I think you are right. I need to think a little more about it, but maybe the whole thing can be drastically simplified.

@mrzv
Copy link
Contributor Author

mrzv commented Dec 17, 2015

Ok, I can't seem to get it to work with just longjmp/setjmp. I'll revisit it later.

@mrzv
Copy link
Contributor Author

mrzv commented Dec 19, 2015

Ok, I think I got it; see the same repository as before. I need to use a couple of intermediate threads (I need some intermediate stack to jump to), but it seems to work. Obviously, it's not extensively tested, but it works reliably for me (on a Mac — no idea about other platforms; not that we need them).

The annoyance with the explicit finalize is still there. If you have an idea how to get rid of it, it would be great.

@mrzv
Copy link
Contributor Author

mrzv commented Dec 19, 2015

Actually, this explicit finalize is a problem in IPython, not in regular Python. I get the impression that IPython tries to do some critical SQLite operation (that has to be done from the original thread) before Python starts deleting its objects. Hence the problem that I'm seeing. In Python proper, finalize in a destructor does the job fine.

@mrzv
Copy link
Contributor Author

mrzv commented Dec 28, 2015

Hi Wenzel,

What are your thoughts on this? Do you want to build this functionality into nanogui (with 'mainloop(detach=True)' swapping threads in Mac and shutdown taking care of finalize)? Or should I keep it local to my application?

Happy holidays.
Dmitriy

@wjakob
Copy link
Owner

wjakob commented Dec 29, 2015

I would love to have this as an option in the python library part of nanogui -- please do go ahead and submit a PR.

Thanks,
Wenzel

@wjakob
Copy link
Owner

wjakob commented Jan 18, 2016

hey -- just a gentle ping regarding the PR :)

@mrzv
Copy link
Contributor Author

mrzv commented Jan 19, 2016

Hey Wenzel,

There was a problem with the code that I hadn't realized was there, and I haven't been able to fix it. Basically, currently the best I can offer is to do this using libcoro. My direct setjmp/longjmp doesn't work. I don't know how you feel about including libcoro into nanogui.

Dmitriy

@wjakob
Copy link
Owner

wjakob commented Jan 19, 2016

Hm, I'm not so excited about bundling another dependency (at least conceptually, I still feel that this should be possible with the builtin tools). What was the problem you encountered?

@mrzv
Copy link
Contributor Author

mrzv commented Jan 19, 2016

I wish I knew what the problem was. It was a spontaneous segfault, occurring with different frequency in different cases. I never managed to get to the bottom of it. For what it's worth, libcoro is a tiny dependency you could simply include in your own source.

@mrzv
Copy link
Contributor Author

mrzv commented Jan 19, 2016

I've pushed the changes I've made to https://github.com/mrzv/nanogui in case you want to take a look.

@wjakob wjakob closed this as completed in db2ce58 Apr 25, 2016
@wjakob
Copy link
Owner

wjakob commented Apr 25, 2016

Hi Dmitry,
this feature finally landed. I used your libcoro approach, but with some extra steps that keep Python thread states & threads in sync.
Please let me know if it works for you (I am especially interested if this addresses the IPython cleanup crashes you mentioned).
Thanks,
Wenzel

@mrzv
Copy link
Contributor Author

mrzv commented Apr 25, 2016

Hi Wenzel,

Here's the code I used to test whether what I was doing with threading was working (slightly, adjusted for how you added detach — correct me if I'm wrong).

import nanogui

from nanogui import *

import time

quit = False

class TestApp(Screen):
    def __init__(self):
        super(TestApp, self).__init__(Vector2i(1024, 768), "NanoGUI Test")

        window = Window(self, "Button demo")
        window.setPosition(Vector2i(15, 15))
        window.setLayout(GroupLayout())

        Label(window, "Push buttons", "sans-bold")
        b = Button(window, "Plain button")
        def cb():
            print("pushed!")
        b.setCallback(cb)

        b = Button(window, "Quit")
        def cb2():
            global quit
            quit = True
            self.setVisible(False)
        b.setCallback(cb2)

        self.performLayout()


if __name__ == "__main__":
    nanogui.init()
    test = TestApp()
    test.drawAll()
    test.setVisible(True)

    print "Calling mainloop"
    h = nanogui.mainloop(detach=test)

    print "Back in Python context"

    for i in xrange(10):
        print i
        time.sleep(1)
        if quit:
            break

    h.join()
    nanogui.shutdown()

When I run it with your new code, it doesn't quite work. Sometimes it does, but sometimes it segfaults. I think at the very least I need to acquire GIL in callbacks. So for example the bindings for setCallback for Button look like:

        .def("setCallback", [](Button& b, std::function<void()> f)
                            {
                                b.setCallback([=]() { py::gil_scoped_acquire gil; f(); });
                            }, D(Button, setCallback))

but this doesn't seem to be enough.

Does the above code work for you? Should it be done differently? Any ideas why it fails?

Thanks.
Dmitriy

@mrzv
Copy link
Contributor Author

mrzv commented Apr 25, 2016

Actually, I was wrong. With the changes to the callback, it's working. (I was making the change incorrectly.) This is amazing. I need to study your code to figure out what you are doing differently. I could never fix up Python state correctly after the context switch.

I'll experiment more to make sure no random segfaults pop up, but this is great.

Why do you choose to not have detach option in C++? Fiddling with it on Mac OS X seems like a tricky exercise, with or without Python.

@mrzv
Copy link
Contributor Author

mrzv commented Apr 25, 2016

Hm, I do occasionally get a segfault in the beginning. Before any GUI is shown. Perhaps, I'm doing something wrong in the initialization?

@wjakob wjakob reopened this Apr 25, 2016
@wjakob
Copy link
Owner

wjakob commented Apr 25, 2016

For reference, on what Python version is this?

@wjakob
Copy link
Owner

wjakob commented Apr 25, 2016

I just committed some fixes for Python 2.7.x, and a tweak that always acquires the GIL before any kind of callback.

Does this fix things for you?

Regarding not having the detach option in C++: swapping the thread states like this is a sort of sledgehammer peration, and the only reason why it is needed is because both Cocoa and Python are extremely inflexible when it comes to multithreaded execution. In a normal C++ program there are many more ways to restructure the program to avoid this. In doubt, it's easy to pick the appropriate pieces from python.cpp, but I'm not sure I want to provide it as a default option.

@mrzv
Copy link
Contributor Author

mrzv commented Apr 25, 2016

I am indeed using Python 2.7.11. The callback issue is definitely fixed. The initial segfault also hasn't triggered, but I haven't really stressed the code. Do you know why that could have been happening?

In any case, thanks a lot for this. I'm going to explore the code later and try more serious examples that I originally wanted to implement. You can probably close this, and if anything comes up, I'll file a new bug, or reply here, so you can re-open.

Also, understood on the C++ front. The reasoning makes perfect sense.

BTW, do I understand correctly that substantial part of the magic that made this work happened through the changes in pybind itself?

@wjakob
Copy link
Owner

wjakob commented Apr 25, 2016

Turns out that the PyGILState is just a broken mess that the Python maintainers have themselves sort of given up (https://bugs.python.org/issue10915, https://bugs.python.org/issue15751). So the main change was indeed in reimplementing a more flexible non-broken PyGILState API in pybind11.

I suspect the startup crash you've been seeing before was due to an issue where Python 2.7.x behaves differently from Python 3.x (which is now fixed): pybind/pybind11@2f6662e

In any case, hopefully this all works now. :)

@wjakob wjakob closed this as completed Apr 25, 2016
@wjakob
Copy link
Owner

wjakob commented Apr 25, 2016

There were a few more issues with some versions of python.. the latest pybind11 versions should resolve them.

@wjakob
Copy link
Owner

wjakob commented Apr 25, 2016

(35028da)

@mrzv
Copy link
Contributor Author

mrzv commented May 17, 2016

I finally got to my Linux box, and I hate to report that python/example3.py doesn't work. It segfaults some time after the mainloop launch. I'll dig into it to see where the problem actually occurs, but for now I wanted to re-open this issue.

@mrzv
Copy link
Contributor Author

mrzv commented May 17, 2016

I don't know what exactly the problem is, but if I enable all the machinery that you enable only on __APPLE__, everything works. So I guess there is an easy solution.

On an unrelated note, python/python.cpp needs to include <mutex> and <condition_variable>. On the Linux box, I have GCC 6.1.1, and it's much stricter about the headers.

@wjakob
Copy link
Owner

wjakob commented May 17, 2016

Ok, great to hear -- do you want to submit a PR for that, since you got it running on your machine?

@mrzv
Copy link
Contributor Author

mrzv commented May 17, 2016

I've created PR #68. The only thing is that I haven't tested it on Windows, so who knows what happens there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants