New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enableWorkerThreads : crash upon SIGINT #419

Closed
mihails-strasuns opened this Issue Dec 1, 2013 · 28 comments

Comments

Projects
None yet
3 participants
@mihails-strasuns
Contributor

mihails-strasuns commented Dec 1, 2013

Using trivial listener app:

import vibe.http.server;
import vibe.core.core : enableWorkerThreads;

void handleRequest(HTTPServerRequest req, HTTPServerResponse res)
{
    res.writeBody("Hello", "text/plain");
}

shared static this()
{   
    enableWorkerThreads();
    listenHTTP(new HTTPServerSettings(), &handleRequest);
}

Crashes after hitting Ctrl+C (Arch Linux x64).

Backtrace:

Program terminated with signal 11, Segmentation fault.
#0  0x000000000061fb94 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.resize() (this=0x7f2d54ac5ba8, k=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:205
205                 m_hasher = k => k.toHash();
(gdb) bt
#0  0x000000000061fb94 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.resize() (this=0x7f2d54ac5ba8, k=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:205
#1  0x0000000000661872 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.findIndex() (this=0x7f2d54ac5ba8, key=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:165
#2  0x000000000061f882 in vibe.utils.hashmap.__T7HashMapTC4core6thread6ThreadTS4vibe4core7drivers9libevent210ThreadSlotZ.HashMap.__T13opBinaryRightVAyaa2_696eZ.opBinaryRight() (this=0x7f2d54ac5ba8, key=0x0) at ../../../../../.dub/packages/vibe-d-master/source/vibe/utils/hashmap.d:121
#3  0x000000000065e0e5 in vibe.core.drivers.libevent2.Libevent2ManualEvent.amOwner() (this=0x7f2d54ac5b80) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:511
#4  0x000000000065d985 in vibe.core.drivers.libevent2.Libevent2ManualEvent.wait() (this=0x7f2d54ac5b80, reference_emit_count=2) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:449
#5  0x000000000065d910 in vibe.core.drivers.libevent2.Libevent2ManualEvent.wait() (this=0x7f2d54ac5b80) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:444
#6  0x0000000000681ada in vibe.core.core.exitEventLoop() (shutdown_workers=true) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/core.d:113
#7  0x0000000000683c81 in onSignal (signal=2) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/core.d:101
#8  <signal handler called>
#9  0x00007f2d53527a73 in epoll_wait () from /usr/lib/libc.so.6
#10 0x00007f2d545af793 in ?? () from /usr/lib/libevent-2.0.so.5
#11 0x00007f2d5459bae7 in event_base_loop () from /usr/lib/libevent-2.0.so.5
#12 0x000000000065ae0b in vibe.core.drivers.libevent2.Libevent2Driver.runEventLoop() (this=0x7f2d54ac5e00) at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/drivers/libevent2.d:141
#13 0x00000000006818f2 in vibe.core.core.runEventLoop() () at ../../../../../.dub/packages/vibe-d-master/source/vibe/core/core.d:84
#14 0x00000000006a2716 in D main () at ../../../../../.dub/packages/vibe-d-master/source/vibe/appmain.d:54
#15 0x000000000072a304 in rt.dmain2._d_run_main() ()
#16 0x000000000072a25e in rt.dmain2._d_run_main() ()
#17 0x000000000072a2c4 in rt.dmain2._d_run_main() ()
#18 0x000000000072a25e in rt.dmain2._d_run_main() ()
#19 0x000000000072a1df in _d_run_main ()
#20 0x00000000006a2765 in main ()
@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 6, 2013

Contributor

Looks like worker threads are currently broken completely. For a simple application just enabling this functionality makes server hang after completing initial TCP handshake with no data written.

Contributor

mihails-strasuns commented Dec 6, 2013

Looks like worker threads are currently broken completely. For a simple application just enabling this functionality makes server hang after completing initial TCP handshake with no data written.

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 9, 2013

Member

I've set up an Arch VM and can't reproduce it - both Ctrl+C and making requests work fine. It also works on Windows and Ubuntu.

Member

s-ludwig commented Dec 9, 2013

I've set up an Arch VM and can't reproduce it - both Ctrl+C and making requests work fine. It also works on Windows and Ubuntu.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 9, 2013

Contributor

Can reliably reproduce it on both native desktop installation and with virtualbox instance. Both 0.7.18 and master. dmd / dub installed from repository. What was your VM setup?

Contributor

mihails-strasuns commented Dec 9, 2013

Can reliably reproduce it on both native desktop installation and with virtualbox instance. Both 0.7.18 and master. dmd / dub installed from repository. What was your VM setup?

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 9, 2013

Member

Hm ok, did you try with vibe.d master? There have been a couple of fixes since then. I'll try 0.7.18 to be sure it's nothing system specific.

Member

s-ludwig commented Dec 9, 2013

Hm ok, did you try with vibe.d master? There have been a couple of fixes since then. I'll try 0.7.18 to be sure it's nothing system specific.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 9, 2013

Contributor

Both 0.7.18 and master

Contributor

mihails-strasuns commented Dec 9, 2013

Both 0.7.18 and master

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 9, 2013

Member

Oh sorry, overlooked that. So that's strange then. I'm testing on a fresh system: Up-to-date Arch Linux x86-64, DMD 2.064.2, vibe.d master, DUB master, everything "dub upgrade"d - running on VMware. Tried with your example code, as well as the "bench-http-server" example.

Member

s-ludwig commented Dec 9, 2013

Oh sorry, overlooked that. So that's strange then. I'm testing on a fresh system: Up-to-date Arch Linux x86-64, DMD 2.064.2, vibe.d master, DUB master, everything "dub upgrade"d - running on VMware. Tried with your example code, as well as the "bench-http-server" example.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 9, 2013

Contributor

Did you install dmd using pacman or used one from .zip? Another obvious difference is VMWare vs VirtualBox. It must be something in environment but don't see anything weird at first glance. I'll investigate actual crash dumps in more details but that will take some time.

Contributor

mihails-strasuns commented Dec 9, 2013

Did you install dmd using pacman or used one from .zip? Another obvious difference is VMWare vs VirtualBox. It must be something in environment but don't see anything weird at first glance. I'll investigate actual crash dumps in more details but that will take some time.

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 9, 2013

Member

The one from the zip file. I guess VMware vs. Virtual Box shouldn't make a difference on this level, but who knows.

Member

s-ludwig commented Dec 9, 2013

The one from the zip file. I guess VMware vs. Virtual Box shouldn't make a difference on this level, but who knows.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 9, 2013

Contributor

Smells bad. Will retest on binary from .zip shortly.

Contributor

mihails-strasuns commented Dec 9, 2013

Smells bad. Will retest on binary from .zip shortly.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 9, 2013

Contributor

Phew, luckily it still crashes :)

Contributor

mihails-strasuns commented Dec 9, 2013

Phew, luckily it still crashes :)

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 9, 2013

Contributor

Ok, will report after more detailed investigation.

Contributor

mihails-strasuns commented Dec 9, 2013

Ok, will report after more detailed investigation.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 10, 2013

Contributor

Looks like corrupted fiber switch - all core dumps have garbage in registers like this:

(gdb) info registers 
rax            0x7f3081b70ba8   139846311414696
rbx            0x7f3081b70b80   139846311414656
rcx            0x7f3081b70ba8   139846311414696
rdx            0x61fc28 6421544
rsi            0x0  0
rdi            0x0  0
rbp            0x7fff9e1310b0   0x7fff9e1310b0
rsp            0x7fff9e1310a0   0x7fff9e1310a0
r8             0x7f3081b70b10   139846311414544
r9             0x1  1
r10            0x0  0
r11            0x246    582
r12            0x28e4214    42877460
r13            0x28e4208    42877448
r14            0x28e4d18    42880280
r15            0xffffffff   4294967295
rip            0x61fc3c 0x61fc3c <vibe.core.file.getFileInfo()+276>
eflags         0x10206  [ PF IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0

Will need to do a very detailed investigation of event loop / fiber code to proceed so this is likely to be put on pause for now =/

Contributor

mihails-strasuns commented Dec 10, 2013

Looks like corrupted fiber switch - all core dumps have garbage in registers like this:

(gdb) info registers 
rax            0x7f3081b70ba8   139846311414696
rbx            0x7f3081b70b80   139846311414656
rcx            0x7f3081b70ba8   139846311414696
rdx            0x61fc28 6421544
rsi            0x0  0
rdi            0x0  0
rbp            0x7fff9e1310b0   0x7fff9e1310b0
rsp            0x7fff9e1310a0   0x7fff9e1310a0
r8             0x7f3081b70b10   139846311414544
r9             0x1  1
r10            0x0  0
r11            0x246    582
r12            0x28e4214    42877460
r13            0x28e4208    42877448
r14            0x28e4d18    42880280
r15            0xffffffff   4294967295
rip            0x61fc3c 0x61fc3c <vibe.core.file.getFileInfo()+276>
eflags         0x10206  [ PF IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0

Will need to do a very detailed investigation of event loop / fiber code to proceed so this is likely to be put on pause for now =/

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Dec 10, 2013

Contributor

libevent2.d:427

~this()
{
    foreach (ts; m_waiters)
        event_free(ts.event);
}

Should be

~this(){
    onThreadShutdown();
}   
Contributor

etcimon commented Dec 10, 2013

libevent2.d:427

~this()
{
    foreach (ts; m_waiters)
        event_free(ts.event);
}

Should be

~this(){
    onThreadShutdown();
}   
@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 10, 2013

Member

Unfortunately it can't, because the class is GC managed. This means that any other GC references (including the mutex in particular) might already be finalized before the destructor is called. Also, locking a mutex in a destructor is generally useless because any competing thread will access the already destructed (and possibly freed) object after the mutex is unlocked, still resulting in a crash. It needs to be made sure in the higher level code that no one accesses the object anymore when its destructor is called.

In this particular case (the call stack above) it also doesn't make sense for the destructor to be called, because the ManualEvent is still referenced by the global vibe.core.core.st_workerTaskSignal and the main thread also owns it. However, Libevent2Driver.registerObject and unregisterObject look suspicious. m_ownedObjects has to be protected by the mutex, because Libevent2Object.~this may be called by the GC in a different thread than is was created (which is bad, but what can you do).

I'll setup the VM with multiple cores, thinking about it, that seems to be the most likely cause for the differences.

Member

s-ludwig commented Dec 10, 2013

Unfortunately it can't, because the class is GC managed. This means that any other GC references (including the mutex in particular) might already be finalized before the destructor is called. Also, locking a mutex in a destructor is generally useless because any competing thread will access the already destructed (and possibly freed) object after the mutex is unlocked, still resulting in a crash. It needs to be made sure in the higher level code that no one accesses the object anymore when its destructor is called.

In this particular case (the call stack above) it also doesn't make sense for the destructor to be called, because the ManualEvent is still referenced by the global vibe.core.core.st_workerTaskSignal and the main thread also owns it. However, Libevent2Driver.registerObject and unregisterObject look suspicious. m_ownedObjects has to be protected by the mutex, because Libevent2Object.~this may be called by the GC in a different thread than is was created (which is bad, but what can you do).

I'll setup the VM with multiple cores, thinking about it, that seems to be the most likely cause for the differences.

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 10, 2013

Member

Still can't get it, even with multiple cores. I'll push a fix for the (un)register methods.

Member

s-ludwig commented Dec 10, 2013

Still can't get it, even with multiple cores. I'll push a fix for the (un)register methods.

s-ludwig added a commit that referenced this issue Dec 10, 2013

Protect m_ownedObjects by a mutex as unregisterObject() may be called…
… from a foreign thread.

The GC calls finalizaters in whatever thread the collection happened, which can cause unregisterObject() to be executed in a different thread than where the driver and the object live(d). See also #419.
@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 10, 2013

Contributor

Using master with that fix I was reliably getting same backtrace with crash on m_hasher = k => k.toHash(); becase k is null. Registers are still a bit weird but that is likely to be just a coincidence. After hacking that to be m_hasher = k => k ? k.toHash() : 0, got triggered assert in libevent2.d:475 : assert(!amOwner());

Any suggestions?

Contributor

mihails-strasuns commented Dec 10, 2013

Using master with that fix I was reliably getting same backtrace with crash on m_hasher = k => k.toHash(); becase k is null. Registers are still a bit weird but that is likely to be just a coincidence. After hacking that to be m_hasher = k => k ? k.toHash() : 0, got triggered assert in libevent2.d:475 : assert(!amOwner());

Any suggestions?

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 11, 2013

Member

I've understood the issue now and will prepare a fix. Windows and Linux work very differently - Windows in particular currently "sidesteps" the issue, because the break signal arrives in a special thread where s_eventLoopRunning is naturally false and exit(1) is used to terminate the process.

On Linux, several issue are working together. First, the default hasher in HashMap for class and pointer types needs to be adjusted to cope with null values. Second, Libevent2ManualEvent.wait() currently can't handle being called from outside of a Task. Third (not happening here), exitEventLoop doesn't shut down the main event loop when called from a non-main thread - probably not what is desired. I still can't reproduce the crash for partially unknown reasons, but that fortunately shouldn't matter anymore.

So basically, the thread shutdown procedure needs to be adjusted to work from any thread and to use a core.sync.condition.Condition instead of Libevent2ManualEvent. Improving Libevent2ManualEvent.wait() would be even better, but that is a lot more involved.

Member

s-ludwig commented Dec 11, 2013

I've understood the issue now and will prepare a fix. Windows and Linux work very differently - Windows in particular currently "sidesteps" the issue, because the break signal arrives in a special thread where s_eventLoopRunning is naturally false and exit(1) is used to terminate the process.

On Linux, several issue are working together. First, the default hasher in HashMap for class and pointer types needs to be adjusted to cope with null values. Second, Libevent2ManualEvent.wait() currently can't handle being called from outside of a Task. Third (not happening here), exitEventLoop doesn't shut down the main event loop when called from a non-main thread - probably not what is desired. I still can't reproduce the crash for partially unknown reasons, but that fortunately shouldn't matter anymore.

So basically, the thread shutdown procedure needs to be adjusted to work from any thread and to use a core.sync.condition.Condition instead of Libevent2ManualEvent. Improving Libevent2ManualEvent.wait() would be even better, but that is a lot more involved.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 12, 2013

Contributor

Have tested the fix. Good stuff : it does not crash. Bad stuff : it hangs forever upon exit :) Though AFAIR you have been mentioning this as some inherent libevent2 issue, is it still true?

Contributor

mihails-strasuns commented Dec 12, 2013

Have tested the fix. Good stuff : it does not crash. Bad stuff : it hangs forever upon exit :) Though AFAIR you have been mentioning this as some inherent libevent2 issue, is it still true?

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 12, 2013

Contributor

Though it seem to ignore even SIGTERM now, only SIGKILL / SIGABRT work.

Contributor

mihails-strasuns commented Dec 12, 2013

Though it seem to ignore even SIGTERM now, only SIGKILL / SIGABRT work.

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 12, 2013

Member

That sucks! I need to somehow be able to reproduce that. For me it shuts down in a perfectly clean way... Could there be any setting that influences how things are handled for your OS installation? Otherwise, you could try to put debug log statements on (almost) each line of onSignal, exitEventLoop and runEventLoop so that the thread interactions become visible (with setLogFormat(FileLogger.Format.threadTime).

Member

s-ludwig commented Dec 12, 2013

That sucks! I need to somehow be able to reproduce that. For me it shuts down in a perfectly clean way... Could there be any setting that influences how things are handled for your OS installation? Otherwise, you could try to put debug log statements on (almost) each line of onSignal, exitEventLoop and runEventLoop so that the thread interactions become visible (with setLogFormat(FileLogger.Format.threadTime).

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 12, 2013

Member

BTW, SIGINT and SIGTERM have been handled the same before, too. Did SIGINT crash for you, but SIGTERM didn't?

Member

s-ludwig commented Dec 12, 2013

BTW, SIGINT and SIGTERM have been handled the same before, too. Did SIGINT crash for you, but SIGTERM didn't?

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 12, 2013

Contributor

Pretty much default Arch x64 virtualbox instance here, 1 CPU setup. Only additional packages installed from repos are git and all dlang stuff so can't imagine any special local configuration / overly clever daemons. I will provide extra trace information once hit some pause at job.

BTW, SIGINT and SIGTERM have been handled the same before, too. Did SIGINT crash for you, but SIGTERM didn't?

I didn't check SIGTERM before. Have noticed that only when trying to kill process. Btw it does print Received signal XX. Shutting down., however nothing observable happens after it.

Contributor

mihails-strasuns commented Dec 12, 2013

Pretty much default Arch x64 virtualbox instance here, 1 CPU setup. Only additional packages installed from repos are git and all dlang stuff so can't imagine any special local configuration / overly clever daemons. I will provide extra trace information once hit some pause at job.

BTW, SIGINT and SIGTERM have been handled the same before, too. Did SIGINT crash for you, but SIGTERM didn't?

I didn't check SIGTERM before. Have noticed that only when trying to kill process. Btw it does print Received signal XX. Shutting down., however nothing observable happens after it.

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 12, 2013

Member

Wait a minute. Using SIGTERM just outputs that the signal was received a bunch of times and then hangs for me, too. I'm investigating.

Member

s-ludwig commented Dec 12, 2013

Wait a minute. Using SIGTERM just outputs that the signal was received a bunch of times and then hangs for me, too. I'm investigating.

s-ludwig added a commit that referenced this issue Dec 12, 2013

Streamline thread shutdown handling and don't wait in the signal hand…
…ler for all threads to shut down.

Also replaces the implicitly "__gshared" s_core.m_eventLoopRunning by a thread local s_eventLoopRunning. This sometimes caused threads to not exit when they should. See #419.
@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 12, 2013

Member

There have still been two issues (one shared variable which was thought to be thread local and the signal handler was taking too long to finish). The code is now also written to unify handling across worker threads and normal threads, which makes it easier to understand. In the current state everything works reliably for me using either "kill" or Ctrl+C.

Member

s-ludwig commented Dec 12, 2013

There have still been two issues (one shared variable which was thought to be thread local and the signal handler was taking too long to finish). The code is now also written to unify handling across worker threads and normal threads, which makes it easier to understand. In the current state everything works reliably for me using either "kill" or Ctrl+C.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 12, 2013

Contributor

Yay, works for me! You can call yourself "telepathy debugging master" now :)
Thanks, that was partially blocking my attempt to create automatic perf tests for vibe.d, back to it.

Contributor

mihails-strasuns commented Dec 12, 2013

Yay, works for me! You can call yourself "telepathy debugging master" now :)
Thanks, that was partially blocking my attempt to create automatic perf tests for vibe.d, back to it.

@mihails-strasuns

This comment has been minimized.

Show comment
Hide comment
@mihails-strasuns

mihails-strasuns Dec 12, 2013

Contributor

P.S. This has also fixed issue with hanging requests with worker threads / distribution enabled.

Contributor

mihails-strasuns commented Dec 12, 2013

P.S. This has also fixed issue with hanging requests with worker threads / distribution enabled.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Dec 12, 2013

Contributor

That's some purely awesome multi-thread handling with atomics. I'm always impressed by it because it's so modern

Contributor

etcimon commented Dec 12, 2013

That's some purely awesome multi-thread handling with atomics. I'm always impressed by it because it's so modern

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Dec 12, 2013

Member

Phew! :)

Okay, now it's time for me to finally take a closer look at the session topics!

Member

s-ludwig commented Dec 12, 2013

Phew! :)

Okay, now it's time for me to finally take a closer look at the session topics!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment