replace signal() call with sigaction() for better portability #127

Merged
merged 2 commits into from Jun 18, 2014

Projects

None yet

3 participants

@blaenk

I came across issue #51 in which Solaris users encounter crashes upon receipt of SIGUSR1 and I have a feeling I know what's happening. I don't think the problem is Solaris' implementation of pthread_kill as that is defined by the POSIX standards and Solaris is apparently fully POSIX-complaint, more than can be said of Linux, and yet those of us on Linux don't experience this problem.

Rather, signal() is known to have portability problems (the POSIX standard allows it presumably due to historical reasons). Some systems such as old Unix systems and System V employ a behavior in which the signal disposition is reset to SIG_DFL (default) upon invocation of a signal handler by the delivery of a signal.

The only portable use of signal() is to set a signal's disposition to
SIG_DFL or SIG_IGN. The semantics when using signal() to establish a
signal handler vary across systems (and POSIX.1 explicitly permits this
variation); do not use it for this purpose.

From: http://man7.org/linux/man-pages/man2/signal.2.html

Not to mention:

The effects of signal() in a multithreaded process are unspecified.

I guess so far we have gotten lucky.

Solaris is a system which employs the SIG_DFL reset behavior even in version 11.1.

See http://docs.oracle.com/cd/E26502_01/html/E29034/signal-3c.html

If signal() is used, disp is the address of a signal handler, and sig is not SIGILL,
SIGTRAP, or SIGPWR, the system first sets the signal's disposition to SIG_DFL before
executing the signal handler.

On Solaris the default signal disposition for SIGUSR1 is to exit the application.

See http://docs.oracle.com/cd/E26502_01/html/E29033/signal.h-3head.html

If I'm correct in assuming this is the problem Solaris users are encountering, the solution to this (and likely any other portability issues that may arise) is to use sigaction() which has explicitly defined behavior as per the POSIX standards.

As you can see, it's a pretty drop-in solution. In fact, on some systems such as Linux, glibc defines signal() as a wrapper around sigaction() to use BSD semantics (i.e. not the behavior detailed above), which is likely why few others have experienced these problems.

See http://man7.org/linux/man-pages/man2/signal.2.html

I really hope this helps you guys out with Solaris (and maybe other systems which employ this behavior). I don't have Solaris myself so I can't test it, but it's a drop-in replacement of signal() so it should have no effect on other systems and hopefully fix the problem on Solaris. All I could do on my Linux machine was add a torrent and force a re-hash, which I believe is what fires SIGUSR1 (or so I inferred from the talk on the issue). The re-hash went through just fine.

@blaenk blaenk replace signal() call with sigaction() for better portability
signal() is known to have portability problems. Some systems such as
old Unix systems and System V employ a behavior in which the signal
disposition is reset to SIG_DFL (default) upon invocation of a signal
handler by the delivery of a signal.

See issue #51 in which Solaris users experience signal-based issues.

Solaris is a system which employs the SIG_DFL reset behavior even in
version 11.1.

See http://docs.oracle.com/cd/E26502_01/html/E29034/signal-3c.html

> If signal() is used, disp is the address of a signal handler, and sig is not SIGILL,
> SIGTRAP, or SIGPWR, the system first sets the signal's disposition to SIG_DFL before
> executing the signal handler.

On Solaris the default signal disposition for SIGUSR1 is to exit the application.

See http://docs.oracle.com/cd/E26502_01/html/E29033/signal.h-3head.html

If I'm correct in assuming this is the problem Solaris users are encountering,
the solution to this (and likely any other portability issues that may arise)
is to use sigaction() which has explicitly defined behavior as per the POSIX standards.

As you can see, it's a pretty drop-in solution. In fact, on some systems such as Linux,
glibc defines signal() as a wrapper around sigaction() to use BSD semantics (i.e. not the
behavior detailed above), which is likely why few others have experienced these problems.

See http://man7.org/linux/man-pages/man2/signal.2.html
2cd70f4
@blaenk blaenk added SA_RESTART to automatically restart system calls interrupted by…
… signals (as was already done by signal()) as well as error checking on sigaction()
49d0028
@blaenk

I noticed there seems to be some effort into using something other than signals, at least for the thing SIGUSR1 is currently used for. I believe my changes are still useful, however, for any other signal handlers which may be established, considering it's a drop-in replacement of signal() and signal() is well known to have portability problems for establishing signal handlers (the other use of it in rtorrent, for setting SIG_DFL and SIG_IGN, is apparently fine as it is well defined in the POSIX standards).

In fact, we would probably experience these issues on Linux were it not for the _BSD_SOURCE feature test macro being implicitly defined in glibc:

By default, in glibc 2 and later, the signal() wrapper function does not
invoke the kernel system call. Instead, it calls sigaction(2) using
flags that supply BSD semantics. This default behavior is provided as
long as the _BSD_SOURCE feature test macro is defined. By default,
_BSD_SOURCE is defined; it is also implicitly defined if one defines
_GNU_SOURCE, and can of course be explicitly defined.

On glibc 2 and later, if the _BSD_SOURCE feature test macro is not
defined, then signal() provides System V semantics. (The default
implicit definition of _BSD_SOURCE is not provided if one invokes gcc(1)
in one of its standard modes (-std=xxx or -ansi) or defines various other
feature test macros such as _POSIX_SOURCE, _XOPEN_SOURCE, or
_SVID_SOURCE; see feature_test_macros(7).)

Hope that helps.

@lotheac

This PR seems to fix #51 for me on Illumos (OmniOS).

@lotheac

I just noticed that SIGWINCH will still cause rtorrent to exit, though. Maybe the signal handler for that has the same issue.

@blaenk

Eh, accidentally deleted the comment when I meant to edit it.

That's weird, I just checked and the handler for SIGWINCH is established using the same function that was patched. Perhaps the bug is some place else? Could you perhaps provide a backtrace or other thing we could analyze?

@lotheac
@blaenk

I think you can stop the program with CTRL-C and then run info threads to get information on the various threads and what they're doing. You can then change threads with thread #, e.g. thread 3, and then do the regular bt to perform the backtrace.

Supposedly you can pause execution of the program on receipt of SIGWINCH with:

handle SIGWINCH stop print pass

You should also ignore SIGUSR1 and SIGPIPE since I think rtorrent uses those:

handle SIGUSR1 nostop noprint
handle SIGPIPE nostop noprint

You should be able to run the same commands found in #51 to get the same backtrace output that user got.

It's weird that it exits that way though. I tried searching the code base and I couldn't find mention of that error message, are you running the latest version with this patch applied? Because it seems to me like the latest version uses the error string "SCGI listener port received an error event."

It's the same error that is mentioned in issue #51 , so perhaps my patch didn't fix it after all? Or did you say it fixed the SIGUSR1 crash? I also found this old issue. It seems like switching to select-based polling fixed it. Worth a try:

export RTORRENT_POLL=select
rtorrent

However, I don't think rtorrent uses Solaris' OS-specific I/O multiplexing mechanism (/dev/poll or event ports I think) to begin with, so it should be that you're already using select.

@blaenk

Nevermind, I realized that the error message originates from libtorrent. I think I may be on to what is causing the problem.

@blaenk

I think it has to do with libtorrent treating the exceptfds FD set in select() as being outright errors. From what I've read, they're not necessarily errors. The man page for Solaris for select() says:

If a socket has a pending error, it is considered to have an exceptional condition pending. Otherwise, what constitutes an exceptional condition is file type-specific. For a file descriptor for use with a socket, it is protocol-specific except as noted below. For other file types, if the operation is meaningless for a particular file type, select() or pselect() indicates that the descriptor is ready for read or write operations and indicates that the descriptor has no exceptional condition pending.
...
A socket is considered to have an exceptional condition pending if a receive operation with O_NONBLOCK clear for the open file description and with the MSG_OOB flag set would return out-of-band data without blocking. (It is protocol-specific whether the MSG_OOB flag would be used to read out-of-band data.) A socket will also be considered to have an exceptional condition pending if an out-of-band data mark is present in the receive queue.

So it seems that it either could be a socket error or out of bound data. I was wondering if it could be something else other than that on Solaris but couldn't find more information. What we can do is rule out that it is a socket error. If this is the case, then it's something else and it's probably safe to ignore the exception in the program.

To do this we have to make the "Listener port received an error event." message a little bit more descriptive.

Clone the latest libtorrent and go to src/net/listen.cc and change the Listen::event_error() function to this, then recompile rtorrent so that it uses this modified libtorrent:

void
Listen::event_error() {
  int socket = get_fd().get_fd();
  int error = 0;
  socklen_t errorLen = sizeof(error);

  std::string errorMsg = "Listener port received an error event.";

  if (getsockopt(socket, SOL_SOCKET, SO_ERROR, &error, &errorLen) == -1)
    throw internal_error(errorMsg.c_str());
  else {
    errorMsg += std::string(" ") + strerror(error);
    throw internal_error(errorMsg.c_str());
  }
}

I haven't tried it (since I don't encounter this issue) but it compiles. What it should do is fetch the error associated with the socket, if there is one, and display it a long with the original error message.

I honestly have no clue about Solaris nor am I very familiar with the rtorrent or libtorrent codebase, I'm just trying to help. My suspicion is that Solaris is a little liberal with what it considers to be an exception, which, to reiterate, everywhere I've read including man pages explicitly emphasize that an exception is not necessarily an error. For example on Linux, it's usually either:

  • state change on pseudoterminal slave connected to master that is in packet mode
  • out-of-band data received on stream socket

A file descriptor present in the exception FD set is what triggers this C++ exception to be thrown which is what causes rtorrent to exit. So if we can establish that it really isn't being caused due to an error then perhaps we can do away with the current behavior of treating anything present in the exception FD set as a dramatic error necessitating absolute program termination.

@lotheac
@lotheac
@blaenk

Alright then. If you want, what you can try is making it so that it only throws a C++ exception (and thus exits the program) if there is indeed a socket error:

void
Listen::event_error() {
  int socket = get_fd().get_fd();
  int error = 0;
  socklen_t errorLen = sizeof(error);

  if (getsockopt(socket, SOL_SOCKET, SO_ERROR, &error, &errorLen) != -1 && error != 0) {
    std::string errorMsg = std::string("Listener port received an error event: ") + strerror(error);
    throw internal_error(errorMsg.c_str());
  }
}

That should check to see if there's a socket error and if so throw the exception with the socket error message. I'm not sure what the implications could be of ignoring the exception, but to reiterate, everywhere I've read has emphasized that a file descriptor being present in the exceptions set doesn't necessarily indicate an error, but an "exceptional condition," such as out-of-band data being present. I can only imagine Solaris' criteria for what can show up there is a bit wider and so you're witnessing this in what is otherwise a normal use case on other platforms.

@blaenk

If this helps and seems to cause no problems in prolonged use, I'll submit it as a separate PR.

@blaenk

Just to clarify, there are two separate kinds of exceptions we're talking about in this issue/fix.

  • C++ exception: the throw internal_error part is the C++ exception being thrown by rtorrent whenever it detects that a file descriptor is present in the exception file descriptor set
  • exception file descriptor set: the select() function monitors/returns three different sets of file descriptors: those that are ready to be read, those that are ready to be written to, and those that have reached an "exceptional condition"

The problem that we're trying to solve is that, when rtorrent detects that a file descriptor is present in the exception FD set as returned by select(), it assumes that to be an outright, stop-everything-you're-doing error, and so it throws the C++ exception which subsequently terminates rtorrent. The problem is that from what I've read, a file descriptor being present in the exceptions file descriptor set doesn't necessarily mean an error, but rather it means that the file descriptor has reached an "exceptional condition" such as out-of-band data being present on a stream socket.

The fix simply uses the socket API to check if there actually is an error present with the socket. If so then indeed throw the exception with a detailed message about the error. Otherwise ignore the file descriptor exception.

@lotheac
@lotheac
@blaenk
@rakshasa rakshasa merged commit 49d0028 into rakshasa:master Jun 18, 2014
@lotheac lotheac added a commit to niksula/omnios-build-scripts that referenced this pull request Oct 7, 2014
@lotheac lotheac patch rtorrent signal handling 82131aa
@lotheac lotheac added a commit to niksula/omnios-build that referenced this pull request Oct 7, 2014
@lotheac lotheac patch rtorrent signal handling 5b190c2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment