-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exceptions are not scalable #73
Comments
Looking at gcc's source code, |
Some very partial progress in the situation:
|
In continuation-passing style, IMHO with promise-future, the continuation is still explicitly expressed (it might be called 'continuation-chaining style'?). We could avoid using C++ |
I created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 |
Thanks to @gleb-cloudius, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68297 was solved in gcc 7, and std::make_exception_ptr() will not involve throwing an actual exception. Gleb, can you please summarize the state of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744, i.e., the attempt to allow concurrent exception throwing? I see there was a lot of activity on that issue, but don't understand what was the conclusion. Note that even though the above fix, and several others mentioned in comments above, reduced the amount of exception throwing - some still remains so it would be nice to make that scalable as well. |
There is no conclusion. I think there is an understanding of the
problem and willingness to address it, but not at all cost. It
involves coordination between gcc and glibc and hence needs much more
time dedicated to it. I proposed a solution that requires adding a new
ABI function that in gllibc that has to be used by newer gcc during
unwind. There are comment on the function implementation itself (how it
achieves parallelism) and on adding a new ABI function as opposite of
doing something with symbol versioning (not sure how difference in
locking behaviour can be addressed by symbol versioning, but then I did
not looked enough into it). That's more or less the state.
…--
Gleb.
|
In #399, @gleb-cloudius explains the current state of this issue.
Basically Seastar does not have this bug any longer because one lock was eliminated by gcc 7 (so switch to gcc 7!) and a second lock we work around by reimplementing dl_iterate_phdr() ourselves in core/exception_hacks.cc. |
Relates: https://st.yandex-team.ru/ Multiple threads on multiple cores should be able to concurrently throw exceptions without bothering one another. But unfortunately, it appears that in the current implementation of libstdc++ and/or glibc, the stack unwinding process takes a global lock (while getting the list of shared objects, and perhaps other things) which serializes these parallel exception-throwing and can dramatically slow down the program. Some might dismiss this inefficiency with the standard "exceptions should be rare" excuse. They should be rare. But sometimes they are not, leading do a catastrophic collapse in performance. We saw an illustrative example of an "exception storm" in an application of ours. This application can handle lots and lots of requests per second on many cores. Some unexpected circumstance caused the application to slow down somewhat, which led to some of the requests timing out. The timeout was implemented as an exception, so now we had thousands of exceptions being thrown in all cores in parallel. This led to the applications threads starting to hang, once in a while, on the lock(s) inside "throw". This in turn made the application even slower, and created even more timeouts, which in turn resulted in even more exceptions. In this way the number of exceptions per second escalated, until the point where most of the work the application was doing was fighting over the "throw" locks, and no useful work was being done. This patch eliminates the "throw" lock by supplying our own "dl_iterate_phdr" function which operates over a cached list of shared objects, which should mitigate blocking behavior in exception storm scenario, but as a tradeoff disables dynamic loading/unloading during component system lifetime -- there is no thread-safe and robust way to synchronize that with the cache we've got. If one really needs dlopen/dlclose outside of component constructor/destructor, this optimization can be disabled via USERVER_DISABLE_PHDR_CACHE cmake option. In the benchmarks which just throw+catch in parallel we are expectedly seeing X (number of threads) improvements: it tooks some 20s+ seconds for 8 threads to throw a million of exceptions each in parallel, and now it takes mere ~2s. This also improves an RPS by a factor of 2+ when an endpoint under load just throws std::runtime_error. Some references: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 scylladb/seastar@464f5e3 scylladb/seastar#73 https://stackoverflow.com/questions/26257343/does-stack-unwinding-really-require-locks
Relates: https://st.yandex-team.ru/ Multiple threads on multiple cores should be able to concurrently throw exceptions without bothering one another. But unfortunately, it appears that in the current implementation of libstdc++ and/or glibc, the stack unwinding process takes a global lock (while getting the list of shared objects, and perhaps other things) which serializes these parallel exception-throwing and can dramatically slow down the program. Some might dismiss this inefficiency with the standard "exceptions should be rare" excuse. They should be rare. But sometimes they are not, leading do a catastrophic collapse in performance. We saw an illustrative example of an "exception storm" in an application of ours. This application can handle lots and lots of requests per second on many cores. Some unexpected circumstance caused the application to slow down somewhat, which led to some of the requests timing out. The timeout was implemented as an exception, so now we had thousands of exceptions being thrown in all cores in parallel. This led to the applications threads starting to hang, once in a while, on the lock(s) inside "throw". This in turn made the application even slower, and created even more timeouts, which in turn resulted in even more exceptions. In this way the number of exceptions per second escalated, until the point where most of the work the application was doing was fighting over the "throw" locks, and no useful work was being done. This patch eliminates the "throw" lock by supplying our own "dl_iterate_phdr" function which operates over a cached list of shared objects, which should mitigate blocking behavior in exception storm scenario, but as a tradeoff disables dynamic loading/unloading during component system lifetime -- there is no thread-safe and robust way to synchronize that with the cache we've got. If one really needs dlopen/dlclose outside of component constructor/destructor, this optimization can be disabled via USERVER_DISABLE_PHDR_CACHE cmake option. In the benchmarks which just throw+catch in parallel we are expectedly seeing X (number of threads) improvements: it tooks some 20s+ seconds for 8 threads to throw a million of exceptions each in parallel, and now it takes mere ~2s. This also improves an RPS by a factor of 2+ when an endpoint under load just throws std::runtime_error. Some references: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 scylladb/seastar@464f5e3 scylladb/seastar#73 https://stackoverflow.com/questions/26257343/does-stack-unwinding-really-require-locks
This is an issue to think about - it's not urgent, and I don't any idea how we can solve it.
Seastar goes out of its way not to use any locks or even atomic operations, because these are not scalable as the number of cores grow. In particularly, we have our own single-thread version of std::shared_ptr and std::string because the standard ones use atomic operations because they can be shared across threads.
One unscalable thing we're left with is exception handling: std::exception_ptr uses atomic operations (just like std::shared_ptr). But more worryingly, throwing an exception appears to be taking global locks while doing stack unwinding (see for example http://stackoverflow.com/questions/26257343/does-stack-unwinding-really-require-locks) which means one thread throwing an exception can block another thread which is also trying to throw an exception. And blocking is really bad on Seastar's single-thread-per-core design.
Obviously, the best solution is to use exceptions as little as possible. But when your sever is handling 1 million requests per second, you need to be really careful to avoid any possibility of exceptions in the course of request handling. Note that exceptions are known to be slow - that is fine. What is not fine is that an exception on one thread can block other threads on a machine with many cores.
I don't know if we can ever solve this issue without modifying/overriding libgcc, but the minimum we should do is to document this issue and warn against using exceptions too much in Seastar.
Another idea worth looking into is whether we can implement a future's exception state without actually throwing exceptions: In a lot of Seastar code, we do not throw an exception, but rather return a make_exception_future<...>(). Commit 44e35a4 prevent a bunch of wasteful rethrows of this store exception, but we still have two problems: 1) make_exception_future internally throws an exception to build a std::exception_ptr, and 2) code which uses then_wrapped() usually rethrows the exception when calling get(). Is there a way to support exceptional futures without the overheads of actual exception handling?
The text was updated successfully, but these errors were encountered: