Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lgamma is not thread safe #1250

Closed
wds15 opened this issue May 23, 2019 · 34 comments
Closed

lgamma is not thread safe #1250

wds15 opened this issue May 23, 2019 · 34 comments
Assignees
Milestone

Comments

@wds15
Copy link
Contributor

wds15 commented May 23, 2019

Description

We have moved away from using boost functions for most math specialty functions in commit 01bb664. This affected in particular the log gamma function. Instead we now use the std::lgamma functions which turn out to cause problems in threading contexts. The behaviour is generally undefined wrt to threading safety, see here.

On macOS Mojave the lgamma implementation of clang will introduce locking of a global mutex as can be seen by terrible scaling of multi-threaded benchmarks, see here.

This is a large problem for any threading based applications involving Stan (parallel chains or httpstan).

Example

See the discourse thread linked above for an example.

Expected Output

We should use a thread-safe version of the lgamma - and ideally one which does not lock a global mutex which drastically limits parallelisation performance under threading. At the worst we should distinguish between threading and non-threading.

Right now the behaviour of lgamma when using threads is implementation specific and generally undefined to my understanding.

Current Version:

v2.19.1

@bob-carpenter
Copy link
Contributor

Are you suggesting using the Boost version of lgamma? It's an easy fix, but we have to be careful again about edge conditions and where we want to throw. I still don't know that exception behavior for our math library has been determined, so I'm punting to @seantalts and @syclik for clarification.

@seantalts
Copy link
Member

It's whatever the doc string for lgamma says: http://mc-stan.org/math/d4/d84/namespacestan_1_1math.html#aead76f03bdbc60484ad760fc31bad40f

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 23, 2019 via email

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 23, 2019 via email

@seantalts
Copy link
Member

Yeah, what we promise in the doxygen comments is what we're holding ourselves to in theory.

Since it doesn't mention exceptions it seems like we have made no promises there. Would it be hard to match whatever behavior we currently have?

@wds15
Copy link
Contributor Author

wds15 commented May 24, 2019

Yes, I am suggesting to revert to commit hash I am quoting above where we replace boost with the std version. However, another area for potential trouble is performance... from my quick tests the boost version is 3x slower than the std version. While the boost version has a well defined precision according to a policy, the std version does not let us control this in any way. Thus, we may have to balance performance vs accuracy, but we don't know where we stand right now.

I can start on a PR where I revert the above commit.

@seantalts
Copy link
Member

We should make a note of how the behavior changes just to be polite and add it to the release notes one day (maybe put a section on that in the top of the PR or issue, something like "for release notes: lgamma now throws exceptions..." or whatever. I think that seems reasonable. Another weird thing we could do would be to use std::lgamma if we're not in threading mode, but I think that would be too confusing.

@bob-carpenter
Copy link
Contributor

I was hoping instead of laying down a letter of the law based on our spotty code documentation, that we'd be moving to consistent treatment of edge cases (+inf, -inf, poles, NaN, and out-of-domain inputs). I also hoped we'd be adding documentation to clarify behavior rather than avoiding it to allow us wiggle room w.r.t. backward compatibility.

@seantalts
Copy link
Member

That sounds good, why don't you write up a proposal for how you want to do that as a design doc?

@sakrejda
Copy link
Contributor

Just linking previous discussion

@rok-cesnovar rok-cesnovar self-assigned this May 25, 2019
@syclik
Copy link
Member

syclik commented May 27, 2019

@wds15: does std::lgamma misbehave with threading? Can you write a test that exhibits the failure? For a bigger question: what other parts of the library aren't really going to work with threading?

I think the issue itself could use more work. What is the ideal outcome? Will using std::lgamma with our current compilers fail us? (Or is it just slow?) This should only affect threaded applications where a Stan program is written with lgamma in it?

That sounds good, why don't you write up a proposal for how you want to do that as a design doc?

@seantalts: I'm not sure how a policy discussion would fall into a design doc (for code)? Maybe my imagination isn't as good as yours.

@seantalts
Copy link
Member

@seantalts: I'm not sure how a policy discussion would fall into a design doc (for code)? Maybe my imagination isn't as good as yours.

Yeah, design doc (for code) is not the right kind - but some kind of Request For Comments draft would probably be a good way to get traction on that issue.

@wds15
Copy link
Contributor Author

wds15 commented May 27, 2019

@syclik std::lgamma is not guaranteed to be thread-safe as per documentation. In fact you get some implementation on a given platform which will behave in a non-defined behaviour wrt to threading. On my macOS system the implementation obviously uses internally a global mutex to access a global variable. I am not sure how I can write a test for such an internal mechanism of the function (maybe by using negative and positive arguments, but I do not know that). Writing a test for this is very hard - right now I could only detect it using performance benchmarks which do not scale with multiple threads, but that isn't a good unit test, right?

Using std::lgamma has the disadvantages of

  • having undefined behaviour across platforms wrt to threading. On macOS performance is seriously hampered when used in parallel due to an internal mutex.
  • the output accuracy is uncontrolled - at least I cannot find any notes on the accuracy of that function

However, when I drop-in replaced std::lgamma with the boost version, then the boost version was ~3x slower than std.

The best outcome is to use the boost version of lgamma for all our applications and hopefully when the boost version is asked to output similar accuracy as the std version then the speed is the same of both functions.

And yes, this problem is with probably a few more gamma functions and maybe even with more functions which we do not know about yet.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 28, 2019 via email

@wds15
Copy link
Contributor Author

wds15 commented May 28, 2019

That's right - all of stan-math is not thread safe unless you run it with STAN_THREADS turned on. In that case we can run things concurrently with map_rect. In this setting we have a problem with std::lgamma for these reasons:

  1. There is no thread safety guarantee. So concurrent calls to std::lgamma from concurrent map_rect context's will be problematic.
  2. It turns out that on macOS the behavior of std::lgamma whenever threads are used is such that a global mutex is used to synchronize the calls (likely to deal with the sign of the gamma function). This synchronization makes the implementation on macOS per se thread safe, but it kills off any performance gains through parallelization.
  3. The worst thing is that the thread behavior of that function is implementation dependent (OS, compiler, libstdc++, whatever)

Does that explain?

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 28, 2019 via email

@seantalts
Copy link
Member

seantalts commented May 28, 2019

STAN_THREADS only makes the memory access thread safe. It doesn't make things like containers (Eigen::Matrix or std::vector) or our stan::math::vari instances thread safe for mutability.

That's why we only support threading via map_rect :D

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 28, 2019 via email

@seantalts
Copy link
Member

My understanding is that std::lgamma's lack of thread safety becomes a problem precisely because map_rect might call it from multiple threads.

Yep. Just highlighting that the only sense of "thread safety" we need to worry about is via map_rect. If the design doc for parallel for goes through, that will need to use the same mechanisms, which I believe is the plan.

@wds15
Copy link
Contributor Author

wds15 commented May 28, 2019

Things need to be safe if concurrently executed in different local contexts....and static variables can be a source of problems if these are not const, yes.

@syclik
Copy link
Member

syclik commented May 30, 2019

@bob-carpenter is right. map_rect is imposing thread-safety requirements on any Stan function that might be evaluated in the function passed to map_rect. We haven't been so careful about this since thread safety wasn't a requirement, but I think we should be much more careful going forward.

Fortunately, I think lgamma is one of the few functions we would use that are known to not be thread-safe. See this list http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_01. (I see it linked everywhere... I don't know if it covers all of the standard library or if we really need to be digging through the doc there.)

We don't have this problem with MPI because it uses separate processes.

Part of me thinks we should reevaluate including threading for now while we sort out these problems; we essentially need to search through all library functions we use to see if they're not thread-safe. But we can pick that discussion up on discourse.

@syclik
Copy link
Member

syclik commented May 30, 2019

For now, I think it's reasonable to describe the qualities of lgamma that we want, to create some tests for it, and to verify that the boost version has the characteristics that we want. Then switch to that implementation.

I am not sure how I can write a test for such an internal mechanism of the function (maybe by using negative and positive arguments, but I do not know that). Writing a test for this is very hard - right now I could only detect it using performance benchmarks which do not scale with multiple threads, but that isn't a good unit test, right?

@wds15: we need to write a test that flexes what we think could be an issue when running the function with threading. Yes, it might not be easy, but this is a necessary condition. How else will you prevent someone in the future from looking at boost::lgamma and thinking... "oh, I'll switch that to std::lgamma"? And maybe we can replace it in the future. Or we might be able to avoid a problem with a future implementation.

Re: nomenclature. I don't care if it's not called a unit test, but it's what we need to test.

@bgoodri
Copy link
Contributor

bgoodri commented May 30, 2019

The link says

The POSIX version of lgamma is not thread-safe: each execution of the function stores the sign of the gamma function of arg in the static external variable signgam.

which I have always thought was a non-issue for us because nothing in Stan Math is or should be writing or reading signgam.

@syclik
Copy link
Member

syclik commented May 30, 2019

@bgoodri: that makes sense. Does that mean it's the results will be safe with multiple threads? Does it explain why @wds15 has seen a performance hit?

@bgoodri
Copy link
Contributor

bgoodri commented May 30, 2019

The title of this issue shouldn't be "lgamma is not thread safe"; rather it should be "signgam is not thread safe". So, if it is true that no one is using signgam then the results are safe and I wouldn't be surprised if the std version were faster than the Boost version. The C++ standards gurus also require accuracy within 1 ULP or something, so we are probably fine on that front.

For statistics in general and HMC in particular, I have a hard time thinking of an example where someone would want to access signgam. Typically, we already are requiring the input to be non-negative; otherwise HMC would have the problems it already has with abs or square on input that can be either negative or positive. Worse, the gamma function is crazy for negative input with oscillations and singularities, so I don't see how anyone would use Stan to do statistical inference unless the input was already constrained to be non-negative.

We could introduce a lgamma_r function to Stan Math and / or have stan::math::lgamma set signgam to -2147483647.

@wds15 wds15 mentioned this issue May 30, 2019
5 tasks
@sakrejda
Copy link
Contributor

@bob-carpenter re: boost::math and thread safety: https://www.boost.org/doc/libs/1_70_0/libs/math/doc/html/math_toolkit/threads.html

@bob-carpenter
Copy link
Contributor

Tests are great if feasible. It can be really hard to test these threading bugs---I've seen cases that take 48 hours on 72 concurrent threads to arise.

How else will you prevent someone in the future from looking at boost::lgamma and thinking... "oh, I'll switch that to std::lgamma"?

lgamma isn't special---we have the same requirements on all of our functions. The only difference is that it has a tempting replacement in std::lgamma. My suggestion is to doc where Boost's version is used saying that the std:: version isn't thread safe.

I don't know how to stop people from calling std::lgamma elsewhere when they shouldn't. Currently, the only place std::lgamma is used is in stan/math/prim/scal/fun/lgamma.hpp and that should probably be the only place Boost's gamma is used.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 30, 2019

https://www.boost.org/doc/libs/1_70_0/libs/math/doc/html/math_toolkit/threads.html

That's convenient. It just says:

The library is fully thread safe and re-entrant for all functions regard[les]s of the data type they are instantiated on. Thread safety limitations relating to user defined types present in previous releases (prior to 1.50.0) have been removed.

@bgoodri
Copy link
Contributor

bgoodri commented May 30, 2019

lgamma isn't special---we have the same requirements on all of our functions. The only difference is that it has a tempting replacement in std::lgamma.

What other standard functions have the side effect of setting a global variable like signgam?

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 31, 2019 via email

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 31, 2019 via email

@syclik
Copy link
Member

syclik commented May 31, 2019

What other standard functions have the side effect of setting a global variable like signgam?

@bgoodri: it looks like functions that are related (I'm seeing the same list on different sites and can find a source if you want it):
gamma(), lgamma(),lgammaf(),lgammal()

I don't know if there are other functions in Eigen or elsewhere that have these behaviors.

More importantly, this means that any Math function implementation can't really use static variables to compute things. This is a new requirement. That would have been bad practice, but we haven't explicitly checked for it or forbidden it. The only places where I could think we might even attempt to do that is in optimized code. There's a lot of cleverness there with separate memory management and we should probably comb through that very carefully.

By imposing thread safety requirements and assuming people know the language. It's how we do everything. We can't guard against generic programmer error. We have exactly the same problem with every one of our functions, though they don't come with such a tempting replacement.

Yup. What's the best way to impose thread safety requirements?

If you can formulate a test that's reliable, that's great, but it's a notoriously hard problem in a multithreaded context to trigger bugs. I've seen threading bugs that took on the order of days to trigger given dozens of concurrent threads.

@wds15 was able to demonstrate that std::lgamma performs differently from boost::lgamma with a few threads. That's what generated this issue. That should be one of the tests, if not the test.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 31, 2019 via email

@wds15
Copy link
Contributor Author

wds15 commented May 31, 2019

A test for this will make everything brittle...I need to launch 4 threads and observe that the speed needs to be 4x as fast. Doubling was not enough...this implies I need 4 idle cores for the test to run fine...on top of that this the behavior on mac...I don’t know how things are under Linux or windows...it could be different there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants