-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Checklist
Please ensure the following tasks are completed before filing an issue.
- Read and understood the Code of Conduct.
- Searched for existing issues and pull requests.
- If this is a general question, searched the FAQ for an existing answer.
- If this is a feature request, the issue name begins with
RFC:.
Description
Description of the issue (or feature request).
Complex trig functions, to name one example, require computation of both sine and cosine. A first cut for their implementations would just evaluate them independently, but it would be nice to take advantage of the ability to compute sine and cosine simultaneously to speed things up, if possible.
Notes:
- Trying to figure out at which level they're evaluated together. If at the horner's method level, then it'd be sending alternating powers of
xto one of two summations. I'm not convinced this would be much of an improvement since it just interleaves work and doesn't seem like it'd reduce multiplications. This suggests to me maybe scaling and reduction are really the only simultaneous part, unless there are fancy tricks. But that still might be worthwhile.sincosseems to be a thing that exists, but I haven't found a good source that explains how to do it. - It seems like boost just does the scaling and the reduction simultaneously, though I wonder if somewhere it's farming this out to the fsincos instruction. I'm having a pretty difficult time navigating the boost source TBH.
- this presentation says sincos is as fast as sine alone (says tested in single precision in c code, I think?)
- Stackoverflow: What is the fastest way to compute sin and cos together? seems mostly focused on approximations and how to get gnu stdlib to optimize via the
sincosinstruction. - CORDIC doesn't seem appropriate. That's for when multiplication isn't available.
- Intel has an interesting approach with a bunch of methods like
ippsSinCos_64f_A26that guarantee, say, 26 correct bits so that you can taylor it (that's a pun) to your needs. No implementation hints though, obv. Interesting, but this is getting off topic.
Anyway, just a thought. Conclusion: maybe it's worthwhile to scale and reduce together and just use the existing sine/cosine kernels. This would mean passing two numbers as a result. I was surprised from the complex inverse benchmarks to see that writing into an existing array for output actually seemed just a bit slower. So if this is even worthwhile, it would definitely be necessary to benchmark it to see if it's even an improvement.
If this isn't worthwhile, then evaluating them independently seems fine.