It looks like someone already implemented a "blocked" version of the covariance functions, to compute them faster. There's similar functions in oneTBB already that do something similar, but instead threads it. Since I'm already familiar with these, I'm going implemented a threaded TBB version, and check it against existing tests for numerical accuracy and speed, as a first pass.
You can see a blocked implemented of computing a covariance function here: https://github.com/stan-dev/math/blob/develop/stan/math/prim/fun/gp_exp_quad_cov.hpp
And TBB has internally implemented functions which I suppose do something similar: https://oneapi-spec.uxlfoundation.org/specifications/oneapi/latest/elements/onetbb/source/algorithms/blocked_ranges/blocked_range_cls
And I'll also try to write the threaded code using C++ directives (), so that way we can just add a command-line option -D thread or something, so that user can choose whether they want to to use global multithreading or not (if that's how it works downstream in cmdstan). This is just a prototype and when/if I finally get something running I'm open to better design choices. But this should be safer because unless the directive is used the threaded code won't even compile, if my understanding is correct.
It looks like someone already implemented a "blocked" version of the covariance functions, to compute them faster. There's similar functions in oneTBB already that do something similar, but instead threads it. Since I'm already familiar with these, I'm going implemented a threaded TBB version, and check it against existing tests for numerical accuracy and speed, as a first pass.
You can see a blocked implemented of computing a covariance function here: https://github.com/stan-dev/math/blob/develop/stan/math/prim/fun/gp_exp_quad_cov.hpp
And TBB has internally implemented functions which I suppose do something similar: https://oneapi-spec.uxlfoundation.org/specifications/oneapi/latest/elements/onetbb/source/algorithms/blocked_ranges/blocked_range_cls
And I'll also try to write the threaded code using C++ directives (), so that way we can just add a command-line option
-D threador something, so that user can choose whether they want to to use global multithreading or not (if that's how it works downstream in cmdstan). This is just a prototype and when/if I finally get something running I'm open to better design choices. But this should be safer because unless the directive is used the threaded code won't even compile, if my understanding is correct.