Use Eigen nullaryExpr for rvalue indexing with multi_index #3046

t4c1 · 2021-06-09T12:11:33Z

Submission Checklist

Run unit tests: ./runTests.py src/test/unit
Run cpplint: make cpplint
Declare copyright holder and open-source license: see below

Summary

By using Eigen nullaryExpression for indexing with multi_index we can have these indexing functions propagate expressions, potentially reducing the number of times memory needs to be accessed.

I have measured that there is no slown for simple indexing like a[b]. The speedup for a[b] + a[c] is around 15% for prim and 10% for rev signature.

Intended Effect

Speedup indexing with multi_index when used in expressions.

How to Verify

Run tests for rvalue. Benchmark indexing with multi_index.

Side Effects

None.

Documentation

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Tadej Ciglarič

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

SteveBronder · 2021-06-09T18:34:00Z

Should we wait to do this till Eigen 3.4 comes out? Because then I think we can use the slices and index views for this

http://eigen.tuxfamily.org/index.php?title=3.4

andrjohns · 2021-06-10T01:14:25Z

Do these need to be returned using Holder?

rok-cesnovar · 2021-06-10T11:36:02Z

Should we wait to do this till Eigen 3.4 comes out?

What is the typical timeline for Eigen releases? Weeks or months after RC? We will also have to wait for RcppEigen to upgrade, though I guess we can help with that.

If this brings 10% of speedup, I think its worth doing now.

…stan into nullary_expr_indexing

t4c1 · 2021-06-10T12:15:01Z

Should we wait to do this till Eigen 3.4 comes out? Because then I think we can use the slices and index views for this

No. This is for multi indexing that can not use slices anyway. Once we do have Eigen 3.4 we can use those for indices, but I don't think it will affect performance.

Do these need to be returned using Holder?

Correctness-wise it is not strictly necessary. These are using a lambda to capture the variable that would otherwise be local and would need holder. Since CwiseNullaryOp stores the lambda , this variable will not go out of scope. However, I tested it and it turns out indexing is a bit faster when using holder. I believe it is due to holder not making copies of data when copying or moving it.

SteveBronder · 2021-06-10T19:41:26Z

Should we wait to do this till Eigen 3.4 comes out? Because then I think we can use the slices and index views for this

No. This is for multi indexing that can not use slices anyway. Once we do have Eigen 3.4 we can use those for indices, but I don't think it will affect performance.

See the docs under "Array of indices" which is exactly what we want to do here. 3.4 should come out pretty soon (they are on the RC right now) and then we would just pass the std vector for multi-indexing to the operator(). Unless NullaryExpr does something more performant than the 3.4 operator() then I'd prefer we wait to do this until 3.4 comes out.

t4c1 · 2021-06-11T06:16:09Z

I see. I expect using "Array of indices" would perform exactly as fast as what I have in this PR using NullaryExprs. The only difference being that we can use NullaryExpr now. If you prefer code style of indexing (and I agree it is a bit nicer), we can still change it after we start using Eigen 3.4.

stan-buildbot · 2021-06-11T07:51:12Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	1.99	2.03	0.98	-1.79% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.03	2.68% faster
eight_schools/eight_schools.stan	0.07	0.07	1.01	1.47% faster
gp_regr/gp_regr.stan	0.1	0.1	1.0	0.47% faster
irt_2pl/irt_2pl.stan	3.12	3.18	0.98	-2.15% slower
performance.compilation	64.75	60.79	1.07	6.11% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	6.18	6.14	1.01	0.57% faster
pkpd/one_comp_mm_elim_abs.stan	18.35	18.15	1.01	1.11% faster
sir/sir.stan	78.88	78.7	1.0	0.22% faster
gp_regr/gen_gp_data.stan	0.02	0.02	1.04	3.41% faster
garch/garch.stan	0.33	0.31	1.08	6.99% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.25	0.25	0.99	-0.84% slower
arK/arK.stan	1.23	1.25	0.98	-1.69% slower
arma/arma.stan	0.43	0.43	1.0	-0.29% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.02	2.01	1.0	0.41% faster
Mean result: 1.01196931038

Jenkins Console Log
Blue Ocean
Commit hash: bec9800

Machine information

ProductName: Mac OS X ProductVersion: 10.14.6 BuildVersion: 18G8022

CPU:
Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz

G++:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Clang:
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

SteveBronder

Couple review comments. I'm fine with this PR but imo I'd rather just wait and use Eigen's indexing stuff.

SteveBronder · 2021-06-11T17:25:11Z

src/stan/model/indexing/rvalue.hpp

+      [name, &idx](auto& x_ref) {
+        return plain_type_t<EigMat>::NullaryExpr(
+            idx.ns_.size(), x_ref.cols(),
+            [name, &idx, &x_ref](Eigen::Index i, Eigen::Index j) {
+              return x_ref.coeff(idx.ns_[i] - 1, j);
+            });
+      },
+      stan::math::to_ref(x));


Do you need name in either of these lambdas?

No. Removed.

SteveBronder · 2021-06-11T17:55:45Z

src/stan/model/indexing/rvalue.hpp

+              return v_ref.coeff(idx.ns_[i] - 1);
+            });
+      },
+      stan::math::to_ref(v));


Why the to_ref here?

Indexing with multi index is often used with index much larger than the indexed vector. So each element of the vector is used multiple times in the output. This to ref makes sure that if the indexed vector is expression each element is calculated only once.

SteveBronder · 2021-06-11T17:56:43Z

src/stan/model/indexing/rvalue.hpp

-    const int n = idx.ns_[i];
-    math::check_range("matrix[multi] row indexing", name, x_ref.rows(), n);
-    x_ret.row(i) = x_ref.row(n - 1);
+    math::check_range("matrix[multi] row indexing", name, x.rows(), idx.ns_[i]);


Should this check range happen in the lambda like the others?

No. We only need this check once per row of the result. In lambda it would happen once for each element of the result.

SteveBronder · 2021-06-11T17:59:09Z

src/stan/model/indexing/rvalue.hpp

+inline auto rvalue(EigVec&& v, const char* name, const index_multi& idx) {
+  return stan::math::make_holder(
+      [name, &idx](auto& v_ref) {


If idx is an temporary could this fall out of scope before it's executed?

In general C++ code yes. In C++ generated by stanc no. This is the same as for any function in math. So as not to litter whole math library with holders we have decided to ignore such cases.

I'm not sure I'm following, if stanc generated

auto blah = rvalue(rvalue(v, "blah1", index_multi(std::vector<int>{1, 2, 3}), "blah1", index_uni(3)))

Wouldn't the index_multi fall out of scope here?

Nope. It stays in scope until everything in the same line of code is executed.

Let me play with this on godbolt for a minute and if I can't get it to break then I'm cool with it

Okay so I'm cool with this, as long as we are agreeing that we won't ever use auto in transformed data/ transformed parameters in the compiler because then we hit a no-no. But don't we want auto in the compiler for the OpenCL stuff? How less performant is it to just make the multi index have a perfect forwarding template and forward the object along into the lambda?

So as not to litter whole math library with holders we have decided to ignore such cases.

Do you have a link to where that decision was made at? tmk didn't we want to add those?

https://godbolt.org/z/jEYTde3G1

But don't we want auto in the compiler for the OpenCL stuff?

No need. We can directly use matrix_cl. As for expressions we can handle them more or less exactly as Eigen ones.

How less performant is it to just make the multi index have a perfect forwarding template and forward the object along into the lambda?

That would not work if the same index is used multiple times.

Do you have a link to where that decision was made at? tmk didn't we want to add those?

stan-dev/math#1470 (comment)

Alright then I'm good with this

src/stan/model/indexing/rvalue.hpp

stan-buildbot · 2021-06-14T07:58:31Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.03	1.97	1.03	2.78% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.01	1.43% faster
eight_schools/eight_schools.stan	0.07	0.06	1.02	1.88% faster
gp_regr/gp_regr.stan	0.11	0.11	0.99	-0.98% slower
irt_2pl/irt_2pl.stan	3.15	3.09	1.02	1.9% faster
performance.compilation	65.82	61.05	1.08	7.24% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	6.26	6.13	1.02	2.09% faster
pkpd/one_comp_mm_elim_abs.stan	18.49	18.11	1.02	2.06% faster
sir/sir.stan	79.45	79.1	1.0	0.44% faster
gp_regr/gen_gp_data.stan	0.02	0.02	1.02	1.95% faster
garch/garch.stan	0.32	0.32	1.01	1.4% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.26	0.25	1.02	1.68% faster
arK/arK.stan	1.22	1.24	0.99	-1.17% slower
arma/arma.stan	0.44	0.43	1.02	2.18% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.0	2.01	1.0	-0.5% slower
Mean result: 1.01691402747

Jenkins Console Log
Blue Ocean
Commit hash: fa9b7d3

Machine information

ProductName: Mac OS X ProductVersion: 10.14.6 BuildVersion: 18G8022

CPU:
Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz

G++:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Clang:
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

SteveBronder

Good!

SteveBronder

Whoops hang on one sec, I'm looking over the stanc code for closures right now which uses auto and just want to check that is ok

SteveBronder

false alarm, that auto is only used to construct the closure so this wouldn't effect it

t4c1 and others added 5 commits June 9, 2021 09:46

nullaryExpr

f72065c

Merge branch 'develop' into nullary_expr_indexing

9992974

bugfix

881b2ab

format

feb1e74

format

accf998

added holders

968a402

Merge branch 'nullary_expr_indexing' of https://github.com/bstatcomp/…

11eeb6f

…stan into nullary_expr_indexing

bugfix different return types

2239b14

format

bec9800

SteveBronder requested changes Jun 11, 2021

View reviewed changes

addressed review comments

fa9b7d3

SteveBronder approved these changes Jun 16, 2021

View reviewed changes

SteveBronder requested changes Jun 16, 2021

View reviewed changes

SteveBronder approved these changes Jun 16, 2021

View reviewed changes

t4c1 merged commit eb49531 into stan-dev:develop Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Eigen nullaryExpr for rvalue indexing with multi_index #3046

Use Eigen nullaryExpr for rvalue indexing with multi_index #3046

t4c1 commented Jun 9, 2021

SteveBronder commented Jun 9, 2021

andrjohns commented Jun 10, 2021

rok-cesnovar commented Jun 10, 2021 •

edited

t4c1 commented Jun 10, 2021 •

edited

SteveBronder commented Jun 10, 2021

t4c1 commented Jun 11, 2021 •

edited

stan-buildbot commented Jun 11, 2021

SteveBronder left a comment

SteveBronder Jun 11, 2021

t4c1 Jun 14, 2021

SteveBronder Jun 11, 2021

t4c1 Jun 14, 2021

SteveBronder Jun 11, 2021

t4c1 Jun 14, 2021

SteveBronder Jun 11, 2021

t4c1 Jun 14, 2021

SteveBronder Jun 15, 2021

t4c1 Jun 15, 2021

SteveBronder Jun 15, 2021

SteveBronder Jun 15, 2021

t4c1 Jun 16, 2021

SteveBronder Jun 16, 2021

stan-buildbot commented Jun 14, 2021

SteveBronder left a comment

SteveBronder left a comment

SteveBronder left a comment

Use Eigen nullaryExpr for rvalue indexing with multi_index #3046

Use Eigen nullaryExpr for rvalue indexing with multi_index #3046

Conversation

t4c1 commented Jun 9, 2021

Submission Checklist

Summary

Intended Effect

How to Verify

Side Effects

Documentation

Copyright and Licensing

SteveBronder commented Jun 9, 2021

andrjohns commented Jun 10, 2021

rok-cesnovar commented Jun 10, 2021 • edited

t4c1 commented Jun 10, 2021 • edited

SteveBronder commented Jun 10, 2021

t4c1 commented Jun 11, 2021 • edited

stan-buildbot commented Jun 11, 2021

SteveBronder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stan-buildbot commented Jun 14, 2021

SteveBronder left a comment

Choose a reason for hiding this comment

SteveBronder left a comment

Choose a reason for hiding this comment

SteveBronder left a comment

Choose a reason for hiding this comment

rok-cesnovar commented Jun 10, 2021 •

edited

t4c1 commented Jun 10, 2021 •

edited

t4c1 commented Jun 11, 2021 •

edited