Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Eigen nullaryExpr for rvalue indexing with multi_index #3046

Merged
merged 10 commits into from
Jun 17, 2021

Conversation

t4c1
Copy link
Collaborator

@t4c1 t4c1 commented Jun 9, 2021

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary

By using Eigen nullaryExpression for indexing with multi_index we can have these indexing functions propagate expressions, potentially reducing the number of times memory needs to be accessed.

I have measured that there is no slown for simple indexing like a[b]. The speedup for a[b] + a[c] is around 15% for prim and 10% for rev signature.

Intended Effect

Speedup indexing with multi_index when used in expressions.

How to Verify

Run tests for rvalue. Benchmark indexing with multi_index.

Side Effects

None.

Documentation

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Tadej Ciglarič

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@SteveBronder
Copy link
Collaborator

Should we wait to do this till Eigen 3.4 comes out? Because then I think we can use the slices and index views for this

http://eigen.tuxfamily.org/index.php?title=3.4

@andrjohns
Copy link
Contributor

Do these need to be returned using Holder?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Jun 10, 2021

Should we wait to do this till Eigen 3.4 comes out?

What is the typical timeline for Eigen releases? Weeks or months after RC? We will also have to wait for RcppEigen to upgrade, though I guess we can help with that.

If this brings 10% of speedup, I think its worth doing now.

@t4c1
Copy link
Collaborator Author

t4c1 commented Jun 10, 2021

Should we wait to do this till Eigen 3.4 comes out? Because then I think we can use the slices and index views for this

No. This is for multi indexing that can not use slices anyway. Once we do have Eigen 3.4 we can use those for indices, but I don't think it will affect performance.

Do these need to be returned using Holder?

Correctness-wise it is not strictly necessary. These are using a lambda to capture the variable that would otherwise be local and would need holder. Since CwiseNullaryOp stores the lambda , this variable will not go out of scope. However, I tested it and it turns out indexing is a bit faster when using holder. I believe it is due to holder not making copies of data when copying or moving it.

@SteveBronder
Copy link
Collaborator

Should we wait to do this till Eigen 3.4 comes out? Because then I think we can use the slices and index views for this

No. This is for multi indexing that can not use slices anyway. Once we do have Eigen 3.4 we can use those for indices, but I don't think it will affect performance.

See the docs under "Array of indices" which is exactly what we want to do here. 3.4 should come out pretty soon (they are on the RC right now) and then we would just pass the std vector for multi-indexing to the operator(). Unless NullaryExpr does something more performant than the 3.4 operator() then I'd prefer we wait to do this until 3.4 comes out.

@t4c1
Copy link
Collaborator Author

t4c1 commented Jun 11, 2021

I see. I expect using "Array of indices" would perform exactly as fast as what I have in this PR using NullaryExprs. The only difference being that we can use NullaryExpr now. If you prefer code style of indexing (and I agree it is a bit nicer), we can still change it after we start using Eigen 3.4.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 1.99 2.03 0.98 -1.79% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.03 2.68% faster
eight_schools/eight_schools.stan 0.07 0.07 1.01 1.47% faster
gp_regr/gp_regr.stan 0.1 0.1 1.0 0.47% faster
irt_2pl/irt_2pl.stan 3.12 3.18 0.98 -2.15% slower
performance.compilation 64.75 60.79 1.07 6.11% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 6.18 6.14 1.01 0.57% faster
pkpd/one_comp_mm_elim_abs.stan 18.35 18.15 1.01 1.11% faster
sir/sir.stan 78.88 78.7 1.0 0.22% faster
gp_regr/gen_gp_data.stan 0.02 0.02 1.04 3.41% faster
garch/garch.stan 0.33 0.31 1.08 6.99% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.25 0.25 0.99 -0.84% slower
arK/arK.stan 1.23 1.25 0.98 -1.69% slower
arma/arma.stan 0.43 0.43 1.0 -0.29% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.02 2.01 1.0 0.41% faster
Mean result: 1.01196931038

Jenkins Console Log
Blue Ocean
Commit hash: bec9800


Machine information ProductName: Mac OS X ProductVersion: 10.14.6 BuildVersion: 18G8022

CPU:
Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz

G++:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Clang:
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple review comments. I'm fine with this PR but imo I'd rather just wait and use Eigen's indexing stuff.

Comment on lines 269 to 276
[name, &idx](auto& x_ref) {
return plain_type_t<EigMat>::NullaryExpr(
idx.ns_.size(), x_ref.cols(),
[name, &idx, &x_ref](Eigen::Index i, Eigen::Index j) {
return x_ref.coeff(idx.ns_[i] - 1, j);
});
},
stan::math::to_ref(x));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need name in either of these lambdas?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Removed.

return v_ref.coeff(idx.ns_[i] - 1);
});
},
stan::math::to_ref(v));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the to_ref here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing with multi index is often used with index much larger than the indexed vector. So each element of the vector is used multiple times in the output. This to ref makes sure that if the indexed vector is expression each element is calculated only once.

const int n = idx.ns_[i];
math::check_range("matrix[multi] row indexing", name, x_ref.rows(), n);
x_ret.row(i) = x_ref.row(n - 1);
math::check_range("matrix[multi] row indexing", name, x.rows(), idx.ns_[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check range happen in the lambda like the others?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We only need this check once per row of the result. In lambda it would happen once for each element of the result.

Comment on lines +154 to +156
inline auto rvalue(EigVec&& v, const char* name, const index_multi& idx) {
return stan::math::make_holder(
[name, &idx](auto& v_ref) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If idx is an temporary could this fall out of scope before it's executed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general C++ code yes. In C++ generated by stanc no. This is the same as for any function in math. So as not to litter whole math library with holders we have decided to ignore such cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following, if stanc generated

auto blah = rvalue(rvalue(v, "blah1", index_multi(std::vector<int>{1, 2, 3}), "blah1", index_uni(3)))

Wouldn't the index_multi fall out of scope here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. It stays in scope until everything in the same line of code is executed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me play with this on godbolt for a minute and if I can't get it to break then I'm cool with it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I'm cool with this, as long as we are agreeing that we won't ever use auto in transformed data/ transformed parameters in the compiler because then we hit a no-no. But don't we want auto in the compiler for the OpenCL stuff? How less performant is it to just make the multi index have a perfect forwarding template and forward the object along into the lambda?

So as not to litter whole math library with holders we have decided to ignore such cases.

Do you have a link to where that decision was made at? tmk didn't we want to add those?

https://godbolt.org/z/jEYTde3G1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't we want auto in the compiler for the OpenCL stuff?

No need. We can directly use matrix_cl. As for expressions we can handle them more or less exactly as Eigen ones.

How less performant is it to just make the multi index have a perfect forwarding template and forward the object along into the lambda?

That would not work if the same index is used multiple times.

Do you have a link to where that decision was made at? tmk didn't we want to add those?

stan-dev/math#1470 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then I'm good with this

src/stan/model/indexing/rvalue.hpp Outdated Show resolved Hide resolved
src/stan/model/indexing/rvalue.hpp Outdated Show resolved Hide resolved
@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.03 1.97 1.03 2.78% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.01 1.43% faster
eight_schools/eight_schools.stan 0.07 0.06 1.02 1.88% faster
gp_regr/gp_regr.stan 0.11 0.11 0.99 -0.98% slower
irt_2pl/irt_2pl.stan 3.15 3.09 1.02 1.9% faster
performance.compilation 65.82 61.05 1.08 7.24% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 6.26 6.13 1.02 2.09% faster
pkpd/one_comp_mm_elim_abs.stan 18.49 18.11 1.02 2.06% faster
sir/sir.stan 79.45 79.1 1.0 0.44% faster
gp_regr/gen_gp_data.stan 0.02 0.02 1.02 1.95% faster
garch/garch.stan 0.32 0.32 1.01 1.4% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.26 0.25 1.02 1.68% faster
arK/arK.stan 1.22 1.24 0.99 -1.17% slower
arma/arma.stan 0.44 0.43 1.02 2.18% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.0 2.01 1.0 -0.5% slower
Mean result: 1.01691402747

Jenkins Console Log
Blue Ocean
Commit hash: fa9b7d3


Machine information ProductName: Mac OS X ProductVersion: 10.14.6 BuildVersion: 18G8022

CPU:
Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz

G++:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Clang:
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops hang on one sec, I'm looking over the stanc code for closures right now which uses auto and just want to check that is ok

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false alarm, that auto is only used to construct the closure so this wouldn't effect it

@t4c1 t4c1 merged commit eb49531 into stan-dev:develop Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants