-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ignore] MPI Code / Design Review #2
base: develop
Are you sure you want to change the base?
Changes from all commits
963f867
fde02c7
89208b8
49f4c38
19257a6
d0e989b
4e78c9b
12f4aff
607eb81
c537899
95d998d
9025924
ca8716f
3c0cdb8
d62fcf6
9371ba4
4e1f17b
f9e1e26
6c88c35
f184221
dc069a0
41dcc49
f922dad
269fe98
8ab8c90
1a51cf3
62862de
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
|
||
WARNING: Boost MPI is BROKEN in 1.64.0; I used 1.65.1, but 1.65.0 | ||
seems to be OK as well, see here: | ||
https://svn.boost.org/trac10/ticket/12723 | ||
|
||
Build boost serialization and boost mpi as shared libraries | ||
|
||
./b2 --prefix=$HOME/local --with-mpi --with-serialization -j4 install | ||
|
||
In make/local you need to set | ||
|
||
CC=mpicxx | ||
|
||
and ensure that the LDFLAGS finds the boost installation of mpi and serialization | ||
|
||
... and of course, the openmpi implementation needs to be | ||
available. The macports openmpi works great. | ||
|
||
The use of mpicxx is recommended by openmp, but not required. That is, | ||
for openmp you can find out the extra compiler flags needed for | ||
linking to openmpi with the command. | ||
|
||
mpicxx --showme | ||
|
||
The mpicxx man page has more details on this. | ||
|
||
The minimal tests are in | ||
|
||
test/unit/math/rev/arr/functor | ||
|
||
map_rect_mpi_test.cpp | ||
map_rect_lpdf_mpi_test.cpp | ||
|
||
To compile these, there is a makefile in their directory. After | ||
compilation they can be started with | ||
|
||
mpirun -np 4 ./map_rect_mpi_test | ||
|
||
with 4 cpus. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
#pragma once | ||
|
||
#ifdef STAN_HAS_MPI | ||
#include <stan/math/prim/arr/functor/map_rect_mpi.hpp> | ||
#endif | ||
|
||
namespace stan { | ||
namespace math { | ||
|
||
template <typename F> | ||
std::vector<double> | ||
map_rect(const std::vector<double>& eta, | ||
const std::vector<std::vector<double> >& theta, | ||
const std::vector<std::vector<double> >& x_r, | ||
const std::vector<std::vector<int> >& x_i, | ||
const std::size_t uid) { | ||
#ifdef STAN_HAS_MPI | ||
return(map_rect_mpi<F>(eta, theta, x_r, x_i, uid)); | ||
#else | ||
std::vector<double> res; | ||
const std::size_t N = theta.size(); | ||
|
||
for(std::size_t i = 0; i != N; ++i) { | ||
const std::vector<double> f = F::apply(eta, theta[i], x_r[i], x_i[i]); | ||
res.insert(res.end(), f.begin(), f.end()); | ||
} | ||
return(res); | ||
#endif | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
#pragma once | ||
|
||
#include <vector> | ||
#include <set> | ||
|
||
#include <stan/math/prim/mat/fun/to_array_1d.hpp> | ||
#include <stan/math/prim/arr/functor/mpi_command.hpp> | ||
|
||
namespace stan { | ||
namespace math { | ||
|
||
template <typename F> | ||
double | ||
map_rect_lpdf_mpi(const std::vector<double>& eta, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why don't you want the primitive version to parallelize? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, we do. I just did not see an elegant way to reuse the mpi code in a clever way. As I only cared about getting my model faster to fit, I did not bother with this. So we either copy the rev code and simplify it or we find a clever way to generalize the rev to handle both cases in a smart way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha. Maybe we can figure something out... I'll think more on it. |
||
const std::vector<std::vector<double> >& theta, | ||
const std::vector<std::vector<double> >& x_r, | ||
const std::vector<std::vector<int> >& x_i, | ||
const std::size_t uid) { | ||
|
||
double res = 0; | ||
const std::size_t N = theta.size(); | ||
|
||
for(std::size_t i = 0; i != N; ++i) { | ||
res += F::apply(eta, theta[i], x_r[i], x_i[i]); | ||
} | ||
|
||
return(res); | ||
} | ||
|
||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
#pragma once | ||
|
||
#include <vector> | ||
#include <set> | ||
|
||
#include <stan/math/prim/mat/fun/to_array_1d.hpp> | ||
#include <stan/math/prim/arr/functor/mpi_command.hpp> | ||
#include <stan/math/prim/arr/functor/mpi_cluster.hpp> | ||
|
||
namespace stan { | ||
namespace math { | ||
|
||
namespace internal { | ||
// forward declare | ||
void | ||
distribute_map_rect_data(const std::vector<std::vector<double> >& x_r, | ||
const std::vector<std::vector<int> >& x_i, | ||
const std::size_t uid); | ||
|
||
|
||
// data distribution needs to go to prim | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does this mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the data distribution should be inside prim. That mechanism should be independent of rev/fwd/whatever. It's prim, not more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes. I read it as a yet-to-be-accomplished directive, oops. |
||
struct distributed_map_rect_data { | ||
distributed_map_rect_data(std::size_t uid) : uid_(uid), N_(-1) {} | ||
std::size_t uid_; | ||
std::size_t N_; | ||
std::vector<std::vector<double> > x_r_; | ||
std::vector<std::vector<int> > x_i_; | ||
static void distributed_apply() { | ||
// called on workers to register data | ||
distribute_map_rect_data(std::vector<std::vector<double> >(), std::vector<std::vector<int> >(), 0); | ||
} | ||
}; | ||
|
||
typedef boost::serialization::singleton<std::map<std::size_t,const distributed_map_rect_data> > distributed_data; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is uid a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. size_t made sense to me, but it can be anything we agree on. I think using integers would be sensible. I think Bob wanted to have long's here. Whatever works. |
||
|
||
void | ||
distribute_map_rect_data(const std::vector<std::vector<double> >& x_r, | ||
const std::vector<std::vector<int> >& x_i, | ||
const std::size_t uid) { | ||
|
||
boost::mpi::communicator world; | ||
|
||
const std::size_t W = world.size(); | ||
const std::size_t R = world.rank(); | ||
|
||
// first check if uid is already registered | ||
if(distributed_data::get_const_instance().find(uid) == distributed_data::get_const_instance().cend()) { | ||
//std::cout << "Job " << R << " registers data..." << std::endl; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this stubbed out for now or old debug code? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, this is old debug code. Not meant to stay. |
||
} else { | ||
//std::cout << "UID = " << uid << " is ALREADY distributed." << std::endl; | ||
return; | ||
} | ||
|
||
std::vector<int> meta(4, 0); | ||
|
||
if(R == 0) { | ||
// initiate on the root call of this function on the workers | ||
mpi_broadcast_command<stan::math::mpi_distributed_apply<distributed_map_rect_data> >(); | ||
|
||
meta[0] = uid; | ||
meta[1] = x_r.size(); | ||
meta[2] = x_r[0].size(); | ||
meta[3] = x_i[0].size(); | ||
} | ||
|
||
boost::mpi::broadcast(world, meta.data(), 4, 0); | ||
|
||
distributed_map_rect_data data(meta[0]); | ||
|
||
const std::size_t N = meta[1]; | ||
const std::size_t X_r = meta[2]; | ||
const std::size_t X_i = meta[3]; | ||
|
||
data.N_ = N; | ||
|
||
//std::cout << "worker " << R << " / " << W << " registers shapes " << N << ", " << X_r << ", " << X_i << std::endl; | ||
|
||
const std::vector<int> chunks = mpi_map_chunks(N, 1); | ||
|
||
data.x_r_.resize(chunks[R]); | ||
data.x_i_.resize(chunks[R]); | ||
|
||
// flatten data and send out/recieve using scatterv | ||
if(X_r > 0) { | ||
const std::vector<double> world_x_r = to_array_1d(x_r); | ||
const std::vector<int> chunks_x_r = mpi_map_chunks(N, X_r); | ||
std::vector<double> flat_x_r_local(chunks_x_r[R]); | ||
|
||
boost::mpi::scatterv(world, world_x_r.data(), chunks_x_r, flat_x_r_local.data(), 0); | ||
|
||
// now register data | ||
for(std::size_t i = 0; i != chunks[R]; ++i) | ||
data.x_r_[i] = std::vector<double>(flat_x_r_local.begin() + i * X_r, flat_x_r_local.begin() + (i+1) * X_r); | ||
|
||
//std::cout << "Job " << R << " got " << flat_x_r_local.size() << " real data " << std::endl; | ||
} | ||
if(X_i > 0) { | ||
const std::vector<int> world_x_i = to_array_1d(x_i); | ||
const std::vector<int> chunks_x_i = mpi_map_chunks(N, X_i); | ||
std::vector<int> flat_x_i_local(chunks_x_i[R]); | ||
|
||
boost::mpi::scatterv(world, world_x_i.data(), chunks_x_i, flat_x_i_local.data(), 0); | ||
|
||
// now register data | ||
for(std::size_t i = 0; i != chunks[R]; ++i) | ||
data.x_i_[i] = std::vector<int>(flat_x_i_local.begin() + i * X_i, flat_x_i_local.begin() + (i+1) * X_i); | ||
|
||
//std::cout << "Job " << R << " got " << flat_x_i_local.size() << " int data " << std::endl; | ||
} | ||
|
||
distributed_data::get_mutable_instance().insert(std::make_pair(uid, data)); | ||
|
||
//std::cout << "Job " << R << " done caching data for uid = " << meta[0] << "." << std::endl; | ||
|
||
return; | ||
} | ||
} | ||
|
||
template <typename F> | ||
std::vector<double> | ||
map_rect_mpi(const std::vector<double>& eta, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This primitive version also shouldn't parallelize? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should... but I would be fine to make it non-parallel for the very first pull. The thing is, I wanted this stuff to work in an actual Stan program; until I figure out the right MPI code I am very reluctant to copy&paste the code which is right now in rev. Once we think the rev code is good to copy (and simplify) we should do that. Even better would be to figure out code which handles all cases at the same time. |
||
const std::vector<std::vector<double> >& theta, | ||
const std::vector<std::vector<double> >& x_r, | ||
const std::vector<std::vector<int> >& x_i, | ||
const std::size_t uid) { | ||
|
||
std::vector<double> res; | ||
const std::size_t N = theta.size(); | ||
|
||
for(std::size_t i = 0; i != N; ++i) { | ||
const std::vector<double> f = F::apply(eta, theta[i], x_r[i], x_i[i]); | ||
res.insert(res.end(), f.begin(), f.end()); | ||
|
||
//for(std::size_t j = 0; j != f.size(); j++) | ||
// std::cout << "res[" << i << ", " << j << "] = " << f[j] << std::endl; | ||
} | ||
|
||
return(res); | ||
} | ||
|
||
} | ||
} | ||
|
||
|
||
BOOST_CLASS_EXPORT(stan::math::mpi_distributed_apply<stan::math::internal::distributed_map_rect_data>); | ||
BOOST_CLASS_TRACKING(stan::math::mpi_distributed_apply<stan::math::internal::distributed_map_rect_data>,track_never); | ||
BOOST_SERIALIZATION_FACTORY_0(stan::math::mpi_distributed_apply<stan::math::internal::distributed_map_rect_data>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you not unify this with the one in
rev
by using a template instead ofdouble
here (and instead ofvar
in therev
version)? They seem to be identical otherwise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... I think you are right in that we can template this and have a single version. Not sure why I did it this way (I added the map_rect quite late).
However, I hope this does not interact badly with the function overloading which we do for the mpi version.