Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ignore] MPI Code / Design Review #2

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
963f867
mpi draft design, first rect version
Aug 9, 2017
fde02c7
add Stan user functor example
Aug 9, 2017
89208b8
code up sending of problem data
Aug 12, 2017
49f4c38
Fully running prototype
Aug 12, 2017
19257a6
refactored to remove much code duplication & ensure that constructor/…
Aug 13, 2017
d0e989b
added data distributing facility
Aug 14, 2017
4e78c9b
implement caching approach
Aug 15, 2017
12f4aff
simplify command communication, make locally cached data correctly sh…
Aug 20, 2017
607eb81
renamed mpi worker to cluster
Aug 20, 2017
c537899
turned version tracking of boost serialization to speed things up
Aug 20, 2017
95d998d
force user to define uid; distribute data on first occasion only; add…
Aug 21, 2017
9025924
added build parts of boost library to lib and integrated (roughly) in…
Aug 21, 2017
ca8716f
turn off outputs and add build notes
weberse2 Aug 23, 2017
3c0cdb8
Merge branch 'feature/concept-mpi' of https://github.com/stan-dev/mat…
Sep 2, 2017
d62fcf6
added shared parameter vector eta as function argument
Sep 2, 2017
9371ba4
update doc
Sep 2, 2017
4e1f17b
remove unneded X_r and X_i member
Sep 3, 2017
f9e1e26
rename inner loop counter
Sep 3, 2017
6c88c35
add accumulating lpdf version of map rect function
Sep 3, 2017
f184221
Merge branch 'feature/concept-mpi' of https://github.com/stan-dev/mat…
weberse2 Sep 3, 2017
dc069a0
various fixes
weberse2 Sep 3, 2017
41dcc49
Merge branch 'feature/concept-mpi' of https://github.com/stan-dev/mat…
Sep 4, 2017
f922dad
rename mpi subsystem a bit
Sep 5, 2017
269fe98
Merge remote-tracking branch 'origin/develop' into feature/concept-mpi
Sep 9, 2017
8ab8c90
integrated with stan-math properly; also providing non-MPI calls
Sep 9, 2017
1a51cf3
avoid use of mpi reduce
weberse2 Sep 10, 2017
62862de
Revert lib changes for easy code review; temp
seantalts Sep 11, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions MPI_NOTES
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@

WARNING: Boost MPI is BROKEN in 1.64.0; I used 1.65.1, but 1.65.0
seems to be OK as well, see here:
https://svn.boost.org/trac10/ticket/12723

Build boost serialization and boost mpi as shared libraries

./b2 --prefix=$HOME/local --with-mpi --with-serialization -j4 install

In make/local you need to set

CC=mpicxx

and ensure that the LDFLAGS finds the boost installation of mpi and serialization

... and of course, the openmpi implementation needs to be
available. The macports openmpi works great.

The use of mpicxx is recommended by openmp, but not required. That is,
for openmp you can find out the extra compiler flags needed for
linking to openmpi with the command.

mpicxx --showme

The mpicxx man page has more details on this.

The minimal tests are in

test/unit/math/rev/arr/functor

map_rect_mpi_test.cpp
map_rect_lpdf_mpi_test.cpp

To compile these, there is a makefile in their directory. After
compilation they can be started with

mpirun -np 4 ./map_rect_mpi_test

with 4 cpus.

30 changes: 29 additions & 1 deletion make/libraries
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ GTEST ?= $(MATH)lib/gtest_1.7.0
CPPLINT ?= $(MATH)lib/cpplint_4.45
CVODES ?= $(MATH)lib/cvodes_2.9.0

BOOST_LIB ?= $(BOOST)/libs

## TODO: autodetect somehow?
MPI_INCLUDE ?= /opt/local/include/openmpi-mp/

##
# Build rules for cvodes
##
Expand All @@ -18,6 +23,8 @@ SUNDIALS_CVODES := $(patsubst %.c,%.o,\
SUNDIALS_NVECSERIAL := $(patsubst %.c,%.o,\
$(addprefix $(CVODES)/src/, nvec_ser/nvector_serial.c sundials/sundials_math.c))

BOOST_SERIALIZATION := $(patsubst %.cpp,%.o,$(wildcard $(BOOST_LIB)/serialization/src/*.cpp))
BOOST_MPI := $(patsubst %.cpp,%.o,$(wildcard $(BOOST_LIB)/mpi/src/*.cpp))

$(sort $(SUNDIALS_CVODES) $(SUNDIALS_NVECSERIAL)) : %.o : %.c
@mkdir -p $(dir $@)
Expand All @@ -31,12 +38,33 @@ $(CVODES)/lib/libsundials_nvecserial.a: $(SUNDIALS_NVECSERIAL)
@mkdir -p $(dir $@)
$(AR) -rs $@ $^

$(BOOST_SERIALIZATION) : %.o : %.cpp
@mkdir -p $(dir $@)
$(COMPILE.c) -O$(O) -isystem $(BOOST) $< -o $@

$(BOOST)/lib/libboost_serialization.a: $(BOOST_SERIALIZATION)
@mkdir -p $(dir $@)
$(AR) -rs $@ $^

$(BOOST_MPI) : %.o : %.cpp
@mkdir -p $(dir $@)
$(COMPILE.c) -O$(O) -isystem $(BOOST) -isystem $(MPI_INCLUDE) -DBOOST_MPI_SOURCE=1 $< -o $@

$(BOOST)/lib/libboost_mpi.a: $(BOOST_MPI)
@mkdir -p $(dir $@)
$(AR) -rs $@ $^

LIBMPI = $(BOOST)/lib/libboost_serialization.a $(BOOST)/lib/libboost_mpi.a

LIBCVODES = $(CVODES)/lib/libsundials_nvecserial.a $(CVODES)/lib/libsundials_cvodes.a


print-% : ; @echo $* = $($*)

STAN_CVODES_HEADERS := $(shell find stan -name *cvodes*.hpp)
$(STAN_CVODES_HEADERS) : $(LIBCVODES)


.PHONY: clean-libraries
clean-libraries:
$(RM) $(sort $(SUNDIALS_CVODES) $(SUNDIALS_NVECSERIAL) $(LIBCVODES))
$(RM) $(sort $(SUNDIALS_CVODES) $(SUNDIALS_NVECSERIAL) $(BOOST_SERIALIZATION) $(BOOST_MPI) $(LIBCVODES) $(LIBMPI) )
1 change: 1 addition & 0 deletions stan/math/prim/arr.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
#include <stan/math/prim/arr/functor/coupled_ode_observer.hpp>
#include <stan/math/prim/arr/functor/coupled_ode_system.hpp>
#include <stan/math/prim/arr/functor/integrate_ode_rk45.hpp>
#include <stan/math/prim/arr/functor/map_rect.hpp>

#include <stan/math/prim/scal.hpp>

Expand Down
31 changes: 31 additions & 0 deletions stan/math/prim/arr/functor/map_rect.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#pragma once

#ifdef STAN_HAS_MPI
#include <stan/math/prim/arr/functor/map_rect_mpi.hpp>
#endif

namespace stan {
namespace math {

template <typename F>
std::vector<double>
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not unify this with the one in rev by using a template instead of double here (and instead of var in the rev version)? They seem to be identical otherwise.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I think you are right in that we can template this and have a single version. Not sure why I did it this way (I added the map_rect quite late).

However, I hope this does not interact badly with the function overloading which we do for the mpi version.

map_rect(const std::vector<double>& eta,
const std::vector<std::vector<double> >& theta,
const std::vector<std::vector<double> >& x_r,
const std::vector<std::vector<int> >& x_i,
const std::size_t uid) {
#ifdef STAN_HAS_MPI
return(map_rect_mpi<F>(eta, theta, x_r, x_i, uid));
#else
std::vector<double> res;
const std::size_t N = theta.size();

for(std::size_t i = 0; i != N; ++i) {
const std::vector<double> f = F::apply(eta, theta[i], x_r[i], x_i[i]);
res.insert(res.end(), f.begin(), f.end());
}
return(res);
#endif
}
}
}
31 changes: 31 additions & 0 deletions stan/math/prim/arr/functor/map_rect_lpdf_mpi.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#pragma once

#include <vector>
#include <set>

#include <stan/math/prim/mat/fun/to_array_1d.hpp>
#include <stan/math/prim/arr/functor/mpi_command.hpp>

namespace stan {
namespace math {

template <typename F>
double
map_rect_lpdf_mpi(const std::vector<double>& eta,
Copy link
Owner Author

@seantalts seantalts Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you want the primitive version to parallelize?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, we do. I just did not see an elegant way to reuse the mpi code in a clever way. As I only cared about getting my model faster to fit, I did not bother with this. So we either copy the rev code and simplify it or we find a clever way to generalize the rev to handle both cases in a smart way.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Maybe we can figure something out... I'll think more on it.

const std::vector<std::vector<double> >& theta,
const std::vector<std::vector<double> >& x_r,
const std::vector<std::vector<int> >& x_i,
const std::size_t uid) {

double res = 0;
const std::size_t N = theta.size();

for(std::size_t i = 0; i != N; ++i) {
res += F::apply(eta, theta[i], x_r[i], x_i[i]);
}

return(res);
}

}
}
147 changes: 147 additions & 0 deletions stan/math/prim/arr/functor/map_rect_mpi.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
#pragma once

#include <vector>
#include <set>

#include <stan/math/prim/mat/fun/to_array_1d.hpp>
#include <stan/math/prim/arr/functor/mpi_command.hpp>
#include <stan/math/prim/arr/functor/mpi_cluster.hpp>

namespace stan {
namespace math {

namespace internal {
// forward declare
void
distribute_map_rect_data(const std::vector<std::vector<double> >& x_r,
const std::vector<std::vector<int> >& x_i,
const std::size_t uid);


// data distribution needs to go to prim
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data distribution should be inside prim. That mechanism should be independent of rev/fwd/whatever. It's prim, not more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I read it as a yet-to-be-accomplished directive, oops.

struct distributed_map_rect_data {
distributed_map_rect_data(std::size_t uid) : uid_(uid), N_(-1) {}
std::size_t uid_;
std::size_t N_;
std::vector<std::vector<double> > x_r_;
std::vector<std::vector<int> > x_i_;
static void distributed_apply() {
// called on workers to register data
distribute_map_rect_data(std::vector<std::vector<double> >(), std::vector<std::vector<int> >(), 0);
}
};

typedef boost::serialization::singleton<std::map<std::size_t,const distributed_map_rect_data> > distributed_data;
Copy link
Owner Author

@seantalts seantalts Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is uid a size_t?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t made sense to me, but it can be anything we agree on. I think using integers would be sensible. I think Bob wanted to have long's here. Whatever works.


void
distribute_map_rect_data(const std::vector<std::vector<double> >& x_r,
const std::vector<std::vector<int> >& x_i,
const std::size_t uid) {

boost::mpi::communicator world;

const std::size_t W = world.size();
const std::size_t R = world.rank();

// first check if uid is already registered
if(distributed_data::get_const_instance().find(uid) == distributed_data::get_const_instance().cend()) {
//std::cout << "Job " << R << " registers data..." << std::endl;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this stubbed out for now or old debug code?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this is old debug code. Not meant to stay.

} else {
//std::cout << "UID = " << uid << " is ALREADY distributed." << std::endl;
return;
}

std::vector<int> meta(4, 0);

if(R == 0) {
// initiate on the root call of this function on the workers
mpi_broadcast_command<stan::math::mpi_distributed_apply<distributed_map_rect_data> >();

meta[0] = uid;
meta[1] = x_r.size();
meta[2] = x_r[0].size();
meta[3] = x_i[0].size();
}

boost::mpi::broadcast(world, meta.data(), 4, 0);

distributed_map_rect_data data(meta[0]);

const std::size_t N = meta[1];
const std::size_t X_r = meta[2];
const std::size_t X_i = meta[3];

data.N_ = N;

//std::cout << "worker " << R << " / " << W << " registers shapes " << N << ", " << X_r << ", " << X_i << std::endl;

const std::vector<int> chunks = mpi_map_chunks(N, 1);

data.x_r_.resize(chunks[R]);
data.x_i_.resize(chunks[R]);

// flatten data and send out/recieve using scatterv
if(X_r > 0) {
const std::vector<double> world_x_r = to_array_1d(x_r);
const std::vector<int> chunks_x_r = mpi_map_chunks(N, X_r);
std::vector<double> flat_x_r_local(chunks_x_r[R]);

boost::mpi::scatterv(world, world_x_r.data(), chunks_x_r, flat_x_r_local.data(), 0);

// now register data
for(std::size_t i = 0; i != chunks[R]; ++i)
data.x_r_[i] = std::vector<double>(flat_x_r_local.begin() + i * X_r, flat_x_r_local.begin() + (i+1) * X_r);

//std::cout << "Job " << R << " got " << flat_x_r_local.size() << " real data " << std::endl;
}
if(X_i > 0) {
const std::vector<int> world_x_i = to_array_1d(x_i);
const std::vector<int> chunks_x_i = mpi_map_chunks(N, X_i);
std::vector<int> flat_x_i_local(chunks_x_i[R]);

boost::mpi::scatterv(world, world_x_i.data(), chunks_x_i, flat_x_i_local.data(), 0);

// now register data
for(std::size_t i = 0; i != chunks[R]; ++i)
data.x_i_[i] = std::vector<int>(flat_x_i_local.begin() + i * X_i, flat_x_i_local.begin() + (i+1) * X_i);

//std::cout << "Job " << R << " got " << flat_x_i_local.size() << " int data " << std::endl;
}

distributed_data::get_mutable_instance().insert(std::make_pair(uid, data));

//std::cout << "Job " << R << " done caching data for uid = " << meta[0] << "." << std::endl;

return;
}
}

template <typename F>
std::vector<double>
map_rect_mpi(const std::vector<double>& eta,
Copy link
Owner Author

@seantalts seantalts Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This primitive version also shouldn't parallelize?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should... but I would be fine to make it non-parallel for the very first pull. The thing is, I wanted this stuff to work in an actual Stan program; until I figure out the right MPI code I am very reluctant to copy&paste the code which is right now in rev. Once we think the rev code is good to copy (and simplify) we should do that. Even better would be to figure out code which handles all cases at the same time.

const std::vector<std::vector<double> >& theta,
const std::vector<std::vector<double> >& x_r,
const std::vector<std::vector<int> >& x_i,
const std::size_t uid) {

std::vector<double> res;
const std::size_t N = theta.size();

for(std::size_t i = 0; i != N; ++i) {
const std::vector<double> f = F::apply(eta, theta[i], x_r[i], x_i[i]);
res.insert(res.end(), f.begin(), f.end());

//for(std::size_t j = 0; j != f.size(); j++)
// std::cout << "res[" << i << ", " << j << "] = " << f[j] << std::endl;
}

return(res);
}

}
}


BOOST_CLASS_EXPORT(stan::math::mpi_distributed_apply<stan::math::internal::distributed_map_rect_data>);
BOOST_CLASS_TRACKING(stan::math::mpi_distributed_apply<stan::math::internal::distributed_map_rect_data>,track_never);
BOOST_SERIALIZATION_FACTORY_0(stan::math::mpi_distributed_apply<stan::math::internal::distributed_map_rect_data>)
Loading