Add OpenMP support #189

rbchan · 2020-07-21T12:31:14Z

Should be easy to add OpenMP statements like:

  omp_set_num_threads(nthreads);
  #pragma omp parallel for reduction (-:nll)

before for loops computing negative log-likelihood in C++ code. Some other clauses like shared might be necessary. Would require a nthreads argument for each fitting function.

OpenMP should result in huge speed improvements on Linux. My understanding is that support for OpenMP is limited, but getting better, on Windows and iOS.

The text was updated successfully, but these errors were encountered:

kenkellner · 2020-07-21T16:55:48Z

I experimented with this a bit:

library(unmarked)
library(microbenchmark)

n <- 10000  # number of sites
T <- 4    # number of primary periods
J <- 3    # number of secondary periods
     
lam <- 3
phi <- 0.5
p <- 0.3

y <- array(NA, c(n, T, J))
M <- rpois(n, lam)          # Local population size
N <- matrix(NA, n, T)       # Individuals available for detection

for(i in 1:n) {
         N[i,] <- rbinom(T, M[i], phi)
         y[i,,1] <- rbinom(T, N[i,], p)    # Observe some
         Nleft1 <- N[i,] - y[i,,1]         # Remove them
         y[i,,2] <- rbinom(T, Nleft1, p)   # ...
         Nleft2 <- Nleft1 - y[i,,2]
         y[i,,3] <- rbinom(T, Nleft2, p)
}
     
y.ijt <- cbind(y[,1,], y[,2,], y[,3,], y[,4,])
umf1 <- unmarkedFrameGMM(y=y.ijt, numPrimary=T, type="removal")
     
microbenchmark(
  m1 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=1),
  m2 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=2),
  m4 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=4),
  times=5)

Unit: seconds
 expr      min       lq     mean   median       uq      max neval cld
   m1 9.099484 9.111422 9.372926 9.418311 9.614503 9.620908     5   b
   m2 5.414858 5.548022 5.629184 5.631519 5.680126 5.871395     5  a
   m4 5.349892 5.388611 5.446901 5.482736 5.503174 5.510094     5  a

Looks like there definitely is a performance improvement. However 2 threads and 4 threads were the same for me. My laptop has 2 cores and 2 threads per core so might be related to that? My CPU usage did spike to 100% for the 4-thread test.

I had to change the indexing around a bit for gmultmix but otherwise didn't have to change any code. I also did have to add the -fopenmp compiler flag for GCC, although I think there is a way to have Rcpp do that automatically.

rbchan · 2020-07-21T19:49:24Z

Sounds promising. You may have had some other programs using up memory on the other two threads. What OS were you using? I've had great luck lately with OpenMP on Ubuntu.

BTW1: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-basics.html

BTW2: I'm thinking threads would be better than nthreads. Also, we might want to set it up so that threads=1 disables all OpenMP stuff so that people can get the old behavior. I worry that adding OpenMP could open up a can of worms on certain platforms. Perhaps we should keep it out of the master branch until we've had a lot of time to test it out.

kenkellner · 2020-07-21T20:49:46Z

I'm also on Ubuntu. I have a pretty limited understanding of cores, threads, and hyperthreads but it seems like going beyond 2 threads on my CPU gets into hyperthreading, and you aren't expected to see much performance boost from that. I have access to another linux machine with many more cores, so I'll compile it on that and see if I can get performance to scale more.

Looks like you can enable openMP conditionally pretty easily:

#pragma omp parallel for if(condition_holds)
for(...) {
}

The other issue is if this code would compile without issues on CRAN/Windows/Mac. For example Dirk warns about this in section 2.10.3 here:
http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf
This might be a little out of date re: MacOS openMP support, but even if it's possible to compile on MacOS, that doesn't mean it will work easily for R packages. Unfortunately MacOS is the one platform I have no way to test on.

I agree this is best kept in a branch for the time being. Maybe it could be worked on together with modernizing the interface with Rcpp that I mentioned a while back.

kenkellner · 2020-07-21T21:30:01Z

On a machine with 4 cores and 2 threads per core, testing 1, 2, 4, and 8 threads:

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
   m1 637.3505 637.6977 639.8118 640.2461 640.2783 643.4865     5
   m2 367.2343 369.1187 370.4815 369.8476 370.8641 375.3429     5
   m4 231.8589 233.7292 237.0996 236.9478 239.3805 243.5815     5
   m8 218.4185 219.7414 231.8048 220.5617 221.1067 279.1956     5

Continued speed-up with more cores, and as before, going beyond the number of cores doesn't speed things up as much.

rbchan · 2020-07-21T21:46:46Z

OK, great. I think there are some other issues that we have to be careful about too. For example, I don't think you can use Rcout or other functions for sending messages or exceptions back to R while in parallel. The bigger issue is that you aren't supposed to call the R API at all in parallel, which would mean that density functions (eg, R::dpois) wouldn't be allowed. From https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#OpenMP-support: Calling any of the R API from threaded code is ‘for experts only’ and strongly discouraged. Many functions in the R API modify internal R data structures and might corrupt these data structures if called simultaneously from multiple threads. Most R API functions can signal errors, which must only happen on the R main thread. Also, external libraries (e.g. LAPACK) may not be thread-safe. If that turns out to be a problem, we could possibly replace R:: functions with stats:: functions from this thread-safe header-only library: https://github.com/kthohr/stats

…

On Tue, Jul 21, 2020 at 5:30 PM Ken Kellner ***@***.***> wrote: On a machine with 4 cores and 2 threads per core, testing 1, 2, 4, and 8 threads: Unit: milliseconds expr min lq mean median uq max neval m1 637.3505 637.6977 639.8118 640.2461 640.2783 643.4865 5 m2 367.2343 369.1187 370.4815 369.8476 370.8641 375.3429 5 m4 231.8589 233.7292 237.0996 236.9478 239.3805 243.5815 5 m8 218.4185 219.7414 231.8048 220.5617 221.1067 279.1956 5 Continued speed-up with more cores, and as before, going beyond the number of cores doesn't speed things up as much. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABWRCUQN2QLO4MXOXS2LI3R4YCGTANCNFSM4PDRMCAA> .

kenkellner · 2020-07-27T02:18:46Z

I'm working on this here. For example, code for pcount. Even after reading the R-ext guidelines it's not totally clear to me if the #pragma statement itself needs to be protected in a _OPENMP conditional block. The #pragma should be ignored when compiling if OpenMP is unavailable, perhaps with a warning. We can always duplicate the loop code, with one loop inside the block, but it would be nice to avoid that.

I looked at the source code for a few of the distribution functions and I think they're relatively thread-safe. Technically, Rf_dpois and the like can throw warnings under unusual circumstances, for example if lambda < 0. If we've written the likelihood correctly this should never happen. If this is still a concern we could also use the Rf_*_raw versions of the distribution functions, which don't generally do input sanity checks. As further evidence, after asking this question, Murray Efford seems to still be using distribution functions from the R APi in parallel C++ code in secr.

On the other hand, the random number generators in the R API are definitely NOT thread safe from what I've seen, which makes sense.

rbchan · 2020-07-27T10:31:17Z

The pcount code looks good to me, and based on what you said, it's fine with me if we go ahead and assume that the density functions are thread-safe. Would be nice to find someone with a Mac to test it. Do you see any warnings from R CMD build --as-cran?

One thing I wonder about is the importance of setting OMP_PLACES and OMP_PROC_BIND, especially when running OpenMP code on multiple cores with the parallel package. I've been doing this on another project, and it seems like more than one OpenMP chunk can land on the same thread. This might happen in unmarked if people are using parboot in parallel. To be safe, I've been setting:

OMP_PLACES="threads"
OMP_PROC_BIND="spread"

but I can't say for sure if this is necessary. More info here and here.

kenkellner · 2020-07-27T15:11:26Z

Setting OMP_PROC_BIND="spread" has a pretty big negative impact on performance for me when I set threads to greater than the number of cores. I'm setting it in the pragma statement with proc_bind(spread) (see here).

I'm not sure how to properly set OMP_PLACES variable in Rcpp, seems like it is complicated?

Regardless of where we land on the above, I wonder if we should just force-set threads=1 for any parallel operations in unmarked, eg in parboot. Intuitively to me it doesn't seem like using more threads in the likelihood calculation is going to help if multiple models are already being run in parallel.

rbchan · 2020-07-27T17:03:17Z

Good points. Let's just use default affinity for now and force threads=1 in parboot.

…

On Mon, Jul 27, 2020, 11:11 AM Ken Kellner ***@***.***> wrote: Setting OMP_PROC_BIND="spread" has a pretty big negative impact on performance for me when I set threads to greater than the number of cores. I'm setting it in the pragma statement with proc_bind(spread) (see here <https://www.openmp.org/spec-html/5.0/openmpse14.html>). I'm not sure how to properly set OMP_PLACES variable in Rcpp, seems like it is complicated <https://stackoverflow.com/questions/62856347/set-omp-places-options-in-rcpp-with-openmp> ? Regardless of where we land on the above, I wonder if we should just force-set threads=1 for any parallel operations in unmarked, eg in parboot. Intuitively to me it doesn't seem like using more threads in the likelihood calculation is going to help if multiple models are already being run in parallel. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABWRCV2KTGJQK7FXSWMBLTR5WKK7ANCNFSM4PDRMCAA> .

kenkellner · 2020-09-14T14:36:29Z

Also worth considering: Armadillo uses openMP automatically for some computations when it's available (see http://arma.sourceforge.net/faq.html). I've not been able to determine exactly what conditions trigger this, but it appears to happen consistently for occuMulti. If you try to layer additional parallelization on top of this, either by using openMP in C++ code or by using parallel in R, you get much slower runtimes then if you avoided trying to run in parallel in the first place.

rbchan · 2020-09-14T15:22:01Z

Interesting. The docs make it sounds like matrix multiplication and most element-wise operations will be done in parallel by default, so perhaps we don't need to do anything on our end.

…

On Mon, Sep 14, 2020 at 10:36 AM Ken Kellner ***@***.***> wrote: Also worth considering: Armadillo uses openMP automatically for some computations when it's available (see http://arma.sourceforge.net/faq.html). I've not been able to determine exactly what conditions trigger this, but it appears to happen consistently for occuMulti. If you try to layer additional parallelization on top of this, either by using openMP in C++ code or by using parallel in R, you get much slower runtimes then if you avoided trying to run in parallel in the first place. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABWRCSPXXDWX732J25HZ4LSFYS75ANCNFSM4PDRMCAA> .

kenkellner mentioned this issue Dec 20, 2020

Add openMP support to some functions #195

Merged

rbchan closed this as completed in #195 Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenMP support #189

Add OpenMP support #189

rbchan commented Jul 21, 2020

kenkellner commented Jul 21, 2020

rbchan commented Jul 21, 2020 •

edited

Loading

kenkellner commented Jul 21, 2020

kenkellner commented Jul 21, 2020

rbchan commented Jul 21, 2020 via email

kenkellner commented Jul 27, 2020

rbchan commented Jul 27, 2020

kenkellner commented Jul 27, 2020

rbchan commented Jul 27, 2020 via email

kenkellner commented Sep 14, 2020

rbchan commented Sep 14, 2020 via email

Add OpenMP support #189

Add OpenMP support #189

Comments

rbchan commented Jul 21, 2020

kenkellner commented Jul 21, 2020

rbchan commented Jul 21, 2020 • edited Loading

kenkellner commented Jul 21, 2020

kenkellner commented Jul 21, 2020

rbchan commented Jul 21, 2020 via email

kenkellner commented Jul 27, 2020

rbchan commented Jul 27, 2020

kenkellner commented Jul 27, 2020

rbchan commented Jul 27, 2020 via email

kenkellner commented Sep 14, 2020

rbchan commented Sep 14, 2020 via email

rbchan commented Jul 21, 2020 •

edited

Loading