Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenMP support #189

Closed
rbchan opened this issue Jul 21, 2020 · 11 comments · Fixed by #195
Closed

Add OpenMP support #189

rbchan opened this issue Jul 21, 2020 · 11 comments · Fixed by #195

Comments

@rbchan
Copy link
Owner

rbchan commented Jul 21, 2020

Should be easy to add OpenMP statements like:

  omp_set_num_threads(nthreads);
  #pragma omp parallel for reduction (-:nll)

before for loops computing negative log-likelihood in C++ code. Some other clauses like shared might be necessary. Would require a nthreads argument for each fitting function.

OpenMP should result in huge speed improvements on Linux. My understanding is that support for OpenMP is limited, but getting better, on Windows and iOS.

@kenkellner
Copy link
Collaborator

I experimented with this a bit:

library(unmarked)
library(microbenchmark)

n <- 10000  # number of sites
T <- 4    # number of primary periods
J <- 3    # number of secondary periods
     
lam <- 3
phi <- 0.5
p <- 0.3

y <- array(NA, c(n, T, J))
M <- rpois(n, lam)          # Local population size
N <- matrix(NA, n, T)       # Individuals available for detection

for(i in 1:n) {
         N[i,] <- rbinom(T, M[i], phi)
         y[i,,1] <- rbinom(T, N[i,], p)    # Observe some
         Nleft1 <- N[i,] - y[i,,1]         # Remove them
         y[i,,2] <- rbinom(T, Nleft1, p)   # ...
         Nleft2 <- Nleft1 - y[i,,2]
         y[i,,3] <- rbinom(T, Nleft2, p)
}
     
y.ijt <- cbind(y[,1,], y[,2,], y[,3,], y[,4,])
umf1 <- unmarkedFrameGMM(y=y.ijt, numPrimary=T, type="removal")
     
microbenchmark(
  m1 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=1),
  m2 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=2),
  m4 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=4),
  times=5)
Unit: seconds
 expr      min       lq     mean   median       uq      max neval cld
   m1 9.099484 9.111422 9.372926 9.418311 9.614503 9.620908     5   b
   m2 5.414858 5.548022 5.629184 5.631519 5.680126 5.871395     5  a
   m4 5.349892 5.388611 5.446901 5.482736 5.503174 5.510094     5  a

Looks like there definitely is a performance improvement. However 2 threads and 4 threads were the same for me. My laptop has 2 cores and 2 threads per core so might be related to that? My CPU usage did spike to 100% for the 4-thread test.

I had to change the indexing around a bit for gmultmix but otherwise didn't have to change any code. I also did have to add the -fopenmp compiler flag for GCC, although I think there is a way to have Rcpp do that automatically.

@rbchan
Copy link
Owner Author

rbchan commented Jul 21, 2020

Sounds promising. You may have had some other programs using up memory on the other two threads. What OS were you using? I've had great luck lately with OpenMP on Ubuntu.

BTW1: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-basics.html

BTW2: I'm thinking threads would be better than nthreads. Also, we might want to set it up so that threads=1 disables all OpenMP stuff so that people can get the old behavior. I worry that adding OpenMP could open up a can of worms on certain platforms. Perhaps we should keep it out of the master branch until we've had a lot of time to test it out.

@kenkellner
Copy link
Collaborator

I'm also on Ubuntu. I have a pretty limited understanding of cores, threads, and hyperthreads but it seems like going beyond 2 threads on my CPU gets into hyperthreading, and you aren't expected to see much performance boost from that. I have access to another linux machine with many more cores, so I'll compile it on that and see if I can get performance to scale more.

Looks like you can enable openMP conditionally pretty easily:

#pragma omp parallel for if(condition_holds)
for(...) {
}

The other issue is if this code would compile without issues on CRAN/Windows/Mac. For example Dirk warns about this in section 2.10.3 here:
http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf
This might be a little out of date re: MacOS openMP support, but even if it's possible to compile on MacOS, that doesn't mean it will work easily for R packages. Unfortunately MacOS is the one platform I have no way to test on.

I agree this is best kept in a branch for the time being. Maybe it could be worked on together with modernizing the interface with Rcpp that I mentioned a while back.

@kenkellner
Copy link
Collaborator

On a machine with 4 cores and 2 threads per core, testing 1, 2, 4, and 8 threads:

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
   m1 637.3505 637.6977 639.8118 640.2461 640.2783 643.4865     5
   m2 367.2343 369.1187 370.4815 369.8476 370.8641 375.3429     5
   m4 231.8589 233.7292 237.0996 236.9478 239.3805 243.5815     5
   m8 218.4185 219.7414 231.8048 220.5617 221.1067 279.1956     5

Continued speed-up with more cores, and as before, going beyond the number of cores doesn't speed things up as much.

@rbchan
Copy link
Owner Author

rbchan commented Jul 21, 2020 via email

@kenkellner
Copy link
Collaborator

I'm working on this here. For example, code for pcount. Even after reading the R-ext guidelines it's not totally clear to me if the #pragma statement itself needs to be protected in a _OPENMP conditional block. The #pragma should be ignored when compiling if OpenMP is unavailable, perhaps with a warning. We can always duplicate the loop code, with one loop inside the block, but it would be nice to avoid that.

I looked at the source code for a few of the distribution functions and I think they're relatively thread-safe. Technically, Rf_dpois and the like can throw warnings under unusual circumstances, for example if lambda < 0. If we've written the likelihood correctly this should never happen. If this is still a concern we could also use the Rf_*_raw versions of the distribution functions, which don't generally do input sanity checks. As further evidence, after asking this question, Murray Efford seems to still be using distribution functions from the R APi in parallel C++ code in secr.

On the other hand, the random number generators in the R API are definitely NOT thread safe from what I've seen, which makes sense.

@rbchan
Copy link
Owner Author

rbchan commented Jul 27, 2020

The pcount code looks good to me, and based on what you said, it's fine with me if we go ahead and assume that the density functions are thread-safe. Would be nice to find someone with a Mac to test it. Do you see any warnings from R CMD build --as-cran?

One thing I wonder about is the importance of setting OMP_PLACES and OMP_PROC_BIND, especially when running OpenMP code on multiple cores with the parallel package. I've been doing this on another project, and it seems like more than one OpenMP chunk can land on the same thread. This might happen in unmarked if people are using parboot in parallel. To be safe, I've been setting:

OMP_PLACES="threads"
OMP_PROC_BIND="spread"

but I can't say for sure if this is necessary. More info here and here.

@kenkellner
Copy link
Collaborator

Setting OMP_PROC_BIND="spread" has a pretty big negative impact on performance for me when I set threads to greater than the number of cores. I'm setting it in the pragma statement with proc_bind(spread) (see here).

I'm not sure how to properly set OMP_PLACES variable in Rcpp, seems like it is complicated?

Regardless of where we land on the above, I wonder if we should just force-set threads=1 for any parallel operations in unmarked, eg in parboot. Intuitively to me it doesn't seem like using more threads in the likelihood calculation is going to help if multiple models are already being run in parallel.

@rbchan
Copy link
Owner Author

rbchan commented Jul 27, 2020 via email

@kenkellner
Copy link
Collaborator

Also worth considering: Armadillo uses openMP automatically for some computations when it's available (see http://arma.sourceforge.net/faq.html). I've not been able to determine exactly what conditions trigger this, but it appears to happen consistently for occuMulti. If you try to layer additional parallelization on top of this, either by using openMP in C++ code or by using parallel in R, you get much slower runtimes then if you avoided trying to run in parallel in the first place.

@rbchan
Copy link
Owner Author

rbchan commented Sep 14, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants