## OpenMP benchmark for Rcpp based codes

Here I test if OpenMP helps with some of the computations.

In [1]:
attach(readRDS('em_optim_difference.rds'))

Here, sample size `N` is around 800, number of variables `P` is around 600. 50 conditions are involved.

In [2]:
X = cbind(X,X,X)

In [3]:
dim(X)

In [4]:
dim(Y)

In [5]:
devtools::load_all('~/GIT/software/mmbr')
omp_test = function(m, d, n_thread) {
    x = m$clone(deep=TRUE)
    x$set_thread(n_thread)
    x$fit(d)
    return(0)
}

Loading mmbr

Loading required package: mashr

Loading required package: ashr

Loading required package: susieR



I will benchmark it on my 40 CPU threads computer, using number of threads from 1 to 96.

## Center and scale the data

In [6]:
d = DenseData$new(X,Y)
d$standardize(T,T)

mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), resid_Y, mash_init)

In [7]:
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)

In [8]:
summary(res)[,c('expr', 'mean', 'median')]

expr,mean,median
<fct>,<dbl>,<dbl>
c1,112.8897,79.78656
c2,123.7933,87.43624
c3,130.1537,84.48205
c4,122.3935,70.99884
c8,151.0569,91.92863
c12,128.0823,83.81152
c24,119.8294,88.32749
c40,147.7913,102.26707
c96,290.5133,268.58782


There is no advantage here, as expected, because when data is centered and scaled, the parallazation happens at mixture prior level. Since only one mixture component is used, there is nothing to parallel.

## Do not center and scale the data

This will be more computationally intensive than previous run, because `sbhat` here is different for every variable. But now the parallazation will happen at variable level.

In [9]:
d = DenseData$new(X,Y)
d$standardize(F,F)
mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), resid_Y, mash_init)

In [10]:
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)

In [11]:
summary(res)[,c('expr', 'mean', 'median')]

expr,mean,median
<fct>,<dbl>,<dbl>
c1,367.9432,329.76213
c2,279.7087,230.23741
c3,222.2039,167.56115
c4,175.0598,133.68217
c8,164.1772,120.50207
c12,156.2795,107.18487
c24,153.163,97.05375
c40,133.3515,106.20786
c96,247.2431,238.34919


We see some advantage here using multiple threads. Performance keeps improving as number of threads increases, up to 40 threads (capacity of my computer). More threads asked beyond that point resulted in performance loss. It seems 4 threads strikes a good balance and reduce the compute time by more than half.

## Center and scale data but using mixture prior

Here since we are running a mixture prior, the advantage of parallazation should kick in because for common `sbhat` we parallel over prior mixture,

In [12]:
mash_init = MashInitializer$new(create_cov_canonical(ncol(Y)), 1)
B = MashRegression$new(ncol(X), resid_Y, mash_init)

In [13]:
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)

In [14]:
summary(res)[,c('expr', 'mean', 'median')]

expr,mean,median
<fct>,<dbl>,<dbl>
c1,2342.0104,2340.3615
c2,1272.294,1274.5796
c3,960.206,919.9432
c4,741.7434,712.733
c8,425.8473,419.3659
c12,344.0645,329.085
c24,323.5198,279.0587
c40,332.9058,312.0563
c96,491.0983,452.3548


We see that the advantage is obvious for using multiple threads for computation with mixture prior having a large number of components (this case is about 60 for canonical prior).