Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nullmodel changes in 2.5-0 #255

Closed
jarioksa opened this issue Oct 11, 2017 · 2 comments
Closed

nullmodel changes in 2.5-0 #255

jarioksa opened this issue Oct 11, 2017 · 2 comments

Comments

@jarioksa
Copy link
Contributor

The nullmodel() functions have changed much during 2.5-0 development, and they are just now partly in volatile changes: there are some functions and choices that may be removed before the release, and some choices may need rethinking. Here is the update of major changes against 2.4-x:

  • Compiled code uses now .Call() interface and proper registration. The effect on speed is probably marginal, but the interface is much cleaner.
  • The methods picking 2x2 submatrices are now much more parsimonious with random numbers which means considerable speed-up, but also means that the results change from previous implementation when using the same seed. This concerns methods quasiswap, swap, tswap, quasiswap_count, swap_count and abuswap_* plus methods using quasiswap internally (swsh_*). Profiling (also of compiled code) showed that these methods spent most of their time in generating random numbers. We used four random numbers for a 2x2 matrix (two row indices, two column indices). Now we have two alternative schemes: "3minus" finds first element directly from the matrix, and then a second row (2 numbers), and if these give a submatrix that cannot be swapped (both 1 or both 0) or quasiswapped (both 0), it skips finding the second column and starts a new cycle. This give 3 or 2 random numbers per cycle (see analysis in commit message 83281ae). In "2x2" scheme we find directly two diagonal elements with 2 random numbers and the antidiagonal elements from their row and column indices (implemented in 33e9813). This always finds only 2 random numbers. This was implemented in quantitative swap methods (810d902), but surprisingly it was usually slower and at best only marginallytfaster in binary quasiswaps, and there we use the "3minus" scheme (analysis in commit message 72b2d93).
  • There are two new quasiswap algorithms on steroids (merged in 175889c). In greedyqswap the first element is picked from >1 elements increasing chances of sum-of-squares reducing quasiswaps. The search for last >1 elements takes most of the time in quasiswap, and being greedy gives a huge speed-up. However, greedy steps are biased, but if we thin, say, to 1% greedy steps among ordinary quasiswaps, we can still double the speed with little risk of bias. Another method is boostedqswap which is based on the same idea as curveball: In curveball we find the set of unique species that occur only in one of two compared sites, and in boostedqswap we find species that are more abundant in one of sites (1 against 0, 2 against 1 etc), and quasiswap equal number of these up and down on two rows. My first tests indicate that this is biased (although similar curveball should be unbiased). If so, this will be removed and not released. Both methods need testing, and neither will be released if they appear to be biased (we don't have a lack of poor null models in this world).
  • backtracking use now compiled code with a huge speed-up. Backtracking is biased, but it is a classic method that may be useful for comparative purposes.
  • There is a new quantitative method that fills a count matrix honouring row and column sums. The method is non-public and has now make.commsim() interface, but it can be called as .Call("do_rcfill", n, rs, cs) where n is the number of simulated matrices, and rs and cs are row and column sums. The main reason for first implementing this method was that posts in R-devel and StackOverflow claimed that stats::r2dtable() (that we use much internally) give too regular data. I checked the Miklós & Podani paper, and found out that they used a function giving more dispersed data as initial matrix in their quasiswap, and implemented that method. My analysis indicates that the huge number of steps we need in quasiswap guarantees that the initial matrix does not influence the result, but the new function is faster than r2dtable() and may speed up simulations. Another use for this function is as a nullmodel that generates count data with larger variance than r2dtable that we now have as a null model. However, I still hesitate with releasing this function, because we really do not have a lack of poor null models (and now I think all quantitative nullmodels for counts are poor).
@jarioksa
Copy link
Contributor Author

jarioksa commented Oct 11, 2017

Here microbenchmark results of do_rcfill effects on quasiswap.
mb_rcquasi
Currently we use r2dtable() to get the initial matrices for quasiswap methods, and the uppermost pair shows the speed advantage of switching to rcfill. The graph also shows the intolerable slowness of quasiswap, and since the basic quasiswap is so slow, switch to rcfill has little effect. The greedyqswap pair has thin=100 meaning that only first of 100 steps is greedy (1% greediness), and even this gives a huge speed-up. quasiswap_count spent some 75% of time in r2dtable and there the boost was also large when using rcfill instead of r2dtable.

@jarioksa
Copy link
Contributor Author

jarioksa commented Jan 17, 2018

Some information about timing. We have some speed-up within 2.4-0 development:

  • each RNG does not check sufficient data (but this is done before)
  • make.commsim uses snappier loop
  • "backtrack" was made faster (but still uses R)

In 2.5-0 the major changes:

  • use .Call() interface with faster internal code in C: use fewer random numbers, move loops from R to C
  • "backtrack" is in C

Benchmarking was performed with microbenchmark package. Times are averages of 100 replicates and they give the average running time in seconds per 100 simulations with BCI

Model Arguments 2.4-0 2.4-5 2.5-0 Notes
backtrack 32.504 23.282 2.085 biased
swap thin=2000, burnin=1000 0.295 0.283 0.175 sequential, biased
tswap thin=40000, burnin=25000 0.232 0.226 0.121 sequential
curveball thin=200 0.064 0.052 0.037 sequential
quasiswap thin=1000 8.675 8.559 7.262 mild bias with thin=0
greedyqswap thin=100 2.035 biased with thin=0
greedyqswap thin=20 1.000
boostedqswap 1.234 biased b067566

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant