Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Call interface for nullmodel simulation #197

Closed
jarioksa opened this issue Sep 15, 2016 · 7 comments
Closed

.Call interface for nullmodel simulation #197

jarioksa opened this issue Sep 15, 2016 · 7 comments
Labels

Comments

@jarioksa
Copy link
Contributor

@jarioksa jarioksa commented Sep 15, 2016

I have started implementing .Call() interface to nullmodel simulation. Currently we have several compiled C functions and we access these with .C() interface as set up in make.commsim() and used then by simulate.nullmodel(). I wanted to see if these functions can be made more efficient by switching from .C to .Call.

In .C interface we pass basic C structures, typically R data as (pointers to) C vectors. The passing of arguments and return of results happens via arguments. In .Call interface we pass R structures and can return R structures. I have now implemented two test cases which together cover all variants in nestedness.c (i.e., sequential and non-sequential models). For sequential methods I have implemented .Call to trial swap, and for non-sequential the .Call to quasiswap. The traditional interface to trial swap is:

    out <- array(0L, c(nr, nc, n))
    out[, , 1] <- .C("trialswap", m = x, nr, nc, thin, PACKAGE = "vegan")$m
    for (k in seq_len(n - 1)) out[, , k + 1] <- .C("trialswap", 
        m = out[, , k], nr, nc, thin, PACKAGE = "vegan")$m
    out

We first create an empty 3D array out, then fill the first facet (third dimension, out[,,1]) with swapped input matrix (m), and then we loop over all n simulations and fill the next matrix with the swapped previous one with repeated .C("trialswap", ...). The new interface is:

 .Call("do_tswap", as.matrix(x), n, thin, PACKAGE = "vegan")

That's all. We pass an R matrix, create the 3D array within the compiled code, loop over its third facet in compiled code, and finally return the 3D array. The do_tswap function will call the very same trialswap C code as the old interface, but this call happens within compiled code instead of crossing R/C border. Using .Call interface normally means moving a part of R into C while keeping the old C.

I would appreciate tests and comments on the new interface. The new interface is now available in branch nestedness-Calls of the devel branch of vegan. The new interfaces can be defined in the nullmodel command or in oecosimu. For instance, the version above can be invoked as:

nm <- nullmodel(BCI, "dotswap")
oecosimu(BCI, nestedchecker, "dotswap", thin=1000)

The new alternative models are "dotswap" , "do2tswap" (an alternative implementation of the same) and "doquasiswap", and the calls to old interfaces also work ("tswap", "quasiswap"). The old and new implementation should give equal results with the same random number seed so it should be testing of robustness, flexibility and speed.

All other nullmodels can be made as a variant of these or even with the same functions with extra argument of method, but I want to gain some experience on the code before implementing these for all null models. In particular, I have just started looking at writing code for .Call() interface a couple of weeks ago, and I need to check the code thoroughly. Do you have time to look at these, @psolymos ?

My first results are that trial swap indeed is much faster with the .Call() interface: I have micorbenchmarked 7-fold speed-up for larger data sets (such as BCI and bryceveg), but the effect varies with platform and data properties. However, I have not found any clear benefits with quasiswap. At best I have had <1% speed-up, but the results are inconsistent and sometimes (though more rarely) the new interface is even slower than the old interface. It seems that quasiswap spends most of its time within compiled code anyway, and mere interface is such a small extra cost that it can hardly be measured. I looked at the working of the C code, and it seems that for instance in BCI we need usually over 1.5 Million cycles to qusiswap the input matrix into 0/1 matrix. Even the tiny sipoo data need 150,000 passes to quasiswap the input matrix. The quasiswap algorithm could be easily parallelized, and that could give a clear advantage, and would probably be easier in R than within C. Perhaps it makes sense to keep the old interface for computer-heavy non-sequential methods, but implement sequential methods with .Call.

@psolymos
Copy link
Contributor

@psolymos psolymos commented Sep 15, 2016

@jarioksa : I will have a look and run some tests.

@jarioksa
Copy link
Contributor Author

@jarioksa jarioksa commented Sep 16, 2016

There are a couple of surprising issues with the current implementation.

Currently I do not duplicate input arguments before they are modified. This can give surprising behaviour like this:

x <- r2dtable(1, rowSums(sipoo), colSums(sipoo))[[1]] # create one matrix
xwork <- x
xnew <- .Call("do_quasiswap", xwork, 1, 1)

After this xnew will contain quasi-swapped matrix of zeros and ones, but actually xwork and x were modified as well and are identical to xnew. To avoid this, we should duplicate arguments before modifying them.

Even when make.commsim sets storage.mode(out) <- "double", the output of do_quasiswap is integer and the set storage.mode would be lost. There are conflicts now in swsh_samp, swsh_samp_r, swsh_samp_c. All these set mode = "double" and set storage.mode(out) <- "double" before calling quasiswap. If this quasiswap call is changed to do_quasiswap, out will be changed to integer. However, as .Call("do_quasiswap", ...) is not faster than .C("quasiswap", ...) we probably need not change that and problems won't emerge. Something to think, though.

@jarioksa
Copy link
Contributor Author

@jarioksa jarioksa commented Sep 19, 2016

There is now an alternative .Call interface to all basic C level functions. The old interface is still there, and the new is invoked with the same nullmodel (make.commsim method) name with prefix do, i.e., "doswap", "dotswap", "docurveball", "doquasiswap", "doswap_count", "doquasiswap_count", "doabuswap_c", "doabuswap_r".

Quasiswap function is called in many quantitative null models, but these interfaces still use the old .C interface. My tests on Mac and Linux show that .Call interface is not faster for quasiswap, and we shall keep the .C interface if this is true. This needs testing.

@jarioksa
Copy link
Contributor Author

@jarioksa jarioksa commented Sep 20, 2016

My test with self-built current R-devel on Linux:
mb

The model parameters were selected so that all examples run in ca. 100 ms in release (2.4-1) version of nullmodel. The do-model is the one using .Call interface, and the plain name is for the old .C interface. The parameters for approximately equal running time are highly dependent on the system used and on the data properties (numbers of rows and columns, fills, abundance values). There were also changes in C files so that some speed-up to release version came from this. In particular, the RNG generation is faster now. There is a radical improvement in all cases except in non-sequential null models (quasiswap_count, quasiswap). The quasiswap method is particularly slow and cannot be helped with .Call.

Here the command I used in benchmarking with microbenchmark package:

mb <- microbenchmark(
    simulate(nullmodel(BCI, "swap"), 400, thin=15),
    simulate(nullmodel(BCI, "doswap"), 400, thin=15),
    simulate(nullmodel(BCI, "tswap"), 400, thin=400),
    simulate(nullmodel(BCI, "dotswap"), 400, thin=400),
    simulate(nullmodel(BCI, "curveball"), 300, thin=50),
    simulate(nullmodel(BCI, "docurveball"), 300, thin=50),
    simulate(nullmodel(BCI, "swap_count"), 400),
    simulate(nullmodel(BCI, "doswap_count"), 400),
    simulate(nullmodel(BCI, "abuswap_r"), 300),
    simulate(nullmodel(BCI, "doabuswap_r"), 300),
    simulate(nullmodel(BCI, "abuswap_c"), 300),
    simulate(nullmodel(BCI, "doabuswap_c"), 300),
    simulate(nullmodel(BCI, "quasiswap_count"), 60),
    simulate(nullmodel(BCI, "doquasiswap_count"), 60),
    simulate(nullmodel(BCI, "quasiswap"), 2),
    simulate(nullmodel(BCI, "doquasiswap"), 2))
@jarioksa
Copy link
Contributor Author

@jarioksa jarioksa commented Sep 23, 2016

I have run some more tests, and it seems that .Call interface is as robust as the earlier one. With the same random number seed we get same results with both interfaces. In Linux and in macOS the .Call is also clearly faster for sequential methods. I haven't tried in Windows. There is not much difference in non-sequential methods (quasiswap, quasiswap_count). I may leave the .C interface for quasiswap, but I'll switch sequential methods to .Call (and I don't know yet what I do with quasiswap_count which has a small advantage with .Call). I'll remove the old .C interface and switch the method names to use the .Call. I'll do that in a separate branch so that the old branch is still available for testing of .C and .Call alternatives.

I have been baffled by the differences in running time. After all, both .Call and .C use exactly the same C code. With .C interface we call that code directly. With .Call interface we call another C function that does some extra things and then calls the same C code as .C did directly.

I profiled trial swap with 9999 simulations and thinning 1000 for BCI, and found out that it spent less than 20% of its running time in .C and the rest in the loop calling .C. It seems like the out[,,k+1] <- .C("trialswap", m = out[,,k], ...) means too much traffic with the out array. I found out that we can actually speed up the calculations by using this variant of the function:

tswap2 <-  
    function (x, n, nr, nc, rs, cs, rf, cf, s, fill, thin) 
{
    out <- array(0L, c(nr, nc, n))
    for (k in seq_len(n)) {
        x <- .C("trialswap", m = x, nr, nc, thin, PACKAGE = "vegan")$m
        out[,,k] <- x
    }
    out
}

Here we sequentially change x and save the current update to out, but we do not need to touch out so many times. This reduced the time in one profile run from 2.9 sec to 1.9 sec (i.e, by 0.1 msec/simulation) and spent nearly 50% of its time in .C. However, the .Call interface was still one second faster (0.9 sec running time) and spent 90% of its time in .Call. The R overhead was almost completely made of setting dimension names which took nearly 10% of running time. Here still a graph of microbenchmark with 999 simulations: nm is the current "tswap" , nm2 is with the code above, and nmdo is the "dotswap" model with .Call inteface
mbtswap

Profiling of quasiswap revealed that it spent some 97% to 98% of its time in .C or in .Call (99 simulations with BCI, no thinning). The rest 2% was mainly spent in r2dtable and the overhead of looping and storing results was minimal, and there is little we can do to speed up function with improved inteface. Quasiswap C code is much slower so that R house keeping is not so important a part of running time. With this BCI data set it seems that we can run 15,000 trial swaps with thinning 1000 in in the same time as one quasiswap. I assume the quasiswap problems are in the inherent slowness of the algorithm. Finding and quasiswapping the last above-one cell needs a huge amount of cycles. Perhaps we should update our recommendation: sequential methods are so much faster even with large thinning that better quality data can be found with them much more conveniently than with quasiswap.

@psolymos
Copy link
Contributor

@psolymos psolymos commented Sep 26, 2016

@jarioksa here are microbenchmark results from Windows 10 and R 3.3.1:

image

image

Timings are very similar to what you have found on Linux and Mac OS X.

Regarding the recommendations: I think it would be best to dichotomize it, i.e. seqential versions might run faster, but there are certain advantages/guarantees to quasiswap if your latest fix makes it truly unbiased.

@jarioksa
Copy link
Contributor Author

@jarioksa jarioksa commented Sep 29, 2016

Finally a comparison of new null models in vegan 2.5-0 (master branch here in github) against the latest vegan release 2.4-1. The new 2.5-0 was run on self-built R-devel (2016-09-27 r71386) and the release version was run in R 3.2.3 (2015-12-10) which ships with the latest Ubuntu Linux. This is not the latest official R release, but it is the one that comes with my OS. There are usually drastic changes. A part of these changes are probably due to faster random number generation in the new code, but for sequential methods it is mainly due to the new .Call interface. NB., the "quasiswap" method uses the old .C interface and there is no difference between old and new versions except that RNG is faster, but still the old version is faster than the new one. I have noticed in some other tests that the stock R 3.2.3 may be faster than my self-built R-devel. I don't know if the difference is in R or in the way the binaries were built. However, even with potentially slower R the NEW versions are much faster.
mb250vs241

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.