Implement residualized predictor permutation as an option in `anova()`/`permutest()` methods for CCA/RDA/dbRDA #542

gavinsimpson · 2022-10-26T09:17:11Z

Cajo has a new paper (ter Braak & te Beest, 2022) out showing that the residualized response permutation method in Canoco 5 (versions < 5.15) and vegan don't work so well (have grossly inflated Type I error rates) in situations where (quoted from abstract)

the abundance data are both overdispersed and highly variable in site total. In contrast, residualized predictor permutation controlled the type I error rate and had good power, also when the predictors were skewed or binary. After square-root or log transformation of the abundance data, the differences between the permutation

{ade4} has response permutation and so it isn't affected but it can't test partial ordinations. So, either we warn people about the specific issues of overdispersion in the presence of highly variable site totals (and we know how that will go down - they won't read the help and/or will ignore warnings we print) or we do that and implement the residualized response permutation method that Cajo & co describe in the paper. There's R code associated with the paper so implementing shouldn't be hard, though might need converting to vegan-like R code (? I don't know, haven't looked at it yet) and checking on the licence under which it was distributed (or asking Cajo for permission to include it in vegan), or we write our own implementation from the paper description.

(@jarioksa if you don't have access to the paper, let me know and I'll send the PDF your way)

References

ter Braak, C.J.F., te Beest, D.E. Testing environmental effects on taxonomic composition with canonical correspondence analysis: alternative permutation tests are not equal. Environ Ecol Stat 29, 849–868 (2022). https://doi.org/10.1007/s10651-022-00545-4

jarioksa · 2022-10-29T10:00:02Z

I read the paper, and it makes sense. Now the question is how to implement this in vegan.

jarioksa · 2022-10-29T18:24:43Z

An intriguing detail is that CCA permutation tests re-weighted X-variates by permuted community weights while we used R-code, or up to vegan release 2.4-6 (permutest.cca was unchanged since 2.4-2). Therefore cca tests were measurable slower than unweighted (rda & friends) tests. When I implemented permutation tests in C with .Call interface, I first used the same re-weighting scheme, but then dropped it for CANOCO compatibility (and the speed difference between cca and unweighted methods disappeared). Early in developing the C code I still had the tools for re-weighting, but they may not have been final and polished. The re-weighting scheme was removed in commits bfc211e and 9a98a8b, relevant comments and changes also in e78bbee. However, I have no idea if this re-weighting scheme was correct and fixes the issues Cajo raises. If there is a test set in Cajo's paper or elsewhere, we could start seeing if vegan 2.4-6 fares better than the current version. Release 2.5-1 was the first with the current unweighted scheme.

jarioksa · 2022-10-31T17:02:16Z

Here a quick test that verifies that the current anova for CCA is biased in vegan. P-values from random model should have a uniform distribution in unbiased test and have a linear relationship in Q-Q plot against uniform distribution. I tested this with mite data set of vegan: it has a large variation of site totals and should find cases where re-weighting matters. It is obvious that the tests with the current release version of vegan (and all versions from 2.5-1) produce too many low P-values.

There is no similar problem in unweighted analysis (RDA), and indeed, vegan 2.4-6 also seems to be unbiased. The tests became biased when I implemented them in C and dropped weighting as specified in the commit message in the previous message.

I don't remember if I dropped re-weighting scheme before making it to work. However, building blocks for re-weighted simulations should be available in old github commits.

jarioksa · 2022-11-04T12:03:23Z

Cajo's paper concerns CCA or permutation on weighted ordination. The crux is that Canoco and vegan 2.5-1 to 2.6-4 ignored weights and just permuted the internal matrices ("working data"). However, when we permute predictors, they should be re-weighted using the weights of their new sampling units. I have now implemented this in branch biased-anova-cca, and the new results seem to be unbiased under Null model. This means that permutation P-values have uniform distribution with randomized or random predictors. The implementation is different from the one Cajo outlined, because we use completely different algebraic tool set (QR decomposition, different handling of predictors in parallel models), although the results are equal.

Up to CRAN release 2.4-6 vegan permutation tests had re-weighting (and were not Canoco compliant), and the first version with current non-reweighted permutations was 2.5-1 (Apr 14, 2018). This switch was made together with moving from R code to C in permutation tests. However, I first started to implement re-weighted permutation in C, but then decided to go for simpler Canoco-compliant code and removed the re-weighting code. So the greatest change in this branch was reverting those two commits in 0fbded3 and 740f434. This gave working compiled code, but permutest.cca had changed so much that it needed changes. I got so many confusing test results that I changed things that should not be fixed. However, now it seems that code is good expect for final tests and cleaning up. If only possible, please test the code to see it really works.

In this process I also found a noteworthy inefficiency in design of permutest.cca in parallel processing. The C code is written to analyse several permutations in one function call, but parallel execution submitted only one permutation per call and the C loop was run only once. This put a considerable overhead of launching and setting up the function, and for many cases parallel execution was clearly slower than the non-parallel sequential execution that could process all permutations with one call. In my microbenchmark tests with vegan data sets, parallel processing with 4 cores could take 50% longer time than non-parallel processing. In the revised code, permutation matrices are submitted in chunks for each thread so that the data need only set up once and can be run in a loop for all permutations within the C code. However, in my computer this is hardly faster, but neither is it much slower (commit 882f48b) . You may have a look at this performance issue as well.

There is still one issue that needs scrutiny and fix, but it seems that we can live with this – if we are careful. It seems that the C function changes input, and if called repeatedly with the same data, the data will change and results will drift. This happens only in weighted analysis (CCA), but I haven't found the reason for this. This drift was the thing that took most of my time: I have parallel processing as default in my working environment, and my tests consistently failed, and I was looking the reason in the C code (which proved to be correct) or in permutest.cca. Only after finding out that non-parallel runs gave unbiased result, I was back in right track and finally found the inefficient repeated one-sample calls in parallel code as the problem. However, the C code should be fixed so that it does not change data in the user environment.

update 5/11/22: We cannot live with this. Tests fail also with anova(<cca-result>, by = "axis") which makes repeated calls to permutest and these change model matrix X. Got to dig up the problem and really fix it.

update 6/11/22: QR decomposition must be duplicated in src/getF.c. QR decompositions are modified in permutation test of CCA (weighted analysis), and the modified QR components will replace the original in the ordination object unless duplicated (commit 2593333). Moreover, wcentre no longer over-writes input matrix and is safer to use (commit 742cfa9).

jarioksa · 2023-03-11T12:40:58Z

I have finally finished with upgraded anova.cca/permutest.cca . This version passes all test cases on Cajo's papers. The previous versions of last year already passed all but one test: The Pinho data set. The upgrade only concerns anova.cca where the data must be reweighted by CCA row weights when weights change. The previous version already did this almost, but the Pinho data had extremely variable weights: from 35 to 36898 or more than 1000-fold.

Cajo's test only concerned coverage: 5 % or randomized P-values should be at P < 0.05. I also looked at the full distribution of permutation P-values which should have uniform distribution with randomized data.

I will merge this change to master branch, but I will still have a look at two issues (and more issues may appear in testing):

I now make the permutation tests like outlined by Cajo as the main approach: permute response Y, predictor X and weights w similarly, re-weight Z for permuted weights, and find the residual predictors of X (permuted) on Z (re-weighted by permuted w). Cajo also suggested another approach where you only permute de-weighted X, re-weight permuted X by weights w, and find the residualized X. This could be simpler to implement and may also be faster.
I was first wondering why you need to keep together [YZw] against [X]. I think this happens because the default is to use reduced model in permutation. That is, we permute Residuals(Y ~ Z) and therefore we need to keep Y and Z in identical order – even when randomizes models permuted input [XZ] (like in the graph in this message). However, vegan also has direct model where we directly permute Y ignoring effect of Z, and there we may need to have pairs [Yw] vs [XZ]. This may be inspected later, and as soon as we can verify that the current design (or its alternative implementation) works and gives unbiased P-values we should push a new CRAN release of vegan.

This implements re-weighted permutation tests with residualized permutations as suggested by ter Braak, C.J.F. & te Beest, D.E. Testing environmental effects on taxonomic composition with canonical correspondence analysis: alternative permutation tests are not equal. This also fixes some smaller details, such as clumsy interface that made parallel execution slower than serial execution. Environ Ecol Stat 29, 849–868 (2022). merge is necessary, # especially if it merges an updated upstream into a topic branch.

jarioksa closed this as completed May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement residualized predictor permutation as an option in `anova()`/`permutest()` methods for CCA/RDA/dbRDA #542

Implement residualized predictor permutation as an option in `anova()`/`permutest()` methods for CCA/RDA/dbRDA #542

gavinsimpson commented Oct 26, 2022

jarioksa commented Oct 29, 2022

jarioksa commented Oct 29, 2022 •

edited

Loading

jarioksa commented Oct 31, 2022 •

edited

Loading

jarioksa commented Nov 4, 2022 •

edited

Loading

jarioksa commented Mar 11, 2023

Implement residualized predictor permutation as an option in anova()/permutest() methods for CCA/RDA/dbRDA #542

Implement residualized predictor permutation as an option in anova()/permutest() methods for CCA/RDA/dbRDA #542

Comments

gavinsimpson commented Oct 26, 2022

References

jarioksa commented Oct 29, 2022

jarioksa commented Oct 29, 2022 • edited Loading

jarioksa commented Oct 31, 2022 • edited Loading

jarioksa commented Nov 4, 2022 • edited Loading

jarioksa commented Mar 11, 2023

Implement residualized predictor permutation as an option in `anova()`/`permutest()` methods for CCA/RDA/dbRDA #542

Implement residualized predictor permutation as an option in `anova()`/`permutest()` methods for CCA/RDA/dbRDA #542

jarioksa commented Oct 29, 2022 •

edited

Loading

jarioksa commented Oct 31, 2022 •

edited

Loading

jarioksa commented Nov 4, 2022 •

edited

Loading