Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoping issue in anova.ccabymargin #100

Closed
gavinsimpson opened this issue Feb 23, 2015 · 13 comments
Closed

Scoping issue in anova.ccabymargin #100

gavinsimpson opened this issue Feb 23, 2015 · 13 comments
Labels
Milestone

Comments

@gavinsimpson
Copy link
Contributor

Not sure if we can do much about this now or in the future, but I'm recording it here in case we update something in the future?

The following example illustrates a scoping bug/infelicity that, ideally, we'd be able to handle in vegan

library("vegan")
data(dune, dune.env)
foo <- function(x, env) {
    m <- rda(x ~ Manure + A1, data = env)
    anova(m, by = "margin")
}
out <- lapply(dune, foo, env = dune.env)

which fails with:

> out <- lapply(dune, foo, env = dune.env)
Error in eval(expr, envir, enclos) : object 'x' not found

the traceback() is:

> traceback()
18: eval(expr, envir, enclos)
17: eval(expr, p)
16: eval.parent(specdata, n = envdepth)
15: ordiParseFormula(formula, data, na.action = na.action, subset = substitute(subset))
14: rda.formula(formula = x ~ A1, data = env)
13: rda(formula = x ~ A1, data = env)
12: eval(expr, envir, enclos)
11: eval(call, parent.frame())
10: update.default(object, paste(".~.-", nm))
9: update(object, paste(".~.-", nm))
8: permutest(update(object, paste(".~.-", nm)), permutations, ...)
7: FUN(c("Manure", "A1")[[1L]], ...)
6: lapply(trmlab, function(nm, ...) permutest(update(object, paste(".~.-", 
       nm)), permutations, ...), ...)
5: anova.ccabymargin(object, permutations = permutations, model = model, 
       parallel = parallel, scope = scope)
4: anova.cca(m, by = "margin")
3: anova(m, by = "margin") at #3
2: FUN(X[[1L]], ...)
1: lapply(dune, foo, env = dune.env)

The obvious workaround is

outs <- vector(mode = "list", length = NCOL(dune))
for (i in seq_len(NCOL(dune))) {
    resp <- dune[, i, drop = FALSE]
    m <- rda(resp ~ Manure + A1, data = dune.env)
    outs[[i]] <- anova(m, by = "margin")
}

or, retaining lapply():

bar <- function(i, env) {
    m <- rda(dune[,i] ~ Manure + A1, data = env)
    anova(m, by = "margin")
}
out <- lapply(seq_len(NCOL(dune)), bar, env = dune.env)
@jarioksa
Copy link
Contributor

There are numerous scoping issues in cca/rda/capscale/prc and their anova.cca* methods. Functions capscale and prc are particularly problematic because they call rda so that rda is nested one level deeper than usually. Then anova.cca* methods update a model formula, and that is nested at different depths depending on whether they update rda or capscale result. The depth of nesting is also the problem in your lapply example. There are now numerous kluges to handle these cases. For instance, ordiGetData is nothing but a kluge, and interpretation in ordiParseFormula is very fragile. Some new problems were generated in vegan 2.2-0 when we changed to permute package, and some tests in tests/vegan-tests.R no longer passed and were commented out.

These scoping issues should be fixed. I know this can be difficult, and there should be regression tests in place to see that nothing is broken when fixing some corners. I fixed one very bad one in https://github.com/jarioksa/vegan/tree/scoping-issue but many other should be fixed. This may require rethinking and simplification of the whole formula interface in constrained ordination.

@jarioksa
Copy link
Contributor

Incidentally, the example that @gavinsimpson gave works after jarioksa@b858209 that I made to fix another issue yesterday. out is a long list, but here one species from the result:

> out$Poatriv
Permutation test for rda under reduced model
Marginal effects of terms
Permutation: free
Number of permutations: 999

Model: rda(formula = x ~ Manure + A1, data = env)
         Df Variance      F Pr(>F)   
Manure    4   4.7257 5.7006  0.007 **
A1        1   0.0153 0.0736  0.802   
Residual 14   2.9014                 

Perhaps I need to merge this.

@jarioksa
Copy link
Contributor

Here one example that still fails. This is based on tests/vegan-tests.R where we have a working variant. However, this fails:

X <- matrix(rnorm(30*6), 30, 6)
A <- factor(rep(rep(c("a","b"), each=3),5))
C <- factor(rep(c(1:5), each=6))
mod <- rda(X ~ A + Condition(C))
anova(mod, by="margin")
##Error in model.frame.default(P.formula, data, na.action = na.pass, xlev = zlev) : 
##  object is not a matrix
mod <- rda(X ~ A + C)
anova(mod, by="term")
##Error in model.frame.default(formula, data, na.action = na.pass, xlev = xlev) : 
## invalid type (closure) for variable 'C'
anova(mod, by="margin")
##Error in model.frame.default(formula, data, na.action = na.pass, xlev = xlev) : 
##  object is not a matrix
mod <- rda(X ~ C)
anova(mod, by="term")
##Error in model.frame.default(formula, data, na.action = na.pass, xlev = xlev) : 
##  object is not a matrix
anova(mod, by="margin")
##Permutation test for rda under NA model
##Marginal effects of terms
##Permutation: free
##Number of permutations: 999
##
##Model: rda(formula = X ~ C)
##        Df Variance      F Pr(>F)
##C         4   0.9696 1.2446  0.201
##Residual 25   4.8688          

The object that is not a matrix but a closure is C: it is a function stats::C that sets contrasts to factors. However, the last case succeeds, but why?

@eduardszoecs
Copy link
Contributor

Related: #16

@jarioksa
Copy link
Contributor

Yes, and the case in issue #16 still persists. Issue #16 also has some longish discussion and analysis that I already forgot. Scoping is painful.

@gavinsimpson
Copy link
Contributor Author

@jarioksa If we were to rethink the formula interface and rewrite that code, would you consider using the Formula package, such that we can have a two-part formula, one for the constraints and one for the conditional (partial) constraints (aka covariables in Canoco, and our current Condition())? Such a formula would look like:

y ~ x1 + x2 | z1 + z2

where xi are the constraints and the zi are the conditional constraints.

I've been thinking about this for a while as one potential alternative way to specify the formula which seems more natural to me, is similar to the way random effects are specified, and we can leverage a widely-used, small, well-tested package for the interface/processing of the formula. Using Formula would add a dependency.

Secondly, would it make sense to separate the computational code for say RDA or CCA into simpler, internal, unexported functions, which implement only one aspect of what current rda() or cca() do now. For example, we'd have .fitPCA() and .fitCA() functions if rda() or cca() were called only with a response matrix/data frame. Likewise .fitRDA() and .fitCCA() for the case of response and constraints and .fitpRDA() and .fitpCCA() if we have conditions as well as constraints? (I'm not wedded to the names of these functions, they are just examples.)

The formula interface would then only exist to resolve the data required to fit the model, then call the foo.default() method, which depending on how it was called would do some common operations but pass out to the relevant internal function depending on what data was specified in the formula.

We could also consider exposing these internal functions if other package writers want to build upon these simpler fitting functions. In my analogue package for example, I wanted a quicker version of rda() and even though I depend on vegan I ended up cooking up my own RDA function, copying just the numerical steps from rda() that I needed (no Z, no need for checking as I'd done this in my function and wanted to call this RDA function repeatedly in a loop, etc.)

@jarioksa
Copy link
Contributor

Quite a few issues here, @gavinsimpson

  1. When I first implemented partial ordination, my desire was to use the ~ x | z syntax, but then I did not know how to do it. The formula interpreter had specials and that was what I used (Condition is a case of specials). The problems we have are not about interpreting formula, but about scoping of items in the formula. I don't know the Formula package, but I don't assume that they have a better implementation of scope. Moreover, they would push scoping beyond our control. Still I'd prefer | to Condition(). BTW, have you checked lm? It was several years ago when I checked lm to copy the scoping patterns to vegan, but I found out that lm failed in more cases than cca etc. If they do better now, we could try to adopt their strategies.
  2. I have considered factorization of cca for a long time, but I didn't want to break things that work. I think that this could be done, and it would be really good, but would take several release to get back the things that work now. You need more modules than you outlined. You listed the numerical core routines, but you also need pre-processing and post-processing. This is something that I would like to have, but to build it you first need to break what we have now. Anybody is welcome to do it. I would be really delighted. However, I have enough trouble without doing that myself.
  3. There really is a lot of overhead in rda etc, and faster and simpler things are often needed. In fact, permutest.cca fast-codes just the things you need to evaluate the significance. In addition, we have function simpleRDA2 that we use in varpart for lightweight and fast analysis of some basic things. However, I do think that combining such fast shortcuts must be done separately, and it may not be possible or sensible to build rda, cca etc so that one of their core routines would be such a shortcut.

@gavinsimpson
Copy link
Contributor Author

@jarioksa followups:

  1. As I understand it, with Formula you can rely on their canned model.frame() methods which work separately on the two parts of the formula, or you can extract the formula() class objects for each part separately and use the default or your own methods to build the model frames. Hence we should be able to use our own scoping code if required to locate the relevant data.

    I suppose I could have a go at this in a separate branch as it would only involve adding an Import on Formula and adding a method cca.Formula(), so it wouldn't interfere with the existing code, plus if we did think this worked and were happy with the new approach we could retain backwards compatibility by leaving the cca.formula methods as they are. Backwards compatibility was something I was concerned with when i started thinking about changing the formula interface.

  2. I envisaged the pre- and post-processing as happening in the foo.default() methods, but I see that having those coded as separate functions to be called as needed would also be useful. I guess the only way to know if this is worth it is to do it. I may take a look at this when I have some spare time in the evenings later in the year.

I guess what I wanted to convey was that if we were going to try to reimplement the scoping, we might consider some larger changes as part of that process.

@jarioksa
Copy link
Contributor

Currently the errors come form model.frame. Formula can help in adding a new and improved UI, but it may not help at all in the scoping issues.

Factoring rda & cca should not be too difficult. They are already 3x almost the same code in one function body. The difference between cca and rda is weighting. So the basic brick would be weighted RDA which with proper weights would be equal to CCA. example(wcmdscale) shows one approach to the unified UI. Must be in separate branch, though.

@gavinsimpson
Copy link
Contributor Author

Currently the errors come form model.frame. Formula can help in adding a new and improved UI, but it may not help at all in the scoping issues.

Agreed, but if we're going to do one (fix or work on the scoping) we could look at the other (improving the UI) as part of those changes.

I'm currently digesting what Hadley Wickham has to say about non-standard evaluation and robust ways of implementing this from his Advanced R book. I'll take a look at ordiGetData, ordiParseFormula(), and the current code we use, plus other implementations in R (like lm()) and see if I can see any areas where we might improve how we do things.

@jarioksa
Copy link
Contributor

The Condition() was introduced in vegan 1.4-0 on May 3, 2002. Think about that when you plan to break a UI.

@gavinsimpson
Copy link
Contributor Author

@jarioksa Agreed. That's why I was happy to see that I could implement a cca.Formula() method independently of the rest of the code base, with a clean formula processing basis, without affecting the other parts of vegan. Once I'd implemented that bit I would then need to duplicate the anova.cca behaviour as separate functions, with tests against all the use cases we have for failures due to scoping issues. Only once I had all that sorted out would I consider adding this to the master branch and I would not consider removing the current behaviour, so I'd need to preserve that internally, perhaps by dispatching to the anova.cca family if the formula of a fitted model was a standard formula, and to new code if it was a Formula formula.

Hopefully, by starting from scratch with a clean implementation we might identify the scoping issues which could be backported to the older interface.

As I mentioned earlier in this thread; I don't want to break the current UI, just add to it if possible and only if it helps.

@jarioksa
Copy link
Contributor

PR #104 seems to fix all these issues, as well as issue #16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants