Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

influence statistics for cca & rda (& deprecate as.mlm methods) #234

Closed
jarioksa opened this issue Apr 21, 2017 · 1 comment
Closed

influence statistics for cca & rda (& deprecate as.mlm methods) #234

jarioksa opened this issue Apr 21, 2017 · 1 comment
Milestone

Comments

@jarioksa
Copy link
Contributor

@jarioksa jarioksa commented Apr 21, 2017

We have relied on functions as.mlm.cca and as.mlm.rda to cast cca and rda result objects to multiple linear model objects (class "mlm"), and to find influence statistics via these using mlm and univariate lm methods. I started to look at implementing these statistics directly. First I did this on intellectual curiosity, but then I understood that we can develop more adequate tools with dedicated methods (and I also found out that some of the stats method gave wrong results for "mlm" objects in R).

Basically we need only three (or two) basic statistics to derive the main influence statistics:

  1. Hatvalues giving the leverage of each point. These are defined only by constraints, and the dependent (community) data has no influence. The hatvalues are also similar in univariate lm. So these are easy and non-problematic.

  2. Raw residuals. Basically these are differences of observed and fitted values, but this concept is far from clear in constrained ordination. Currently I have implemented two alternatives: type = "response" uses the residuals of species after fitting constraints, and type = "canoco" uses the differences of WA and LC scores. The problem with response residuals is that they have the same dimensions as the original community data which can be too much for intuitive use. Further, they have no clear correspondence to ordination axes, but they are only based on all constraints. The canoco residuals are given for each axis and are much more manageable, but I fail to see what makes model$CCA$wa - model$CCA$u to be meaningful residuals. This difference is in no way related to residual variation after constrained ordination.

  3. Residual standard deviation or sigma. This is actually not completely independent statistic, but it should be consistent with raw residuals (and I have now implemented this so that it will be derived from the sum of squared raw residuals). For response residuals, this is based on column sums of squared residual community matrix colSums(ordiYbar(model, "CA")^2) or with old "cca" class from colSums(model$CA$Xbar^2). For canoco residuals, the sigma is based on colSums(model$CCA$wa^2) - 1 and has no clear relation to residual variation (but it has a very simple relationship to species-environment correlation) -- it is not even monotonic with constrained eigenvalues.

Branch influence-cca has functions hatvalues.cca, hatvalues.rda and sigma.cca which are all based on R standard generic functions mainly working with lm objects. There is no dedicated function for raw residuals (when writing this), but these are extracted within specific functions. Based on these, I have now implemented rstandard.cca, rstudent.cca and cooks.distance.cca which all work both with cca and rda results, and all have option to select type = "response" or "canoco". What we would really need is an option that combines good sides of these responses: gives influence statistics for axis (or cumulative axes) like "canoco", but uses meaningful residual error like "response".

I haven't implemented dfbeta and dfbetas functions. These produce so huge results even for univariate cases that they would be unmanageable and unusable in multivariate cases.

I have also implemented method functionvcov.cca (using SSD.cca) which can be used to find the covariance matrix and standard errors of coefficients of constraints. When writing this, I have only option type = "canoco", but adding "response" should be as simple as adding choice for raw residuals in SSD.cca. These also produce large matrices which can be difficult to inspect. However, the diagonal of this matrix contains the residual variances of coefficients, and the the same t-values for coefficients as summary(as.mlm(model)) will be given by coef(model)/sqrt(diag(vcov(model))). However, these are based on "canoco" style sigma, which seems to make them rather meaningless. We really need a better sigma.

If these functions are merged into master, there really is no longer need for as.mlm methods and we could deprecate them. I think all usable statistics can be found with the proposed new functions, and we can also easily provide better alternatives than the inbuilt "canoco" type of as.mlm. Development of new properties is also easier with the proposed structure than relying on casting to "mlm" objects.

@jarioksa jarioksa added this to the 2.5-0 milestone May 17, 2017
@jarioksa jarioksa mentioned this issue May 18, 2017
12 of 12 tasks complete
@jarioksa
Copy link
Contributor Author

@jarioksa jarioksa commented May 18, 2017

Closed with merges 8bdd218 (influence statistics for constrained ordination) and 6c24e43 (deprecate as.mlm).

@jarioksa jarioksa closed this May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.