Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide less verbose output in summary.foo for ordinations #203

Open
gavinsimpson opened this issue Oct 5, 2016 · 25 comments
Open

Provide less verbose output in summary.foo for ordinations #203

gavinsimpson opened this issue Oct 5, 2016 · 25 comments
Assignees

Comments

@gavinsimpson
Copy link
Contributor

@tpoisot comment on Twitter how verbose the outputs from some summary methods are for ordination models. These don't so much as summarise a fit but provide detailed output including lots of axis score data.

I copied vegan when writing summary() methods for my cocorresp pkg and am not changing those because the verbosity of output just isn't useful to anyone.

Would anyone freak out too much if I took a look at vegan's summary methods for ordination objects and propose changes that would shorten them?

@jarioksa
Copy link
Contributor

jarioksa commented Oct 5, 2016

summary.cca is from 2002, and hindsight says that it could have been made differently. Coding style was not so very standardized then, and I also thought there must be a method that shows there are scores, too. However, in that time it was the standard that S3 methods have both print and summary if nothing else. I think we could have a more compact summary, but the R standard still seems to be that in most (but not in all) cases summary is more verbose then print. That makes sense because nobody uses print() explicitly, but print just happens if you do nothing.

For the record, here a directory listing of cca methods in vegan_1.4-0 (May 3, 2002):

anova.cca.R     plot.cca.R      print.permutest.cca.R   summary.cca.R
cca.R           print.anova.cca.R   print.summary.cca.R
permutest.cca.R     print.cca.R     scores.cca.R

Most summarizing functions were introduces only later (inertcomp, spenvcor, eigenvals etc). However, we do have scores and therefore we hardly need them repetead in summary (as long as we tell users that they do exist). I'll be happy to accept a new, more compact summary.

@jarioksa
Copy link
Contributor

jarioksa commented Oct 5, 2016

Here some summarizing vegan functions that could be a part of new summary.cca: eigenvals (already is), goodness, inertcomp, intersetcor, RsquareAdj, spenvcor, vif.cca. I think summary could also give hints to methods accessing the results (use 'scores' to see the scaled scores, use 'plot' for graphical display etc.), although this is non-standard. However, I think people often do not realize what all they can do with the results.

@jarioksa
Copy link
Contributor

jarioksa commented Oct 6, 2016

First we should decide what we want to have in summary and then we should put it there. Now it is essentially eigenvals + scores. I was thinking this overnight when I slept, but I really don't know what else there should be: the other diagnostic functions really don't fit there (and some of them exist just for legacy compatibility and should not be used). The quickest hack is to keep summary like it is, but change print.summary.cca to work like head.summary.cca.

I'm satisfied with the current print.cca and I don't want it to change. The print should be short because nobody uses print: it is what you get when you type a -- when you type summary(a) you should get more voluminous output than when you just type a.

@jarioksa
Copy link
Contributor

jarioksa commented Oct 6, 2016

A quick fix to cut down the volume but keep the current design is

diff --git a/R/print.summary.cca.R b/R/print.summary.cca.R
index 180220e..a59fdeb 100644
--- a/R/print.summary.cca.R
+++ b/R/print.summary.cca.R
@@ -1,5 +1,5 @@
 `print.summary.cca` <-
-    function (x, digits = x$digits, head=NA, tail=head, ...) 
+    function (x, digits = x$digits, head=3, tail=head, ...)
 {
     hcat <- function(x, head=head, tail=tail, ...) {
         if (!is.na(head) && !is.na(tail) && head+tail+4 < nrow(x))
diff --git a/R/print.summary.decorana.R b/R/print.summary.decorana.R
index 1326697..d481702 100644
--- a/R/print.summary.decorana.R
+++ b/R/print.summary.decorana.R
@@ -1,5 +1,5 @@
 `print.summary.decorana` <-
-function (x, head=NA, tail=head, ...) 
+function (x, head=3, tail=head, ...)
 {
     digits <- x$digits
     hcat <- function(x, head=head, tail=tail, ...) {
diff --git a/man/decorana.Rd b/man/decorana.Rd
index 686d2e4..84dcafe 100644
--- a/man/decorana.Rd
+++ b/man/decorana.Rd
@@ -30,7 +30,7 @@ decorana(veg, iweigh=0, iresc=4, ira=0, mk=26, short=0,
 \method{summary}{decorana}(object, digits=3, origin=TRUE,
         display=c("both", "species","sites","none"), ...)

-\method{print}{summary.decorana}(x, head = NA, tail = head, ...)
+\method{print}{summary.decorana}(x, head = 3, tail = head, ...)

 downweight(veg, fraction = 5)

diff --git a/man/plot.cca.Rd b/man/plot.cca.Rd
index 2fe4db5..8e090c6 100644
--- a/man/plot.cca.Rd
+++ b/man/plot.cca.Rd
@@ -35,7 +35,7 @@
         display = c("sp", "wa", "lc", "bp", "cn"),
         digits = max(3, getOption("digits") - 3),
         correlation = FALSE, hill = FALSE, ...)
-\method{print}{summary.cca}(x, digits = x$digits, head = NA, tail = head, ...)
+\method{print}{summary.cca}(x, digits = x$digits, head = 3, tail = head, ...)
 \method{head}{summary.cca}(x, n = 6, tail = 0, ...)
 \method{tail}{summary.cca}(x, n = 6, head = 0, ...)
 }

I haven't committed these changes. It is unnecessary, since the change can be tested using print(summary(<ordination_result>), head = 3). If this looks like an improvement, the change in default can be made. If something more fundamental is desired, then there is no need to this change. I leave this up to discussion.

@gavinsimpson
Copy link
Contributor Author

I was thinking along the lines of:

print.cca

Call: cca(formula = dune ~ Management, data = dune.env)

              Inertia Proportion Rank
Total          2.1153     1.0000     
Constrained    0.6038     0.2855    3
Unconstrained  1.5114     0.7145   16
Inertia is mean squared contingency coefficient

print.summary.cca

Call: cca(formula = dune ~ Management, data = dune.env)

              Inertia Proportion Rank
Total          2.1153     1.0000     
Constrained    0.6038     0.2855    3
Unconstrained  1.5114     0.7145   16
Inertia is mean squared contingency coefficient

Importance of components:
                        CCA1    CCA2    CCA3    CA1     CA2     CA3     CA4
Eigenvalue            0.3186 0.18247 0.10274 0.4474 0.20300 0.16301 0.13457
Proportion Explained  0.1506 0.08626 0.04857 0.2115 0.09597 0.07706 0.06362
Cumulative Proportion 0.1506 0.23690 0.28547 0.4970 0.59294 0.67000 0.73362
                          CA5     CA6     CA7     CA8     CA9    CA10   CA11
Eigenvalue            0.12940 0.09494 0.07904 0.06526 0.05004 0.04321 0.0387
Proportion Explained  0.06117 0.04488 0.03737 0.03085 0.02366 0.02043 0.0183
Cumulative Proportion 0.79479 0.83968 0.87704 0.90790 0.93155 0.95198 0.9703
                         CA12    CA13     CA14     CA15     CA16
Eigenvalue            0.02385 0.01773 0.009172 0.007959 0.004157
Proportion Explained  0.01128 0.00838 0.004340 0.003760 0.001970
Cumulative Proportion 0.98155 0.98994 0.994270 0.998030 1.000000

i.e. abbreviate current print output to remove eigenvals output and extend the print output via summary(eigenvals(foo)) for summary methods.

@gavinsimpson
Copy link
Contributor Author

An alternative would be to keep the current print behaviour (showing eigenvalues) with summary method extending these to include variance explained etc.

@jarioksa
Copy link
Contributor

jarioksa commented Oct 12, 2016

I think that the length of print.cca is rarely a problem. We do already truncate the vectors of eigenvalues if these tend to become very long. We don't do this to constrained eigenvalues. I think that the number of constraints should be pretty small in general, and if there is a huge number of constrained axes, this is a user problem and I don't feel very helpful in this self-caused mess. The only reason cutting the print.cca output is to make space for more compact print.summary.cca. If there is no space for print.summary.cca, why not throwing away the whole summary.cca instead of crippling print.cca? I'm quite ready for this. I think print.cca shows now the essential information without need of typing any other commands, and it is a disservice to people to force them to type those extra commands just to justify a more compact print.summary.cca.

I would suggest deprecation of summary.cca via changing print.summary.cca to default to head=2 or head=3.

Currently summary.cca is only a combination of print.cca tabulation, summary(eigenvals()), summary.eigenvals(..., constrained = TRUE)), scores.cca plus informative text on the type of scaling of scores. Nothing very unique, and nothing will be lost if we remove summary.cca. Please note that also the two summary.eigenvals() tabulations can be really long.

A historic note about having summary.cca in the first place. In those early times R and the small number of packages (about 120 when vegan was released) used to have two functions to display the results: print for a brief presentation, and summary for more detailed one. The first ordination method in vegan was decorana in release 1.2-0 in November 2001. Its print looked like this (it still works with the current decorana):

Call:
decorana(veg = dune) 

Detrended correspondence analysis, with 26 segments.
Rescaling of axes with 4 iterations.

               DCA1   DCA2   DCA3   DCA4
Eigenvalues  0.5117 0.3036 0.1213 0.1427
Axis lengths 3.7004 3.1166 1.3005 1.4789

This shows the essential information on call, used options and the results, including diagnostic statistics on axes. The decorana object was not very large, and its summary showed its ordination scores which are the most of the object. When cca was released one year later (release 1.4-0 in May 2002), it was modelled similarly: the print showed the same information. It looked something like this:

Call:
cca(X = varespec, Y = varechem[, 1:3])

Mean squared contingency coefficient: 2.0832 
                         Constrained: 0.4464   ( rank 3 )
                       Unconstrained: 1.6368   ( rank 20 )

Eigenvalues for constrained axes:
  CCA1   CCA2   CCA3 
0.1931 0.1627 0.0906 

Eigenvalues for unconstrained axes:
    CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8 
0.44953 0.28701 0.18772 0.16748 0.12805 0.10497 0.07497 0.06287 
(Showed only 8 of all 20 unconstrained eigenvalues)

(This does not quite work with modern cca which has Call in different place.) There is more information and more axes than for decorana, but the output is comparable. What probably was unfortunate was that the style of summary also was copied from decorana. The summary did not much bother me mainly because I didn't use it. After one meeting with you in a course, I added head and tail just for you to allow making nicer Sweave tutorials (release 1.15-1, December 2008). However, now I think we could let summary.cca go, and probably summary.decorana and some others could go along. All information in summary can be accessed by other means. If we do not want to remove it completely, we could just tell people to use various tools to extract information from the ordination object. Perhaps something like:

"summary.cca" <-
    function(object)
{
    structure(
        list("class" = class(object),
             "methods" = methods(class = "cca")),
        class = "summary.cca")
}

"print.summary.cca" <-
    function(x, ...)
{
    cat("object of class", paste(x$class, collapse = ", "), "\n\n")
    cat("use following methods to extract or analyse the result:\n\n")
    print(x$methods)
}

@eduardszoecs
Copy link
Contributor

eduardszoecs commented Oct 12, 2016

I would keep the summary function.
Perhabs it is enough to change the default to display=NULL?

> summary(mod, display = NULL)

Call:
cca(formula = dune ~ Management, data = dune.env) 

Partitioning of mean squared contingency coefficient:
              Inertia Proportion
Total          2.1153     1.0000
Constrained    0.6038     0.2855
Unconstrained  1.5114     0.7145

Eigenvalues, and their contribution to the mean squared contingency coefficient 

Importance of components:
                        CCA1    CCA2    CCA3    CA1     CA2     CA3     CA4     CA5     CA6
Eigenvalue            0.3186 0.18247 0.10274 0.4474 0.20300 0.16301 0.13457 0.12940 0.09494
Proportion Explained  0.1506 0.08626 0.04857 0.2115 0.09597 0.07706 0.06362 0.06117 0.04488
Cumulative Proportion 0.1506 0.23690 0.28547 0.4970 0.59294 0.67000 0.73362 0.79479 0.83968
                          CA7     CA8     CA9    CA10   CA11    CA12    CA13     CA14     CA15
Eigenvalue            0.07904 0.06526 0.05004 0.04321 0.0387 0.02385 0.01773 0.009172 0.007959
Proportion Explained  0.03737 0.03085 0.02366 0.02043 0.0183 0.01128 0.00838 0.004340 0.003760
Cumulative Proportion 0.87704 0.90790 0.93155 0.95198 0.9703 0.98155 0.98994 0.994270 0.998030
                          CA16
Eigenvalue            0.004157
Proportion Explained  0.001970
Cumulative Proportion 1.000000

Accumulated constrained eigenvalues
Importance of components:
                        CCA1   CCA2   CCA3
Eigenvalue            0.3186 0.1825 0.1027
Proportion Explained  0.5277 0.3022 0.1701
Cumulative Proportion 0.5277 0.8299 1.0000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions

Perhabs, if display = NULL, the last three lines could be skipped?
And display only the first 10 components?

So that it looks like:

Call:
cca(formula = dune ~ Management, data = dune.env) 

Partitioning of mean squared contingency coefficient:
              Inertia Proportion
Total          2.1153     1.0000
Constrained    0.6038     0.2855
Unconstrained  1.5114     0.7145

Eigenvalues, and their contribution to the mean squared contingency coefficient 

Importance of components:
                        CCA1    CCA2    CCA3    CA1     CA2     CA3     CA4     CA5     CA6
Eigenvalue            0.3186 0.18247 0.10274 0.4474 0.20300 0.16301 0.13457 0.12940 0.09494
Proportion Explained  0.1506 0.08626 0.04857 0.2115 0.09597 0.07706 0.06362 0.06117 0.04488
Cumulative Proportion 0.1506 0.23690 0.28547 0.4970 0.59294 0.67000 0.73362 0.79479 0.83968

Accumulated constrained eigenvalues
Importance of components:
                        CCA1   CCA2   CCA3
Eigenvalue            0.3186 0.1825 0.1027
Proportion Explained  0.5277 0.3022 0.1701
Cumulative Proportion 0.5277 0.8299 1.0000

But actually, I don't get @tpoisot complaints...

@gavinsimpson
Copy link
Contributor Author

@Edild the issue with current summary is most easily discerned if you have a data set of a thousand samples or, as I do now, 13000+ OTUs. No one wants or needs those scores dumped to the console.

@eduardszoecs
Copy link
Contributor

OK, but isn't that what display = NULL is for?

@jarioksa
Copy link
Contributor

@gavinsimpson : I really think there currently is a summary problem if you have thousands (or even hundreds) of axes. Even if we only have eigenvals, they also need a cutter to truncate the output.

@gavinsimpson
Copy link
Contributor Author

gavinsimpson commented Oct 12, 2016

@Edild That's a side effect; it's not the default so you get bitten by huge output until you've been bitten enough times to remember, assuming you work it out, to use display = NULL. If you want the scores, you should ask for them via scores().

@jarioksa re print.cca OK, agreed. I was perhaps too zealous in cutting it in order to get a trimmed summary output that was somewhat more substantial than the print method.

I don't think that the correct place to educate users about the availability of methods for ordination classed objects is the output of summary. If those are methods are not well-known, a simple HTML vignette illustrating their availability would be a useful addition. I would be happy to include a list of the types of scores available for an object, perhaps.

I don't think we need to remove summary.cca, so how about:

  • print.cca stays the same as it is now
  • print.summary.cca adds to print.cca:
    • the full output of summary.eigenvals(..., constrained = TRUE) and,
    • the truncated output of summary.eigenvals(...) for the unconstrained axes, truncating at say 8 unconstrained axes.

Optionally it would include a list of the types of scores available, say something like the way capabilities() outputs:

Site (WA) Site (LC)   Species    Biplot  Centroid 
     TRUE      TRUE      TRUE     FALSE      TRUE

It also strikes me that eigenvals() should gain argument unconstrained to control whether the unconstrained eigenvalues are returned or not. At the moment, you either get all the eigenvalues (constrained = FALSE) or just the constrained ones (constrained = TRUE)

@gavinsimpson
Copy link
Contributor Author

By adding to the print.cca output I mean rather than just showing eigenvalues summary should show the eigenvalues, proportion explained, cumulative explained data, split up by constrained and unconstrained axes, with the latter truncated at 8 axes.

@jarioksa
Copy link
Contributor

@gavinsimpson : it is indeed important to make a difference between summary and print.summary. The summary should be complete and precise because people may use it to extract data, but print.summary is for the display and it can truncate and round. This has been forgotten sometimes, even in base R:

    r71150 | maechler | 2016-08-25 22:57:19 +0300 (Thu, 25 Aug 2016) | 1 line
    summary.default() no longer rounds by default; just *prints* rounded

(This R fix will improve some of the vegan output and make some earlier tweaks no longer necessary.) Removing or radically changing summary.cca will break many things. Not necessarily in packages, but in the ways people work. Just a couple of hours after my summary rant I got an email where summary.cca was used to extract results.

The modern Big Data applications mean that we really should have some truncation in several print methods, even in things like eigenvals. People are also using constraints in the way that I didn't imagine, for instance with PCNM constraints. We do need some way of clean display for these cases, but that should always be only in print.anything.

@gavinsimpson
Copy link
Contributor Author

@jarioksa If people have used summary.cca to extract results, that's really their own fault and I will gladly incur their wrath for breaking those things if the result is we get a cleaner printed output from summary.cca. However, such a change must come in a point release, so 2.5-0, not before.

I'm not sold on truncating eigenvals output via print; I don't see the point in having specific print methods for this, and things like this, because if I want to show 10 not 8 axes (or whatever default we choose) I don't want to have to do print(eigenvals(mod), n = 10). In this case having eigenvals return an object that inherits from a vector or a matrix would solve all problems (assuming people use the language to manipulate things). (It currently doesn't work for summary.eigenvals because we return a list with a single component importance, which seems unnecessarily complicated.)

If you had a linear regression model with 100 coefficients, print.lm wouldn't truncate them after a certain number and I don't believe we should either. (print.summary.cca is different in case in unconstrained axes.)

@jarioksa
Copy link
Contributor

jarioksa commented Oct 12, 2016

@gavinsimpson : using summary to extract results is a standard R idiom -- that is the way you get information from an lm object, for instance. Actually, this was one of my first real R style lessons: Bill Venables complained me sometimes in 2002 that he could not extract some decorana results because summary.decorana only printed those instead of first extracting and then printing (I said that Ripley does the same in MASS, but Bill Venables told me that what suits Ripley does not suit me -- or "Quod licet Iovi, non licet bovi" in Latin). That is the reason why you have separate summary and print.summary, and print.summary.default if you don't have your own. The essential difference between summary and mere print should be that print shows only data that already are in the result, but summary may do calculations with the result to derive new results.

@gavinsimpson
Copy link
Contributor Author

@jarioksa I don't disagree with what you write. You should be able to extract the data as the underlying R objects that get printed via the print.summary.cca method. I'm fully aware of this convention and have written all my summary and print.summary methods over the years to work accordingly. The reasoning is clear here; summary methods do work, which seemingly is useful to the user if you are going to print that work (or a representation of it) to the console, and as such should return those objects for use by the user, with the print method doing any printing and associated formatting for print.

Where I disagree with the line of reasoning is, just because in the past people have used summary.cca to extract scores, that shouldn't be used as an argument to not change summary.cca. I agree that whatever we return via summary.cca should be the raw objects we will then print. I do not agree that these raw objects should be matrices of scores if we no longer choose to print them.

It's not like anyone is proposing to delete scores().

@jarioksa
Copy link
Contributor

@gavinsimpson , I agree with scores: you should not use summary to extract those. I was ready to let summary go, but I think that may break a plenty of user habits (even if not so many packages). However, scores output should be removed. Also, the eigenvals output could be truncated in extreme cases. The limit could be higher than in print.cca, though. There we print up to 16 axes, and if there are more, we print 8 with a truncation message (and I think it is good to have this convention of different truncation limit (higher) and printing limit with truncation (lower)).

@gavinsimpson
Copy link
Contributor Author

OK, let me implement something in a branch and do a PR. We can discuss the detail and implementation there referencing this thread/issue.

@jarioksa
Copy link
Contributor

@gavinsimpson : this is another open issue. Should something be done for the "imminent" 2.5-0 release?

@jarioksa
Copy link
Contributor

jarioksa commented Apr 4, 2024

@gavinsimpson I dropped scores from summary.cca. I think this was the minimum change suggested in this thread. Feel free to provide a different summary, but for me this is OK now.

@jarioksa
Copy link
Contributor

jarioksa commented May 7, 2024

The commit 85e3e28 of April 4, 2024 breaks BiodiversityR::caprescale which uses summary to get the scores. I have emailed to the maintainer (Roeland Kindt) with a suggested patch that switches to use scores instead of summary.

Found in vegan reverse dependencies tests.

@jarioksa
Copy link
Contributor

jarioksa commented May 7, 2024

Dropping scores form summary.cca also breaks bipartite. I have submitted a pull request. There are also problems in building vignettes in bipartiteD3. I haven't checked that issue yet, but it may be that it is caused by the bipartite problems.

Found in vegan reversed dependencies tests.

@jarioksa jarioksa reopened this May 7, 2024
@jarioksa
Copy link
Contributor

jarioksa commented May 7, 2024

pctax seems to use summary to get scores. I have committed a pull request in github.

Actually, old summary.cca provided a list of scores, and so does scores.cca even with the same names. Number of axes and kind of scores may need to be set explicitly.

@jarioksa
Copy link
Contributor

jarioksa commented May 11, 2024

Dropping scores broke four packages in reverse dependencies tests: BiodiversityR, bipartite, bipartiteD3 and pctax. All these used summary to get scores, and can be fixed by using scores. I have made pull requests for github packages and emailed patch files to the maintainers of other packages. I suspect there are more breaking packages, but they do not test these functions, and these are tedious to find because summary is such a common function.

To reduce the disruption with CRAN release, I modified summary.cca to return scores (invisibly) with the object in the cran-2.6-6 branch (commit cf075c5). This will give maintainers some time to adapt to the change – if they notice the need. However, this quick temporary fix fails in BiodiversityR. Looks like scoping issue: summary.cca sees the name of the dot variable, but cannot find its numeric value. The patch I suggested to the maintainer will fix the function (BiodiversityR::caprescale).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants