Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rrarefy- fix silent failure with numeric count table #259

Closed
rrohwer opened this issue Dec 1, 2017 · 17 comments

Comments

@rrohwer
Copy link

commented Dec 1, 2017

There is a bug in the rrarefy() function if the structure of the count table is numeric/double instead of integer because of the behavior of the base r function rep() in line 18 of the function:

row <- sample(rep(nm, times = x[i, ]), sample[i])

rep always rounds down, so it gives unexpected results when a very large matrix of doubles that is basically whole numbers has tiny rounding errors. Here's a simple example of the unexpected behavior:

> rep(x = letters[1:3], times = c(.99999999, 0.0000001, 1))
[1] "c"

This is a fix to get the desired behavior:

> rep(x = letters[1:3], times = as.integer(round(x = c(.99999999, 0.0000001, 1), digits = 0)))
[1] "a" "c"

This means that if you are subsampling to the shortest sample, rep will throw an error sometimes saying replace=F and there aren't enough things to sample from. I suggest fixing it by adding this after the other checks in the rrarefy function:

if (as.numeric(x)){
   x <- round(x, digits = 0)
   save.row.names <- rownames(x)
   x <- apply(X = x, MARGIN = 2, FUN = as.integer)
   rownames(x) <- save.row.names
   rm(save.row.names)
}

Thanks!
Robin

@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2017

An intriguing point of view. Can you trigger this error with real data? You mean this is not sufficient to catch non-integer data:

    if (!identical(all.equal(x, round(x)), TRUE)) 
        stop("function is meaningful only for integers (counts)")
@gavinsimpson

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2017

This'll pass that test

> vals <- c(.999999999999999999999, 0.0000000000000000000001, 1)
> all.equal(vals, round(vals))
[1] TRUE

So I presume what's happening is that some trivial floating point error is sneaking passed this check but causing the reported behaviour later on.

It can't hurt to implement a modification of the proposed change, along the lines of:

x <- round(x, digits = 0)
storage.mode(x) <- 'integer'

to anything that sneaks by. Or we can tighten up the check and force the user to fix up their data before using rrarefy()?

@gavinsimpson

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2017

...and given @rrohwer's area of research I can well imagine such issues being all too common an occurrence in real data.

@gavinsimpson gavinsimpson added the bug label Dec 1, 2017

@rrohwer

This comment has been minimized.

Copy link
Author

commented Dec 1, 2017

Yes it occurred with my data. It happened because I had normalized the data instead of rarefying, but later converted the normalized OTU table back to counts to simulate how results changed with different sequencing depths. Since the data table was very large, I was getting a difference of over 30,000 between the length and the sum!!

Since the first check:

if (!identical(all.equal(x, round(x)), TRUE)) 
    stop("function is meaningful only for integers (counts)")

makes sure you are using whole numbers already, I think it would be OK to convert to integer structure automatically. Plus, I think most downstream things that might require a numeric structure would automatically convert it back.

@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 2, 2017

Good morning, America!

I think this is the simplest fix:

diff --git a/R/rrarefy.R b/R/rrarefy.R
index e58c749..61bea6e 100644
--- a/R/rrarefy.R
+++ b/R/rrarefy.R
@@ -6,6 +6,7 @@
     if (!identical(all.equal(x, round(x)), TRUE)) 
         stop("function is meaningful only for integers (counts)")
     x <- as.matrix(x)
+    x <- round(x)
     if (ncol(x) == 1)
         x <- t(x)
     if (length(sample) > 1 && length(sample) != nrow(x))

@gavinsimpson storage.mode() would not do because it truncates (and R is dynamically typed and any function can duplicate and change the storage.mode).

There are many other functions where we check the integerness with near-equality (all.equal()) and which may suffer from the same problem. All must be fixed.

@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 2, 2017

We have two alternatives: the easy and the hard.

The easy way is the one I outlined above: we check that input are integers within a tolerance, and then safely round() them to exact integers. With this, vegan functions work correctly, and the users will be bitten by some other functions.

The harder way is to test for exact integers, and let users sort out the problem before using our functions. We already have bug reports from users who say that vegan functions say that they have negative data entries although they don't have. Now we would get bug reports from users who complain that vegan claims them have non-integer data, although they can see that they only have integers.

The difference is

> all.equal(as.matrix(BCI), (sqrt(BCI))^2)
[1] TRUE # approximately integer
> all(as.matrix(BCI) == (sqrt(BCI))^2)
[1] FALSE # not exactly integer

We have similar integer tests in several functions. In the following tests I compare data BCI and (sqrt(BCI))^2 which have a difference of magnitude 10-14. Rarefaction methods are compared at size 100.

  • rarefy no effect: test can be slack, and rounding is not needed
  • rarecurve (not yet analysed)
  • rareslope no effect.
  • rrarefy the difference in rrarefied values is of magnitude 5. So we really need a stricter test/round.
  • drarefy gave superfluous and wrong warning (now fixed in 4342cdc), but there was no effect.
  • as.fisher no effects.
  • as.preston large effect in tied values: needs stricter test/round.
  • decostand here the test actually concerns having <1 values in Marti Anderson "log", and the effect is tiny (magnitude 10-16). No changes needed, and round() makes no sense.
  • chaodist tiny effect of magnitude 10-16
  • estimateR has a huge effect, up to difference 1100 in S.chao1
  • vegdist we use here more drastic test than in other function: it is x against as.integer(x) and as.integer() truncates. Influences indices "morisita", "chao" and "cao", but the effect is of magnitude 10-16 and we could adopt a slacker test, and we do not even need to round.
@psolymos

This comment has been minimized.

Copy link
Contributor

commented Dec 2, 2017

Hi All, I would vote for silent rounding when needed, and soft testing where it doesn't matter.

@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2017

I think I'll go with practically-integer test + round to exact integer. This gives the least disruption in normal usage, although users may be burnt somewhere else.

jarioksa added a commit that referenced this issue Dec 3, 2017
take care that input really are integer, not only practically
github issue #259: backtransformation to integers can be inexact,
but analysis assumes exact integers because input are truncated.
Now we first check that the values are practically integers, and
then silently take care that they are exactly integers with round().
@gavinsimpson

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2017

@jarioksa re:

@gavinsimpson storage.mode() would not do because it truncates (and R is dynamically typed and any function can duplicate and change the storage.mode).

Yup, hence the explicit round() preceding it. I was thinking we might require explicit integer counts and either enforce it or fail if not integer, with advice on how to fix this.

I'm not a fan of silently doing things - if we change the input we should emit at least a message().

@jarioksa jarioksa referenced this issue Dec 3, 2017
@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2017

@gavinsimpson when we say foo(x) we do many things silently. Now I suggest that (1) we check that input can be seen as integers, and (2) take care it can be seen as integers. We do not change the input, and I don't see any reason for the message(): the data remain unchanged, we only analyse them like they were integers, and do it by rounding to nearest integer when they are within 10-8 of corresponding integer instead of failing or truncating them.

Please note that storage.mode() in one way does more: it truncates the data and it is equivalent to floor(). On the other hand, it does nothing as storage mode cannot be set within non-typed languages like R: next function that handles data like double will silently do so. The storage.mode is only needed when we pass data to functions written in typed languages using .C() or .Fortran() interface. Even with .Call() interface to C code, the data can be silently truncated to integer. Currently we do so, e.g., in permutations tests for cca() & friends, where the permutation matrix is truncated to integers would it be double (like it can be if it was not directly produced by permute package functions). However, any R command between setting storage.mode and .C can change the data to double.

@rrohwer

This comment has been minimized.

Copy link
Author

commented Dec 3, 2017

I'd just point out that R already silently changes between integer and double all the time. The reason this error occurred is because the base r function rep() silently converted doubles to integers, but with truncation instead of rounding. The fix you committed just corrects the silent conversion, so I think it's fine to leave it silent. Thanks everyone!!

@gavinsimpson

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2017

@jarioksa we're talking at cross-purposes. My suggestion was

x <- round(x)
storage.mode(x) <- "integer"

& I only meant it in the sense of

> m <- m + sample(c(0.00000000000000000000001, 0, 0.999999999999999999), 9, replace = TRUE)
> storage.mode(m)
[1] "double"
> m
     [,1] [,2] [,3]
[1,]    3    5    8
[2,]    8    4    5
[3,]    3    4    3
> is.integer(m)
[1] FALSE
> mm <- round(m)
> storage.mode(mm)
[1] "double"
> is.integer(mm)
[1] FALSE
> storage.mode(mm) <- "integer"
> is.integer(mm)
[1] TRUE
> mm
     [,1] [,2] [,3]
[1,]    3    5    8
[2,]    8    4    5
[3,]    3    4    3

So, as we've already round()ed the data storage.mode isn't doing anything

> all.equal(mm, round(m))
[1] TRUE

further to the data, but it is marking the matrix as explicitly integer. I had in mind that we might remove the existing check and simply enforce an integer matrix, sensu stopifnot(is.matrix(x) & is.integer(x)), perhaps suggesting how to create such a thing in ?rrarefy

I appreciate that conversion to/from doubles happens all the time. I also appreciate I wasn't very clear initially.

I remain a little concerned that behaviour of rrarefy() and related functions that say they require integers, will work silently for non-integer data, by rounding, and give the impression this is OK. Given the triviality of wrapping anything that offends the user in suppressMessages(). Perhaps I'm being to fastidious?

@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2017

The problem with R is that it really does not have types: storage.mode only sets storage mode (i.e., how data are stored in the disk), but it does not set type (i.e., how data are used). Normally R integers are not stored as integers:

> data(BCI) # data are counts (integers)
> is.integer(BCI)
[1] FALSE
> ints <- c(1,3,4) # integers
> is.integer(ints)
[1] FALSE
> storage.mode(ints) <- "integer"
> is.integer(ints)
[1] TRUE
> ints[1]/ints[3]
[1] 0.25 # integer division should give integer 0, but it gives a float

Numbers are always treated as non-integer in R, and storage.mode is only needed when you pass these numbers to languages that have distinct integer type, that is, to code called by .C() or .Fortran(). Even in .Call() interface you can change the type, and we do this in many cases in vegan (try grep coerceVector src/*.c in bash). In simulate.nullmodel() we set storage.mode only to save space: integers are more stored more compactly. However, R will still treat these numbers like they are double (floating point numbers).

We could argue that it is always an error to consider data as integer in R. If we do so, we must be very careful, and obviously we were not in rrarefy. I had considered the problem, but I was only thinking about a use case were data are read into R as integer-looking (as numeric, but decimal part zero), but I did not consider back-transformation toward integer. Having such data is an error, and it is a matter of taste if that is a coding error or a user error: if you back-transform toward integer, you also need round() to back-transform to integer. In the long run, it could be better for user's sake to be anal in vegan and flag an error if we see that a number that looks like 2 when you stare at it on the screen is actually

> print(sqrt(2)^2, digits=17)
[1] 2.0000000000000004
jarioksa added a commit that referenced this issue Dec 4, 2017
@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2017

There is still one issue with rrarefy(): I am not at all proud of its code. It is a slow memory hog. I wrote the code rather quickly for an immediate need in neighbouring office, and I was ashamed of the implementation while I wrote it. It must take ages to run the function in data set sizes that @rrohwer had. When looking at this integer issue, I also considered re-writing the key part of rrarefy in C, but I am not really sure that I will spend time in writing and testing the code (and I also need to think which way to do it fast and memory-efficiently). Actually, I am surprised that people use this function: I don't think that random rarefaction is a good idea: you introduce both error and bias into your data and that reduces the quality of the data. However, this integer issue can be closed with PR #260.

@jarioksa jarioksa closed this Dec 4, 2017

@rrohwer

This comment has been minimized.

Copy link
Author

commented Dec 4, 2017

@jarioksa I wouldn't bother with the speed unless there is a clear need for more use. It actually only took ~20 sec to run on my full count table, and rarefaction is out of favor these days. (I was using it for a quick simulation to compare effects of sequencing depth on some of my analysis metrics.)

@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2017

Good to hear that random rarefaction is out of fashion. It never sounded a good idea to me, and I disliked the idea even when I added rrarefy to vegan. Something like 20 secs really is what I expect for larger data sets, and I think it is much too long. However, it takes still much longer if I make the function faster -- and that's my time that has a value for me. Probably I'll just let it be.

@jarioksa

This comment has been minimized.

Copy link
Contributor

commented Dec 8, 2017

@gavinsimpson : I was permissive with rrarefy and other functions, but I think we should be strict with permutation matrices. The unmodified permutation matrices from permute are always integers, but we cannot be sure that the user-supplied are, and I added a strict test in getPermuteMatrix (commit dc6d6bd). Opinions?

jarioksa added a commit that referenced this issue Dec 11, 2017
take care that input really are integer, not only practically
github issue #259: backtransformation to integers can be inexact,
but analysis assumes exact integers because input are truncated.
Now we first check that the values are practically integers, and
then silently take care that they are exactly integers with round().

(cherry picked from commit 87d76f0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.