rrarefy- fix silent failure with numeric count table #259

rrohwer · 2017-12-01T18:44:40Z

There is a bug in the rrarefy() function if the structure of the count table is numeric/double instead of integer because of the behavior of the base r function rep() in line 18 of the function:

row <- sample(rep(nm, times = x[i, ]), sample[i])

rep always rounds down, so it gives unexpected results when a very large matrix of doubles that is basically whole numbers has tiny rounding errors. Here's a simple example of the unexpected behavior:

> rep(x = letters[1:3], times = c(.99999999, 0.0000001, 1))
[1] "c"

This is a fix to get the desired behavior:

> rep(x = letters[1:3], times = as.integer(round(x = c(.99999999, 0.0000001, 1), digits = 0)))
[1] "a" "c"

This means that if you are subsampling to the shortest sample, rep will throw an error sometimes saying replace=F and there aren't enough things to sample from. I suggest fixing it by adding this after the other checks in the rrarefy function:

if (as.numeric(x)){
   x <- round(x, digits = 0)
   save.row.names <- rownames(x)
   x <- apply(X = x, MARGIN = 2, FUN = as.integer)
   rownames(x) <- save.row.names
   rm(save.row.names)
}

Thanks!
Robin

The text was updated successfully, but these errors were encountered:

jarioksa · 2017-12-01T19:06:09Z

An intriguing point of view. Can you trigger this error with real data? You mean this is not sufficient to catch non-integer data:

    if (!identical(all.equal(x, round(x)), TRUE)) 
        stop("function is meaningful only for integers (counts)")

gavinsimpson · 2017-12-01T19:17:11Z

This'll pass that test

> vals <- c(.999999999999999999999, 0.0000000000000000000001, 1)
> all.equal(vals, round(vals))
[1] TRUE

So I presume what's happening is that some trivial floating point error is sneaking passed this check but causing the reported behaviour later on.

It can't hurt to implement a modification of the proposed change, along the lines of:

x <- round(x, digits = 0)
storage.mode(x) <- 'integer'

to anything that sneaks by. Or we can tighten up the check and force the user to fix up their data before using rrarefy()?

gavinsimpson · 2017-12-01T19:18:36Z

...and given @rrohwer's area of research I can well imagine such issues being all too common an occurrence in real data.

rrohwer · 2017-12-01T20:18:48Z

Yes it occurred with my data. It happened because I had normalized the data instead of rarefying, but later converted the normalized OTU table back to counts to simulate how results changed with different sequencing depths. Since the data table was very large, I was getting a difference of over 30,000 between the length and the sum!!

Since the first check:

if (!identical(all.equal(x, round(x)), TRUE)) 
    stop("function is meaningful only for integers (counts)")

makes sure you are using whole numbers already, I think it would be OK to convert to integer structure automatically. Plus, I think most downstream things that might require a numeric structure would automatically convert it back.

jarioksa · 2017-12-02T06:27:24Z

Good morning, America!

I think this is the simplest fix:

diff --git a/R/rrarefy.R b/R/rrarefy.R
index e58c749..61bea6e 100644
--- a/R/rrarefy.R
+++ b/R/rrarefy.R
@@ -6,6 +6,7 @@
     if (!identical(all.equal(x, round(x)), TRUE)) 
         stop("function is meaningful only for integers (counts)")
     x <- as.matrix(x)
+    x <- round(x)
     if (ncol(x) == 1)
         x <- t(x)
     if (length(sample) > 1 && length(sample) != nrow(x))

@gavinsimpson storage.mode() would not do because it truncates (and R is dynamically typed and any function can duplicate and change the storage.mode).

There are many other functions where we check the integerness with near-equality (all.equal()) and which may suffer from the same problem. All must be fixed.

jarioksa · 2017-12-02T11:06:09Z

We have two alternatives: the easy and the hard.

The easy way is the one I outlined above: we check that input are integers within a tolerance, and then safely round() them to exact integers. With this, vegan functions work correctly, and the users will be bitten by some other functions.

The harder way is to test for exact integers, and let users sort out the problem before using our functions. We already have bug reports from users who say that vegan functions say that they have negative data entries although they don't have. Now we would get bug reports from users who complain that vegan claims them have non-integer data, although they can see that they only have integers.

The difference is

> all.equal(as.matrix(BCI), (sqrt(BCI))^2)
[1] TRUE # approximately integer
> all(as.matrix(BCI) == (sqrt(BCI))^2)
[1] FALSE # not exactly integer

We have similar integer tests in several functions. In the following tests I compare data BCI and (sqrt(BCI))^2 which have a difference of magnitude 10^-14. Rarefaction methods are compared at size 100.

rarefy no effect: test can be slack, and rounding is not needed
rarecurve (not yet analysed)
rareslope no effect.
rrarefy the difference in rrarefied values is of magnitude 5. So we really need a stricter test/round.
drarefy gave superfluous and wrong warning (now fixed in 4342cdc), but there was no effect.
as.fisher no effects.
as.preston large effect in tied values: needs stricter test/round.
decostand here the test actually concerns having <1 values in Marti Anderson "log", and the effect is tiny (magnitude 10^-16). No changes needed, and round() makes no sense.
chaodist tiny effect of magnitude 10^-16
estimateR has a huge effect, up to difference 1100 in S.chao1
vegdist we use here more drastic test than in other function: it is x against as.integer(x) and as.integer() truncates. Influences indices "morisita", "chao" and "cao", but the effect is of magnitude 10^-16 and we could adopt a slacker test, and we do not even need to round.

psolymos · 2017-12-02T16:25:42Z

Hi All, I would vote for silent rounding when needed, and soft testing where it doesn't matter.

jarioksa · 2017-12-03T14:43:17Z

I think I'll go with practically-integer test + round to exact integer. This gives the least disruption in normal usage, although users may be burnt somewhere else.

github issue #259: backtransformation to integers can be inexact, but analysis assumes exact integers because input are truncated. Now we first check that the values are practically integers, and then silently take care that they are exactly integers with round().

gavinsimpson · 2017-12-03T15:58:29Z

@jarioksa re:

@gavinsimpson storage.mode() would not do because it truncates (and R is dynamically typed and any function can duplicate and change the storage.mode).

Yup, hence the explicit round() preceding it. I was thinking we might require explicit integer counts and either enforce it or fail if not integer, with advice on how to fix this.

I'm not a fan of silently doing things - if we change the input we should emit at least a message().

jarioksa · 2017-12-03T17:54:43Z

@gavinsimpson when we say foo(x) we do many things silently. Now I suggest that (1) we check that input can be seen as integers, and (2) take care it can be seen as integers. We do not change the input, and I don't see any reason for the message(): the data remain unchanged, we only analyse them like they were integers, and do it by rounding to nearest integer when they are within 10^-8 of corresponding integer instead of failing or truncating them.

Please note that storage.mode() in one way does more: it truncates the data and it is equivalent to floor(). On the other hand, it does nothing as storage mode cannot be set within non-typed languages like R: next function that handles data like double will silently do so. The storage.mode is only needed when we pass data to functions written in typed languages using .C() or .Fortran() interface. Even with .Call() interface to C code, the data can be silently truncated to integer. Currently we do so, e.g., in permutations tests for cca() & friends, where the permutation matrix is truncated to integers would it be double (like it can be if it was not directly produced by permute package functions). However, any R command between setting storage.mode and .C can change the data to double.

rrohwer · 2017-12-03T19:12:47Z

I'd just point out that R already silently changes between integer and double all the time. The reason this error occurred is because the base r function rep() silently converted doubles to integers, but with truncation instead of rounding. The fix you committed just corrects the silent conversion, so I think it's fine to leave it silent. Thanks everyone!!

gavinsimpson · 2017-12-03T21:51:55Z

@jarioksa we're talking at cross-purposes. My suggestion was

x <- round(x)
storage.mode(x) <- "integer"

& I only meant it in the sense of

> m <- m + sample(c(0.00000000000000000000001, 0, 0.999999999999999999), 9, replace = TRUE)
> storage.mode(m)
[1] "double"
> m
     [,1] [,2] [,3]
[1,]    3    5    8
[2,]    8    4    5
[3,]    3    4    3
> is.integer(m)
[1] FALSE
> mm <- round(m)
> storage.mode(mm)
[1] "double"
> is.integer(mm)
[1] FALSE
> storage.mode(mm) <- "integer"
> is.integer(mm)
[1] TRUE
> mm
     [,1] [,2] [,3]
[1,]    3    5    8
[2,]    8    4    5
[3,]    3    4    3

So, as we've already round()ed the data storage.mode isn't doing anything

> all.equal(mm, round(m))
[1] TRUE

further to the data, but it is marking the matrix as explicitly integer. I had in mind that we might remove the existing check and simply enforce an integer matrix, sensu stopifnot(is.matrix(x) & is.integer(x)), perhaps suggesting how to create such a thing in ?rrarefy

I appreciate that conversion to/from doubles happens all the time. I also appreciate I wasn't very clear initially.

I remain a little concerned that behaviour of rrarefy() and related functions that say they require integers, will work silently for non-integer data, by rounding, and give the impression this is OK. Given the triviality of wrapping anything that offends the user in suppressMessages(). Perhaps I'm being to fastidious?

jarioksa · 2017-12-04T06:23:03Z

The problem with R is that it really does not have types: storage.mode only sets storage mode (i.e., how data are stored in the disk), but it does not set type (i.e., how data are used). Normally R integers are not stored as integers:

> data(BCI) # data are counts (integers)
> is.integer(BCI)
[1] FALSE
> ints <- c(1,3,4) # integers
> is.integer(ints)
[1] FALSE
> storage.mode(ints) <- "integer"
> is.integer(ints)
[1] TRUE
> ints[1]/ints[3]
[1] 0.25 # integer division should give integer 0, but it gives a float

Numbers are always treated as non-integer in R, and storage.mode is only needed when you pass these numbers to languages that have distinct integer type, that is, to code called by .C() or .Fortran(). Even in .Call() interface you can change the type, and we do this in many cases in vegan (try grep coerceVector src/*.c in bash). In simulate.nullmodel() we set storage.mode only to save space: integers are more stored more compactly. However, R will still treat these numbers like they are double (floating point numbers).

We could argue that it is always an error to consider data as integer in R. If we do so, we must be very careful, and obviously we were not in rrarefy. I had considered the problem, but I was only thinking about a use case were data are read into R as integer-looking (as numeric, but decimal part zero), but I did not consider back-transformation toward integer. Having such data is an error, and it is a matter of taste if that is a coding error or a user error: if you back-transform toward integer, you also need round() to back-transform to integer. In the long run, it could be better for user's sake to be anal in vegan and flag an error if we see that a number that looks like 2 when you stare at it on the screen is actually

> print(sqrt(2)^2, digits=17)
[1] 2.0000000000000004

Issue #259

jarioksa · 2017-12-04T09:12:11Z

There is still one issue with rrarefy(): I am not at all proud of its code. It is a slow memory hog. I wrote the code rather quickly for an immediate need in neighbouring office, and I was ashamed of the implementation while I wrote it. It must take ages to run the function in data set sizes that @rrohwer had. When looking at this integer issue, I also considered re-writing the key part of rrarefy in C, but I am not really sure that I will spend time in writing and testing the code (and I also need to think which way to do it fast and memory-efficiently). Actually, I am surprised that people use this function: I don't think that random rarefaction is a good idea: you introduce both error and bias into your data and that reduces the quality of the data. However, this integer issue can be closed with PR #260.

rrohwer · 2017-12-04T16:22:56Z

@jarioksa I wouldn't bother with the speed unless there is a clear need for more use. It actually only took ~20 sec to run on my full count table, and rarefaction is out of favor these days. (I was using it for a quick simulation to compare effects of sequencing depth on some of my analysis metrics.)

jarioksa · 2017-12-04T17:42:18Z

Good to hear that random rarefaction is out of fashion. It never sounded a good idea to me, and I disliked the idea even when I added rrarefy to vegan. Something like 20 secs really is what I expect for larger data sets, and I think it is much too long. However, it takes still much longer if I make the function faster -- and that's my time that has a value for me. Probably I'll just let it be.

jarioksa · 2017-12-08T11:02:32Z

@gavinsimpson : I was permissive with rrarefy and other functions, but I think we should be strict with permutation matrices. The unmodified permutation matrices from permute are always integers, but we cannot be sure that the user-supplied are, and I added a strict test in getPermuteMatrix (commit dc6d6bd). Opinions?

github issue #259: backtransformation to integers can be inexact, but analysis assumes exact integers because input are truncated. Now we first check that the values are practically integers, and then silently take care that they are exactly integers with round(). (cherry picked from commit 87d76f0)

gavinsimpson added the bug label Dec 1, 2017

jarioksa mentioned this issue Dec 3, 2017

Issue #259 #260

Merged

jarioksa pushed a commit that referenced this issue Dec 4, 2017

Merge pull request #260 from vegandevs/issue-#259-almost-integer-fix

1b90f35

Issue #259

jarioksa closed this as completed Dec 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rrarefy- fix silent failure with numeric count table #259

rrarefy- fix silent failure with numeric count table #259

rrohwer commented Dec 1, 2017

jarioksa commented Dec 1, 2017

gavinsimpson commented Dec 1, 2017

gavinsimpson commented Dec 1, 2017

rrohwer commented Dec 1, 2017 •

edited

Loading

jarioksa commented Dec 2, 2017 •

edited

Loading

jarioksa commented Dec 2, 2017 •

edited

Loading

psolymos commented Dec 2, 2017

jarioksa commented Dec 3, 2017

gavinsimpson commented Dec 3, 2017

jarioksa commented Dec 3, 2017

rrohwer commented Dec 3, 2017

gavinsimpson commented Dec 3, 2017

jarioksa commented Dec 4, 2017

jarioksa commented Dec 4, 2017

rrohwer commented Dec 4, 2017

jarioksa commented Dec 4, 2017

jarioksa commented Dec 8, 2017

rrarefy- fix silent failure with numeric count table #259

rrarefy- fix silent failure with numeric count table #259

Comments

rrohwer commented Dec 1, 2017

jarioksa commented Dec 1, 2017

gavinsimpson commented Dec 1, 2017

gavinsimpson commented Dec 1, 2017

rrohwer commented Dec 1, 2017 • edited Loading

jarioksa commented Dec 2, 2017 • edited Loading

jarioksa commented Dec 2, 2017 • edited Loading

psolymos commented Dec 2, 2017

jarioksa commented Dec 3, 2017

gavinsimpson commented Dec 3, 2017

jarioksa commented Dec 3, 2017

rrohwer commented Dec 3, 2017

gavinsimpson commented Dec 3, 2017

jarioksa commented Dec 4, 2017

jarioksa commented Dec 4, 2017

rrohwer commented Dec 4, 2017

jarioksa commented Dec 4, 2017

jarioksa commented Dec 8, 2017

rrohwer commented Dec 1, 2017 •

edited

Loading

jarioksa commented Dec 2, 2017 •

edited

Loading

jarioksa commented Dec 2, 2017 •

edited

Loading