-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rrarefy- fix silent failure with numeric count table #259
Comments
An intriguing point of view. Can you trigger this error with real data? You mean this is not sufficient to catch non-integer data: if (!identical(all.equal(x, round(x)), TRUE))
stop("function is meaningful only for integers (counts)") |
This'll pass that test > vals <- c(.999999999999999999999, 0.0000000000000000000001, 1)
> all.equal(vals, round(vals))
[1] TRUE So I presume what's happening is that some trivial floating point error is sneaking passed this check but causing the reported behaviour later on. It can't hurt to implement a modification of the proposed change, along the lines of: x <- round(x, digits = 0)
storage.mode(x) <- 'integer' to anything that sneaks by. Or we can tighten up the check and force the user to fix up their data before using |
...and given @rrohwer's area of research I can well imagine such issues being all too common an occurrence in real data. |
Yes it occurred with my data. It happened because I had normalized the data instead of rarefying, but later converted the normalized OTU table back to counts to simulate how results changed with different sequencing depths. Since the data table was very large, I was getting a difference of over 30,000 between the length and the sum!! Since the first check:
makes sure you are using whole numbers already, I think it would be OK to convert to integer structure automatically. Plus, I think most downstream things that might require a numeric structure would automatically convert it back. |
Good morning, America! I think this is the simplest fix: diff --git a/R/rrarefy.R b/R/rrarefy.R
index e58c749..61bea6e 100644
--- a/R/rrarefy.R
+++ b/R/rrarefy.R
@@ -6,6 +6,7 @@
if (!identical(all.equal(x, round(x)), TRUE))
stop("function is meaningful only for integers (counts)")
x <- as.matrix(x)
+ x <- round(x)
if (ncol(x) == 1)
x <- t(x)
if (length(sample) > 1 && length(sample) != nrow(x)) @gavinsimpson There are many other functions where we check the integerness with near-equality ( |
We have two alternatives: the easy and the hard. The easy way is the one I outlined above: we check that input are integers within a tolerance, and then safely The harder way is to test for exact integers, and let users sort out the problem before using our functions. We already have bug reports from users who say that vegan functions say that they have negative data entries although they don't have. Now we would get bug reports from users who complain that vegan claims them have non-integer data, although they can see that they only have integers. The difference is > all.equal(as.matrix(BCI), (sqrt(BCI))^2)
[1] TRUE # approximately integer
> all(as.matrix(BCI) == (sqrt(BCI))^2)
[1] FALSE # not exactly integer We have similar integer tests in several functions. In the following tests I compare data
|
Hi All, I would vote for silent rounding when needed, and soft testing where it doesn't matter. |
I think I'll go with practically-integer test + round to exact integer. This gives the least disruption in normal usage, although users may be burnt somewhere else. |
github issue #259: backtransformation to integers can be inexact, but analysis assumes exact integers because input are truncated. Now we first check that the values are practically integers, and then silently take care that they are exactly integers with round().
@jarioksa re:
Yup, hence the explicit I'm not a fan of silently doing things - if we change the input we should emit at least a |
@gavinsimpson when we say Please note that |
I'd just point out that R already silently changes between integer and double all the time. The reason this error occurred is because the base r function |
@jarioksa we're talking at cross-purposes. My suggestion was x <- round(x)
storage.mode(x) <- "integer" & I only meant it in the sense of > m <- m + sample(c(0.00000000000000000000001, 0, 0.999999999999999999), 9, replace = TRUE)
> storage.mode(m)
[1] "double"
> m
[,1] [,2] [,3]
[1,] 3 5 8
[2,] 8 4 5
[3,] 3 4 3
> is.integer(m)
[1] FALSE
> mm <- round(m)
> storage.mode(mm)
[1] "double"
> is.integer(mm)
[1] FALSE
> storage.mode(mm) <- "integer"
> is.integer(mm)
[1] TRUE
> mm
[,1] [,2] [,3]
[1,] 3 5 8
[2,] 8 4 5
[3,] 3 4 3 So, as we've already > all.equal(mm, round(m))
[1] TRUE further to the data, but it is marking the matrix as explicitly integer. I had in mind that we might remove the existing check and simply enforce an integer matrix, sensu I appreciate that conversion to/from doubles happens all the time. I also appreciate I wasn't very clear initially. I remain a little concerned that behaviour of |
The problem with R is that it really does not have types: > data(BCI) # data are counts (integers)
> is.integer(BCI)
[1] FALSE
> ints <- c(1,3,4) # integers
> is.integer(ints)
[1] FALSE
> storage.mode(ints) <- "integer"
> is.integer(ints)
[1] TRUE
> ints[1]/ints[3]
[1] 0.25 # integer division should give integer 0, but it gives a float Numbers are always treated as non-integer in R, and We could argue that it is always an error to consider data as integer in R. If we do so, we must be very careful, and obviously we were not in > print(sqrt(2)^2, digits=17)
[1] 2.0000000000000004 |
There is still one issue with |
@jarioksa I wouldn't bother with the speed unless there is a clear need for more use. It actually only took ~20 sec to run on my full count table, and rarefaction is out of favor these days. (I was using it for a quick simulation to compare effects of sequencing depth on some of my analysis metrics.) |
Good to hear that random rarefaction is out of fashion. It never sounded a good idea to me, and I disliked the idea even when I added |
@gavinsimpson : I was permissive with |
github issue #259: backtransformation to integers can be inexact, but analysis assumes exact integers because input are truncated. Now we first check that the values are practically integers, and then silently take care that they are exactly integers with round(). (cherry picked from commit 87d76f0)
There is a bug in the
rrarefy()
function if the structure of the count table is numeric/double instead of integer because of the behavior of the base r functionrep()
in line 18 of the function:rep always rounds down, so it gives unexpected results when a very large matrix of doubles that is basically whole numbers has tiny rounding errors. Here's a simple example of the unexpected behavior:
This is a fix to get the desired behavior:
This means that if you are subsampling to the shortest sample, rep will throw an error sometimes saying replace=F and there aren't enough things to sample from. I suggest fixing it by adding this after the other checks in the
rrarefy
function:Thanks!
Robin
The text was updated successfully, but these errors were encountered: