-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: rank transformation in decostand()? #225
Comments
It looks that the following code would do: rank = {
is.na(x) <- x == 0
x <- apply(x, MARGIN, rank, na.last = "keep")
x[is.na(x)] <- 0
x
} Comments are welcome, both on the idea and on the implementation. What worries me here is that the maximum rank is related to the frequency of species: is it really so that more common (= higher number of occurrences) species should get higher abundance values? |
Thanks, yes, this implementation is exactly what I had in mind, although I was only thinking about the situation where MARGIN = 1 (rows). I'm not sure in what situation one would want to rank by columns (species), although I guess it doesn't hurt to provide the option, but perhaps provide MARGIN = 1 as the default? rank = {
if (missing(MARGIN)) MARGIN <- 1
is.na(x) <- x == 0
x <- apply(x, MARGIN, rank, na.last = "keep")
tmp <- apply(x, MARGIN, max, na.rm = T)
x <- sweep(x, MARGIN, tmp, "/")
x[is.na(x)] <- 0
} |
Actually, probably better to standardise by margin total? rank = {
if (missing(MARGIN)) MARGIN <- 1
is.na(x) <- x == 0
x <- apply(x, MARGIN, rank, na.last = "keep")
tmp <- apply(x, MARGIN, sum, na.rm = T)
x <- sweep(x, MARGIN, tmp, "/")
x[is.na(x)] <- 0
} |
Using
instead of
is slightly faster. |
My suggestion is indeed a bit slow, but microbenchmark did not find any clear or consistent differences with normal data sizes. Here the test results: library(vegan)
library(microbenchmark)
data(BCI)
bci <- as.matrix(BCI) ## as.matrix will be used in decostand
fun1 <- function(x) is.na(x) <- x == 0 ## my original
fun2 <- function(x) x[x==0] <- NA ## the classic
fun3 <- function(x) x <- replace(x, x == 0, NA) ## @EDiLD
microbenchmark(fun1(bci), fun2(bci), fun3(bci))
#Unit: microseconds
# expr min lq mean median uq max neval
# fun1(bci) 101.587 118.7785 204.8792 125.1175 208.7510 2885.952 100
# fun2(bci) 89.953 105.0370 179.6405 108.0380 182.4740 3350.545 100
# fun3(bci) 91.386 105.5060 189.9827 107.1530 183.2845 2994.159 100 |
Jupp, no consistent differences (though fun1() seems to be slightly slower). |
@elaliberte : it seems that About standardization: If the number of above-zero values (number of occurrences or number of species depending on the MARGIN) is f, then the sum of ranks will be (f^2 + f)/2, and the maximum rank is f (and lower when there are ties at highest rank). I think a different standardization is needed unless we want to down-weight OTU-rich sampling units. Divide by f to give proportional ranks in 0..1? |
Suggested and discussed in github issue #225 by @elaliberte
I have now implemented the To compare different options, let us see the default case of
It is difficult to say which of these is generally most useful. Two important points are:
Currently I am not sure how to standardize the results. I am mainly hovering between two alternatives: do nothing or use relative ranks (max rank). |
Hi, |
Good points @jarioksa, and I agree that it is not obvious what is generally most useful. In this case, perhaps the best strategy would be to just provide ranks (i.e. do nothing). |
I could not make up my minds, and I implemented both raw ranks ( |
This sounds good to me. Thanks again! |
closes wishlist issue #225 in github
Closed with merge commit 0fe6a5d |
High-throughput sequencing methods for characterizing e.g., soil microbial communities generate data matrices of 'sites' (or samples) as rows, and operational taxonomic units (OTUs) as columns. Values of cells are typically number of 'reads' (i.e. number of amplicons sequenced belonging to a particular OTU). Many studies treat these number of 'reads' as raw abundance data (and use vegan functions for community analyses).
However, many people criticize this usage because number of reads is not always clearly linked to abundance, due to PCR biases and all sorts of other problems. Some people therefore transform these data to presence-absence, but this is quite an extreme transformation and you end up losing a lot of potentially valuable information. Everyone seems to agree that these types of data could probably be treated at least semi-quantitatively. Therefore, my suggestion would be to add another method in decostand() to rank-transform. One important issue however, is that zeroes should stay zeroes during the rank transformation, see example below.
Is this an option that could be worth adding to decostand()?
It is worth noting that as absolute numbers of reads (i.e. row total) are meaningless in these studies because they are a direct result of PCR efficiency and a bunch of other things, everyone standardises these data to look at differences in relatives 'abundances' but not absolute differences. I realize that standardizing ranks is unorthodox, but to me it is less bad than not doing anything and treating number 'of reads' as quantitative data (i.e. 10 times more reads means that OTU was 10 times more abundant).
The text was updated successfully, but these errors were encountered: