Suggestion: rank transformation in decostand()? #225

elaliberte · 2017-02-08T16:59:33Z

High-throughput sequencing methods for characterizing e.g., soil microbial communities generate data matrices of 'sites' (or samples) as rows, and operational taxonomic units (OTUs) as columns. Values of cells are typically number of 'reads' (i.e. number of amplicons sequenced belonging to a particular OTU). Many studies treat these number of 'reads' as raw abundance data (and use vegan functions for community analyses).

However, many people criticize this usage because number of reads is not always clearly linked to abundance, due to PCR biases and all sorts of other problems. Some people therefore transform these data to presence-absence, but this is quite an extreme transformation and you end up losing a lot of potentially valuable information. Everyone seems to agree that these types of data could probably be treated at least semi-quantitatively. Therefore, my suggestion would be to add another method in decostand() to rank-transform. One important issue however, is that zeroes should stay zeroes during the rank transformation, see example below.

Is this an option that could be worth adding to decostand()?

It is worth noting that as absolute numbers of reads (i.e. row total) are meaningless in these studies because they are a direct result of PCR efficiency and a bunch of other things, everyone standardises these data to look at differences in relatives 'abundances' but not absolute differences. I realize that standardizing ranks is unorthodox, but to me it is less bad than not doing anything and treating number 'of reads' as quantitative data (i.e. 10 times more reads means that OTU was 10 times more abundant).

# example: different number of amplicons ('reads') of different OTUs in a sample
seqs <- c(0, 0, 2, 4, 19, 23)

# rank transformation to treat these data semi-quantitatively
seqs.rank <- rank(seqs)
seqs.rank # not ok; note that 0's were converted 
# [1] 1.5 1.5 3.0 4.0 5.0 6.0

# solution: first replace 0's with NA's
seqs.na <- seqs
seqs.na[seqs.na == 0] <- NA
seqs.na
# [1] NA NA  2  4 19 23

# rank again, keeping NA's
seqs.na.rank <- rank(seqs.na, na.last = 'keep')
seqs.na.rank # ok now, but need to replace NA's with 0's
# [1] NA NA  1  2  3  4

# final step
seqs.na.final <- seqs.na.rank
seqs.na.final[is.na(seqs.na.final)] <- 0
seqs.na.final # now ok: non-zero values, ranked, but 0's (absences) preserved
# [1] 0 0 1 2 3 4

jarioksa · 2017-02-08T18:04:42Z

It looks that the following code would do:

rank = {
      is.na(x) <- x == 0
      x <- apply(x, MARGIN, rank, na.last = "keep")
      x[is.na(x)] <- 0
      x
}

Comments are welcome, both on the idea and on the implementation.

What worries me here is that the maximum rank is related to the frequency of species: is it really so that more common (= higher number of occurrences) species should get higher abundance values?

elaliberte · 2017-02-08T19:17:05Z

Thanks, yes, this implementation is exactly what I had in mind, although I was only thinking about the situation where MARGIN = 1 (rows). I'm not sure in what situation one would want to rank by columns (species), although I guess it doesn't hurt to provide the option, but perhaps provide MARGIN = 1 as the default?
You're right about the maximum rank being a function of species richness, but I initially didn't see this as a big problem since people standardise by row after (though probably best not to assume they would). In this case perhaps one option would be to standardize by the maximum rank in each row to obtain "relative ranks"? Something like:

rank = {
     if (missing(MARGIN)) MARGIN <- 1
     is.na(x) <- x == 0
     x <- apply(x, MARGIN, rank, na.last = "keep")
     tmp <- apply(x, MARGIN, max, na.rm = T)
     x <- sweep(x, MARGIN, tmp, "/")
     x[is.na(x)] <- 0
}

elaliberte · 2017-02-08T19:30:37Z

Actually, probably better to standardise by margin total?

rank = {
     if (missing(MARGIN)) MARGIN <- 1
     is.na(x) <- x == 0
     x <- apply(x, MARGIN, rank, na.last = "keep")
     tmp <- apply(x, MARGIN, sum, na.rm = T)
     x <- sweep(x, MARGIN, tmp, "/")
     x[is.na(x)] <- 0
}

eduardszoecs · 2017-02-08T19:43:04Z

Using

x <- replace(x, x == 0, NA)

instead of

 is.na(x) <- x == 0

is slightly faster.

jarioksa · 2017-02-09T08:23:20Z

My suggestion is indeed a bit slow, but microbenchmark did not find any clear or consistent differences with normal data sizes. Here the test results:

library(vegan)
library(microbenchmark)
data(BCI)
bci <- as.matrix(BCI) ## as.matrix will be used in decostand
fun1 <- function(x) is.na(x) <- x == 0           ## my original
fun2 <- function(x) x[x==0] <- NA                ## the classic
fun3 <- function(x) x <- replace(x, x == 0, NA)  ## @EDiLD
microbenchmark(fun1(bci), fun2(bci), fun3(bci))
#Unit: microseconds
#      expr     min       lq     mean   median       uq      max neval
# fun1(bci) 101.587 118.7785 204.8792 125.1175 208.7510 2885.952   100
# fun2(bci)  89.953 105.0370 179.6405 108.0380 182.4740 3350.545   100
# fun3(bci)  91.386 105.5060 189.9827 107.1530 183.2845 2994.159   100

eduardszoecs · 2017-02-09T09:31:14Z

Jupp, no consistent differences (though fun1() seems to be slightly slower).
The classic one is IMO more readable...

jarioksa · 2017-02-09T10:15:36Z

@elaliberte : it seems that apply(x, 1, ...) will transpose x and we need to back-transpose.

About standardization: If the number of above-zero values (number of occurrences or number of species depending on the MARGIN) is f, then the sum of ranks will be (f^2 + f)/2, and the maximum rank is f (and lower when there are ties at highest rank). I think a different standardization is needed unless we want to down-weight OTU-rich sampling units. Divide by f to give proportional ranks in 0..1?

@elaliberte

Suggested and discussed in github issue #225 by @elaliberte

jarioksa · 2017-02-10T08:04:52Z

I have now implemented the rank option for decostand in a separate branch. This performs the unadjusted rank-standardization at the moment, but does no adjustment to ranks, but these vary with the number of non-zero items. I was originally considering to adjust these by maximum possible rank (not the maximum observed rank which can be lower if the highest values are tied), and @elaliberte suggested adjustment by total. At the moment I am not sure what to do, and I also have considered leaving the rank standardization like it is.

To compare different options, let us see the default case of MARGIN = 1 which standardizes by sampling units, and let S be the number of species / OTUs / taxa for each unit. Then -- in the absence of ties -- the different adjustments will give us:

Adjustment	Ranks	Total
None	0, 1, ..., S	S(S + 1)/2
Max rank	0, 1/S, ..., 1	(S + 1)/2
Total	0, 2/ (S^2 + S), ..., 2/(S + 1)	1

It is difficult to say which of these is generally most useful. Two important points are:

What is the meaning of S? Is the number of species a meaningful value ("alpha diversity") or a consequence of sampling intensity, sampling success or capture rate? The original request quoted example of high-throughput sequencing where S is hardly meaningful. Other typical examples are samples of benthic invertebrates with variable and unknown capture rate. Even if S is meaningful, do we want to change a superabundant monodominant species to 1 and a scarce species in a diverse community to S?
How these transformed abundances are used? Several later analytic tools perform their own adjustments of raw values, and these can over-turn any adjustment we did previously. Primarily I see this rank adjustment as a way of fixing strongly skewed abundance values for data where ranks may be meaningful and ties are rare. This is the case for sequence data, count data of individuals etc. It does not serve any other useful purpose. For instane, any real differences in total abundances will vanish. Apart from skewness, many analytic tools will include semi-automatic correction for unequal and unknown capture rate. For instance, correspondence analysis includes a double standardization which is just designed for these kind of data.

Currently I am not sure how to standardize the results. I am mainly hovering between two alternatives: do nothing or use relative ranks (max rank).

BErfanian · 2017-02-11T04:54:10Z

Hi,
I had the problem of "superabundant monodominant species" in calculating rank-abundance plots, the two dominant species alter when I compute rank-abundance plot for P/A and abundance data (and both of them are not the "real" type I observed in field, because of a superabundant and occasional species).
After some thinking, I used the frequency of a species (number of presence in SUs / total number of SUs) as a downweighting coefficient for abundance of that species. After using this, the results looks better.

elaliberte · 2017-02-13T15:00:11Z

Good points @jarioksa, and I agree that it is not obvious what is generally most useful. In this case, perhaps the best strategy would be to just provide ranks (i.e. do nothing).

jarioksa · 2017-02-14T08:34:17Z

I could not make up my minds, and I implemented both raw ranks (rank) and relative ranks (rrank). OK?

elaliberte · 2017-02-16T15:14:39Z

This sounds good to me. Thanks again!

closes wishlist issue #225 in github

jarioksa · 2017-02-17T08:11:50Z

Closed with merge commit 0fe6a5d

jarioksa added the feature-request label Feb 8, 2017

jarioksa pushed a commit that referenced this issue Feb 9, 2017

implement rank standardization as suggested by Etienne Laliberté

23790a6

Suggested and discussed in github issue #225 by @elaliberte

jarioksa pushed a commit that referenced this issue Feb 17, 2017

Merge branch 'decostand-rank'

0fe6a5d

closes wishlist issue #225 in github

jarioksa closed this as completed Feb 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: rank transformation in decostand()? #225

Suggestion: rank transformation in decostand()? #225

elaliberte commented Feb 8, 2017 •

edited by jarioksa

Loading

jarioksa commented Feb 8, 2017 •

edited

Loading

elaliberte commented Feb 8, 2017 •

edited

Loading

elaliberte commented Feb 8, 2017 •

edited

Loading

eduardszoecs commented Feb 8, 2017

jarioksa commented Feb 9, 2017

eduardszoecs commented Feb 9, 2017

jarioksa commented Feb 9, 2017 •

edited

Loading

jarioksa commented Feb 10, 2017 •

edited

Loading

BErfanian commented Feb 11, 2017 •

edited

Loading

elaliberte commented Feb 13, 2017

jarioksa commented Feb 14, 2017

elaliberte commented Feb 16, 2017

jarioksa commented Feb 17, 2017

Suggestion: rank transformation in decostand()? #225

Suggestion: rank transformation in decostand()? #225

Comments

elaliberte commented Feb 8, 2017 • edited by jarioksa Loading

jarioksa commented Feb 8, 2017 • edited Loading

elaliberte commented Feb 8, 2017 • edited Loading

elaliberte commented Feb 8, 2017 • edited Loading

eduardszoecs commented Feb 8, 2017

jarioksa commented Feb 9, 2017

eduardszoecs commented Feb 9, 2017

jarioksa commented Feb 9, 2017 • edited Loading

jarioksa commented Feb 10, 2017 • edited Loading

BErfanian commented Feb 11, 2017 • edited Loading

elaliberte commented Feb 13, 2017

jarioksa commented Feb 14, 2017

elaliberte commented Feb 16, 2017

jarioksa commented Feb 17, 2017

elaliberte commented Feb 8, 2017 •

edited by jarioksa

Loading

jarioksa commented Feb 8, 2017 •

edited

Loading

elaliberte commented Feb 8, 2017 •

edited

Loading

elaliberte commented Feb 8, 2017 •

edited

Loading

jarioksa commented Feb 9, 2017 •

edited

Loading

jarioksa commented Feb 10, 2017 •

edited

Loading

BErfanian commented Feb 11, 2017 •

edited

Loading