Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

how many people have computed jaccard distances incorrectly using vegdist? #153

Open
CarlyRae opened this Issue Jan 20, 2016 · 5 comments

Comments

Projects
None yet
4 participants

Since I realized that you must specify binary = T in vegdist to compute distance matrices on presence/absence data as opposed to abundance data, I keep seeing more and more places where others are incorrectly using jaccard in vegdist (by not specifying binary = T).

I assumed for at least three years that when you would specify distance = jaccard the command would calculate it on presence/absence data. I think most expect that. Now guaranteed it is in the help file that that is not the case, it is still tricky.

Vegan is used in multiple other packages for computing distances, and it is not only jaccard, but a few other metrics that one would assume are on presence/absence data, but without specifying binary = T, you would not get the expected results. Again I am noticing more and more places where people are missing this critical point.

TWO questions:

  1. Is there any way to adjust the code so people (not me anymore at least!) would not make that mistake of the need to specify binary = T for computing distances for measures that are known to only be computed based on presence/absence data?

  2. May I also ask how it is possible that "Jaccard index is computed as 2B/(1+B), where B is Bray–Curtis dissimilarity." I can't seem to find much information about the mathematical relationship between the two measures and am no math whiz. I'm just curious how that is possible as I have been taught that those are two separate measures and not that Bray-Curtis is a transformation of Jaccard + abundance information.

Thank you!

Contributor

jarioksa commented Jan 21, 2016

  1. This is possible, but in my opinion, not really desirable: Quantitative indices are useful and I cannot see why we should remove them (like you seem to suggest). There are a couple of indices that only work with binary data, but (A+B-2*J)/(A+B-J) looks pretty general to me, and I cannot see why we should restrict it so that A, B and J are derived from binarized data. Probably you're correct when you say that many people have been misled by the naming, although this is documented both in the help files and in the vegan FAQ. If there is a helpful solution that does not strip functionality, I'll gladly add that in vegan.

  2. Isn't it nice to learn something new? I haven't seen this in any book either, but it is not really higher mathematics. Just put x=(A+B-2J)/(A+B) in 2x/(1+x), expand and simplify and you end up with (A+B-2*J)/(A+B-J). There really are not so many different ways you can combine three terms (A,B and J above, or a, b, c for those who prefer them), and several indices with different names are mathematically equivalent, and mathematically equivalent ways can have several names. Check function betadiver in vegan which has all indices in its source paper, but only a few of these are distinct -- and many are Sørensen with another name.

Thank you for the response. I use the package often and appreciate your contribution.

  1. I was not suggesting removing quantitative indices. I wanted to indicate that it is easy to incorrectly calculate jaccard distances when using vegdist. Since Jaccard and only a few other indices (Raup-Crick, others?) are the only indices that may be calculated incorrectly easily, would it be possible to return a warning message? For instance, if you run:

vegdist(x, jaccard)

vegdist(x, raup-crick)

the warning message says:

Warning: if you wish to calculate distances on a presence/absence matrix, please specify binary = T

  1. Thank you for the response. That clarifies things for me. I have learned many of these indices as a giant list of various equations and have never really thought about how there are really only so many ways you can combine the terms.

georgeblck commented Jun 15, 2016 edited

I don't understand how this is could be a problem. As long as your input is a presence/absence matrix, that is a matrix filled only with 0/1, it doesn't matter if you specify binary = TRUE or binary = FALSE in the call to vegdist(x, method = "jaccard").
And if your input matrix isn't an presence/absence matrix, well then there is no use in calculating the Jaccard distance without doing a presence/absence standardization beforehand via setting binary = TRUE.
Or maybe I am missing something essential?

Here is an example code to produce the two scenarios

library(vegan)
library(ggplot2)
library(reshape2)

# set binary: TRUE if presence/absence data
#             FALSE if abundance data
binary <- TRUE

m <- 500
n <- m
set.seed(12345)
if(binary == TRUE){
  # Create matrix with random 0/1 values
  x <- sample.int (2, m*n, TRUE)-1L
  dim(x) <- c(m,n)  
} else {
  # Or create matrix with values in the range 0-10
  x <- matrix(round(runif(n*m, 0, 10)), nrow = m, ncol = n)  
}

#### Test the standardization 
jaccard <- vegdist(x, method = "jaccard", binary = FALSE)
jaccard.stand <- vegdist(x, method = "jaccard", binary = TRUE)
### Are they equal?
all(jaccard == jaccard.stand)

### Plot for safety
jacc.df <- melt(cbind(jaccard, jaccard.stand))
ggplot(jacc.df, aes(x = value, fill = Var2, colour = Var2)) + 
  geom_histogram(alpha = 0.2, position = "identity", binwidth = 0.01)
Contributor

EDiLD commented Jul 26, 2016

Maybe the binary argument should be removed from vegdist():
It should be left in the responsibility of the user to transform the data as they want (binary is just a call to decostand).
This could resolve parts of the confusion?

Contributor

jarioksa commented Jul 29, 2016 edited

The ChangeLog tells that the argument binary was added in version 1.6-5 (Oct 12, 2004) and gives this motivation:

  • vegdist: an option for binary indices, since some users believed
    these are not in vegan, although you can get them with
    'decostand'.

Even some printed papers claimed that you cannot have certain indices (such as Sørensen or binary Jaccard) in vegan, and I don't want to go back to that situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment