Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how many people have computed jaccard distances incorrectly using vegdist? #153

Open
CarlyMuletzWolz opened this issue Jan 20, 2016 · 9 comments

Comments

@CarlyMuletzWolz
Copy link

Since I realized that you must specify binary = T in vegdist to compute distance matrices on presence/absence data as opposed to abundance data, I keep seeing more and more places where others are incorrectly using jaccard in vegdist (by not specifying binary = T).

I assumed for at least three years that when you would specify distance = jaccard the command would calculate it on presence/absence data. I think most expect that. Now guaranteed it is in the help file that that is not the case, it is still tricky.

Vegan is used in multiple other packages for computing distances, and it is not only jaccard, but a few other metrics that one would assume are on presence/absence data, but without specifying binary = T, you would not get the expected results. Again I am noticing more and more places where people are missing this critical point.

TWO questions:

  1. Is there any way to adjust the code so people (not me anymore at least!) would not make that mistake of the need to specify binary = T for computing distances for measures that are known to only be computed based on presence/absence data?

  2. May I also ask how it is possible that "Jaccard index is computed as 2B/(1+B), where B is Bray–Curtis dissimilarity." I can't seem to find much information about the mathematical relationship between the two measures and am no math whiz. I'm just curious how that is possible as I have been taught that those are two separate measures and not that Bray-Curtis is a transformation of Jaccard + abundance information.

Thank you!

@jarioksa
Copy link
Contributor

  1. This is possible, but in my opinion, not really desirable: Quantitative indices are useful and I cannot see why we should remove them (like you seem to suggest). There are a couple of indices that only work with binary data, but (A+B-2*J)/(A+B-J) looks pretty general to me, and I cannot see why we should restrict it so that A, B and J are derived from binarized data. Probably you're correct when you say that many people have been misled by the naming, although this is documented both in the help files and in the vegan FAQ. If there is a helpful solution that does not strip functionality, I'll gladly add that in vegan.

  2. Isn't it nice to learn something new? I haven't seen this in any book either, but it is not really higher mathematics. Just put x=(A+B-2J)/(A+B) in 2x/(1+x), expand and simplify and you end up with (A+B-2*J)/(A+B-J). There really are not so many different ways you can combine three terms (A,B and J above, or a, b, c for those who prefer them), and several indices with different names are mathematically equivalent, and mathematically equivalent ways can have several names. Check function betadiver in vegan which has all indices in its source paper, but only a few of these are distinct -- and many are Sørensen with another name.

@CarlyMuletzWolz
Copy link
Author

Thank you for the response. I use the package often and appreciate your contribution.

  1. I was not suggesting removing quantitative indices. I wanted to indicate that it is easy to incorrectly calculate jaccard distances when using vegdist. Since Jaccard and only a few other indices (Raup-Crick, others?) are the only indices that may be calculated incorrectly easily, would it be possible to return a warning message? For instance, if you run:

vegdist(x, jaccard)

vegdist(x, raup-crick)

the warning message says:

Warning: if you wish to calculate distances on a presence/absence matrix, please specify binary = T

  1. Thank you for the response. That clarifies things for me. I have learned many of these indices as a giant list of various equations and have never really thought about how there are really only so many ways you can combine the terms.

@georgeblck
Copy link

georgeblck commented Jun 15, 2016

I don't understand how this is could be a problem. As long as your input is a presence/absence matrix, that is a matrix filled only with 0/1, it doesn't matter if you specify binary = TRUE or binary = FALSE in the call to vegdist(x, method = "jaccard").
And if your input matrix isn't an presence/absence matrix, well then there is no use in calculating the Jaccard distance without doing a presence/absence standardization beforehand via setting binary = TRUE.
Or maybe I am missing something essential?

Here is an example code to produce the two scenarios

library(vegan)
library(ggplot2)
library(reshape2)

# set binary: TRUE if presence/absence data
#             FALSE if abundance data
binary <- TRUE

m <- 500
n <- m
set.seed(12345)
if(binary == TRUE){
  # Create matrix with random 0/1 values
  x <- sample.int (2, m*n, TRUE)-1L
  dim(x) <- c(m,n)  
} else {
  # Or create matrix with values in the range 0-10
  x <- matrix(round(runif(n*m, 0, 10)), nrow = m, ncol = n)  
}

#### Test the standardization 
jaccard <- vegdist(x, method = "jaccard", binary = FALSE)
jaccard.stand <- vegdist(x, method = "jaccard", binary = TRUE)
### Are they equal?
all(jaccard == jaccard.stand)

### Plot for safety
jacc.df <- melt(cbind(jaccard, jaccard.stand))
ggplot(jacc.df, aes(x = value, fill = Var2, colour = Var2)) + 
  geom_histogram(alpha = 0.2, position = "identity", binwidth = 0.01)

@eduardszoecs
Copy link
Contributor

Maybe the binary argument should be removed from vegdist():
It should be left in the responsibility of the user to transform the data as they want (binary is just a call to decostand).
This could resolve parts of the confusion?

@jarioksa
Copy link
Contributor

jarioksa commented Jul 29, 2016

The ChangeLog tells that the argument binary was added in version 1.6-5 (Oct 12, 2004) and gives this motivation:

  • vegdist: an option for binary indices, since some users believed
    these are not in vegan, although you can get them with
    'decostand'.

Even some printed papers claimed that you cannot have certain indices (such as Sørensen or binary Jaccard) in vegan, and I don't want to go back to that situation.

@JChristopherEllis
Copy link

I like EDiLD comment, perhaps an error would be better if it is NOT in binary format.

@jarioksa
Copy link
Contributor

@micromania2 Why should a legal and a very well and extensively documented call trigger an error? If you want a binary index, call a binary index. That's all that's needed. (The extensive documentation includes this thread that I have not closed.)

@jarioksa
Copy link
Contributor

I beg to disagree with @Edild : removing argument binary would add to confusion. If we do this, there would be no way of calculating binary variant of indices without external functions to transform input to binary. How would that help anybody? With the argument binary we can make the users aware of the issue, but if we drop the argument we just hide the issue and mislead people. My piece of advice is that if you want to have a binary variant of an index, use argument binary = TRUE. If you don't do so but use the default binary = FALSE, don't expect to get a binary variant expect in the cases where no other alternatives are available as documented in ?vegdist.

@CarlyMuletzWolz
Copy link
Author

I still hold that a simple warning message would prevent any misunderstandings. This was my comment back in Jan 2016.

...it is easy to incorrectly calculate jaccard distances (most are taught it is on presence/absence data) when using vegdist. Since Jaccard and only a few other indices (Raup-Crick, others?) are the only indices that are generally though of as presence/absence metrics/measures, would it be possible to return a warning message? For instance, if you run:
vegdist(x, jaccard)

vegdist(x, raup-crick)

the warning message says:

Warning: if you wish to calculate distances on a presence/absence matrix, please specify binary = T

I often use the package phyloseq that depends on vegan and they have since updated their help files to indicate the need to specify binary = T. It is now just a standard part of my code and I have since forgotten about this post. I still think for new users of vegan it may be helpful to clarify and clear up some of the confusion, but Jari you are in charge so I support your decision!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants