Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Canned dissimilarities? #196

Open
jarioksa opened this Issue Sep 14, 2016 · 0 comments

Comments

Projects
None yet
1 participant
Contributor

jarioksa commented Sep 14, 2016 edited

Function designdist is currently faster than vegdist. With "binary" and "quadratic" terms it is much faster than vegdist. With "minimum" terms (used by most dissimilarity functions in vegdist) it used to be slower than vegdist, but I wrote C code with .Call() interface to find those minimum terms (5fb205d), and now even these are faster than in vegdist. The speed comes with some cost:

  • Memory footprint of designdist is higher. I made vegdist to have .Call() interface which further reduced the memory footprint of vegdist and makes the difference even larger in 2.5-0 than it used to be (and still is in 2.4-1). In the same process I also made vegdist faster and it now matches stats::dist() which used to be much faster earlier. However, this does not close the gap to designdist (major changes in 8125d43).
  • Missing values in input data give missing dissimilarities (NA) in designdist, but in vegdist we can use ´´pairwise deletion´´. For "minimum" terms this is the main reason for faster designdist.
  • designdist coefficients must be designed and written which may be tricky for some users.

The last point could be solved by providing a function of canned dissimilarity functions. We could have a long list of dissimilarity indices defined in designdist terms, and these could be selected with an index name. The following function demonstrates the concept:

canneddist <-
    function(x, method)
{
index <- list(
    "sorensen" = list(method = "(A+B-2*J)/(A+B)", terms = "binary"),
    "bray" =   list(method = "(A+B-2*J)/(A+B)", terms = "minimum"),
    "whittaker" =  list(method = "(A+B-2*J)/(A+B)", terms = "binary"),
    "ochiai" = list(method = "1-J/sqrt(A*B)", terms = "binary"),
    "cosine" = list(method = "1-J/sqrt(A*B)", terms = "quadratic"))
ind <- match.arg(method, names(index))
z <- index[[ind]]
designdist(x, method = z$method, terms = z$terms, name = ind)
}
## use this as
library(vegan)
data(dune)
canneddist(dune, "och")

The list of indices could grow to any desired size. For instance, an article by Z. Hubalek lists 86 binary indices, and there are many more.

The function is simple, but the real challenge is documentation. The list of indices is dynamic, and when it reaches something like 200 alternatives, we need also ways of paging the output, filtering the results, finding synonyms (there are synonyms even in the list above) etc. Currently I have a simple help argument in betadiver which lists the seventeen indices available there, but this would not be sufficient for this choice of canned dissimilarities.

Probably we would also want to have optional fields like synonym and note which could print message() of canonical names or implementation specifics for certain indices. Perhaps also an entry on source could be useful to give the source reference to literature on each index (not usually the original but a text book or similar), but this would call for a more complicated design as same sources are duplicated and we do not want to write them in full for each index.

What do you think of this idea. Should we have a function like this?

This popped up in issue #182 but I decided to make this a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment