Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Canned dissimilarities? #196
Comments
jarioksa
added the
feature-request
label
Sep 14, 2016
jarioksa
referenced this issue
Sep 14, 2016
Closed
designdist faster than vegdist for binary distances #182
jarioksa
added the
request-for-comments
label
Dec 30, 2016
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
jarioksa commentedSep 14, 2016
•
edited
Function
designdistis currently faster thanvegdist. With"binary"and"quadratic"terms it is much faster thanvegdist. With"minimum"terms (used by most dissimilarity functions invegdist) it used to be slower thanvegdist, but I wrote C code with.Call()interface to find those minimum terms (5fb205d), and now even these are faster than invegdist. The speed comes with some cost:designdistis higher. I madevegdistto have.Call()interface which further reduced the memory footprint ofvegdistand makes the difference even larger in 2.5-0 than it used to be (and still is in 2.4-1). In the same process I also madevegdistfaster and it now matchesstats::dist()which used to be much faster earlier. However, this does not close the gap todesigndist(major changes in 8125d43).NA) indesigndist, but invegdistwe can use ´´pairwise deletion´´. For"minimum"terms this is the main reason for fasterdesigndist.designdistcoefficients must be designed and written which may be tricky for some users.The last point could be solved by providing a function of canned dissimilarity functions. We could have a long list of dissimilarity indices defined in
designdistterms, and these could be selected with an index name. The following function demonstrates the concept:The list of indices could grow to any desired size. For instance, an article by Z. Hubalek lists 86 binary indices, and there are many more.
The function is simple, but the real challenge is documentation. The list of indices is dynamic, and when it reaches something like 200 alternatives, we need also ways of paging the output, filtering the results, finding synonyms (there are synonyms even in the list above) etc. Currently I have a simple
helpargument inbetadiverwhich lists the seventeen indices available there, but this would not be sufficient for this choice of canned dissimilarities.Probably we would also want to have optional fields like
synonymandnotewhich could printmessage()of canonical names or implementation specifics for certain indices. Perhaps also an entry onsourcecould be useful to give the source reference to literature on each index (not usually the original but a text book or similar), but this would call for a more complicated design as same sources are duplicated and we do not want to write them in full for each index.What do you think of this idea. Should we have a function like this?
This popped up in issue #182 but I decided to make this a separate issue.