Add global option old.bruvo.model; update docs

This global option allows me to give the users a switch without having to switch it on for every function that uses Bruvo's distance.
zkamvar · Aug 13, 2017 · 486928a · 486928a
1 parent 6e2aaaf
commit 486928a
Show file tree

Hide file tree

Showing 5 changed files with 85 additions and 54 deletions.
diff --git a/R/bruvo.r b/R/bruvo.r
@@ -96,7 +96,7 @@
 #'   distinct alleles at each locus, so you end up with genotypes that appear to
 #'   have a lower ploidy level than the organism.
 #'   
-#'   To help deal with these situations, Bruvo has suggested three methods for 
+#'   To help deal with these situations, Bruvo has suggested three methods for
 #'   dealing with these differences in ploidy levels: \itemize{ \item
 #'   \strong{Infinite Model} - The simplest way to deal with it is to count all
 #'   missing alleles as infinitely large so that the distance between it and
@@ -107,34 +107,48 @@
 #'   replace with all possible combinations of the observed alleles in the
 #'   shorter genotype}. For example, if there is a genotype of [69, 70, 0, 0]
 #'   where 0 is a missing allele, the possible combinations are: [69, 70, 69,
-#'   69], [69, 70, 69, 70], and [69, 70, 70, 70]. The resulting distances are
-#'   then averaged over the number of comparisons. \item \strong{Genome Loss
-#'   Model} - This is similar to the genome addition model, except that it
-#'   assumes that there was a recent genome reduction event and uses \strong{the
-#'   observed values in the full genotype to fill the missing values in the
-#'   short genotype}. As with the Genome Addition Model, the resulting distances
-#'   are averaged over the number of comparisons. \item \strong{Combination
-#'   Model} - Combine and average the genome addition and loss models. } As
-#'   mentioned above, the infinite model is biased, but it is not nearly as
+#'   69], [69, 70, 69, 70], [69, 70, 70, 69], and [69, 70, 70, 70]. The
+#'   resulting distances are then averaged over the number of comparisons. \item
+#'   \strong{Genome Loss Model} - This is similar to the genome addition model,
+#'   except that it assumes that there was a recent genome reduction event and
+#'   uses \strong{the observed values in the full genotype to fill the missing
+#'   values in the short genotype}. As with the Genome Addition Model, the
+#'   resulting distances are averaged over the number of comparisons. \item
+#'   \strong{Combination Model} - Combine and average the genome addition and
+#'   loss models. } 
+#'   
+#'   As mentioned above, the infinite model is biased, but it is not nearly as
 #'   computationally intensive as either of the other models. The reason for
 #'   this is that both of the addition and loss models requires replacement of
 #'   alleles and recalculation of Bruvo's distance. The number of replacements
-#'   required is equal to the multiset coefficient: \eqn{\left({n \choose
-#'   k}\right) == {(n+k-1) \choose k}}{choose(n+k-1, k)} where \emph{n} is the
-#'   number of potential replacements and \emph{k} is the number of alleles to
-#'   be replaced. So, for the example given above, The genome addition model
-#'   would require \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3}
-#'   calculations of Bruvo's distance, whereas the genome loss model would
-#'   require \eqn{\left({4 \choose 2}\right) = 10}{choose(4+2-1, 2) == 10}
-#'   calculations.
-#'   
+#'   required is equal to n^k where where \emph{n} is the number of potential
+#'   replacements and \emph{k} is the number of alleles to be replaced.
+
 #'   To reduce the number of calculations and assumptions otherwise, Bruvo's 
 #'   distance will be calculated using the largest observed ploidy in pairwise 
 #'   comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0], 
 #'   they will be treated as triploids.
 #'   }
 #'   
-#' @note Do not use missingno with this function. 
+#' @note Do not use missingno with this function.
+#'   \subsection{Missing alleles and Bruvo's distance in \pkg{poppr} versions < 2.5}{
+#'   In earlier versions of \pkg{poppr}, the authors had assumed that, because
+#'   the calculation of Bruvo's distance does not rely on orderd sets of
+#'   alleles, the imputation methods in the genome addition and genome loss
+#'   models would also assume unordered alleles for creating the hypothetical
+#'   genotypes. This means that the results from this imputation did not
+#'   consider all possible combinations of alleles, resulting in either an over-
+#'   or under- estimation of Bruvo's distance between two samples with two or
+#'   more missing alleles. This version of \pkg{poppr} considers all possible
+#'   combinations when calculating Bruvo's distance for incomplete genotype with
+#'   a negligable gain in computation time.
+#'   
+#'   If you want to see the effect of this change on your data, you can use the
+#'   global \pkg{poppr} option \code{old.bruvo.model}. Currently, this option is
+#'   \code{FALSE} and you can set it by using 
+#'   \code{options(old.bruvo.model = TRUE)}, but make sure to reset it to 
+#'   \code{FALSE} afterwards.
+#'   }
 #'   \subsection{Repeat Lengths (replen)}{
 #'   The \code{replen} argument is crucial for proper analysis of Bruvo's
 #'   distance since the calculation relies on the knowledge of the number of

diff --git a/R/internal.r b/R/internal.r
@@ -946,7 +946,7 @@ fix_negative_branch <- function(tre){
 #==============================================================================#
 
 bruvos_distance <- function(bruvomat, funk_call = match.call(), add = TRUE, 
-                            loss = TRUE, by_locus = FALSE, old_model = FALSE){
+                            loss = TRUE, by_locus = FALSE){
   x      <- bruvomat@mat
   ploid  <- bruvomat@ploidy
   replen <- bruvomat@replen
@@ -960,7 +960,14 @@ bruvos_distance <- function(bruvomat, funk_call = match.call(), add = TRUE,
   perms <- .Call("permuto", ploid, PACKAGE = "poppr")
 
   # Calculating bruvo's distance over each locus. 
-  distmat <- .Call("bruvo_distance", x, perms, ploid, add, loss, old_model, PACKAGE = "poppr")
+  distmat <- .Call("bruvo_distance", 
+                   x,     # data matrix
+                   perms, # permutation vector (0-indexed)
+                   ploid, # maximum ploidy
+                   add,   # Genome addition model switch
+                   loss,  # Genome loss model switch
+                   getOption("old.bruvo.model"), # switch to use unordered genotypes
+                   PACKAGE = "poppr")
 
   # If there are missing values, the distance returns 100, which means that the
   # comparison is not made. These are changed to NA.

diff --git a/R/zzz.r b/R/zzz.r
@@ -45,7 +45,8 @@
 .onAttach <- function(...) {
   op <- options()
   op.poppr <- list(
-    poppr.debug = FALSE # flag for verbosity
+    poppr.debug = FALSE, # flag for verbosity
+    old.bruvo.model = FALSE # flag for using the old model of Bruvo's distance.
   )
   toset <- !(names(op.poppr) %in% names(op))
   if(any(toset)) options(op.poppr[toset])

diff --git a/man/bruvo.dist.Rd b/man/bruvo.dist.Rd
diff --git a/vignettes/algo.Rnw b/vignettes/algo.Rnw
@@ -429,9 +429,9 @@ with these differences in ploidy levels \citep{Bruvo:2004}:
   through a recent genome expansion, the missing alleles will be replace with
   all possible combinations of the observed alleles in the shorter genotype. For
   example, if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele,
-  the possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], and [69,
-  70, 70, 70]. The resulting distances are then averaged over the number of
-  comparisons.
+  the possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], 
+  [69, 70, 70, 69], and [69, 70, 70, 70]. The resulting distances are then 
+  averaged over the number of comparisons.
   \item{Genome Loss Model -} This is similar to the genome addition model,
   except that it assumes that there was a recent genome reduction event and uses
   the observed values in the full genotype to fill the missing values in the
@@ -444,18 +444,14 @@ with these differences in ploidy levels \citep{Bruvo:2004}:
 As mentioned above, the infinite model is biased, but it is not nearly as
 computationally intensive as either of the other models. The reason for this is
 that both of the addition and loss models requires replacement of alleles and
-recalculation of Bruvo's distance. The number of replacements required is equal
-to the multiset coefficient: $\left({n \choose k}\right) == {(n-k+1) \choose k}$
+recalculation of Bruvo's distance. The number of replacements required is $n^k$
 where $n$ is the number of potential replacements and $k$ is the number of
-alleles to be replaced. So, for the example given above, The genome addition
-model would require $\left({2 \choose 2}\right) = 3$ calculations of Bruvo's
-distance, whereas the genome loss model would require $\left({4 \choose
-2}\right) = 10$ calculations.
+alleles to be replaced.
 
 To reduce the number of calculations and assumptions otherwise, Bruvo's distance
 will be calculated using the largest observed ploidy in pairwise comparisons. 
-This means that when
-comparing [69,70,71,0] and [59,60,0,0], they will be treated as triploids.
+This means that when comparing [69,70,71,0] and [59,60,0,0], they will be 
+treated as triploids.
 
 \subsubsection{Choosing a model}
 \label{appendix:algorithm:bruvomodel}