abs in centroid distance calculation #306

itcarroll · 2019-02-02T14:34:25Z

My colleague @mavolio and I have noticed that the betadisper function can give a result that differs from Primer-e. This situation arises when the input distance matrix violates the triangle inequality, for example:

> df <- read.csv(text = '
    Group, SpA, SpB
        1,   0,   2
        1,   1,   2
        1,   1,   0')
> dst <- vegdist(df[,-1], 'bray')
> dst
    1   2
2 0.2    
3 1.0 0.5

The "distance" from 1 to 3 is 1.0, but it's shorter to go the 0.2 to 2 and then 0.5 to 3. The distances to the group centroid that betadisper returns are all real values, due to the abs call in L130.

> dsp <- betadisper(dst, df$Group, type = 'centroid')
> dsp$distances
[1] 0.4509250 0.2160247 0.5228129

Without that abs, the result would appear to be meaningless because it is complex valued:

> dsp$distances
[1] 0.4509250+0.0000000i 0.0000000+0.2160247i 0.5228129+0.0000000i

It's not clear to us that a correct resolution of this situation is described in the literature anywhere. We think the result returned by betadisper may be faulty for two reasons:

It's not what Primer-e returns. Primer-e returns only the real part of the complex array above.
The sum of squared distances (0.523) does not equal the sum of squared inter-point distances divided by the number of points (0.43).

The second point arises from our reading of page 17 of the Primer-e manual. The sum of squares of the complex array above (0.43+0i) does pass the test, so despite it being complex may be the correct answer. We currently plan, in an upcoming patch to the NCEAS/codyn package, to return NA for centroid distances that fall into this category, because we don't feel the topic has passed through peer-review. How did the vegan developers settle on the current approach of taking an absolute value?

The text was updated successfully, but these errors were encountered:

jarioksa · 2019-02-02T17:29:59Z

The vegan behaviour is documented (with an equation) in the betadisper help page. There we refer to Anderson (Biometrics 62, 245–253; 2006). We could have referred to J. C. Gower, Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra and its Applications 67, 81–97 (1985) which gives the basic equations, but (naturally) not yet in the context of betadisper. So I think it is documented and it has passed peer review 35 years ago.

jarioksa · 2019-02-02T17:32:38Z

@gavinsimpson may remember more details, but we had a long discussion about this behaviour before implementation. However, it seems to be so ancient that the commits were before version control.

gavinsimpson · 2019-02-02T17:40:15Z

@itcarroll thanks for the question & example. As @jarioksa mentions, we followed Marti Anderson's Biometrics paper closely, implementing the equations presented there. It has been quite some time since I looked at that paper but I'll take a look next week and provide more detail if it jogs my memory, but I'd suggest you also take a look at it and if an issue remains we can revisit this.

Part of the difficulty here is the Primer-e is not open source and I am unsure what they actually implement; I know there are some newer developments that Marti has included in Primer that I am not aware have been published so it is further difficult to know what is being done with that software.

itcarroll · 2019-02-03T00:10:49Z

Thank you, @gavinsimpson, for offering to look further into the question. I am familiar with Anderson's 2006 paper, and I'm pretty sure @jarioksa is referring to equation 3. The equation does not include taking an absolute value before taking the square root. In fact, on page 248, Anderson writes:

Although neither (3) nor (4) precludes the possibility of either z^c or z^m being complex values, in practice this would be highly unlikely to occur, provided a reasonable dissimilarity measure was chosen.

I think the example I've given above is one "highly unlikely" case, and am interested to know whether you find it so too.

It is unfortunate that Primer-e is not open source, and while we cannot see what the code does we see the result. I've run this matrix through Primer-e, and the result is consistent with simply dropping the imaginary part in the centroid distances. My points 1 and 2 above address specific inconsistencies between the betadisper result and Marti Anderson's description and calculation (via Primer-e).

jarioksa · 2019-02-03T17:27:58Z

@gavinsimpson it seems that we talked about the negative eigenvalues in 2009, but my archive is very patchy. However, ignoring negative eigenvalues and only using the real eigenvectors is clearly wrong (@itcarroll). The point is only how to handle the negative eigenvalues. In all decent vegan functions we handle negative eigenvalues & complex eigenvectors (dbrda, adonis, wcmdscale) – the only exception being capscale which is there just for the "inertia" (= people have used it, and we keep it there to maintain the history). capscale deals only with the real part of the dissimilarities, but currently we think that is wrong. We wanted to keep the imaginary eigenvectors in betadisper, and I think that you @gavinsimpson even consulted the Marti Anderson when implementing the current behaviour, but I cannot find those emails ten years after.

itcarroll · 2019-02-04T14:31:04Z

Not totally sure yet that the specific concern I'm raising is understood, so to be clear, I do not suggest ignoring components associated with negative eigenvalues. I'm interested in the case where dist.neg is non-zero (i.e. there is a negative eigenvalue) AND greater than dist.pos. I cannot find a discussion in the literature of rescuing this case from producing a complex valued square root by taking the absolute value as in line 130. Thanks for pursuing the source of this decision.

jarioksa · 2019-02-04T20:55:14Z

@itcarroll : things come back to my mind by and by. We really had a discussion on exactly this issue (neg distances larger than pos ones), but that was back in 2008 which is older than my email archive and older than having vegan in version control. However, there is an old ChangeLog entry on the issue from May 1, 2008:

betadisper: was not calculating distance to centroid correctly
for observations where the imaginary distance to centroid was
greater than the real distance (resulting in negative distance)
which resulted in NaN when we took the square root. betadisper()
now takes the absolute value of the combined distance before taking
the square root. This is in-line with Marti Anderson's PERMDISP2.

I am quite sure that @gavinsimpson did the background work here (and I think all the work), and we also had a longer discussion about this issue, but I can't find any email from 2008 on that subject to see what kind of background work we had. I think Gavin consulted Marti, but I really cannot give any supporting evidence.

itcarroll · 2019-02-04T21:26:02Z

Interesting, thanks. Hopefully Gavin has some recollection to share. For completeness, here's the output I get in the trial version of Primer7 with a partial mis-match to the betadisper output (in top post) for the three centroid distances.

jarioksa · 2019-02-04T21:32:17Z

The Primer window banner seems to refer this analysis as PERMDISP1, whereas our ChangeLog entry was about PERMDISP2 compatibility. I have no idea where these alternative PERMDISPs are, but how do their outputs compare? (I don't have a Primer licence nor a Windows computer.)

itcarroll · 2019-02-04T21:47:06Z

Good question. Best I can tell, the 1 corresponds to "Data1" and "Resem1" which is just the default name given to windows in this workspace. If I load up a second workspace I get "Data2" and "PERMDISP2". Lord help us if that changes the underlying algorithm. I cannot find a reference in the software manual to the different versions of PERMDISP.

jarioksa · 2019-02-04T22:02:06Z

I think Marti Anderson had a stand-alone Windows binary called PERMDISP2, but I never tried that.

Here a graph of your example case, where the PCoA has one positive and one negative eigenvalue. In this case point S2 has clearly longer imaginary distance (negative squared distance) to the centroid (red cross) than real distance (squared positive distance: this zero for the median point), and Primer gives that as a zero-distance, we as a longish distance, and you suggest having it as a NaN (or NA) distance. I am undecided at the moment.

itcarroll · 2019-02-05T03:35:59Z

A fourth option is to return a complex valued "distance". An argument for doing so is that the result appears to agree with what Anderson calls Huygen's Theorem as applied to euclidean distances. An argument against doing so is that it's uninterpretable.

jarioksa · 2019-02-06T08:22:06Z

@itcarroll : I had a look at the issue yesterday, and after sleeping over the night I think you are right: the distance should be complex valued. This is indeed unfortunate. In general, we want methods to help people to understand their data, and returning imaginary distances is not very helpful. You already have an n-dimensional space, but then a point finds a wormhole in that space and shoots into imaginary space.

In the one-class case, the sum of squared distances should be equal to the sum of eigenvalues in betadisper with type="centroid" (for multi-class case the sum of squared distances should be equal to the sum of residual eigenvalues (CA component) in corresponding dbrda). In your example, this is only true if the squared distance can be negative meaning that the corresponding distance must be complex-valued.

> x
     [,1] [,2]
[1,]    0    2
[2,]    1    2
[3,]    1    0
> m <- betadisper(vegdist(x), rep("a",3), type="centr")
> m$eig
      PCoA1       PCoA2 
 0.50637605 -0.07637605 
> sum(m$eig)
[1] 0.43
> m$distances # 2nd distance should be imaginary
[1] 0.4509250 0.2160247 0.5228129 
> sum(m$distances^2)
[1] 0.5233333
> sum(m$distances[c(1,3)]^2) - m$distances[2]^2 # 2nd square should be negative
[1] 0.43

The absolute value of that imaginary component that we report is correct, but we do not give information that it should be complex-valued, and this is not correct. Giving that value as zero is incorrect as well. Giving it as NA is correct in some sense, but it does not solve the problem of inconsistency of squared distances and eigenvalues, and would be confusing since users see in plot and other analysis that the point is not at zero-distance from the centroid. Moreover, some methods (anova, boxplot, TukeyHSD will just omit this NA and give misleading results, and permutest will fail and print will show the whole class as NA average. Returning these values as negative distances could actually be more correct (although confusing), but naturally the squares of negative values are positive and will not give the correct sum of squared distances.

I think that something must be changed in betadisper, but I am not quite sure what to change. The first implementation gave these cases as NaN (square root of negative number), and that we changed in 2008 to accommodate user response. @gavinsimpson what do you think?

itcarroll · 2019-02-10T05:09:32Z

While hoping to hear better ideas from Gavin, I've been wondering ...

Is there any hope that anova.betadisper could be correctly implemented, given that only squared values go in to calculating the test statistic? Usually any notion of squaring complex z = a + bi means Conj(z)*z = a^2 + b^2. Blindly* taking it instead to mean a^2 - b^2 guarantees real sums of squares, though not positive ones, but what happens if you stick with it anyway?

Setup the problem with a second (well-behaved) group:

df <- read.csv(text = '
    Group, SpA, SpB
        1,   0,   2
        1,   1,   2
        1,   1,   0
        2,   0,   2
        2,   1,   2')
dst <- vegdist(df[,-1])
dsp <- betadisper(dst, df$Group, 'centroid')
dsp$distances[[2]] <- dsp$distances[[2]] * 1i ## the "correction"

Calculate the usual one-way ANOVA F statistic but with z^2 = a^2 - b^2:

# means (replicated to length y)
y <- dsp[['distances']]
n <- aggregate(y, dsp['group'], length)[[2]]
y.. <- mean(y) %>% mapply(rep, ., n) %>% unlist
yk. <- aggregate(y, dsp['group'], mean)[[2]] %>% mapply(rep, ., n) %>% unlist

# explained
mdl <- yk. - y..
XSS <- sum(Re(mdl)^2 - Im(mdl)^2)
MXSS <- XSS / (length(n) - 1)

# residual
rsd <- y - yk.
RSS <- sum(Re(rsd)^2 - Im(rsd)^2)
MRSS <- RSS / (sum(n) - length(n))

Could the mean sum of squares ratio (here MXSS / MRSS equal to 1.26), possibly be F distributed?

*uh oh. this kind of looks like https://en.wikipedia.org/wiki/Minkowski_space...

jarioksa · 2019-03-18T15:05:40Z

It is obvious that abs is wrong and should be changed. The problem is only how to change it.

The easiest way is to use zero (0), and probably with a warning about negative squared distances. The current practice of taking abs inflates within-group dispersion. Zero is still too long, but it is more correct than using abs. Zero is my recommendation for the first-line defence.

Using NA sounds good, but then we run into complications in the support functions. For instance, NA are silently removed in plots and in parametric anova, and this again inflates dispersions by removing very short distances. If we go for NA, many support functions must be changed to handle these.

Complex valued output makes sense, but leaves people in trouble since they should be able to handle complex vectors. Currently support functions do not handle them and they all must be re-written.

The complex distances arise when we have negative squared distances. If we can do with squared distances, then we avoid the explicit problem. As @itcarroll demonstrates above, some of the functions (parametric anova) can be changed to handle correctly these negative squared distances.

My suggestion is to first change negative squared distances to zero distances, and then at the second stage consider should we change some support functions for negative squared distances.

Some squared distances to centroid can be negative giving complex valued distances. We used to take the Mod of these complex value (or sqrt of abs differences) which replaces these "more similar than identical" distances with a positive value inflating estimates of within-group distances. Now we take only the real part (or zero) for these distances. This is still biased, but less so than the old practice.

jarioksa · 2019-03-19T11:32:23Z

I pushed branch issue-#306 to the github. Please work on this if you want to develop the approach I took. If you want something radically different, publish a new alternative branch. Nothing merged yet.

itcarroll · 2019-03-22T12:02:29Z

I think your approach is a good plan. My only suggestion is to consider issuing a warning when negative squared distances are encountered.

jarioksa · 2019-03-22T21:02:48Z

My suggestion issues a warning on negative squared distances. @itcarroll, do you suggest I should re-consider this and remove the warning? (I do know – life teaches you – that this warning will cause confusion and many email messages to me.)

itcarroll · 2019-03-23T12:26:32Z

Sorry @jarioksa, I had not looked at your branch. Carry on (and sorry for the emails)!

Some squared distances to centroid can be negative giving complex valued distances. We used to take the Mod of these complex value (or sqrt of abs differences) which replaces these "more similar than identical" distances with a positive value inflating estimates of within-group distances. Now we take only the real part (or zero) for these distances. This is still biased, but less so than the old practice. (cherry picked from commit 48b616e)

(cherry picked from commit e5938e0)

jarioksa pushed a commit that referenced this issue Mar 28, 2019

Merge branch 'issue-#306'

6d975c9

jarioksa pushed a commit that referenced this issue Apr 3, 2019

document new negative squared distance policy in betadisper (issue #306)

e5938e0

jarioksa pushed a commit that referenced this issue Apr 3, 2019

Merge branch 'issue-#306'

7988596

jarioksa pushed a commit that referenced this issue Apr 3, 2019

document new negative squared distance policy in betadisper (issue #306)

9c680df

(cherry picked from commit e5938e0)

jarioksa closed this as completed Sep 8, 2019

jarioksa mentioned this issue Aug 17, 2021

Code contradicts docs for betadisper (dealing with negative values) #434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abs in centroid distance calculation #306

abs in centroid distance calculation #306

itcarroll commented Feb 2, 2019 •

edited

Loading

jarioksa commented Feb 2, 2019

jarioksa commented Feb 2, 2019

gavinsimpson commented Feb 2, 2019

itcarroll commented Feb 3, 2019

jarioksa commented Feb 3, 2019 •

edited

Loading

itcarroll commented Feb 4, 2019

jarioksa commented Feb 4, 2019 •

edited

Loading

itcarroll commented Feb 4, 2019 •

edited

Loading

jarioksa commented Feb 4, 2019 •

edited

Loading

itcarroll commented Feb 4, 2019

jarioksa commented Feb 4, 2019 •

edited

Loading

itcarroll commented Feb 5, 2019

jarioksa commented Feb 6, 2019 •

edited

Loading

itcarroll commented Feb 10, 2019 •

edited

Loading

jarioksa commented Mar 18, 2019

jarioksa commented Mar 19, 2019

itcarroll commented Mar 22, 2019

jarioksa commented Mar 22, 2019

itcarroll commented Mar 23, 2019

abs in centroid distance calculation #306

abs in centroid distance calculation #306

Comments

itcarroll commented Feb 2, 2019 • edited Loading

jarioksa commented Feb 2, 2019

jarioksa commented Feb 2, 2019

gavinsimpson commented Feb 2, 2019

itcarroll commented Feb 3, 2019

jarioksa commented Feb 3, 2019 • edited Loading

itcarroll commented Feb 4, 2019

jarioksa commented Feb 4, 2019 • edited Loading

itcarroll commented Feb 4, 2019 • edited Loading

jarioksa commented Feb 4, 2019 • edited Loading

itcarroll commented Feb 4, 2019

jarioksa commented Feb 4, 2019 • edited Loading

itcarroll commented Feb 5, 2019

jarioksa commented Feb 6, 2019 • edited Loading

itcarroll commented Feb 10, 2019 • edited Loading

jarioksa commented Mar 18, 2019

jarioksa commented Mar 19, 2019

itcarroll commented Mar 22, 2019

jarioksa commented Mar 22, 2019

itcarroll commented Mar 23, 2019

itcarroll commented Feb 2, 2019 •

edited

Loading

jarioksa commented Feb 3, 2019 •

edited

Loading

jarioksa commented Feb 4, 2019 •

edited

Loading

itcarroll commented Feb 4, 2019 •

edited

Loading

jarioksa commented Feb 4, 2019 •

edited

Loading

jarioksa commented Feb 4, 2019 •

edited

Loading

jarioksa commented Feb 6, 2019 •

edited

Loading

itcarroll commented Feb 10, 2019 •

edited

Loading