Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abs in centroid distance calculation #306

Closed
itcarroll opened this issue Feb 2, 2019 · 19 comments
Closed

abs in centroid distance calculation #306

itcarroll opened this issue Feb 2, 2019 · 19 comments

Comments

@itcarroll
Copy link

itcarroll commented Feb 2, 2019

My colleague @mavolio and I have noticed that the betadisper function can give a result that differs from Primer-e. This situation arises when the input distance matrix violates the triangle inequality, for example:

> df <- read.csv(text = '
    Group, SpA, SpB
        1,   0,   2
        1,   1,   2
        1,   1,   0')
> dst <- vegdist(df[,-1], 'bray')
> dst
    1   2
2 0.2    
3 1.0 0.5

The "distance" from 1 to 3 is 1.0, but it's shorter to go the 0.2 to 2 and then 0.5 to 3. The distances to the group centroid that betadisper returns are all real values, due to the abs call in L130.

> dsp <- betadisper(dst, df$Group, type = 'centroid')
> dsp$distances
[1] 0.4509250 0.2160247 0.5228129

Without that abs, the result would appear to be meaningless because it is complex valued:

> dsp$distances
[1] 0.4509250+0.0000000i 0.0000000+0.2160247i 0.5228129+0.0000000i

It's not clear to us that a correct resolution of this situation is described in the literature anywhere. We think the result returned by betadisper may be faulty for two reasons:

  1. It's not what Primer-e returns. Primer-e returns only the real part of the complex array above.
  2. The sum of squared distances (0.523) does not equal the sum of squared inter-point distances divided by the number of points (0.43).

The second point arises from our reading of page 17 of the Primer-e manual. The sum of squares of the complex array above (0.43+0i) does pass the test, so despite it being complex may be the correct answer. We currently plan, in an upcoming patch to the NCEAS/codyn package, to return NA for centroid distances that fall into this category, because we don't feel the topic has passed through peer-review. How did the vegan developers settle on the current approach of taking an absolute value?

@jarioksa
Copy link
Contributor

jarioksa commented Feb 2, 2019

The vegan behaviour is documented (with an equation) in the betadisper help page. There we refer to Anderson (Biometrics 62, 245–253; 2006). We could have referred to J. C. Gower, Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra and its Applications 67, 81–97 (1985) which gives the basic equations, but (naturally) not yet in the context of betadisper. So I think it is documented and it has passed peer review 35 years ago.

@jarioksa
Copy link
Contributor

jarioksa commented Feb 2, 2019

@gavinsimpson may remember more details, but we had a long discussion about this behaviour before implementation. However, it seems to be so ancient that the commits were before version control.

@gavinsimpson
Copy link
Contributor

@itcarroll thanks for the question & example. As @jarioksa mentions, we followed Marti Anderson's Biometrics paper closely, implementing the equations presented there. It has been quite some time since I looked at that paper but I'll take a look next week and provide more detail if it jogs my memory, but I'd suggest you also take a look at it and if an issue remains we can revisit this.

Part of the difficulty here is the Primer-e is not open source and I am unsure what they actually implement; I know there are some newer developments that Marti has included in Primer that I am not aware have been published so it is further difficult to know what is being done with that software.

@itcarroll
Copy link
Author

Thank you, @gavinsimpson, for offering to look further into the question. I am familiar with Anderson's 2006 paper, and I'm pretty sure @jarioksa is referring to equation 3. The equation does not include taking an absolute value before taking the square root. In fact, on page 248, Anderson writes:

Although neither (3) nor (4) precludes the possibility of either z^c or z^m being complex values, in practice this would be highly unlikely to occur, provided a reasonable dissimilarity measure was chosen.

I think the example I've given above is one "highly unlikely" case, and am interested to know whether you find it so too.

It is unfortunate that Primer-e is not open source, and while we cannot see what the code does we see the result. I've run this matrix through Primer-e, and the result is consistent with simply dropping the imaginary part in the centroid distances. My points 1 and 2 above address specific inconsistencies between the betadisper result and Marti Anderson's description and calculation (via Primer-e).

@jarioksa
Copy link
Contributor

jarioksa commented Feb 3, 2019

@gavinsimpson it seems that we talked about the negative eigenvalues in 2009, but my archive is very patchy. However, ignoring negative eigenvalues and only using the real eigenvectors is clearly wrong (@itcarroll). The point is only how to handle the negative eigenvalues. In all decent vegan functions we handle negative eigenvalues & complex eigenvectors (dbrda, adonis, wcmdscale) – the only exception being capscale which is there just for the "inertia" (= people have used it, and we keep it there to maintain the history). capscale deals only with the real part of the dissimilarities, but currently we think that is wrong. We wanted to keep the imaginary eigenvectors in betadisper, and I think that you @gavinsimpson even consulted the Marti Anderson when implementing the current behaviour, but I cannot find those emails ten years after.

@itcarroll
Copy link
Author

Not totally sure yet that the specific concern I'm raising is understood, so to be clear, I do not suggest ignoring components associated with negative eigenvalues. I'm interested in the case where dist.neg is non-zero (i.e. there is a negative eigenvalue) AND greater than dist.pos. I cannot find a discussion in the literature of rescuing this case from producing a complex valued square root by taking the absolute value as in line 130. Thanks for pursuing the source of this decision.

@jarioksa
Copy link
Contributor

jarioksa commented Feb 4, 2019

@itcarroll : things come back to my mind by and by. We really had a discussion on exactly this issue (neg distances larger than pos ones), but that was back in 2008 which is older than my email archive and older than having vegan in version control. However, there is an old ChangeLog entry on the issue from May 1, 2008:

  • betadisper: was not calculating distance to centroid correctly
    for observations where the imaginary distance to centroid was
    greater than the real distance (resulting in negative distance)
    which resulted in NaN when we took the square root. betadisper()
    now takes the absolute value of the combined distance before taking
    the square root. This is in-line with Marti Anderson's PERMDISP2.

I am quite sure that @gavinsimpson did the background work here (and I think all the work), and we also had a longer discussion about this issue, but I can't find any email from 2008 on that subject to see what kind of background work we had. I think Gavin consulted Marti, but I really cannot give any supporting evidence.

@itcarroll
Copy link
Author

itcarroll commented Feb 4, 2019

Interesting, thanks. Hopefully Gavin has some recollection to share. For completeness, here's the output I get in the trial version of Primer7 with a partial mis-match to the betadisper output (in top post) for the three centroid distances.

screen shot 2019-02-04 at 4 19 19 pm

@jarioksa
Copy link
Contributor

jarioksa commented Feb 4, 2019

The Primer window banner seems to refer this analysis as PERMDISP1, whereas our ChangeLog entry was about PERMDISP2 compatibility. I have no idea where these alternative PERMDISPs are, but how do their outputs compare? (I don't have a Primer licence nor a Windows computer.)

@itcarroll
Copy link
Author

Good question. Best I can tell, the 1 corresponds to "Data1" and "Resem1" which is just the default name given to windows in this workspace. If I load up a second workspace I get "Data2" and "PERMDISP2". Lord help us if that changes the underlying algorithm. I cannot find a reference in the software manual to the different versions of PERMDISP.

@jarioksa
Copy link
Contributor

jarioksa commented Feb 4, 2019

I think Marti Anderson had a stand-alone Windows binary called PERMDISP2, but I never tried that.

Here a graph of your example case, where the PCoA has one positive and one negative eigenvalue. In this case point S2 has clearly longer imaginary distance (negative squared distance) to the centroid (red cross) than real distance (squared positive distance: this zero for the median point), and Primer gives that as a zero-distance, we as a longish distance, and you suggest having it as a NaN (or NA) distance. I am undecided at the moment.
imagine

@itcarroll
Copy link
Author

A fourth option is to return a complex valued "distance". An argument for doing so is that the result appears to agree with what Anderson calls Huygen's Theorem as applied to euclidean distances. An argument against doing so is that it's uninterpretable.

@jarioksa
Copy link
Contributor

jarioksa commented Feb 6, 2019

@itcarroll : I had a look at the issue yesterday, and after sleeping over the night I think you are right: the distance should be complex valued. This is indeed unfortunate. In general, we want methods to help people to understand their data, and returning imaginary distances is not very helpful. You already have an n-dimensional space, but then a point finds a wormhole in that space and shoots into imaginary space.

In the one-class case, the sum of squared distances should be equal to the sum of eigenvalues in betadisper with type="centroid" (for multi-class case the sum of squared distances should be equal to the sum of residual eigenvalues (CA component) in corresponding dbrda). In your example, this is only true if the squared distance can be negative meaning that the corresponding distance must be complex-valued.

> x
     [,1] [,2]
[1,]    0    2
[2,]    1    2
[3,]    1    0
> m <- betadisper(vegdist(x), rep("a",3), type="centr")
> m$eig
      PCoA1       PCoA2 
 0.50637605 -0.07637605 
> sum(m$eig)
[1] 0.43
> m$distances # 2nd distance should be imaginary
[1] 0.4509250 0.2160247 0.5228129 
> sum(m$distances^2)
[1] 0.5233333
> sum(m$distances[c(1,3)]^2) - m$distances[2]^2 # 2nd square should be negative
[1] 0.43

The absolute value of that imaginary component that we report is correct, but we do not give information that it should be complex-valued, and this is not correct. Giving that value as zero is incorrect as well. Giving it as NA is correct in some sense, but it does not solve the problem of inconsistency of squared distances and eigenvalues, and would be confusing since users see in plot and other analysis that the point is not at zero-distance from the centroid. Moreover, some methods (anova, boxplot, TukeyHSD will just omit this NA and give misleading results, and permutest will fail and print will show the whole class as NA average. Returning these values as negative distances could actually be more correct (although confusing), but naturally the squares of negative values are positive and will not give the correct sum of squared distances.

I think that something must be changed in betadisper, but I am not quite sure what to change. The first implementation gave these cases as NaN (square root of negative number), and that we changed in 2008 to accommodate user response. @gavinsimpson what do you think?

@itcarroll
Copy link
Author

itcarroll commented Feb 10, 2019

While hoping to hear better ideas from Gavin, I've been wondering ...

Is there any hope that anova.betadisper could be correctly implemented, given that only squared values go in to calculating the test statistic? Usually any notion of squaring complex z = a + bi means Conj(z)*z = a^2 + b^2. Blindly* taking it instead to mean a^2 - b^2 guarantees real sums of squares, though not positive ones, but what happens if you stick with it anyway?

Setup the problem with a second (well-behaved) group:

df <- read.csv(text = '
    Group, SpA, SpB
        1,   0,   2
        1,   1,   2
        1,   1,   0
        2,   0,   2
        2,   1,   2')
dst <- vegdist(df[,-1])
dsp <- betadisper(dst, df$Group, 'centroid')
dsp$distances[[2]] <- dsp$distances[[2]] * 1i ## the "correction"

Calculate the usual one-way ANOVA F statistic but with z^2 = a^2 - b^2:

# means (replicated to length y)
y <- dsp[['distances']]
n <- aggregate(y, dsp['group'], length)[[2]]
y.. <- mean(y) %>% mapply(rep, ., n) %>% unlist
yk. <- aggregate(y, dsp['group'], mean)[[2]] %>% mapply(rep, ., n) %>% unlist

# explained
mdl <- yk. - y..
XSS <- sum(Re(mdl)^2 - Im(mdl)^2)
MXSS <- XSS / (length(n) - 1)

# residual
rsd <- y - yk.
RSS <- sum(Re(rsd)^2 - Im(rsd)^2)
MRSS <- RSS / (sum(n) - length(n))

Could the mean sum of squares ratio (here MXSS / MRSS equal to 1.26), possibly be F distributed?

*uh oh. this kind of looks like https://en.wikipedia.org/wiki/Minkowski_space...

@jarioksa
Copy link
Contributor

It is obvious that abs is wrong and should be changed. The problem is only how to change it.

The easiest way is to use zero (0), and probably with a warning about negative squared distances. The current practice of taking abs inflates within-group dispersion. Zero is still too long, but it is more correct than using abs. Zero is my recommendation for the first-line defence.

Using NA sounds good, but then we run into complications in the support functions. For instance, NA are silently removed in plots and in parametric anova, and this again inflates dispersions by removing very short distances. If we go for NA, many support functions must be changed to handle these.

Complex valued output makes sense, but leaves people in trouble since they should be able to handle complex vectors. Currently support functions do not handle them and they all must be re-written.

The complex distances arise when we have negative squared distances. If we can do with squared distances, then we avoid the explicit problem. As @itcarroll demonstrates above, some of the functions (parametric anova) can be changed to handle correctly these negative squared distances.

My suggestion is to first change negative squared distances to zero distances, and then at the second stage consider should we change some support functions for negative squared distances.

jarioksa pushed a commit that referenced this issue Mar 19, 2019
Some squared distances to centroid can be negative giving complex
valued distances. We used to take the Mod of these complex value (or
sqrt of abs differences) which replaces these "more similar than
identical" distances with a positive value inflating estimates of
within-group distances. Now we take only the real part (or zero) for
these distances. This is still biased, but less so than the old
practice.
@jarioksa
Copy link
Contributor

I pushed branch issue-#306 to the github. Please work on this if you want to develop the approach I took. If you want something radically different, publish a new alternative branch. Nothing merged yet.

@itcarroll
Copy link
Author

I think your approach is a good plan. My only suggestion is to consider issuing a warning when negative squared distances are encountered.

@jarioksa
Copy link
Contributor

My suggestion issues a warning on negative squared distances. @itcarroll, do you suggest I should re-consider this and remove the warning? (I do know – life teaches you – that this warning will cause confusion and many email messages to me.)

@itcarroll
Copy link
Author

Sorry @jarioksa, I had not looked at your branch. Carry on (and sorry for the emails)!

jarioksa pushed a commit that referenced this issue Mar 28, 2019
jarioksa pushed a commit that referenced this issue Mar 28, 2019
Some squared distances to centroid can be negative giving complex
valued distances. We used to take the Mod of these complex value (or
sqrt of abs differences) which replaces these "more similar than
identical" distances with a positive value inflating estimates of
within-group distances. Now we take only the real part (or zero) for
these distances. This is still biased, but less so than the old
practice.

(cherry picked from commit 48b616e)
jarioksa pushed a commit that referenced this issue Apr 3, 2019
jarioksa pushed a commit that referenced this issue Apr 3, 2019
@jarioksa jarioksa closed this as completed Sep 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants