Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between expected and observed coalescence time in beta coalescent for large values of alpha #2256

Closed
Sendrowski opened this issue Feb 5, 2024 · 23 comments · Fixed by #2257 or #2267

Comments

@Sendrowski
Copy link

Sendrowski commented Feb 5, 2024

Hey!

There appears to be an upward bias in the simulated coalescence times of the beta coalescent. This is relative to my expectation based on the time scaling provided in the documentation. The bias starts to be noticeable from alpha=1.8 and increases substantially for values closer to 2.

Relative error between observed and expected coalescent times plotted against alpha:
image

Code to generate figure:

import multiprocessing as mp

import matplotlib.pyplot as plt
import msprime
import numpy as np
import scipy


def compute_beta_timescale(pop_size, alpha, ploidy):
    """
    Compute the generation time for the beta coalescent exactly as done in msprime when testing.
    https://github.com/tskit-dev/msprime/blob/804e0361c4f8b5f5051a9fbf411054ee8be3426a/verification.py#L3447
    """

    if ploidy > 1:
        N = pop_size / 2
        m = 2 + np.exp(
            alpha * np.log(2) + (1 - alpha) * np.log(3) - np.log(alpha - 1)
        )
    else:
        N = pop_size
        m = 1 + np.exp((1 - alpha) * np.log(2) - np.log(alpha - 1))
    ret = np.exp(
        alpha * np.log(m)
        + (alpha - 1) * np.log(N)
        - np.log(alpha)
        - scipy.special.betaln(2 - alpha, alpha)
    )
    return ret


def simulate_tree_height(alpha):
    """
    Simulate genealogies under the beta coalescent and compute the average tree height.
    Since we only have two lineages, the coalescence time should coincide with `compute_beta_timescale ()`.
    """

    # simulate genealogies under the beta coalescent
    g = msprime.sim_ancestry(
        samples=2,
        num_replicates=100000,
        model=msprime.BetaCoalescent(alpha=alpha),
        ploidy=1
    )

    return np.mean([ts.first().total_branch_length / 2 for ts in g])


if __name__ == "__main__":
    alphas = np.linspace(1.99, 1.999, 30)

    # simulate average tree heights in parallel
    with mp.Pool() as pool:
        heights_observed = np.array(pool.map(simulate_tree_height, alphas))

    # compute theoretical tree heights
    heights_theoretical = np.array([compute_beta_timescale(1, alpha, 1) for alpha in alphas])

    # compute relative difference between observed and theoretical
    diff_rel = (heights_observed - heights_theoretical) / heights_theoretical

    # plot relative difference against alpha
    plt.plot(alphas, diff_rel)
    plt.margins(0)
    plt.xlabel('alpha')
    plt.ylabel('diff_rel')
    plt.show()

Am I perhaps missing something?

@jeromekelleher
Copy link
Member

Thanks for the report @Sendrowski - the figure seems to have gone missing, would you mind posting again please?

@Sendrowski
Copy link
Author

Hm that’s interesting. Did it work this time?

image

@jeromekelleher
Copy link
Member

No - maybe a problem with GitHub?

@Sendrowski
Copy link
Author

Possibly — I managed to view it with an independent device. Here the link to the image if that is of any help.

@jeromekelleher
Copy link
Member

Sorry, still not showing up for me (I got a 404 on that link)

@Sendrowski
Copy link
Author

And here is an iCloud link. Hope it works :)

@jeromekelleher
Copy link
Member

Screenshot from 2024-02-05 15-08-50

@jeromekelleher
Copy link
Member

Any thoughts here @JereKoskela ?

@JereKoskela
Copy link
Member

I ran a C++ script simulating Beta(2-a, a) random variables for a very close to 2, and relative errors of 5 or 6 per cent in an empirical mean from 10k samples are common because the true mean is so small. I think that explains what is going on below alpha = 1.995 or so.

As for the spike very close to 2, the timescale of the Schweinsberg Beta-coalescent model is unpleasant. It goes to infinity at alpha = 1 and zero at alpha = 2, so I wouldn't be surprised if numerical issues creep in near those boundaries. At the alpha = 2 boundary in particular, you're essentially sampling 0/0. The practical thing to do is just use the Hudson model, since probability of seeing a multiple merger in the the Beta-coalescent is effectively zero in that regime.

@Sendrowski
Copy link
Author

I agree that a relative error of this magnitude might perhaps not seem so surprising (100k replicates in my code example), but it is also important to note that the simulated values are consistently larger than the theoretical ones (I am not taking absolute values). For alpha=1.9, I obtain a relative error of ~2.7% (1e6 replicates, very stable estimate), and for this value we have a (not so extreme) time scaling of about 0.14 relative to the Kingman coalescent. This plot is probably doing a better job (iCloud link in case you still can’t see my images):

image

@JereKoskela
Copy link
Member

You're right @Sendrowski, I missed the lack of a modulus sign, and evidently also one zero on the number of replicates. I'll do some more digging. It looks like a very odd bug; the code just calculates the timescale and multiplies it by an Exp(1) random variable. And somehow the resulting mean does not equal the timescale.

@Sendrowski
Copy link
Author

Yes, very odd indeed... The bias also appears to drop somewhat near alpha=1.99 before shooting up again. I hope it won't be very hard to find out. I didn't debug the C code, so perhaps the time scaling is simply different when calculated there. Or perhaps it is caused by including the incomplete beta function although I could not observe that when translating into native Python.

@JereKoskela
Copy link
Member

JereKoskela commented Feb 6, 2024

Ok, this turned out to be an insufficiently sensitive polynomial approximation for evaluating (essentially) x^2 / x^2 for very small x. It was using a polynomial approximation for x < 1e-9, which I've now hiked to x < 1e-5. @Sendrowski, your Python script now yields unbiased estimators with errors on both sides of zero.

@jeromekelleher, any comments or are you happy to merge the linked PR? I've rerun the tests in verification.py for the Beta-coalescent and all still look good.

@jeromekelleher
Copy link
Member

Thanks for sorting this out so quickly @JereKoskela! Change LGTM.

@Sendrowski, are you happy with the changes? Shall we push out a bugfix release?

@Sendrowski
Copy link
Author

Also looks good to me! Thank you for the quick fix @JereKoskela, and I look forward to meeting you in Warwick in April. As for the release, it’s totally up to you how soon you would like to push it out.

@mergify mergify bot closed this as completed in #2257 Feb 8, 2024
@Sendrowski
Copy link
Author

Hey again. The bias due to numerical imprecision for values of alpha very close to 2 persists, however. Maybe one should prohibit values greater than 1.99 or so to make sure this won't be causing any problems.

image

@jeromekelleher
Copy link
Member

Sorry, still can't see the image @Sendrowski . I think maybe you're uploading a tiff instead of a png or something?

@Sendrowski
Copy link
Author

Hm it should be a png, but the graph I enclosed actually looks identical to the one I originally posted.

msprime_bug

@JereKoskela
Copy link
Member

Yeah, the timescale will always break close to alpha = 1 or 2. There is no way to avoid that. I suppose we could disallow values very close to the boundaries as a result. I'd like to make sure that 1.01 and 1.99 work even if they come from floating point calculations in some other script, so maybe 1.009 and 1.991?

@Sendrowski
Copy link
Author

Sounds good! Of note perhaps that I could not observe the same kind of numerical instability for values near 1 — at least for similar distances to the boundary.

@JereKoskela
Copy link
Member

Thanks, that's useful to know. It's possible that we can get much closer to 1 than 2, but eventually everything will be infinite for alpha too close to 1. I'll see if I can find a reasonable lower bound.

@jeromekelleher
Copy link
Member

Shall we reopen this issue, or make a new one?

@JereKoskela JereKoskela reopened this Feb 12, 2024
@JereKoskela
Copy link
Member

I've reopened this one. I'm hoping to get to this within a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants