Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean time in discretised distribution #240

Open
hyanwong opened this issue Jan 12, 2023 · 12 comments
Open

Mean time in discretised distribution #240

hyanwong opened this issue Jan 12, 2023 · 12 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Jan 12, 2023

I'm trying to find a sensible calculation for the mean time of a node in tsdate, given the time discretisation scheme. More specifically, if we take the $n$ probabilities $p_i$ for $i=1..n$ associated with each timepoint $t_i$ to mean the probability that the true time lies between $t_i$ and $t_{i+1}$, then the last value $p_n$ is the probability that the node time will be from timepoint $t_n$ to ∞. In other words, this is an open-ended distribution.

If we want to find the mean, while including the mass at the top end, we need to calculate the mean in the top region and add its weight to the total. Here's one way to do it. Does it sound correct?

Assume that the highest category fits a negative exponential distribution (as expected from coalescent theory), with exponent $-2n_e$ (this will be correct for the oldest node, but presumably not for younger nodes). Since the total mass is $p_n$, the equivalent portion of the exponential distribution runs from $a...∞$, where

$$\int_a^∞e^{-x/2N_e}dx = p_m$$

$$a = -2N_e log(p_m)$$

The extra mass in that region that needs to be incorporated is

$$\int_a^∞e^{-x/2N_e}xdx$$

Substituting in a, this equates exactly to $2 * Ne * p_m * (1-\log(p_m))$ (nice!)

So I think that to get the mean we can take the midpoints of most of the intervals, then simply add on $2 * Ne * p_m * (1-\log(p_m))$ for the top end. Here's an example:

import scipy
import tsdate
import numpy as np
from tsdate.prior import PriorParams

def test_approx_mean(distr, Ne, n_tips, samples_under_node):
    p = tsdate.prior.ConditionalCoalescentTimes(n_tips+1, Ne=Ne, prior_distr=distr)
    p.add(n_tips) # Test for a node with 2 tips underneath
    timepoints = tsdate.prior.create_timepoints(p, n_points=21)

    node_parameters = p[n_tips]

    if p.prior_distr == "lognorm":
        func = scipy.stats.lognorm
        main_param = np.sqrt(node_parameters[:, PriorParams.field_index("beta")])
        scale_param = np.exp(node_parameters[:, PriorParams.field_index("alpha")])
    elif p.prior_distr == "gamma":
        func = scipy.stats.gamma
        main_param = node_parameters[:, PriorParams.field_index("alpha")]
        scale_param = 1 / node_parameters[:, PriorParams.field_index("beta")]

    prior_node = func.cdf(
        timepoints,
        main_param[samples_under_node],
        scale=scale_param[samples_under_node])
    probs = np.diff(np.concatenate((prior_node, [1])))

    assert len(probs) == len(timepoints)
    assert np.sum(probs) == 1

    print(
        f"Mean from discretised probabilities from {p.prior_distr} is:",
        np.sum(probs[:-1] * (timepoints[1:]+timepoints[:-1])/2) +
        2 * Ne * probs[-1] * (1 - np.log(probs[-1])),
        "but is expected to be",
        func.stats(main_param[samples_under_node],
                   scale=scale_param[samples_under_node],
                   moments='m')
    )

Here are some cases:

>>> test_approx_mean("lognorm", 50, 2, 2)
Mean from discretised probabilities from lognorm is: 99.30403663288052 but is expected to be 100.0000
>>> test_approx_mean("gamma", 50, 2, 2)
Mean from discretised probabilities from gamma is: 100.36711196465089 but is expected to be 100.0
>>> test_approx_mean("lognorm", 40, 1000, 100)
Mean from discretised probabilities from lognorm is: 7.945020255145858 but is expected to be 7.92000

This is better than anything else I've come up with. Can you think of an alternative approach @nspope ?

@nspope
Copy link
Contributor

nspope commented Jan 12, 2023

Hmm ...
I'm probably misunderstanding what happens at the inference step, but I'm assuming that the likelihood is discretized in the same way as the prior, e.g. you have a likelihood for each bin including the last (open) interval. Then, the posterior mean is,

$$\mathbb{E}[t \mid y] \propto \mathbb{L}(t \geq x_N \mid y) \times \mathbb{P}(t \geq x_N) \times \mathbb{E}[t \mid t \geq x_N] + \sum_{i=0}^{N-1} \mathbb{L}(x_i \leq t < x_{i+1} \mid y) \times \mathbb{P}(x_i \leq t < x_{i+1}) \times \mathbb{E}(t \mid x_i \leq t < x_{i+1})$$

where $\mathbb{L}(. \mid y)$ is the likelihood given data $y$, $x$ are the $N$ finite breakpoints, and the normalizing constant is the sum of likelihood x prior over bins. In the current scheme $\mathbb{E}(t \mid x_i \leq t &lt; x_{i+1}) = \frac{x_i + x_{i+1}}{2}$, because the prior is getting chopped up into a piecewise-uniform distribution.

So to be able to calculate a posterior mean over the discretized time distribution, you need to know the prior expectation in the last time bin (that is $\mathbb{E}[t \mid t \geq x_N]$). You're proposing using an exponential as a model. Do I have this right?

If so, this seems reasonable, but why not use the exact expectations under the lognormal/gamma prior (and do the same for the "interior bins")? These would be

$$\mathbb{E}[t \mid x_i \leq t < x_{i+1}] = \frac{\int_{x_i}^{x_{i+1}} t p(t) dt}{\int_{x_i}^{x_{i+1}} p(t) dt} = \frac{\int_{x_i}^{x_{i+1}} t p(t) dt}{\mathbb{P}(x_i \leq t < x_{i+1})}$$

where $p(t)$ is the gamma or lognormal pdf. So the posterior expectation above becomes

$$\mathbb{E}[t \mid y] \propto \mathbb{L}(t \geq x_N \mid y) \int_{x_N}^{\infty} t p(t) dt + \sum_{i=0}^{N-1} \mathbb{L}(x_i \leq t < x_{i+1} \mid y) \int_{x_i}^{x_{i+1}} t p(t) dt$$

If the likelihood is constant ( $\mathbb{L}( . \mid y) = 1$ ) then this reduces to the exact prior expectation (e.g. the time discretization is irrelevant). When the likelihood is not constant, the posterior expectation is an interval-reweighting of the prior expectation. The integrals could be calculated ahead of time in a lookup table for each node, using quadrature.

@hyanwong
Copy link
Member Author

hyanwong commented Jan 12, 2023

You're proposing using an exponential as a model. Do I have this right?

Yes, exactly

why not use the exact expectations under the lognormal/gamma prior

The issue is that I want to calculate the posterior mean, and after running the inside/outside step, you no longer are assuming that the distribution is anything like a gamma or lognormal, so you don't have a fitted function for it (indeed, if you wanted to fit a function, you would want a way to calculate the mean anyway!). If could even e.g. be bimodal, depending on where the weights are coming from. It is nicer to only rely on the distributional assumptions for the prior and not for the posterior.

Additionally, I'm not sure that the lognormal holds well as an approximation right in the top end of the tail. I also worried that the centre of mass of the top end of a lognormal might be at infinity, but I don't think that can really be the case.

@nspope
Copy link
Contributor

nspope commented Jan 12, 2023

The issue is that I want to calculate this for the posterior, and after running the inside/outside step, you no longer are assuming that the distribution is anything like a gamma or lognormal, so you don't have a fitted function for it

Right, but presumably the likelihood and prior factorize over the time discretization like I wrote in previous comment? In which case, the posterior can be multimodal (e.g. if the likelihood is multimodal over bins or the likelihood strongly disagrees with the prior). It's just that if the likelihood is discretized over time (like with interval censoring in survival analysis), then the posterior expectation can be written in terms of reweighted piecewise integrals over the prior. The current choice of prior is piecewise uniform, with the intervals/weights for each piece coming from PPF/CDF of gamma or lognormal. What this means is that if the time discretization changes, then the prior changes (e.g. its moments will change). Instead, if the "pieces" are truncated gamma/lognormal distributions, then changing the time discretization does not alter the prior distribution (although it would change the likelihood & posterior).

But, I'm blathering without really understanding what's going on in the inference step, so let me take a look at the code! Regardless, I think using an exponential as a model for the upper bin makes sense. In which case, it doesn't seem like a big deal if it doesn't exactly match the original (undiscretized) prior, because that'll get distorted anyway by the piecewise-uniform approximation.

@hyanwong
Copy link
Member Author

I guess I might also want to do something like this for the variance, although we don't actually use the variance anywhere important. Discretising the distribution is known to bias the variance, which can be countered by using "Sheppard's Correction" for symmetrical distributions with even bin widths. We don't have those assumptions, however.

@hyanwong
Copy link
Member Author

Regardless, I think using an exponential as a model for the upper bin makes sense

I think so. It also doesn't bake the prior choice into the posterior, which I somewhat prefer. I'm unsure if I should be using the same exponent (i.e. 2Ne) for the upper end of all the nodes, but I suspect it won't matter too much.

@hyanwong
Copy link
Member Author

Also, I don't know how we would apply this if there is a variable Ne over time. It would be nice to be able to fit the tail end of the exponential by using some internal feature of the posterior (e.g. the median, or the mean without the tail region) rather than having to rely on something baked into the prior.

@hyanwong
Copy link
Member Author

Actually, another issue I just through about: for very young nodes (e.g. when most of the mass is in the youngest relevant timeslice), we probably shouldn't find the mean using the midpoints of the time slices. Instead we should maybe fit some sort of decaying exponential. This may be the reason that we overestimate the mean ages of the youngest nodes. I'm not sure what the best approach is here. I feel it would be nice to have a method that worked uniformly for all nodes, but which didn't depend on either Ne or the prior distribution. Rather, it somehow adaptively accounted for the shape of the posterior, using something like e.g. Simpson's 1/3rd rule. I may be talking nonsense here though.

@nspope
Copy link
Contributor

nspope commented Jan 12, 2023

One possibility would be to use the "binned" posterior to fit a nonparametric density estimate per node -- rather than use a gamma, exponential or whatever -- and use the expectation of these to fill out ages in the tree sequence. This would be more of a postprocessing step.

A quick search indicates that there are some proposed methods on how to do this (e.g. go from a histogram to a density estimator) but all these will surely require some tuning (bandwidth) and may not do that well at the boundary (which is where the problems are it sounds like?)

Another approach would be to treat the posterior as interval-censored data and fit a mixture distribution, e.g. a mixture of gammas.

@hyanwong
Copy link
Member Author

hyanwong commented Jan 12, 2023

If the distribution is considered on a log scale, then we could treat the right and left sides as both open (and perhaps fit the tails using a lognormal rather than an exponential). That would be pleasingly symmetrical. But I am slightly worried that the lognormal is too "fat tailed" given that these are generated by a coalescent process, and suspect that it might bias the mean toward older times for some of the oldest (and hardest to date) nodes

@hyanwong
Copy link
Member Author

I'm hoping this will fix #229

@nspope
Copy link
Contributor

nspope commented Jan 16, 2023

Probably relevant to the comment here. I think the confusion for me stems from what sort of distribution is actually being approximated -- the posterior mean is calculated from a finite set of atoms (timepoints) with weights, rather than as expectations of a continuous piecewise-uniform distribution over time intervals. So it's tricky to think about combining this with a model over an interval for the uppermost timeslice. If I'm understanding correctly, the proposal is to:

  1. Use an exponential model for the node age after the last finite timepoint
  2. Calculate the rate parameter of this exponential for each node using the discrete time points & their posterior weights
  3. Calculate posterior mean for $i$-th node that is $$\sum_{k=0}^{K-1} P_{ik} t_k +P_{iK} (t_{K-1} + \bar{t})$$ where $t_k$ are the $K$ finite timepoints, $P_{ik}$ are posterior weights, and $\bar{t}$ is the expectation of the fitted exponential

While this seems reasonable I'm struggling to figure out what the approximation is of, exactly. If the timepoints were midpoints of the intervals then it'd be something like the expectation of a piecewise uniform distribution over time intervals, with an exponential tacked on at the last (open) interval.

@nspope
Copy link
Contributor

nspope commented Jan 25, 2023

After futzing around for awhile, it seems to me that the algorithm works quite well as is, and that the most reliable way to improve accuracy may come from optimizing the time grid; rather than something like quadrature or post-processing the posterior.

In essence, the variable Ne prior in #248 provides a way to arbitrarily adjust the spacing of the time grid; so iteratively improving the prior via pairwise coalescence rates (or the marginal likelihood produced by the inside algorithm) could possibly have a substantial impact on accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants