-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mean time in discretised distribution #240
Comments
Hmm ... where So to be able to calculate a posterior mean over the discretized time distribution, you need to know the prior expectation in the last time bin (that is If so, this seems reasonable, but why not use the exact expectations under the lognormal/gamma prior (and do the same for the "interior bins")? These would be where If the likelihood is constant ( |
Yes, exactly
The issue is that I want to calculate the posterior mean, and after running the inside/outside step, you no longer are assuming that the distribution is anything like a gamma or lognormal, so you don't have a fitted function for it (indeed, if you wanted to fit a function, you would want a way to calculate the mean anyway!). If could even e.g. be bimodal, depending on where the weights are coming from. It is nicer to only rely on the distributional assumptions for the prior and not for the posterior. Additionally, I'm not sure that the lognormal holds well as an approximation right in the top end of the tail. I also worried that the centre of mass of the top end of a lognormal might be at infinity, but I don't think that can really be the case. |
Right, but presumably the likelihood and prior factorize over the time discretization like I wrote in previous comment? In which case, the posterior can be multimodal (e.g. if the likelihood is multimodal over bins or the likelihood strongly disagrees with the prior). It's just that if the likelihood is discretized over time (like with interval censoring in survival analysis), then the posterior expectation can be written in terms of reweighted piecewise integrals over the prior. The current choice of prior is piecewise uniform, with the intervals/weights for each piece coming from PPF/CDF of gamma or lognormal. What this means is that if the time discretization changes, then the prior changes (e.g. its moments will change). Instead, if the "pieces" are truncated gamma/lognormal distributions, then changing the time discretization does not alter the prior distribution (although it would change the likelihood & posterior). But, I'm blathering without really understanding what's going on in the inference step, so let me take a look at the code! Regardless, I think using an exponential as a model for the upper bin makes sense. In which case, it doesn't seem like a big deal if it doesn't exactly match the original (undiscretized) prior, because that'll get distorted anyway by the piecewise-uniform approximation. |
I guess I might also want to do something like this for the variance, although we don't actually use the variance anywhere important. Discretising the distribution is known to bias the variance, which can be countered by using "Sheppard's Correction" for symmetrical distributions with even bin widths. We don't have those assumptions, however. |
I think so. It also doesn't bake the prior choice into the posterior, which I somewhat prefer. I'm unsure if I should be using the same exponent (i.e. 2Ne) for the upper end of all the nodes, but I suspect it won't matter too much. |
Also, I don't know how we would apply this if there is a variable Ne over time. It would be nice to be able to fit the tail end of the exponential by using some internal feature of the posterior (e.g. the median, or the mean without the tail region) rather than having to rely on something baked into the prior. |
Actually, another issue I just through about: for very young nodes (e.g. when most of the mass is in the youngest relevant timeslice), we probably shouldn't find the mean using the midpoints of the time slices. Instead we should maybe fit some sort of decaying exponential. This may be the reason that we overestimate the mean ages of the youngest nodes. I'm not sure what the best approach is here. I feel it would be nice to have a method that worked uniformly for all nodes, but which didn't depend on either Ne or the prior distribution. Rather, it somehow adaptively accounted for the shape of the posterior, using something like e.g. Simpson's 1/3rd rule. I may be talking nonsense here though. |
One possibility would be to use the "binned" posterior to fit a nonparametric density estimate per node -- rather than use a gamma, exponential or whatever -- and use the expectation of these to fill out ages in the tree sequence. This would be more of a postprocessing step. A quick search indicates that there are some proposed methods on how to do this (e.g. go from a histogram to a density estimator) but all these will surely require some tuning (bandwidth) and may not do that well at the boundary (which is where the problems are it sounds like?) Another approach would be to treat the posterior as interval-censored data and fit a mixture distribution, e.g. a mixture of gammas. |
If the distribution is considered on a log scale, then we could treat the right and left sides as both open (and perhaps fit the tails using a lognormal rather than an exponential). That would be pleasingly symmetrical. But I am slightly worried that the lognormal is too "fat tailed" given that these are generated by a coalescent process, and suspect that it might bias the mean toward older times for some of the oldest (and hardest to date) nodes |
I'm hoping this will fix #229 |
Probably relevant to the comment here. I think the confusion for me stems from what sort of distribution is actually being approximated -- the posterior mean is calculated from a finite set of atoms (timepoints) with weights, rather than as expectations of a continuous piecewise-uniform distribution over time intervals. So it's tricky to think about combining this with a model over an interval for the uppermost timeslice. If I'm understanding correctly, the proposal is to:
While this seems reasonable I'm struggling to figure out what the approximation is of, exactly. If the timepoints were midpoints of the intervals then it'd be something like the expectation of a piecewise uniform distribution over time intervals, with an exponential tacked on at the last (open) interval. |
After futzing around for awhile, it seems to me that the algorithm works quite well as is, and that the most reliable way to improve accuracy may come from optimizing the time grid; rather than something like quadrature or post-processing the posterior. In essence, the variable Ne prior in #248 provides a way to arbitrarily adjust the spacing of the time grid; so iteratively improving the prior via pairwise coalescence rates (or the marginal likelihood produced by the inside algorithm) could possibly have a substantial impact on accuracy. |
I'm trying to find a sensible calculation for the mean time of a node in tsdate, given the time discretisation scheme. More specifically, if we take the$n$ probabilities $p_i$ for $i=1..n$ associated with each timepoint $t_i$ to mean the probability that the true time lies between $t_i$ and $t_{i+1}$ , then the last value $p_n$ is the probability that the node time will be from timepoint $t_n$ to ∞. In other words, this is an open-ended distribution.
If we want to find the mean, while including the mass at the top end, we need to calculate the mean in the top region and add its weight to the total. Here's one way to do it. Does it sound correct?
Assume that the highest category fits a negative exponential distribution (as expected from coalescent theory), with exponent$-2n_e$ (this will be correct for the oldest node, but presumably not for younger nodes). Since the total mass is $p_n$ , the equivalent portion of the exponential distribution runs from $a...∞$ , where
The extra mass in that region that needs to be incorporated is
Substituting in a, this equates exactly to$2 * Ne * p_m * (1-\log(p_m))$ (nice!)
So I think that to get the mean we can take the midpoints of most of the intervals, then simply add on$2 * Ne * p_m * (1-\log(p_m))$ for the top end. Here's an example:
Here are some cases:
This is better than anything else I've come up with. Can you think of an alternative approach @nspope ?
The text was updated successfully, but these errors were encountered: