-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Returned posteriors incorrect for 2 tip tree with mutation_rate=None #230
Comments
Actually, this is not a bug in the import tsdate
import tskit
tables = tskit.Tree.generate_comb(3).tree_sequence.dump_tables()
ts = tables.tree_sequence()
display(ts.draw_svg())
grid = tsdate.build_prior_grid(ts, Ne=40, prior_distribution="gamma", timepoints=20)
ts, posteriors = tsdate.date(ts, mutation_rate=None, priors=grid, return_posteriors=True)
print(posteriors[ts.first().root]) |
I think this is the same bug that we identified previously, although I can't find the issue just now. The problem is that the priors do not properly account for tips at time 0. Here's a trivial example: import tsdate
import tskit
ts = tskit.Tree.generate_comb(2).tree_sequence
grid = tsdate.build_prior_grid(ts, Ne=40, prior_distribution="gamma", timepoints=20)
print(grid[2]) # Gives array([0., 1., 1., ..., 1., 1., 1.]) rather than a proper distribution The reason for this is that we place a zero as the first element of the priors array here: Line 979 in c9db1d9
This means that any node whose children have their entire weight at zero get probabilities of (essentially) zero passed to them from their children. I think the fix should be to consider the probabilities as being between each of the timepoints, and for the last probability to cover the probability of the node landing in the timeslice from |
Finally fixes tskit-dev#7 and produces a much nicer looking fit for large tree sequences, but leads to slightly worse performance for tiny tree sequences such as those tested in test_accuracy.py, because of tskit-dev#230. When we fix that, this PR should provide uniformly better performance, I hope
In might be the same issue being picked up in https://github.com/hyanwong/molecular-sample-dating/blob/main/notebooks/uniform_prior.ipynb |
Another place that we ignore the zeroth timeslice is when normalising: Line 135 in e9bd098
|
I actually don't think this is a bug or related to having zero mass at the first timepoint-- it is expected behavior given how the default timegrid is constructed from quantiles. That is, the default timegrid takes equally spaced quantiles of the prior as the timepoints (here), concatenates these quantiles across nodes (with some thinning), then assigns mass that is the difference in the prior CDF between adjacent timepoints (here). When there's only one internal node, there's only one prior distribution from which to take quantiles, and because the quantiles are equally spaced the CDF differences are all the same ( For example, with a non-default timegrid, we don't get uniform probabilities because the timepoints don't correspond to equally-spaced quantiles: import tsdate
import tskit
import numpy as np
ts = tskit.Tree.generate_comb(2).tree_sequence
prior = tsdate.prior.MixturePrior(ts, prior_distribution="lognorm")
grid_default = prior.make_discretized_prior(population_size=40, timepoints=6)
grid_custom = prior.make_discretized_prior(
population_size=40, timepoints=np.array([0, 0.1, 1, 10, 100, 1000])
)
print(grid_default[2])
# [0. 1. 1. 1. 1. 1. 1.]
print(grid_custom[2])
# [0.00000000e+00 1.44197669e-02 1.00000000e+00 5.18212919e-01 1.12178648e-03 2.08604998e-09] It may seem a bit pathological, but it seems to work well enough in practice. An alternative, which wouldn't have this behavior, would be to use the prior PDF at each timepoint rather than the difference in CDF between adjacent timepoints. This would, in effect, be defining the discretized prior in terms of discrete atoms rather than time intervals -- and this is consistent with how the likelihoods are calculated. I think this would give very similar results (to the current approach) when there's a lot of timepoints. I do remember seeing some discrepancies (between PDF v CDF priors) while working on a quadrature scheme, but there could be other issues at play there. |
Conceptually, I think it's cleaner to have both the prior and posterior be defined over discrete timepoints (all this would require is calculating the prior probs from PDF rather than CDF). Then, to extend the algorithm to timeslices, we'd use this discretization with a quadrature scheme. However, the current setup seems to work pretty well, and is super convenient for the variable Ne prior. |
Finally fixes tskit-dev#7 and produces a much nicer looking fit for large tree sequences, but leads to slightly worse performance for tiny tree sequences such as those tested in test_accuracy.py, because of tskit-dev#230. When we fix that, this PR should provide uniformly better performance, I hope
Finally fixes tskit-dev#7 and produces a much nicer looking fit for large tree sequences, but leads to slightly worse performance for tiny tree sequences such as those tested in test_accuracy.py, because of tskit-dev#230. When we fix that, this PR should provide uniformly better performance, I hope
Finally fixes tskit-dev#7 and produces a much nicer looking fit for large tree sequences, but leads to slightly worse performance for tiny tree sequences such as those tested in test_accuracy.py, because of tskit-dev#230. When we fix that, this PR should provide uniformly better performance, I hope
For example:
All the posterior probabilities are one in this case. I think that's wrong
The text was updated successfully, but these errors were encountered: