-
Notifications
You must be signed in to change notification settings - Fork 79
Add tests for genetic_relatedness #1023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
📖 Docs for this PR can be previewed here |
Well, should it? It seems like it should, since "zero weight" corresponds to bits of the tree sequence not inherited by anyone, and "total weight" corresponds to bits inherited by everyone in the sample sets. (So, for instance, a mutation inherited by none of the samples has weight zero, and shouldn't contribute to trait covariance.) So, if Now, "full sample weight" would be |
|
I suspect this means we need to redefine it a bit to account for the not-all-sample-sets-the-same-size case. (But in a way that's equivalent if they are.) |
Ah, right. I think what you've done (redefine ... ? |
|
I've carefully read the python 'test' versions, by the way, and they look great. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a look through @brieuclehmann, and it all looks good to me (although I've not thought through the exact details). I'm happy to merge whenever @petrelharp is.
|
There's still a fair bit amount of work to do as I haven't started looking at the branch or node statistics. I'm a bit swamped at the moment but will have a go over the next few days. |
One way to do this would be: This is compatible with the previous definition since EDIT: The following is also consistent and passes most of the tests for SiteGeneticRelatedness (whereas the above does not) It fails on the ghost allele test and single_tree_jukes_cantor. I haven't changed the C implementation yet, but can push what I have now if that would help? |
|
I'm also getting lots of warnings: which I'm pretty sure is to do with dividing by zero when number of segregating is zero. |
|
For zero segregating sites, I guess we should be returning |
|
And, I like both your suggestions! Do you think one makes more sense than the other? (e.g., which one has a easier/more sensible verbal definition?) |
|
I think the second one makes more sense (and requires less work :D); it corresponds to what you should get if you take the definition of (site) genetic_relatedness to be the centered genotype matrix covariance, and so has the same trait covariance interpretation from the overleaf document. |
|
I agree! I guess we just should be more careful about what "centering" means? |
1a390ca to
9afe1cf
Compare
|
I've just pushed the latest changes for site_genetic_relatedness. This still fails on two tests ( I haven't yet looked at node_ or branch_genetic_relatedness either; at the moment, these are just copied from divergence from now with the plumbing changed to include |
Taking our definition-in-words in the documentation should make this clear, hopefully? And if not, updating the definition? That's the idea - implement the description-in-words in as simple and obviously-correct a way as possible. The divergence code should be a good place to start. |
9afe1cf to
4ec6ac4
Compare
|
I've added branch_genetic_relatedness, which is passing all the tests! It follows the same logic as the site_genetic_relatedness, but treating each branch as a site. Just node_genetic_relatedness and sorting the two failing tests in site_genetic_relatedness to go... |
|
So I think the issue with test_ghost_allele and test_single_tree_jukes_cantor is that G = ts.genotype_matrix() contains values other than {0,1}. For example, in test_ghost_allele, we have G = [2 0 0 0]. We should be treating this as G = [1 0 0 0] and computing the covariance based on this - this is what ts.genetic_relatedness() is currently doing, but not the naive implementation, so I would need to tweak that. Does this sound about right @petrelharp ? |
2676487 to
c8145a2
Compare
Codecov Report
@@ Coverage Diff @@
## main #1023 +/- ##
=======================================
Coverage 93.68% 93.68%
=======================================
Files 26 26
Lines 20985 20991 +6
Branches 875 875
=======================================
+ Hits 19659 19665 +6
Misses 1289 1289
Partials 37 37
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Oh, right, that'll be it - good spotting! |
|
Yep, I think I've fixed it! Everything passing without issues now. |
|
Woot! Sounds like we're nearly there @brieuclehmann, we just need the node stat now? |
|
The node stat is done too @jeromekelleher ! |
|
Excellent! I think we just need @petrelharp's approval on the PR and we can merge then. |
c8145a2 to
d7a2676
Compare
|
Looking good, and, nearly there! But, gee - this has turned up an oversight in the testing set-up! We weren't running tests were the union of all sample sets was not equal to all the samples. I've fixed this over in #1039; you could merge that in to get tests that fail for the issue related to centering I'm pointing out in the comments. |
|
@brieuclehmann I can merge if you like - you want #1039 to be part of this PR, yes? |
|
Um, possibly! I'd like to apply the tests that @petrelharp has added so that I can check that my fixes have worked. |
|
I've rebased #1039 onto #1023 at https://github.com/benjeffery/tskit/tree/1023-1039. Running the tests there are few failures. |
d7a2676 to
9fe3639
Compare
|
Ah right, I just pushed through my proposed changes. How should I go about checking these on https://github.com/benjeffery/tskit/tree/1023-1039 ? |
|
I've just updated and am running the tests! The changes in 1039 are quite small - you could prob cut and paste them as a quick fix. |
Good idea! I've done this and it seems to be working for the genetic_relatedness tests on my machine... Any suggestions? |
|
Add them to this PR and push to see if CI is happy? |
9fe3639 to
28b2348
Compare
|
The factor of 2 is bothering me. Can you do a spot check that this definition agrees with the equation we cite, in the case of a binary alleles matrix? We could just tack one more test on to check the normalization, like: |
That's great! The fixed you made to the tests look right after a quick glance to me, so I believe it... |
28b2348 to
5b1f10c
Compare
I've added a test @petrelharp - let me know if that's what you had in mind! |
Yes, perfect! Just two things: (1) put in a comment at the top of the test what it's doing (comparing to Speed & Balding's K_{c0}), and (2) move the test to |
5b1f10c to
cb9037c
Compare
|
How's this looking @brieuclehmann? It would be super-nice if we could get the last of @petrelharp's comments sorted out so we can ship 0.3.3 today. |
|
Just waiting for @petrelharp 's approval on the latest changes I believe @jeromekelleher . |
|
Ah, so you've implemented the Speed and Balding method in a test? I think we're good to go then. |
cb9037c to
9792946
Compare
|
Looks to me like the test has been implemented, so we should be pretty air-tight now. If there's any remaining issues we can follow them up and fix (with a bugfix release, if necessary). This is great, congratulations and thanks @brieuclehmann! |
9792946 to
46b2e3f
Compare
|
Ah good, yes, sorry, I was off in the woods. |
This is to add more extensive tests for genetic_relatedness, under the test_tree_stats framework.
I've run into a few issues so far:
ts.genetic_relatednesshas the additional argumentproportion=True(to divide by segregating_sites) which requires some additional plumbing in the overall testing framework. I'm not sure how best to set this up!ts.segregating_sitesreturns a vector when we have multiple windows. This was causing some unexpected output when dividing by segregating_sites (n_window x n_window) instead of (n_window) whenproportion=True. I've fixed this by adding[:, None](see line 5603 in trees.py). Is this OK?test_single_tree_infinite_sitescase. I'm getting the following error:ValueError: Summary function does not return zero for both zero weight and total weight.which I'm not sure how to fix.Fixes #974 .