Skip to content

Conversation

@brieuclehmann
Copy link
Contributor

This is to add more extensive tests for genetic_relatedness, under the test_tree_stats framework.

I've run into a few issues so far:

  • ts.genetic_relatedness has the additional argument proportion=True (to divide by segregating_sites) which requires some additional plumbing in the overall testing framework. I'm not sure how best to set this up!
  • ts.segregating_sites returns a vector when we have multiple windows. This was causing some unexpected output when dividing by segregating_sites (n_window x n_window) instead of (n_window) when proportion=True. I've fixed this by adding [:, None] (see line 5603 in trees.py). Is this OK?
  • I've just been looking at the site tests for now, and just the test_single_tree_infinite_sites case. I'm getting the following error: ValueError: Summary function does not return zero for both zero weight and total weight. which I'm not sure how to fix.

Fixes #974 .

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@brieuclehmann brieuclehmann marked this pull request as draft November 20, 2020 19:32
@petrelharp
Copy link
Contributor

Summary function does not return zero for both zero weight and total weight.

Well, should it? It seems like it should, since "zero weight" corresponds to bits of the tree sequence not inherited by anyone, and "total weight" corresponds to bits inherited by everyone in the sample sets. (So, for instance, a mutation inherited by none of the samples has weight zero, and shouldn't contribute to trait covariance.) So, if x=[0,0,...,0] then in fact f(x) = 0, which is good:

        def f(x):
            mx = np.mean(x)
            return np.array([(x[i] - mx) * (x[j] - mx) / 2 for i, j in indexes])

Now, "full sample weight" would be x = n, where n[i] = len(sample_sets[i]), I think. And if all the sample sets have the same size, then f(n) = 0, again. But I guess this would not be zero for sample sets of different sizes? For instance, if sample_sets = [[0], [2,3]] (the covariance between a haploid and a diploid), and indexes=[(0,0), (0,1), (1,1)], then f([1,2]) = [0.25, -0.25, 0.25], I think? Is this correct?

@petrelharp
Copy link
Contributor

I suspect this means we need to redefine it a bit to account for the not-all-sample-sets-the-same-size case. (But in a way that's equivalent if they are.)

@petrelharp
Copy link
Contributor

ts.genetic_relatedness has the additional argument proportion=True (to divide by segregating_sites) which requires some additional plumbing in the overall testing framework. I'm not sure how best to set this up!

Ah, right. I think what you've done (redefine verify_definition) is a good way to go - good job figuring out the spaghetti in there. Now, you could add proportion as an argument to it, and then do:

    def verify_sample_sets_indexes(self, ts, sample_sets, indexes, windows):
        def f(x):
            mx = np.mean(x)
            return np.array([(x[i] - mx) * (x[j] - mx) / 2 for i, j in indexes])

        for proportion in [True, False]:
         self.verify_definition(
             ts,
             sample_sets,
             indexes,
             windows,
             f,
             ts.genetic_relatedness,
             genetic_relatedness,
             proportion=proportion,
         )

... ?

@petrelharp
Copy link
Contributor

I've carefully read the python 'test' versions, by the way, and they look great.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a look through @brieuclehmann, and it all looks good to me (although I've not thought through the exact details). I'm happy to merge whenever @petrelharp is.

@brieuclehmann
Copy link
Contributor Author

There's still a fair bit amount of work to do as I haven't started looking at the branch or node statistics. I'm a bit swamped at the moment but will have a go over the next few days.

@brieuclehmann
Copy link
Contributor Author

brieuclehmann commented Nov 23, 2020

I suspect this means we need to redefine it a bit to account for the not-all-sample-sets-the-same-size case. (But in a way that's equivalent if they are.)

One way to do this would be:

n = np.array([len(x) for x in sample_sets])
mn = np.mean(n)
def f(x):
     mx = np.mean(x)
     return np.array([(mn * x[i] / n[i] - mx) * (mn * x[j] / n[j] - mx) / 2 for i, j in indexes])

This is compatible with the previous definition since mn / n[i] = 1 for all i when all the sample sets have the same length. Does this seem reasonable @petrelharp ?

EDIT: The following is also consistent and passes most of the tests for SiteGeneticRelatedness (whereas the above does not)

n = np.array([len(x) for x in sample_sets])
n_total = sum(n)
def f(x):
     mx = np.sum(x) / n_total
     return np.array([(x[i] - n[i] * mx) * (x[j] - n[j] * mx) / 2 for i, j in indexes])

It fails on the ghost allele test and single_tree_jukes_cantor.

I haven't changed the C implementation yet, but can push what I have now if that would help?

@brieuclehmann
Copy link
Contributor Author

brieuclehmann commented Nov 23, 2020

I'm also getting lots of warnings:

RuntimeWarning: invalid value encountered in true_divide
    return numerator / denominator

which I'm pretty sure is to do with dividing by zero when number of segregating is zero.

@petrelharp
Copy link
Contributor

For zero segregating sites, I guess we should be returning nans. See (I think) Fst for how to do this?

@petrelharp
Copy link
Contributor

petrelharp commented Nov 23, 2020

And, I like both your suggestions! Do you think one makes more sense than the other? (e.g., which one has a easier/more sensible verbal definition?)

@brieuclehmann
Copy link
Contributor Author

I think the second one makes more sense (and requires less work :D); it corresponds to what you should get if you take the definition of (site) genetic_relatedness to be the centered genotype matrix covariance, and so has the same trait covariance interpretation from the overleaf document.

@petrelharp
Copy link
Contributor

I agree! I guess we just should be more careful about what "centering" means?

@brieuclehmann
Copy link
Contributor Author

brieuclehmann commented Nov 23, 2020

I've just pushed the latest changes for site_genetic_relatedness. This still fails on two tests (test_ghost_allele and test_single_tree_jukes_cantor) - I haven't looked into why this is the case yet.

I haven't yet looked at node_ or branch_genetic_relatedness either; at the moment, these are just copied from divergence from now with the plumbing changed to include proportion. I should have time later this week to have a go at these. It's not immediately clear to me what the 'naive' definitions should be but I haven't given it a great deal of thought yet!

@petrelharp
Copy link
Contributor

It's not immediately clear to me what the 'naive' definitions should be but I haven't given it a great deal of thought yet!

Taking our definition-in-words in the documentation should make this clear, hopefully? And if not, updating the definition? That's the idea - implement the description-in-words in as simple and obviously-correct a way as possible. The divergence code should be a good place to start.

@brieuclehmann
Copy link
Contributor Author

I've added branch_genetic_relatedness, which is passing all the tests! It follows the same logic as the site_genetic_relatedness, but treating each branch as a site.

Just node_genetic_relatedness and sorting the two failing tests in site_genetic_relatedness to go...

@brieuclehmann
Copy link
Contributor Author

So I think the issue with test_ghost_allele and test_single_tree_jukes_cantor is that G = ts.genotype_matrix() contains values other than {0,1}. For example, in test_ghost_allele, we have G = [2 0 0 0]. We should be treating this as G = [1 0 0 0] and computing the covariance based on this - this is what ts.genetic_relatedness() is currently doing, but not the naive implementation, so I would need to tweak that. Does this sound about right @petrelharp ?

@codecov
Copy link

codecov bot commented Nov 26, 2020

Codecov Report

Merging #1023 (46b2e3f) into main (1e66249) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1023   +/-   ##
=======================================
  Coverage   93.68%   93.68%           
=======================================
  Files          26       26           
  Lines       20985    20991    +6     
  Branches      875      875           
=======================================
+ Hits        19659    19665    +6     
  Misses       1289     1289           
  Partials       37       37           
Flag Coverage Δ
c-tests 92.50% <100.00%> (+<0.01%) ⬆️
lwt-tests 92.80% <ø> (ø)
python-c-tests 94.82% <100.00%> (+<0.01%) ⬆️
python-tests 98.62% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/trees.c 94.80% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 97.46% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e66249...46b2e3f. Read the comment docs.

@brieuclehmann brieuclehmann marked this pull request as ready for review November 26, 2020 09:26
@petrelharp
Copy link
Contributor

G = ts.genotype_matrix() contains values other than {0,1}.

Oh, right, that'll be it - good spotting! site_genetic_relatedness can be fixed up by rewriting using the more general x != y formulation from the theory paper.

@brieuclehmann
Copy link
Contributor Author

Yep, I think I've fixed it! Everything passing without issues now.

@jeromekelleher
Copy link
Member

Woot! Sounds like we're nearly there @brieuclehmann, we just need the node stat now?

@brieuclehmann
Copy link
Contributor Author

The node stat is done too @jeromekelleher !

@jeromekelleher
Copy link
Member

Excellent! I think we just need @petrelharp's approval on the PR and we can merge then.

@petrelharp
Copy link
Contributor

Looking good, and, nearly there! But, gee - this has turned up an oversight in the testing set-up! We weren't running tests were the union of all sample sets was not equal to all the samples. I've fixed this over in #1039; you could merge that in to get tests that fail for the issue related to centering I'm pointing out in the comments.

@brieuclehmann
Copy link
Contributor Author

I've fixed this over in #1039; you could merge that in to get tests that fail for the issue related to centering I'm pointing out in the comments.

I fear my (lack of) git skills might cause some issues. How should I merge #1039 to my current branch?

@benjeffery
Copy link
Member

@brieuclehmann I can merge if you like - you want #1039 to be part of this PR, yes?

@brieuclehmann
Copy link
Contributor Author

brieuclehmann commented Nov 26, 2020

Um, possibly! I'd like to apply the tests that @petrelharp has added so that I can check that my fixes have worked.

@benjeffery
Copy link
Member

I've rebased #1039 onto #1023 at https://github.com/benjeffery/tskit/tree/1023-1039. Running the tests there are few failures.

@brieuclehmann
Copy link
Contributor Author

Ah right, I just pushed through my proposed changes. How should I go about checking these on https://github.com/benjeffery/tskit/tree/1023-1039 ?

@benjeffery
Copy link
Member

benjeffery commented Nov 26, 2020

I've just updated and am running the tests! The changes in 1039 are quite small - you could prob cut and paste them as a quick fix.
EDIT: Still seeing failures.

@brieuclehmann
Copy link
Contributor Author

you could prob cut and paste them as a quick fix.

Good idea! I've done this and it seems to be working for the genetic_relatedness tests on my machine... Any suggestions?

@benjeffery
Copy link
Member

Add them to this PR and push to see if CI is happy?

@petrelharp
Copy link
Contributor

The factor of 2 is bothering me. Can you do a spot check that this definition agrees with the equation we cite, in the case of a binary alleles matrix? We could just tack one more test on to check the normalization, like:

class TestSiteGeneticRelatedness(TestGeneticRelatedness, MutatedTopologyExamplesMixin):
    mode = "site"

    def test_one_more_thing(self):
        ts = msprime.simulate(10, mutation_rate=0.01, sequence_length=100, recombination_rate=0.01, random_seed=23)
        A = ts.genetic_relatedness([[0,1], [2,3], [4,5]], mode="site", span_normalise=False, ...)
        B = < computed as in Speed and Balding >
        self.assertArrayAlmostEqual(A, B)

@petrelharp
Copy link
Contributor

Good idea! I've done this and it seems to be working for the genetic_relatedness tests on my machine... Any suggestions?

That's great! The fixed you made to the tests look right after a quick glance to me, so I believe it...

@brieuclehmann
Copy link
Contributor Author

We could just tack one more test on to check the normalization, like:

I've added a test @petrelharp - let me know if that's what you had in mind!

@petrelharp
Copy link
Contributor

I've added a test @petrelharp - let me know if that's what you had in mind!

Yes, perfect! Just two things: (1) put in a comment at the top of the test what it's doing (comparing to Speed & Balding's K_{c0}), and (2) move the test to TestSiteGeneticRelatedness (since right now it's getting called three times, including under TestBranchGeneticRelatedness and TestNodeGeneticRelatedness, which inherit from the class it's defined in).

@jeromekelleher
Copy link
Member

How's this looking @brieuclehmann? It would be super-nice if we could get the last of @petrelharp's comments sorted out so we can ship 0.3.3 today.

@brieuclehmann
Copy link
Contributor Author

Just waiting for @petrelharp 's approval on the latest changes I believe @jeromekelleher .

@jeromekelleher
Copy link
Member

jeromekelleher commented Nov 27, 2020

Ah, so you've implemented the Speed and Balding method in a test? I think we're good to go then.

@jeromekelleher
Copy link
Member

Looks to me like the test has been implemented, so we should be pretty air-tight now. If there's any remaining issues we can follow them up and fix (with a bugfix release, if necessary).

This is great, congratulations and thanks @brieuclehmann!

@mergify mergify bot merged commit 63566d3 into tskit-dev:main Nov 27, 2020
@petrelharp
Copy link
Contributor

Ah good, yes, sorry, I was off in the woods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more tests for genetic_relatedness

5 participants