Add tests for genetic_relatedness #1023

brieuclehmann · 2020-11-20T18:55:05Z

This is to add more extensive tests for genetic_relatedness, under the test_tree_stats framework.

I've run into a few issues so far:

ts.genetic_relatedness has the additional argument proportion=True (to divide by segregating_sites) which requires some additional plumbing in the overall testing framework. I'm not sure how best to set this up!
ts.segregating_sites returns a vector when we have multiple windows. This was causing some unexpected output when dividing by segregating_sites (n_window x n_window) instead of (n_window) when proportion=True. I've fixed this by adding [:, None] (see line 5603 in trees.py). Is this OK?
I've just been looking at the site tests for now, and just the test_single_tree_infinite_sites case. I'm getting the following error: ValueError: Summary function does not return zero for both zero weight and total weight. which I'm not sure how to fix.

Fixes #974 .

AdminBot-tskit · 2020-11-20T18:56:37Z

📖 Docs for this PR can be previewed here

petrelharp · 2020-11-20T21:37:04Z

Summary function does not return zero for both zero weight and total weight.

Well, should it? It seems like it should, since "zero weight" corresponds to bits of the tree sequence not inherited by anyone, and "total weight" corresponds to bits inherited by everyone in the sample sets. (So, for instance, a mutation inherited by none of the samples has weight zero, and shouldn't contribute to trait covariance.) So, if x=[0,0,...,0] then in fact f(x) = 0, which is good:

        def f(x):
            mx = np.mean(x)
            return np.array([(x[i] - mx) * (x[j] - mx) / 2 for i, j in indexes])

Now, "full sample weight" would be x = n, where n[i] = len(sample_sets[i]), I think. And if all the sample sets have the same size, then f(n) = 0, again. But I guess this would not be zero for sample sets of different sizes? For instance, if sample_sets = [[0], [2,3]] (the covariance between a haploid and a diploid), and indexes=[(0,0), (0,1), (1,1)], then f([1,2]) = [0.25, -0.25, 0.25], I think? Is this correct?

petrelharp · 2020-11-21T00:05:45Z

I suspect this means we need to redefine it a bit to account for the not-all-sample-sets-the-same-size case. (But in a way that's equivalent if they are.)

python/tskit/trees.py

petrelharp · 2020-11-21T05:25:32Z

ts.genetic_relatedness has the additional argument proportion=True (to divide by segregating_sites) which requires some additional plumbing in the overall testing framework. I'm not sure how best to set this up!

Ah, right. I think what you've done (redefine verify_definition) is a good way to go - good job figuring out the spaghetti in there. Now, you could add proportion as an argument to it, and then do:

    def verify_sample_sets_indexes(self, ts, sample_sets, indexes, windows):
        def f(x):
            mx = np.mean(x)
            return np.array([(x[i] - mx) * (x[j] - mx) / 2 for i, j in indexes])

        for proportion in [True, False]:
         self.verify_definition(
             ts,
             sample_sets,
             indexes,
             windows,
             f,
             ts.genetic_relatedness,
             genetic_relatedness,
             proportion=proportion,
         )

... ?

petrelharp · 2020-11-22T04:18:21Z

I've carefully read the python 'test' versions, by the way, and they look great.

jeromekelleher

I've had a look through @brieuclehmann, and it all looks good to me (although I've not thought through the exact details). I'm happy to merge whenever @petrelharp is.

python/tests/test_tree_stats.py

python/tskit/trees.py

brieuclehmann · 2020-11-23T09:30:50Z

There's still a fair bit amount of work to do as I haven't started looking at the branch or node statistics. I'm a bit swamped at the moment but will have a go over the next few days.

brieuclehmann · 2020-11-23T09:59:51Z

I suspect this means we need to redefine it a bit to account for the not-all-sample-sets-the-same-size case. (But in a way that's equivalent if they are.)

One way to do this would be:

n = np.array([len(x) for x in sample_sets])
mn = np.mean(n)
def f(x):
     mx = np.mean(x)
     return np.array([(mn * x[i] / n[i] - mx) * (mn * x[j] / n[j] - mx) / 2 for i, j in indexes])

This is compatible with the previous definition since mn / n[i] = 1 for all i when all the sample sets have the same length. Does this seem reasonable @petrelharp ?

EDIT: The following is also consistent and passes most of the tests for SiteGeneticRelatedness (whereas the above does not)

n = np.array([len(x) for x in sample_sets])
n_total = sum(n)
def f(x):
     mx = np.sum(x) / n_total
     return np.array([(x[i] - n[i] * mx) * (x[j] - n[j] * mx) / 2 for i, j in indexes])

It fails on the ghost allele test and single_tree_jukes_cantor.

I haven't changed the C implementation yet, but can push what I have now if that would help?

brieuclehmann · 2020-11-23T14:48:42Z

I'm also getting lots of warnings:

RuntimeWarning: invalid value encountered in true_divide
    return numerator / denominator

which I'm pretty sure is to do with dividing by zero when number of segregating is zero.

petrelharp · 2020-11-23T18:39:55Z

For zero segregating sites, I guess we should be returning nans. See (I think) Fst for how to do this?

petrelharp · 2020-11-23T18:40:39Z

And, I like both your suggestions! Do you think one makes more sense than the other? (e.g., which one has a easier/more sensible verbal definition?)

brieuclehmann · 2020-11-23T18:49:10Z

I think the second one makes more sense (and requires less work :D); it corresponds to what you should get if you take the definition of (site) genetic_relatedness to be the centered genotype matrix covariance, and so has the same trait covariance interpretation from the overleaf document.

petrelharp · 2020-11-23T19:14:14Z

I agree! I guess we just should be more careful about what "centering" means?

brieuclehmann · 2020-11-23T19:41:14Z

I've just pushed the latest changes for site_genetic_relatedness. This still fails on two tests (test_ghost_allele and test_single_tree_jukes_cantor) - I haven't looked into why this is the case yet.

I haven't yet looked at node_ or branch_genetic_relatedness either; at the moment, these are just copied from divergence from now with the plumbing changed to include proportion. I should have time later this week to have a go at these. It's not immediately clear to me what the 'naive' definitions should be but I haven't given it a great deal of thought yet!

petrelharp · 2020-11-23T19:44:33Z

It's not immediately clear to me what the 'naive' definitions should be but I haven't given it a great deal of thought yet!

Taking our definition-in-words in the documentation should make this clear, hopefully? And if not, updating the definition? That's the idea - implement the description-in-words in as simple and obviously-correct a way as possible. The divergence code should be a good place to start.

brieuclehmann · 2020-11-25T17:39:55Z

I've added branch_genetic_relatedness, which is passing all the tests! It follows the same logic as the site_genetic_relatedness, but treating each branch as a site.

Just node_genetic_relatedness and sorting the two failing tests in site_genetic_relatedness to go...

brieuclehmann · 2020-11-25T21:02:59Z

So I think the issue with test_ghost_allele and test_single_tree_jukes_cantor is that G = ts.genotype_matrix() contains values other than {0,1}. For example, in test_ghost_allele, we have G = [2 0 0 0]. We should be treating this as G = [1 0 0 0] and computing the covariance based on this - this is what ts.genetic_relatedness() is currently doing, but not the naive implementation, so I would need to tweak that. Does this sound about right @petrelharp ?

codecov · 2020-11-26T09:14:40Z

Codecov Report

Merging #1023 (46b2e3f) into main (1e66249) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1023   +/-   ##
=======================================
  Coverage   93.68%   93.68%           
=======================================
  Files          26       26           
  Lines       20985    20991    +6     
  Branches      875      875           
=======================================
+ Hits        19659    19665    +6     
  Misses       1289     1289           
  Partials       37       37

Flag	Coverage Δ
c-tests	`92.50% <100.00%> (+<0.01%)`	⬆️
lwt-tests	`92.80% <ø> (ø)`
python-c-tests	`94.82% <100.00%> (+<0.01%)`	⬆️
python-tests	`98.62% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
c/tskit/trees.c	`94.80% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`97.46% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e66249...46b2e3f. Read the comment docs.

petrelharp · 2020-11-26T12:44:40Z

G = ts.genotype_matrix() contains values other than {0,1}.

Oh, right, that'll be it - good spotting! site_genetic_relatedness can be fixed up by rewriting using the more general x != y formulation from the theory paper.

brieuclehmann · 2020-11-26T12:47:01Z

Yep, I think I've fixed it! Everything passing without issues now.

jeromekelleher · 2020-11-26T13:01:27Z

Woot! Sounds like we're nearly there @brieuclehmann, we just need the node stat now?

brieuclehmann · 2020-11-26T13:02:28Z

The node stat is done too @jeromekelleher !

jeromekelleher · 2020-11-26T13:06:30Z

Excellent! I think we just need @petrelharp's approval on the PR and we can merge then.

petrelharp · 2020-11-26T14:06:31Z

Looking good, and, nearly there! But, gee - this has turned up an oversight in the testing set-up! We weren't running tests were the union of all sample sets was not equal to all the samples. I've fixed this over in #1039; you could merge that in to get tests that fail for the issue related to centering I'm pointing out in the comments.

brieuclehmann · 2020-11-26T15:14:58Z

I've fixed this over in #1039; you could merge that in to get tests that fail for the issue related to centering I'm pointing out in the comments.

I fear my (lack of) git skills might cause some issues. How should I merge #1039 to my current branch?

benjeffery · 2020-11-26T15:31:14Z

@brieuclehmann I can merge if you like - you want #1039 to be part of this PR, yes?

brieuclehmann · 2020-11-26T15:55:12Z

Um, possibly! I'd like to apply the tests that @petrelharp has added so that I can check that my fixes have worked.

benjeffery · 2020-11-26T16:16:08Z

I've rebased #1039 onto #1023 at https://github.com/benjeffery/tskit/tree/1023-1039. Running the tests there are few failures.

brieuclehmann · 2020-11-26T16:20:25Z

Ah right, I just pushed through my proposed changes. How should I go about checking these on https://github.com/benjeffery/tskit/tree/1023-1039 ?

benjeffery · 2020-11-26T16:35:29Z

I've just updated and am running the tests! The changes in 1039 are quite small - you could prob cut and paste them as a quick fix.
EDIT: Still seeing failures.

brieuclehmann · 2020-11-26T16:51:57Z

you could prob cut and paste them as a quick fix.

Good idea! I've done this and it seems to be working for the genetic_relatedness tests on my machine... Any suggestions?

benjeffery · 2020-11-26T16:55:29Z

Add them to this PR and push to see if CI is happy?

petrelharp · 2020-11-26T17:17:28Z

The factor of 2 is bothering me. Can you do a spot check that this definition agrees with the equation we cite, in the case of a binary alleles matrix? We could just tack one more test on to check the normalization, like:

class TestSiteGeneticRelatedness(TestGeneticRelatedness, MutatedTopologyExamplesMixin):
    mode = "site"

    def test_one_more_thing(self):
        ts = msprime.simulate(10, mutation_rate=0.01, sequence_length=100, recombination_rate=0.01, random_seed=23)
        A = ts.genetic_relatedness([[0,1], [2,3], [4,5]], mode="site", span_normalise=False, ...)
        B = < computed as in Speed and Balding >
        self.assertArrayAlmostEqual(A, B)

petrelharp · 2020-11-26T17:18:27Z

Good idea! I've done this and it seems to be working for the genetic_relatedness tests on my machine... Any suggestions?

That's great! The fixed you made to the tests look right after a quick glance to me, so I believe it...

brieuclehmann · 2020-11-26T17:42:37Z

We could just tack one more test on to check the normalization, like:

I've added a test @petrelharp - let me know if that's what you had in mind!

petrelharp · 2020-11-26T21:08:23Z

I've added a test @petrelharp - let me know if that's what you had in mind!

Yes, perfect! Just two things: (1) put in a comment at the top of the test what it's doing (comparing to Speed & Balding's K_{c0}), and (2) move the test to TestSiteGeneticRelatedness (since right now it's getting called three times, including under TestBranchGeneticRelatedness and TestNodeGeneticRelatedness, which inherit from the class it's defined in).

jeromekelleher · 2020-11-27T11:18:53Z

How's this looking @brieuclehmann? It would be super-nice if we could get the last of @petrelharp's comments sorted out so we can ship 0.3.3 today.

brieuclehmann · 2020-11-27T11:20:27Z

Just waiting for @petrelharp 's approval on the latest changes I believe @jeromekelleher .

jeromekelleher · 2020-11-27T11:23:23Z

Ah, so you've implemented the Speed and Balding method in a test? I think we're good to go then.

jeromekelleher · 2020-11-27T11:43:18Z

Looks to me like the test has been implemented, so we should be pretty air-tight now. If there's any remaining issues we can follow them up and fix (with a bugfix release, if necessary).

This is great, congratulations and thanks @brieuclehmann!

petrelharp · 2020-11-30T16:10:48Z

Ah good, yes, sorry, I was off in the woods.

brieuclehmann marked this pull request as draft November 20, 2020 19:32

petrelharp reviewed Nov 21, 2020

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

jeromekelleher reviewed Nov 23, 2020

View reviewed changes

python/tests/test_tree_stats.py Outdated Show resolved Hide resolved

python/tskit/trees.py Outdated Show resolved Hide resolved

brieuclehmann force-pushed the test_relatedness branch from 1a390ca to 9afe1cf Compare November 23, 2020 19:32

brieuclehmann force-pushed the test_relatedness branch from 9afe1cf to 4ec6ac4 Compare November 25, 2020 17:38

brieuclehmann force-pushed the test_relatedness branch from 2676487 to c8145a2 Compare November 26, 2020 09:07

brieuclehmann marked this pull request as ready for review November 26, 2020 09:26

brieuclehmann force-pushed the test_relatedness branch from c8145a2 to d7a2676 Compare November 26, 2020 13:07

brieuclehmann force-pushed the test_relatedness branch from d7a2676 to 9fe3639 Compare November 26, 2020 16:19

brieuclehmann force-pushed the test_relatedness branch from 9fe3639 to 28b2348 Compare November 26, 2020 17:15

brieuclehmann force-pushed the test_relatedness branch from 28b2348 to 5b1f10c Compare November 26, 2020 17:41

brieuclehmann force-pushed the test_relatedness branch from 5b1f10c to cb9037c Compare November 26, 2020 21:19

benjeffery mentioned this pull request Nov 27, 2020

0.3.3 prep #1047

Merged

jeromekelleher approved these changes Nov 27, 2020

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED label Nov 27, 2020

AdminBot-tskit force-pushed the test_relatedness branch from cb9037c to 9792946 Compare November 27, 2020 11:41

Add tests for genetic_relatedness as part of stats framework.

46b2e3f

AdminBot-tskit force-pushed the test_relatedness branch from 9792946 to 46b2e3f Compare November 27, 2020 12:00

mergify bot merged commit 63566d3 into tskit-dev:main Nov 27, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Nov 27, 2020

Add tests for genetic_relatedness #1023

Add tests for genetic_relatedness #1023

Uh oh!

Conversation

brieuclehmann commented Nov 20, 2020

Uh oh!

AdminBot-tskit commented Nov 20, 2020

Uh oh!

petrelharp commented Nov 20, 2020

Uh oh!

petrelharp commented Nov 21, 2020

Uh oh!

Uh oh!

petrelharp commented Nov 21, 2020

Uh oh!

petrelharp commented Nov 22, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

brieuclehmann commented Nov 23, 2020

Uh oh!

brieuclehmann commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brieuclehmann commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Nov 23, 2020

Uh oh!

petrelharp commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brieuclehmann commented Nov 23, 2020

Uh oh!

petrelharp commented Nov 23, 2020

Uh oh!

brieuclehmann commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Nov 23, 2020

Uh oh!

brieuclehmann commented Nov 25, 2020

Uh oh!

brieuclehmann commented Nov 25, 2020

Uh oh!

codecov bot commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

petrelharp commented Nov 26, 2020

Uh oh!

brieuclehmann commented Nov 26, 2020

Uh oh!

jeromekelleher commented Nov 26, 2020

Uh oh!

brieuclehmann commented Nov 26, 2020

Uh oh!

jeromekelleher commented Nov 26, 2020

Uh oh!

petrelharp commented Nov 26, 2020

Uh oh!

brieuclehmann commented Nov 26, 2020

Uh oh!

benjeffery commented Nov 26, 2020

Uh oh!

brieuclehmann commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented Nov 26, 2020

Uh oh!

brieuclehmann commented Nov 26, 2020

Uh oh!

benjeffery commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brieuclehmann commented Nov 26, 2020

Uh oh!

brieuclehmann commented Nov 23, 2020 •

edited

Loading

brieuclehmann commented Nov 23, 2020 •

edited

Loading

petrelharp commented Nov 23, 2020 •

edited

Loading

brieuclehmann commented Nov 23, 2020 •

edited

Loading

codecov bot commented Nov 26, 2020 •

edited

Loading

brieuclehmann commented Nov 26, 2020 •

edited

Loading

benjeffery commented Nov 26, 2020 •

edited

Loading

jeromekelleher commented Nov 27, 2020 •

edited

Loading