Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nasty gotcha in the AFS calculation #2729

Open
lukashuebner opened this issue Apr 3, 2023 · 3 comments · May be fixed by #2734
Open

Nasty gotcha in the AFS calculation #2729

lukashuebner opened this issue Apr 3, 2023 · 3 comments · May be fixed by #2734

Comments

@lukashuebner
Copy link

lukashuebner commented Apr 3, 2023

I get strange results when trying to calculate the AFS for the following dataset:

// Same as the example from the PLOS paper but with back and recurring mutations
/*
0.25┊     8   ┊         ┊         ┊
    ┊   ┏━┻━┓ ┊         ┊         ┊
0.20┊   ┃   ┃ ┊         ┊   7     ┊
    ┊   ┃   ┃ ┊         ┊ ┏━┻━┓   ┊
0.17┊   6   ┃ ┊   6     ┊ ┃   ┃   ┊
    ┊ ┏━┻┓  ┃ ┊ ┏━┻━┓   ┊ ┃   ┃   ┊
0.09┊ ┃  5  ┃ ┊ ┃   5   ┊ ┃   5   ┊
    ┊ ┃ ┏┻┓ ┃ ┊ ┃ ┏━┻┓  ┊ ┃ ┏━┻┓  ┊
0.07┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃  4  ┊ ┃ ┃  4  ┊
    ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┊ ┃ ┃ ┏┻┓ ┊
0.00┊ 0 1 3 2 ┊ 0 1 2 3 ┊ 0 1 2 3 ┊
  0.00      2.00      7.00      10.00
*/
char const* multi_tree_back_recurrent_nodes = "1  0       -1   0\n"
                                              "1  0       -1   0\n"
                                              "1  0       -1   1\n"
                                              "1  0       -1   1\n"
                                              "0  0.071   -1   -1\n"
                                              "0  0.090   -1   -1\n"
                                              "0  0.170   -1   -1\n"
                                              "0  0.202   -1   -1\n"
                                              "0  0.253   -1   -1\n";

char const* multi_tree_back_recurrent_edges = "2 10 4 2\n"
                                              "2 10 4 3\n"
                                              "0 10 5 1\n"
                                              "0 2  5 3\n"
                                              "2 10 5 4\n"
                                              "0 7  6 0,5\n"
                                              "7 10 7 0,5\n"
                                              "0 2  8 2,6\n";

char const* multi_tree_back_recurrent_sites = "1      0\n"
                                              "4.5    0\n"
                                              "8.5    0\n";

/* site, node, derived_state, [parent, time] */
char const* multi_tree_back_recurrent_mutations = "0    6   1\n"  // to derived state
                                                  "0    5   0\n"  // back to ancestral state
                                                  "0    3   1\n"  // and once more to the derived state
                                                  "1    5   1\n"  // to derived state
                                                  "1    4   0\n"  // back to ancestral state
                                                  "2    4   1\n"; // to derived state

/* Two (diploid) individuals */
char const* multi_tree_back_recurrent_individuals = "0      0.2,1.5    -1,-1\n"
                                                    "0      0.0,0.0    -1,-1\n";

// [...]
tsk_treeseq_t tskit_tree_sequence;
tsk_id_t      samples[]          = {0, 1, 2, 3};
tsk_size_t    sample_set_sizes[] = {4, 0};
double        ref_result[5];
int           ret;

tsk_treeseq_from_text(
      &tskit_tree_sequence,
      10,
      multi_tree_back_recurrent_nodes,
      multi_tree_back_recurrent_edges,
      NULL,
      multi_tree_back_recurrent_sites,
      multi_tree_back_recurrent_mutations,
      multi_tree_back_recurrent_individuals,
      NULL,
      0
);

// Polarised AFS
ret = tsk_treeseq_allele_frequency_spectrum(
      &tskit_tree_sequence,
      1,
      sample_set_sizes,
      samples,
      0,
      NULL,
      TSK_STAT_POLARISED,
      ref_result
);
assert(ret == 0);
fmt::print("tskit AFS: {}\n", fmt::join(ref_result, ", "));
// Output:          tskit AFS: 0, 0, 1, 1, 0
// Expected output: tskit AFS: 0, 1, 2, 0, 0

Am I doing something wrong when calling the tskit functions? Shouldn't the entries in the AFS histogram always sum up to the number of sites (3 in this case)?

I also find something else unintuitive, maybe you can help: If all sites are in the derived state, why is the site counted as if all sites were in ancestral state?

@petrelharp
Copy link
Contributor

Thanks for the report - this is indeed the wrong AFS, although not technically a bug. The problem here is that the tables did not have compute_mutation_parents called on it, so the resulting state of the tables was incorrect. As I believe @jeromekelleher noted elsewhere recently, we probably ought to be doing this as part of creating a tree sequence. A fix for your particular code is to call tsk_table_collection_compute_mutation_parents on the tables before creating the tree sequence; really, this ought to happen in tsk_treeseq_from_text (although note that this latter function is not part of the public API).

Here's some python code that reproduces the problem:

import tskit

nodes = [[1,  0    ,   -1,    0],
         [1,  0    ,   -1,    0],
         [1,  0    ,   -1,    1],
         [1,  0    ,   -1,    1],
         [0,  0.071,   -1,   -1],
         [0,  0.090,   -1,   -1],
         [0,  0.170,   -1,   -1],
         [0,  0.202,   -1,   -1],
         [0,  0.253,   -1,   -1],
]

edges = [
  [2, 10, 4 , (  2,)],
  [2, 10, 4 , (  3,)],
  [0, 10, 5 , (  1,)],
  [0, 2 , 5 , (  3,)],
  [2, 10, 5 , (  4,)],
  [0, 7 , 6 , (0,5,)],
  [7, 10, 7 , (0,5,)],
  [0, 2 , 8 , (2,6,)],
]

sites = [1, 4.5, 8.5]

# /* site, node, derived_state, [parent, time] */
mutations = [
 [0,    6,   "1"], # to derived state
 [0,    5,   "0"], # back to ancestral state
 [0,    3,   "1"], # and once more to the derived state
 [1,    5,   "1"], # to derived state
 [1,    4,   "0"], # back to ancestral state
 [2,    4,   "1"], # to derived state
]

individuals  = [
    [0,      (0.2,1.5),    -1, -1],
    [0,      (0.0,0.0),    -1, -1],
]

t = tskit.TableCollection(sequence_length=10)

for flags, time, population, individual in nodes:
    t.nodes.add_row(flags=flags, time=time, population=population, individual=individual)

for left, right, parent, children in edges:
    for child in children:
        t.edges.add_row(left=left, right=right, parent=parent, child=child)

for s in sites:
    t.sites.add_row(position=s, ancestral_state="0")

for site, node, derived_state in  mutations:
    t.mutations.add_row(site=site, node=node, derived_state=derived_state)

for flags, location, _, _ in individuals:
    t.individuals.add_row(flags=flags, location=location)

ts = t.tree_sequence()
print("AFS before computing mutation parents:",
      ts.allele_frequency_spectrum(span_normalise=False, polarised=True))

t.compute_mutation_parents()

nts = t.tree_sequence()
print("AFS after computing mutation parents:",
      nts.allele_frequency_spectrum(span_normalise=False, polarised=True))

which outputs

AFS before computing mutation parents: [0. 0. 1. 1. 0.]
AFS after computing mutation parents: [0. 1. 2. 0. 0.]

@jeromekelleher
Copy link
Member

Well spotted @petrelharp! I agree this isn't a bug since the function isn't part of the API, but it is a nasty gotcha.

@lukashuebner lukashuebner changed the title Possible bug in the AFS calculation Nasty gotcha in the AFS calculation Apr 4, 2023
@lukashuebner lukashuebner linked a pull request Apr 4, 2023 that will close this issue
@lukashuebner
Copy link
Author

Thanks for the fast reply! :-) I created a PR with the proposed improvement: #2734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants