Skip to content

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Jul 1, 2021

Starts to address #1542. Currently this is a draft version of a change to the python-only Hartigan map_mutations code. I think this is right, but it could do with checking and also we should add a few round-trip tests with ancestral state specified (I'm not sure how comprehensive to make this, e.g. whether to patch it in to the round trip tests higher up).

The doesn't code up the C version, but I guess that should be easy. It also doesn't add the ancestral state function to the fitch map_mutations code, but we don't use that except for testing anyway.

I guess it's OK that the algorithm is allowed to put a mutation above the root, so that the ancestral state for the whole tree switches immediately?

@hyanwong hyanwong requested a review from jeromekelleher July 1, 2021 10:25
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks basically right to me.

@codecov
Copy link

codecov bot commented Jul 1, 2021

Codecov Report

Merging #1550 (401c1aa) into main (2c897a3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 401c1aa differs from pull request most recent head 2384f39. Consider uploading reports for the commit 2384f39 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1550   +/-   ##
=======================================
  Coverage   93.70%   93.70%           
=======================================
  Files          27       27           
  Lines       22759    22775   +16     
  Branches     1076     1076           
=======================================
+ Hits        21326    21342   +16     
  Misses       1399     1399           
  Partials       34       34           
Flag Coverage Δ
c-tests 92.03% <100.00%> (+<0.01%) ⬆️
lwt-tests 92.97% <ø> (ø)
python-c-tests 95.29% <100.00%> (+<0.01%) ⬆️
python-tests 98.80% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/core.c 97.09% <100.00%> (+0.01%) ⬆️
c/tskit/trees.c 94.85% <100.00%> (+<0.01%) ⬆️
python/_tskitmodule.c 91.84% <100.00%> (+0.01%) ⬆️
python/tskit/trees.py 97.94% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c897a3...2384f39. Read the comment docs.

@jeromekelleher
Copy link
Member

I guess it's OK that the algorithm is allowed to put a mutation above the root, so that the ancestral state for the whole tree switches immediately?

I don't understand what you mean here - the actual site.ancestral_state is guaranteed to be the value you asked for, right? Once this holds, then it's up to the algorithm to decide a parsimonious assignment, and putting a mutation over the root will often be the only way to do this (.e.g, genotypes = [0,...,0] and specifying ancestral state = 1).

@hyanwong
Copy link
Member Author

hyanwong commented Jul 1, 2021

I don't understand what you mean here - the actual site.ancestral_state is guaranteed to be the value you asked for, right? Once this holds, then it's up to the algorithm to decide a parsimonious assignment, and putting a mutation over the root will often be the only way to do this (.e.g, genotypes = [0,...,0] and specifying ancestral state = 1).

Yep, it just depends if by ancestral state you mean the state of the MRCA. If that were the case, then the most parsimonious explanation would be to have all branches descending from the MRCA to have a separate switch. I guess it's just a question of making sure the documentation is clear that the ancestral state is not necessarily that shown by the grand MRCA, if mutations above the root are allowed (perhaps that's obvious, but it bears saying somewhere).

@hyanwong
Copy link
Member Author

hyanwong commented Jul 1, 2021

This looks basically right to me.

Thanks @jeromekelleher . It would be useful to get this in for the tsdate paper revisions, so I'll have a go at the C side. Are there any tips on how in C you are supposed to check for a variable that can get passed in as either an integer or None in python?

@jeromekelleher
Copy link
Member

In the Python C API you'd make it an optional value. It's not obvious what the right approach to getting the plumbing set up, might be simplest if I put in a commit here doing the C side and then perhaps you could write more tests?

@hyanwong
Copy link
Member Author

hyanwong commented Jul 1, 2021

might be simplest if I put in a commit here doing the C side and then perhaps you could write more tests?

If you have time, that would be great, thanks. Then I can see how it's done too.

@hyanwong hyanwong force-pushed the map_mutations_as branch from 16a340f to ee6896c Compare July 1, 2021 13:11
@hyanwong
Copy link
Member Author

hyanwong commented Jul 1, 2021

Changes to tests now pushed - this now has compare_lib=True, so tests should fail until the C version is working.

@jeromekelleher
Copy link
Member

Done - C stuff was slightly tricky because we already had an ancestral_state parameter used to send it back. Using a flag to say that this is actually an input seems OK, though.

I think we should probably support specifying the ancestral state as a string as well, since it's easy enough to do. What do you think?

@jeromekelleher
Copy link
Member

Also, it's not clear to me when we'll have mutations over roots and mutations over the children of the root - it would be good to test this on a few different oddball scenarios to make sure it's doing what we think it's doing.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 1, 2021

Done - C stuff was slightly tricky because we already had an ancestral_state parameter used to send it back. Using a flag to say that this is actually an input seems OK, though.

Thanks - quick work.

I think we should probably support specifying the ancestral state as a string as well, since it's easy enough to do. What do you think?

If you don't mind variable type params, that's fine, yeah. I guess this would just be in Python though?

@jeromekelleher
Copy link
Member

If you don't mind variable type params, that's fine, yeah. I guess this would just be in Python though?

Yes, just at the top-level Python function. See the comment.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 1, 2021

I think that there's a slight wrinkle: at the moment we calculate the number of alleles as the maximum integer in the genotype array. But actually we should be calculating it as the length of the provided alleles list (or one less than that if the last item in the alleles list is None). Think of the case where we specify an alleles list of ["A", "B", "C"], have a genotype array of 0s and 1s, and specify the ancestral_state as 2 (or "C").

For the python code, we'd just have

num_alleles = len(alleles)
if alleles[-1] is None:
    num_alleles -= 1

(instead of num_alleles = np.max(genotypes[not_missing]) + 1)

I guess something similar could happen in the C code? I assume it was done like that previously simply for efficiency reasons, e.g. if the user specified a long list of alleles but only a few were actually present in the genotype array?

(edit) - or we could simply do

num_alleles = np.max(genotypes[not_missing]) + 1
if ancestral_state is not None and ancestral_state >= num_alleles:
    num_alleles = ancestral_state + 1

@jeromekelleher
Copy link
Member

This is a quirk all right, and we could work around it by letting anything < 64 be the ancestral state (from the C code's perspective, which doesn't know anything about alleles). My thought was though, that specifying a value 2 (say) as the ancestrel state when the genotypes array is full of 0s and 1s is probably a mistake.

We can change it though, if you think there's a reasonable use case.

@hyanwong hyanwong force-pushed the map_mutations_as branch from 3420384 to 9c3c1e7 Compare July 2, 2021 09:00
@hyanwong
Copy link
Member Author

hyanwong commented Jul 2, 2021

This is a quirk all right, and we could work around it by letting anything < 64 be the ancestral state (from the C code's perspective, which doesn't know anything about alleles). My thought was though, that specifying a value 2 (say) as the ancestral state when the genotypes array is full of 0s and 1s is probably a mistake.

We can change it though, if you think there's a reasonable use case.

I think it needn't be a mistake to specify an ancestral state that isn't in the haplotype array. Indeed, it is quite likely to happen in user code. We have had a fair number of examples brought up by users where the ancestral state in the VCF is, say, "C", but the samples that are actually used in the analysis have only "A" and "T". It would be a mistake, IMO, not to be able to lay down variation on these triallelic sites via parsimony in tsinfer, which is one of the use-cases of this new functionality.

I've just pushed a commit which changes both the C and python versions as appropriate, along with a few more tests. Note that I haven't added to the C tests, so the check in the C code that ancestral_state < HARTIGAN_MAX_ALLELES is not tested.

@hyanwong hyanwong force-pushed the map_mutations_as branch from 9c3c1e7 to c503dd2 Compare July 2, 2021 09:04
@jeromekelleher
Copy link
Member

Yep, looks good.

@hyanwong hyanwong force-pushed the map_mutations_as branch 2 times, most recently from 83b9631 to 1d886a4 Compare July 2, 2021 10:57
@hyanwong hyanwong marked this pull request as ready for review July 2, 2021 10:58
@hyanwong
Copy link
Member Author

hyanwong commented Jul 2, 2021

Squashed and ready for review

@hyanwong
Copy link
Member Author

hyanwong commented Jul 2, 2021

This is failing CI because the C checker is failing to notice that ancestral_state is set either in line 4338 (if (options & TSK_MM_FIXED_ANCESTRAL_STATE)) or 4407 (if (!(options & TSK_MM_FIXED_ANCESTRAL_STATE))). How do I get the checker to realise that this covers all the branches, given that there's a bit of code that needs to be run between the two if statements (because it depends on the first if, and is then used in the second if)?

@jeromekelleher
Copy link
Member

Just set ancestral_state to zero before the if, like

ancestral_state = 0; /* keep compiler happy */
``

@hyanwong hyanwong force-pushed the map_mutations_as branch from 1d886a4 to 1683fb8 Compare July 2, 2021 14:32
@hyanwong
Copy link
Member Author

hyanwong commented Jul 2, 2021

Just set ancestral_state to zero before the if, like

ancestral_state = 0; /* keep compiler happy */

Perfect, thanks. Now it's complaining about

num_alleles = ancestral_state + 1;

(error: conversion from ‘int’ to ‘int8_t’ {aka ‘signed char’} may change value)

presumably because although both num_alleles and ancestral_state are int8_t, the "1" isn't. Do I just cast the result, or can I convert the 1 to an int8_t value?

@jeromekelleher
Copy link
Member

Probably cast the 1 - either is fine though

@hyanwong hyanwong force-pushed the map_mutations_as branch 6 times, most recently from d85a39b to ace432b Compare July 3, 2021 09:13
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final things to clear up

@hyanwong hyanwong force-pushed the map_mutations_as branch 2 times, most recently from f691f6a to e6ab993 Compare July 5, 2021 10:01
@hyanwong
Copy link
Member Author

hyanwong commented Jul 5, 2021

Seems to all be passing now, and all comments addressed. Thanks for the review @jeromekelleher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants