Skip to content

Conversation

@saunack
Copy link
Contributor

@saunack saunack commented Apr 21, 2020

No description provided.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start, thanks @saunack! Some comments above.

@codecov
Copy link

codecov bot commented Apr 23, 2020

Codecov Report

Merging #550 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #550      +/-   ##
==========================================
+ Coverage   87.27%   87.28%   +0.01%     
==========================================
  Files          21       21              
  Lines       16459    16474      +15     
  Branches     3234     3237       +3     
==========================================
+ Hits        14364    14379      +15     
  Misses       1029     1029              
  Partials     1066     1066              
Flag Coverage Δ
#c_tests 88.46% <100.00%> (+0.01%) ⬆️
#python_c_tests 90.36% <100.00%> (+0.01%) ⬆️
#python_tests 99.20% <100.00%> (+<0.01%) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.66% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b872e37...ae05b3d. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @saunack. Needs a few more updates and cleanups.

@saunack
Copy link
Contributor Author

saunack commented Apr 28, 2020

I translated the tree samples manually so that they are now labelled from 1 to n. The tests also seem to be working fine.
Any more tests which need to be added? I have added all the tests from TestNewick after appropriate modifications (the only ones left are the ones which use node_labels).
If this looks fine, I can go ahead and add user-defined translate blocks too, if that is a required feature

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @saunack. Some minor tidy-ups and we're good to merge I think.

@saunack
Copy link
Contributor Author

saunack commented Apr 29, 2020

There's also an option for specifying rooted and unrooted trees in Nexus. Should I use that too in the nexus output?

@saunack
Copy link
Contributor Author

saunack commented Apr 29, 2020

Also, a peculiar behavior that I have noticed. The BioNexus package doesn't actually do anything with the BEGIN TAXA block; it just stores it in another class. Even if I write incorrect labels there, it does not matter since the parsing for the nexus trees uses only the data in the TRANSLATE block and the TREE block. If the BioNexus package does not itself check for consistency across the TAXA and TRANSLATE block, should I be doing it instead?

@jeromekelleher
Copy link
Member

There's also an option for specifying rooted and unrooted trees in Nexus. Should I use that too in the nexus output?

If it's part of the standard I guess we should state that the trees are rooted, I suppose.

Also, a peculiar behavior that I have noticed. The BioNexus package doesn't actually do anything with the BEGIN TAXA block; it just stores it in another class. Even if I write incorrect labels there, it does not matter since the parsing for the nexus trees uses only the data in the TRANSLATE block and the TREE block. If the BioNexus package does not itself check for consistency across the TAXA and TRANSLATE block, should I be doing it instead?

Hmm. We could try parsing with the nexus package to see if it reads the TAXA block? No harm in testing against two different libraries, in case one interprets the standard differently to the other.

But yes, I guess we should make sure that the labels are correctly output. You should be able to use the output from BioPython to do this?

@saunack
Copy link
Contributor Author

saunack commented Apr 29, 2020

The python-nexus package that you mentioned also doesn't check for consistency between different blocks.
And yeah, I should be able to do it using BioPython.
Although now I doubt if we should check for it. Can you go through page 15 and page 42 of this paper on Nexus and let me know your views on it? From what I gathered, it is not necessary to write all taxons there and using numbers as taxlabels is not preferred.

@jeromekelleher
Copy link
Member

This is surprisingly complicated! I don't know enough about this to make calls on it really - @hyanwong, would you mind reviewing here please? In particular, can you clear up what the TAXLABELS and TRANSLATE blocks actually mean, and what we should use them for?

@jeromekelleher jeromekelleher marked this pull request as ready for review April 30, 2020 09:59
@hyanwong
Copy link
Member

hyanwong commented Apr 30, 2020

@saunack: I'm not sure why we want to define a TAXA block at all. We don't have to, do we? Then all this goes away.

The main sticking point here is whether you can have a TRANSLATE command without a TAXON block - i.e. can the second column of the TRANSLATE table simply refer to a taxon number, even if it hasn't been labelled in a TAXON block. I think this is OK - certainly at the moment we should be applying TRANSLATE to all the nodes, not just the sample ones, and as I read it, you are only putting the samples into the TAXA block anyway: i.e. there will be internal nodes that are TRANSLATEd but not in the TAXA block.

Some relevant text is:

The TRANSLATE statement maps arbitrary labels in the tree specification to
valid taxon names. If the arbitrary labels are integers, they are mapped onto the valid taxon names
as dictated by the TRANSLATE command, without any consideration of the order of the taxa in the
matrix. Thus, if an integer is encountered in the tree description, a program first checks to see if it
matches one of the arbitrary labels defined in the TRANSLATE command;

I'm assuming you are doing what I did and putting:

#nexus
...
begin trees;
  translate
    0 1,
    1 2,
    2 3,
    3 4,
    4 5
  ;
  tree one = [&U] (0,1,(2,(3,4));
  tree two = [&U] (0,2,(4,(1,3));
end;

so the question is whether, in the TRANSLATE line reading 0 1,, the number 1 is a "valid taxon name". The formal grammar proposed at https://informatics.nescent.org/wiki/Supporting_NEXUS#Testing_conformance_of_NEXUS_files (https://www.cs.nmsu.edu/~epontell/nexus/nexus_grammar) seems to suggest this is OK, but the spec wording does not include this possibility. I'll see if I know any experts who can help.

@saunack
Copy link
Contributor Author

saunack commented May 1, 2020

No, it's not required. Everything works well without it too.
All the problems go away then. Thanks @hyanwong
Although I do not translate all the nodes in the tree. Is it necessary since we're just checking for isomorphism? Traversing the tree from the leaf to the root and checking node lengths along the way should be fine, I guess

@hyanwong
Copy link
Member

hyanwong commented May 1, 2020

Although I do not translate all the nodes in the tree. Is it necessary

The reason I used a TRANSLATE block is that, AFAIK, zero is not a valid NEXUS taxon number. I think this caused problems when reading these nexus files into R. So I moved all the taxon numbers up by 1 using the TRANSLATE block. But then we have the issue that strictly, according to one version of the spec, we need to translate to names not numbers, and we are back to the problem of defining a name for each taxon (including internal nodes). Personally, I think we should treat the spec as allowing numbers as the second column in the translate block, otherwise it opens up a can of worms.

@hyanwong
Copy link
Member

hyanwong commented May 1, 2020

Perhaps we should add a flag to nexus conversion that says essentially "leave as 0-based node numbers" (this breaks the spec so may cause problems for some nexus readers) or "translate to 1-based node numbers" (it is unclear if this breaks the TRANSLATE block spec or not). What do you think @jeromekelleher ?

@jeromekelleher
Copy link
Member

Perhaps we should add a flag to nexus conversion that says essentially "leave as 0-based node numbers" (this breaks the spec so may cause problems for some nexus readers) or "translate to 1-based node numbers" (it is unclear if this breaks the TRANSLATE block spec or not). What do you think @jeromekelleher ?

The trouble is our sample IDs aren't necessarily 0,.., n - 1, so just adding one doesn't really work and makes things even more confusing. I don't think we should break the spec - can we not to something like making the names [f"tsk_{u}" for u in samples] and translate them to node numbers?

@hyanwong
Copy link
Member

hyanwong commented May 1, 2020

The trouble is our sample IDs aren't necessarily 0,.., n - 1, so just adding one doesn't really work and makes things even more confusing.

I wasn't meaning just sample IDs @jeromekelleher. Surely we want all the IDs (including internal nodes) in the trees, otherwise we can't compare tree-to-tree? This would make all node IDs equal to n+1, which is surely fine?

@jeromekelleher
Copy link
Member

Oh, so we label all internal nodes as well. That makes sense.

But isn't there still an issue issue that Nexus thinks numbers refer to (1 based) indexes into the Taxa list? I think your approach above is the correct one:

Personally, I think we should treat the spec as allowing numbers as the second column in the translate block, otherwise it opens up a can of worms.

Can you illustrate this with a quick example?

@hyanwong
Copy link
Member

hyanwong commented May 1, 2020

So as I understand it @jeromekelleher, we don't require a TAXA list (there is an example in the spec without one), and indeed it's not even clear from the spec that internal nodes can be labelled as taxa. So if we were using a 1-based system for node labels in the trees, everything would be grand - the numbers refer to positions in the (non-existent) TAXA list. The problem is, of course, that in reality we will have a node in out trees (could be internal or a tip) with the label 0. I have previously found that this causes problems with nexus parsers.

However, there is the option to pass a TRANSLATE block before the tree definitions, of the form TRANSLATE tree_label1 taxa_label1, tree_label2 taxa_label2 ... So previously, to allow a tree with a node labelled 0, I have set

TRANSLATE
0 1,
1 2,
2 3
...

simply to keep the nexus parsers happy. This means that the tree can continue to have a label 0 on a node, but for nexus purposes it is taken as referring to the first item in the taxon list, were a taxon list to exist.

The only problem now is that if you read the spec (bottom of page 613) strictly, you might think that taxa_label1, taxa_label2 (i.e. the 2nd "column") in the TRANSLATE block above have to be labels (strictly, valid-taxon-name) which might be taken to exclude the possibility that they are indexes into the TAXA list. I think this is too tight a reading of the spec. In fact, labels and indexes are treated as entirely interchangeable in the rest of the spec ("Characters, taxa, and trees can be referred to by their name or number;" and "Taxon names ... must not correspond to another taxon name or number; thus, 1 is not a valid name for the second taxon listed. "). Moreover, the proposed automatic nexus parser is specifically coded to allow the second column in the TRANSLATE block to be an index or a label, so I deem this an acceptable work-around.

@jeromekelleher
Copy link
Member

Great, thanks @hyanwong. It's very tricky having this interchangability between labels and numbers, so wouldn't it be better if we took away any possibility of ambiguity by defining our samples using the TAXA list like

TAXA
    tsk_0, tsk_1,...

where the taxon label is [f"tsk_{u}" for u in ts.samples()]? And we then consistently refer to the nodes using these labels?

It's horribly confusing having offset-by-one as the default, and I think this is the only way to both resolve the potential ambiguity between our node IDs and the taxon IDs and to avoid this nasty source of potential errors.

Let's not worry about the few extra bytes that this costs - efficiency is not a concern here.

@hyanwong
Copy link
Member

hyanwong commented May 4, 2020

wouldn't it be better if we took away any possibility of ambiguity by defining our samples using the TAXA list like

That's fine for the samples, but I don't know if we want to (or can) do that for the non-sample nodes, as these aren't really taxa. I do think we should leave the non-sample node numbers in the trees. I suppose we could use the TRANSLATE block, and list the samples only, in which case if the sample IDs are (say) 10, 11, 12, then we would have blocks like

begin taxa;
 dimensions ntax=X; taxlabels tsk_10 tsk_11 tsk_12 ...; end;
begin trees ; 
 TRANSLATE 10 1, 11 2, 12 3, ...;

And we would need to work out how to map the internal nodes and add them to the translate block too, at least for the nodes whose tskit ID lies between 0 and the number of listed taxa. Otherwise a 1 in the newick will be taken to refer to the first listed taxon (i.e. tsk_10). So either way we will need to do numerical remapping, I fear.

@jeromekelleher
Copy link
Member

I'm not convinced that internal nodes can't be thought of as taxa - suppose we have two closely related species and we know the name of the ancestral species. Does the spec specifically say that taxa can only be leaf nodes?

Perhaps we could try doing

taxa = [f"tsk_{u}" for u in ts.num_nodes()]

and see if the parsers choke on this? There's too much ambiguity if we don't stringify our node IDs.

@hyanwong
Copy link
Member

hyanwong commented May 4, 2020

The only reference I can find to internal nodes is this:

The label of the node is a NEXUS token that is either a taxon's defined name, a taxon's number, a taxon's label from the translation table, or a clade's defined name. The label is optional for internal nodes that are not observed taxa; it is not optional for terminal nodes. Internal nodes that have no label are represented implicitly by the parentheses containing the list of subclades.

Given that we could have terminal nodes that are not samples (e.g. in an ancestors TS), perhaps we should indeed list all nodes, including internal ones, as TAXA. It might be useful to have a different TAXA prefix for samples, though?

@jeromekelleher
Copy link
Member

Given that we could have terminal nodes that are not samples (e.g. in an ancestors TS), perhaps we should indeed list all nodes, including internal ones, as TAXA. It might be useful to have a different TAXA prefix for samples, though?

Good idea - how about ["tsk_{node.flags}_{node.id}" for node in ts.nodes()]? This will usually be zero or 1, but can contain more information if it's there. (Let's not worry about other node information for now - we can follow this up with a better encoding scheme if this works.)

@saunack, would you mind trying this idea out to see if BioPython will round-trip these node labels faithfully? It'll look something like

node_labels = {node.id: f"tsk_{node.flags}_{node.id}" for node in self.nodes()}
taxa_block = "TAXA " + ",".join(node_labels[node.id] for node in self.nodes()) + "\n"
# Pass node_labels in as the argument to tree.newick()

@hyanwong
Copy link
Member

hyanwong commented May 4, 2020

taxa_block = "TAXA " + ",".join
I don't think you join TAXA by a comma, BTW. I happen to know one of the principal authors of a well-used nexus parsing library, so I'm just chatting to him about conformance.

@jeromekelleher
Copy link
Member

I don't think you join TAXA by a comma, BTW.

Just going by your example above:

TAXA tsk_10, tsk_11, tsk_12
TRANSLATE 10 1, 11 2, 12 3

BTW. I happen to know one of the principal authors of a well-used nexus parsing library, so I'm just chatting to him about conformance.

Great, thanks.

@hyanwong
Copy link
Member

hyanwong commented May 4, 2020

Just going by your example above:

Oops. My bad :embarrassed:. Corrected now!

@hyanwong
Copy link
Member

hyanwong commented May 4, 2020

Apparently PAUP doesn't (didn't?) like having taxa as internal labels, but I can't see anything in the spec against it. And we aren't likely to want to read tskit output into PAUP, I guess.

[edit - the old version of PAUP didn't like this. The new version is fine with is, so I reckon just go for it]

@jeromekelleher
Copy link
Member

Excellent, consensus!

@saunack
Copy link
Contributor Author

saunack commented May 5, 2020

@saunack, would you mind trying this idea out to see if BioPython will round-trip these node labels faithfully? It'll look something like

node_labels = {node.id: f"tsk_{node.flags}_{node.id}" for node in self.nodes()}
taxa_block = "TAXA " + ",".join(node_labels[node.id] for node in self.nodes()) + "\n"
# Pass node_labels in as the argument to tree.newick()

This works without any issues. To be clear though, I haven't used the TRANSLATE block.

@jeromekelleher
Copy link
Member

OK, great! This is ready to merge @saunack, thanks very much.

I updated the PR with a few extra tests, making sure we deal with multiple trees correctly and that we're mapping from ID labels in a one-to-one manner. I also updated the docs a bit.

@hyanwong
Copy link
Member

hyanwong commented May 5, 2020

I'm not sure it it's a good idea to omit the TRANSLATE command? It does pass BioPython import, and maybe some others, but I think it might not be spec-compliant? ISTR it broke import into R, for example. It would be trivial to add, this surely @jeromekelleher ?

@hyanwong
Copy link
Member

hyanwong commented May 5, 2020

Mark Holder, who I consulted said that without the TRANSLATE command:

oddly, n+1 will be a default alias for n in your 0-based tree

Probably worth importing a small example into R to check.

@jeromekelleher jeromekelleher merged commit 5fd04ec into tskit-dev:master May 5, 2020
@jeromekelleher
Copy link
Member

I've just merged @hyanwong - would you mind checking it out in R, and adding any extra stuff you think it needs to be standards compliant?

@saunack
Copy link
Contributor Author

saunack commented May 5, 2020

Since I've renamed all the nodes to tsk_flag_id, shouldn't this no longer be a problem?
The newick format also uses these labels for printing the trees so if it was to be read by some other program, it should work fine too.
Although I may wrong due to the compliance part, which I don't know anything about.

@hyanwong
Copy link
Member

hyanwong commented May 5, 2020

@saunack - do you mean you have renamed them in the trees? Or simply in the TAXA block. If in the trees themselves, that's fine, of course, although it obviously bloats the file size, and it's not so obvious to pull a tree out of the text file and compare it to the Tree.newick() output.

@jeromekelleher
Copy link
Member

do you mean you have renamed them in the trees? Or simply in the TAXA block. If in the trees themselves, that's fine, of course, although it obviously bloats the file size, and it's not so obvious to pull a tree out of the text file and compare it to the Tree.newick() output.

Yes, it's in the trees. I don't think we really care about file size. It doesn't matter that Tree.newick() has a different default format.

@hyanwong
Copy link
Member

hyanwong commented May 5, 2020

Oh, well that's fine then. Sorry for the noise. I just assumed you were re-using the Tree.newick() output and sticking it into the nexus file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants