-
Notifications
You must be signed in to change notification settings - Fork 78
Added nexus output and test #550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start, thanks @saunack! Some comments above.
Codecov Report
@@ Coverage Diff @@
## master #550 +/- ##
==========================================
+ Coverage 87.27% 87.28% +0.01%
==========================================
Files 21 21
Lines 16459 16474 +15
Branches 3234 3237 +3
==========================================
+ Hits 14364 14379 +15
Misses 1029 1029
Partials 1066 1066
Continue to review full report at Codecov.
|
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @saunack. Needs a few more updates and cleanups.
|
I translated the tree samples manually so that they are now labelled from 1 to n. The tests also seem to be working fine. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @saunack. Some minor tidy-ups and we're good to merge I think.
|
There's also an option for specifying rooted and unrooted trees in Nexus. Should I use that too in the nexus output? |
|
Also, a peculiar behavior that I have noticed. The BioNexus package doesn't actually do anything with the BEGIN TAXA block; it just stores it in another class. Even if I write incorrect labels there, it does not matter since the parsing for the nexus trees uses only the data in the TRANSLATE block and the TREE block. If the BioNexus package does not itself check for consistency across the TAXA and TRANSLATE block, should I be doing it instead? |
If it's part of the standard I guess we should state that the trees are rooted, I suppose.
Hmm. We could try parsing with the nexus package to see if it reads the TAXA block? No harm in testing against two different libraries, in case one interprets the standard differently to the other. But yes, I guess we should make sure that the labels are correctly output. You should be able to use the output from BioPython to do this? |
|
The python-nexus package that you mentioned also doesn't check for consistency between different blocks. |
|
This is surprisingly complicated! I don't know enough about this to make calls on it really - @hyanwong, would you mind reviewing here please? In particular, can you clear up what the TAXLABELS and TRANSLATE blocks actually mean, and what we should use them for? |
|
@saunack: I'm not sure why we want to define a TAXA block at all. We don't have to, do we? Then all this goes away. The main sticking point here is whether you can have a TRANSLATE command without a TAXON block - i.e. can the second column of the TRANSLATE table simply refer to a taxon number, even if it hasn't been labelled in a TAXON block. I think this is OK - certainly at the moment we should be applying TRANSLATE to all the nodes, not just the sample ones, and as I read it, you are only putting the samples into the TAXA block anyway: i.e. there will be internal nodes that are TRANSLATEd but not in the TAXA block. Some relevant text is:
I'm assuming you are doing what I did and putting: so the question is whether, in the TRANSLATE line reading |
|
No, it's not required. Everything works well without it too. |
The reason I used a TRANSLATE block is that, AFAIK, zero is not a valid NEXUS taxon number. I think this caused problems when reading these nexus files into R. So I moved all the taxon numbers up by 1 using the TRANSLATE block. But then we have the issue that strictly, according to one version of the spec, we need to translate to names not numbers, and we are back to the problem of defining a name for each taxon (including internal nodes). Personally, I think we should treat the spec as allowing numbers as the second column in the translate block, otherwise it opens up a can of worms. |
|
Perhaps we should add a flag to nexus conversion that says essentially "leave as 0-based node numbers" (this breaks the spec so may cause problems for some nexus readers) or "translate to 1-based node numbers" (it is unclear if this breaks the TRANSLATE block spec or not). What do you think @jeromekelleher ? |
The trouble is our sample IDs aren't necessarily 0,.., n - 1, so just adding one doesn't really work and makes things even more confusing. I don't think we should break the spec - can we not to something like making the names |
I wasn't meaning just sample IDs @jeromekelleher. Surely we want all the IDs (including internal nodes) in the trees, otherwise we can't compare tree-to-tree? This would make all node IDs equal to n+1, which is surely fine? |
|
Oh, so we label all internal nodes as well. That makes sense. But isn't there still an issue issue that Nexus thinks numbers refer to (1 based) indexes into the Taxa list? I think your approach above is the correct one:
Can you illustrate this with a quick example? |
|
So as I understand it @jeromekelleher, we don't require a TAXA list (there is an example in the spec without one), and indeed it's not even clear from the spec that internal nodes can be labelled as taxa. So if we were using a 1-based system for node labels in the trees, everything would be grand - the numbers refer to positions in the (non-existent) TAXA list. The problem is, of course, that in reality we will have a node in out trees (could be internal or a tip) with the label However, there is the option to pass a simply to keep the nexus parsers happy. This means that the tree can continue to have a label The only problem now is that if you read the spec (bottom of page 613) strictly, you might think that |
|
Great, thanks @hyanwong. It's very tricky having this interchangability between labels and numbers, so wouldn't it be better if we took away any possibility of ambiguity by defining our samples using the TAXA list like where the taxon label is It's horribly confusing having offset-by-one as the default, and I think this is the only way to both resolve the potential ambiguity between our node IDs and the taxon IDs and to avoid this nasty source of potential errors. Let's not worry about the few extra bytes that this costs - efficiency is not a concern here. |
That's fine for the samples, but I don't know if we want to (or can) do that for the non-sample nodes, as these aren't really taxa. I do think we should leave the non-sample node numbers in the trees. I suppose we could use the TRANSLATE block, and list the samples only, in which case if the sample IDs are (say) 10, 11, 12, then we would have blocks like And we would need to work out how to map the internal nodes and add them to the translate block too, at least for the nodes whose tskit ID lies between 0 and the number of listed taxa. Otherwise a |
|
I'm not convinced that internal nodes can't be thought of as taxa - suppose we have two closely related species and we know the name of the ancestral species. Does the spec specifically say that taxa can only be leaf nodes? Perhaps we could try doing and see if the parsers choke on this? There's too much ambiguity if we don't stringify our node IDs. |
|
The only reference I can find to internal nodes is this:
Given that we could have terminal nodes that are not samples (e.g. in an ancestors TS), perhaps we should indeed list all nodes, including internal ones, as TAXA. It might be useful to have a different TAXA prefix for samples, though? |
Good idea - how about @saunack, would you mind trying this idea out to see if BioPython will round-trip these node labels faithfully? It'll look something like node_labels = {node.id: f"tsk_{node.flags}_{node.id}" for node in self.nodes()}
taxa_block = "TAXA " + ",".join(node_labels[node.id] for node in self.nodes()) + "\n"
# Pass node_labels in as the argument to tree.newick() |
|
Just going by your example above:
Great, thanks. |
Oops. My bad :embarrassed:. Corrected now! |
|
Apparently PAUP doesn't (didn't?) like having taxa as internal labels, but I can't see anything in the spec against it. And we aren't likely to want to read tskit output into PAUP, I guess. [edit - the old version of PAUP didn't like this. The new version is fine with is, so I reckon just go for it] |
|
Excellent, consensus! |
This works without any issues. To be clear though, I haven't used the TRANSLATE block. |
Combined newick and nexus test files to test_phylo_formats.py
|
OK, great! This is ready to merge @saunack, thanks very much. I updated the PR with a few extra tests, making sure we deal with multiple trees correctly and that we're mapping from ID labels in a one-to-one manner. I also updated the docs a bit. |
|
I'm not sure it it's a good idea to omit the TRANSLATE command? It does pass BioPython import, and maybe some others, but I think it might not be spec-compliant? ISTR it broke import into R, for example. It would be trivial to add, this surely @jeromekelleher ? |
|
Mark Holder, who I consulted said that without the TRANSLATE command:
Probably worth importing a small example into R to check. |
|
I've just merged @hyanwong - would you mind checking it out in R, and adding any extra stuff you think it needs to be standards compliant? |
|
Since I've renamed all the nodes to tsk_flag_id, shouldn't this no longer be a problem? |
|
@saunack - do you mean you have renamed them in the trees? Or simply in the TAXA block. If in the trees themselves, that's fine, of course, although it obviously bloats the file size, and it's not so obvious to pull a tree out of the text file and compare it to the |
Yes, it's in the trees. I don't think we really care about file size. It doesn't matter that Tree.newick() has a different default format. |
|
Oh, well that's fine then. Sorry for the noise. I just assumed you were re-using the |
No description provided.