Added nexus output and test #550

saunack · 2020-04-21T14:02:16Z

No description provided.

jeromekelleher

Great start, thanks @saunack! Some comments above.

python/tests/test_nexus.py

python/tskit/trees.py

codecov · 2020-04-23T07:48:55Z

Codecov Report

Merging #550 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #550      +/-   ##
==========================================
+ Coverage   87.27%   87.28%   +0.01%     
==========================================
  Files          21       21              
  Lines       16459    16474      +15     
  Branches     3234     3237       +3     
==========================================
+ Hits        14364    14379      +15     
  Misses       1029     1029              
  Partials     1066     1066

Flag	Coverage Δ
#c_tests	`88.46% <100.00%> (+0.01%)`	⬆️
#python_c_tests	`90.36% <100.00%> (+0.01%)`	⬆️
#python_tests	`99.20% <100.00%> (+<0.01%)`	⬆️

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.66% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b872e37...ae05b3d. Read the comment docs.

jeromekelleher

Looks good, thanks @saunack. Needs a few more updates and cleanups.

python/tskit/trees.py

python/tests/test_phylo_formats.py

python/tskit/trees.py

python/tests/test_phylo_formats.py

saunack · 2020-04-28T10:27:05Z

I translated the tree samples manually so that they are now labelled from 1 to n. The tests also seem to be working fine.
Any more tests which need to be added? I have added all the tests from TestNewick after appropriate modifications (the only ones left are the ones which use node_labels).
If this looks fine, I can go ahead and add user-defined translate blocks too, if that is a required feature

jeromekelleher

Looks good, thanks @saunack. Some minor tidy-ups and we're good to merge I think.

python/tests/test_phylo_formats.py

python/tskit/trees.py

saunack · 2020-04-29T15:15:29Z

There's also an option for specifying rooted and unrooted trees in Nexus. Should I use that too in the nexus output?

saunack · 2020-04-29T15:38:34Z

Also, a peculiar behavior that I have noticed. The BioNexus package doesn't actually do anything with the BEGIN TAXA block; it just stores it in another class. Even if I write incorrect labels there, it does not matter since the parsing for the nexus trees uses only the data in the TRANSLATE block and the TREE block. If the BioNexus package does not itself check for consistency across the TAXA and TRANSLATE block, should I be doing it instead?

jeromekelleher · 2020-04-29T15:55:50Z

There's also an option for specifying rooted and unrooted trees in Nexus. Should I use that too in the nexus output?

If it's part of the standard I guess we should state that the trees are rooted, I suppose.

Also, a peculiar behavior that I have noticed. The BioNexus package doesn't actually do anything with the BEGIN TAXA block; it just stores it in another class. Even if I write incorrect labels there, it does not matter since the parsing for the nexus trees uses only the data in the TRANSLATE block and the TREE block. If the BioNexus package does not itself check for consistency across the TAXA and TRANSLATE block, should I be doing it instead?

Hmm. We could try parsing with the nexus package to see if it reads the TAXA block? No harm in testing against two different libraries, in case one interprets the standard differently to the other.

But yes, I guess we should make sure that the labels are correctly output. You should be able to use the output from BioPython to do this?

saunack · 2020-04-29T17:06:58Z

The python-nexus package that you mentioned also doesn't check for consistency between different blocks.
And yeah, I should be able to do it using BioPython.
Although now I doubt if we should check for it. Can you go through page 15 and page 42 of this paper on Nexus and let me know your views on it? From what I gathered, it is not necessary to write all taxons there and using numbers as taxlabels is not preferred.

python/tskit/trees.py

jeromekelleher · 2020-04-30T09:08:39Z

This is surprisingly complicated! I don't know enough about this to make calls on it really - @hyanwong, would you mind reviewing here please? In particular, can you clear up what the TAXLABELS and TRANSLATE blocks actually mean, and what we should use them for?

hyanwong · 2020-04-30T10:14:20Z

@saunack: I'm not sure why we want to define a TAXA block at all. We don't have to, do we? Then all this goes away.

The main sticking point here is whether you can have a TRANSLATE command without a TAXON block - i.e. can the second column of the TRANSLATE table simply refer to a taxon number, even if it hasn't been labelled in a TAXON block. I think this is OK - certainly at the moment we should be applying TRANSLATE to all the nodes, not just the sample ones, and as I read it, you are only putting the samples into the TAXA block anyway: i.e. there will be internal nodes that are TRANSLATEd but not in the TAXA block.

Some relevant text is:

The TRANSLATE statement maps arbitrary labels in the tree specification to
valid taxon names. If the arbitrary labels are integers, they are mapped onto the valid taxon names
as dictated by the TRANSLATE command, without any consideration of the order of the taxa in the
matrix. Thus, if an integer is encountered in the tree description, a program first checks to see if it
matches one of the arbitrary labels defined in the TRANSLATE command;

I'm assuming you are doing what I did and putting:

#nexus
...
begin trees;
  translate
    0 1,
    1 2,
    2 3,
    3 4,
    4 5
  ;
  tree one = [&U] (0,1,(2,(3,4));
  tree two = [&U] (0,2,(4,(1,3));
end;

so the question is whether, in the TRANSLATE line reading 0 1,, the number 1 is a "valid taxon name". The formal grammar proposed at https://informatics.nescent.org/wiki/Supporting_NEXUS#Testing_conformance_of_NEXUS_files (https://www.cs.nmsu.edu/~epontell/nexus/nexus_grammar) seems to suggest this is OK, but the spec wording does not include this possibility. I'll see if I know any experts who can help.

saunack · 2020-05-01T08:38:35Z

No, it's not required. Everything works well without it too.
All the problems go away then. Thanks @hyanwong
Although I do not translate all the nodes in the tree. Is it necessary since we're just checking for isomorphism? Traversing the tree from the leaf to the root and checking node lengths along the way should be fine, I guess

hyanwong · 2020-05-01T14:51:44Z

Although I do not translate all the nodes in the tree. Is it necessary

The reason I used a TRANSLATE block is that, AFAIK, zero is not a valid NEXUS taxon number. I think this caused problems when reading these nexus files into R. So I moved all the taxon numbers up by 1 using the TRANSLATE block. But then we have the issue that strictly, according to one version of the spec, we need to translate to names not numbers, and we are back to the problem of defining a name for each taxon (including internal nodes). Personally, I think we should treat the spec as allowing numbers as the second column in the translate block, otherwise it opens up a can of worms.

hyanwong · 2020-05-01T14:55:24Z

Perhaps we should add a flag to nexus conversion that says essentially "leave as 0-based node numbers" (this breaks the spec so may cause problems for some nexus readers) or "translate to 1-based node numbers" (it is unclear if this breaks the TRANSLATE block spec or not). What do you think @jeromekelleher ?

jeromekelleher · 2020-05-01T15:00:06Z

Perhaps we should add a flag to nexus conversion that says essentially "leave as 0-based node numbers" (this breaks the spec so may cause problems for some nexus readers) or "translate to 1-based node numbers" (it is unclear if this breaks the TRANSLATE block spec or not). What do you think @jeromekelleher ?

The trouble is our sample IDs aren't necessarily 0,.., n - 1, so just adding one doesn't really work and makes things even more confusing. I don't think we should break the spec - can we not to something like making the names [f"tsk_{u}" for u in samples] and translate them to node numbers?

hyanwong · 2020-05-01T15:17:11Z

The trouble is our sample IDs aren't necessarily 0,.., n - 1, so just adding one doesn't really work and makes things even more confusing.

I wasn't meaning just sample IDs @jeromekelleher. Surely we want all the IDs (including internal nodes) in the trees, otherwise we can't compare tree-to-tree? This would make all node IDs equal to n+1, which is surely fine?

jeromekelleher · 2020-05-01T16:27:36Z

Oh, so we label all internal nodes as well. That makes sense.

But isn't there still an issue issue that Nexus thinks numbers refer to (1 based) indexes into the Taxa list? I think your approach above is the correct one:

Personally, I think we should treat the spec as allowing numbers as the second column in the translate block, otherwise it opens up a can of worms.

Can you illustrate this with a quick example?

hyanwong · 2020-05-01T17:23:32Z

So as I understand it @jeromekelleher, we don't require a TAXA list (there is an example in the spec without one), and indeed it's not even clear from the spec that internal nodes can be labelled as taxa. So if we were using a 1-based system for node labels in the trees, everything would be grand - the numbers refer to positions in the (non-existent) TAXA list. The problem is, of course, that in reality we will have a node in out trees (could be internal or a tip) with the label 0. I have previously found that this causes problems with nexus parsers.

However, there is the option to pass a TRANSLATE block before the tree definitions, of the form TRANSLATE tree_label1 taxa_label1, tree_label2 taxa_label2 ... So previously, to allow a tree with a node labelled 0, I have set

TRANSLATE
0 1,
1 2,
2 3
...

simply to keep the nexus parsers happy. This means that the tree can continue to have a label 0 on a node, but for nexus purposes it is taken as referring to the first item in the taxon list, were a taxon list to exist.

The only problem now is that if you read the spec (bottom of page 613) strictly, you might think that taxa_label1, taxa_label2 (i.e. the 2nd "column") in the TRANSLATE block above have to be labels (strictly, valid-taxon-name) which might be taken to exclude the possibility that they are indexes into the TAXA list. I think this is too tight a reading of the spec. In fact, labels and indexes are treated as entirely interchangeable in the rest of the spec ("Characters, taxa, and trees can be referred to by their name or number;" and "Taxon names ... must not correspond to another taxon name or number; thus, 1 is not a valid name for the second taxon listed. "). Moreover, the proposed automatic nexus parser is specifically coded to allow the second column in the TRANSLATE block to be an index or a label, so I deem this an acceptable work-around.

jeromekelleher · 2020-05-04T09:46:30Z

Great, thanks @hyanwong. It's very tricky having this interchangability between labels and numbers, so wouldn't it be better if we took away any possibility of ambiguity by defining our samples using the TAXA list like

TAXA
    tsk_0, tsk_1,...

where the taxon label is [f"tsk_{u}" for u in ts.samples()]? And we then consistently refer to the nodes using these labels?

It's horribly confusing having offset-by-one as the default, and I think this is the only way to both resolve the potential ambiguity between our node IDs and the taxon IDs and to avoid this nasty source of potential errors.

Let's not worry about the few extra bytes that this costs - efficiency is not a concern here.

hyanwong · 2020-05-04T12:12:19Z

wouldn't it be better if we took away any possibility of ambiguity by defining our samples using the TAXA list like

That's fine for the samples, but I don't know if we want to (or can) do that for the non-sample nodes, as these aren't really taxa. I do think we should leave the non-sample node numbers in the trees. I suppose we could use the TRANSLATE block, and list the samples only, in which case if the sample IDs are (say) 10, 11, 12, then we would have blocks like

begin taxa;
 dimensions ntax=X; taxlabels tsk_10 tsk_11 tsk_12 ...; end;
begin trees ; 
 TRANSLATE 10 1, 11 2, 12 3, ...;

And we would need to work out how to map the internal nodes and add them to the translate block too, at least for the nodes whose tskit ID lies between 0 and the number of listed taxa. Otherwise a 1 in the newick will be taken to refer to the first listed taxon (i.e. tsk_10). So either way we will need to do numerical remapping, I fear.

jeromekelleher · 2020-05-04T14:56:00Z

I'm not convinced that internal nodes can't be thought of as taxa - suppose we have two closely related species and we know the name of the ancestral species. Does the spec specifically say that taxa can only be leaf nodes?

Perhaps we could try doing

taxa = [f"tsk_{u}" for u in ts.num_nodes()]

and see if the parsers choke on this? There's too much ambiguity if we don't stringify our node IDs.

hyanwong · 2020-05-04T15:23:15Z

The only reference I can find to internal nodes is this:

The label of the node is a NEXUS token that is either a taxon's defined name, a taxon's number, a taxon's label from the translation table, or a clade's defined name. The label is optional for internal nodes that are not observed taxa; it is not optional for terminal nodes. Internal nodes that have no label are represented implicitly by the parentheses containing the list of subclades.

Given that we could have terminal nodes that are not samples (e.g. in an ancestors TS), perhaps we should indeed list all nodes, including internal ones, as TAXA. It might be useful to have a different TAXA prefix for samples, though?

jeromekelleher · 2020-05-04T16:23:27Z

Given that we could have terminal nodes that are not samples (e.g. in an ancestors TS), perhaps we should indeed list all nodes, including internal ones, as TAXA. It might be useful to have a different TAXA prefix for samples, though?

Good idea - how about ["tsk_{node.flags}_{node.id}" for node in ts.nodes()]? This will usually be zero or 1, but can contain more information if it's there. (Let's not worry about other node information for now - we can follow this up with a better encoding scheme if this works.)

@saunack, would you mind trying this idea out to see if BioPython will round-trip these node labels faithfully? It'll look something like

node_labels = {node.id: f"tsk_{node.flags}_{node.id}" for node in self.nodes()}
taxa_block = "TAXA " + ",".join(node_labels[node.id] for node in self.nodes()) + "\n"
# Pass node_labels in as the argument to tree.newick()

hyanwong · 2020-05-04T19:05:04Z

taxa_block = "TAXA " + ",".join
I don't think you join TAXA by a comma, BTW. I happen to know one of the principal authors of a well-used nexus parsing library, so I'm just chatting to him about conformance.

jeromekelleher · 2020-05-04T19:10:00Z

I don't think you join TAXA by a comma, BTW.

Just going by your example above:

TAXA tsk_10, tsk_11, tsk_12
TRANSLATE 10 1, 11 2, 12 3

BTW. I happen to know one of the principal authors of a well-used nexus parsing library, so I'm just chatting to him about conformance.

Great, thanks.

hyanwong · 2020-05-04T19:11:25Z

Just going by your example above:

Oops. My bad :embarrassed:. Corrected now!

hyanwong · 2020-05-04T19:14:06Z

Apparently PAUP doesn't (didn't?) like having taxa as internal labels, but I can't see anything in the spec against it. And we aren't likely to want to read tskit output into PAUP, I guess.

[edit - the old version of PAUP didn't like this. The new version is fine with is, so I reckon just go for it]

jeromekelleher · 2020-05-05T07:19:45Z

Excellent, consensus!

saunack · 2020-05-05T11:11:27Z

@saunack, would you mind trying this idea out to see if BioPython will round-trip these node labels faithfully? It'll look something like
node_labels = {node.id: f"tsk_{node.flags}_{node.id}" for node in self.nodes()}
taxa_block = "TAXA " + ",".join(node_labels[node.id] for node in self.nodes()) + "\n"
# Pass node_labels in as the argument to tree.newick()

This works without any issues. To be clear though, I haven't used the TRANSLATE block.

Combined newick and nexus test files to test_phylo_formats.py

jeromekelleher · 2020-05-05T15:59:50Z

OK, great! This is ready to merge @saunack, thanks very much.

I updated the PR with a few extra tests, making sure we deal with multiple trees correctly and that we're mapping from ID labels in a one-to-one manner. I also updated the docs a bit.

hyanwong · 2020-05-05T16:32:43Z

I'm not sure it it's a good idea to omit the TRANSLATE command? It does pass BioPython import, and maybe some others, but I think it might not be spec-compliant? ISTR it broke import into R, for example. It would be trivial to add, this surely @jeromekelleher ?

hyanwong · 2020-05-05T16:35:23Z

Mark Holder, who I consulted said that without the TRANSLATE command:

oddly, n+1 will be a default alias for n in your 0-based tree

Probably worth importing a small example into R to check.

jeromekelleher · 2020-05-05T16:51:04Z

I've just merged @hyanwong - would you mind checking it out in R, and adding any extra stuff you think it needs to be standards compliant?

saunack · 2020-05-05T16:53:59Z

Since I've renamed all the nodes to tsk_flag_id, shouldn't this no longer be a problem?
The newick format also uses these labels for printing the trees so if it was to be read by some other program, it should work fine too.
Although I may wrong due to the compliance part, which I don't know anything about.

hyanwong · 2020-05-05T16:56:13Z

@saunack - do you mean you have renamed them in the trees? Or simply in the TAXA block. If in the trees themselves, that's fine, of course, although it obviously bloats the file size, and it's not so obvious to pull a tree out of the text file and compare it to the Tree.newick() output.

jeromekelleher · 2020-05-05T17:02:47Z

do you mean you have renamed them in the trees? Or simply in the TAXA block. If in the trees themselves, that's fine, of course, although it obviously bloats the file size, and it's not so obvious to pull a tree out of the text file and compare it to the Tree.newick() output.

Yes, it's in the trees. I don't think we really care about file size. It doesn't matter that Tree.newick() has a different default format.

hyanwong · 2020-05-05T17:28:42Z

Oh, well that's fine then. Sorry for the noise. I just assumed you were re-using the Tree.newick() output and sticking it into the nexus file.

jeromekelleher reviewed Apr 21, 2020

View reviewed changes

saunack force-pushed the nexus branch from d9b752c to 5792373 Compare April 23, 2020 07:24

jeromekelleher reviewed Apr 23, 2020

View reviewed changes

saunack force-pushed the nexus branch from 5792373 to 3bd4b25 Compare April 28, 2020 10:22

jeromekelleher reviewed Apr 29, 2020

View reviewed changes

saunack force-pushed the nexus branch from 3bd4b25 to 1369fea Compare April 29, 2020 15:51

jeromekelleher reviewed Apr 30, 2020

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

jeromekelleher marked this pull request as ready for review April 30, 2020 09:59

saunack force-pushed the nexus branch from 1369fea to 147141f Compare May 1, 2020 08:40

saunack force-pushed the nexus branch from 147141f to fc04ec0 Compare May 5, 2020 11:15

saunack and others added 2 commits May 5, 2020 16:57

Added nexus output functionality in TreeSequence and corresponding tests

8f3f2bd

Combined newick and nexus test files to test_phylo_formats.py

Add some extra tests and update CHANGELOG.

ae05b3d

jeromekelleher force-pushed the nexus branch from fc04ec0 to ae05b3d Compare May 5, 2020 15:58

jeromekelleher merged commit 5fd04ec into tskit-dev:master May 5, 2020

saunack deleted the nexus branch June 12, 2020 12:06

jeromekelleher mentioned this pull request Aug 12, 2020

Functionality to output nexus-format list of trees #62

Closed

Added nexus output and test #550

Added nexus output and test #550

Uh oh!

Conversation

saunack commented Apr 21, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saunack commented Apr 28, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saunack commented Apr 29, 2020

Uh oh!

saunack commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Apr 29, 2020

Uh oh!

saunack commented Apr 29, 2020

Uh oh!

Uh oh!

jeromekelleher commented Apr 30, 2020

Uh oh!

hyanwong commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saunack commented May 1, 2020

Uh oh!

hyanwong commented May 1, 2020

Uh oh!

hyanwong commented May 1, 2020

Uh oh!

jeromekelleher commented May 1, 2020

Uh oh!

hyanwong commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented May 1, 2020

Uh oh!

hyanwong commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented May 4, 2020

Uh oh!

hyanwong commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented May 4, 2020

codecov bot commented Apr 23, 2020 •

edited

Loading

saunack commented Apr 29, 2020 •

edited

Loading

hyanwong commented Apr 30, 2020 •

edited

Loading

hyanwong commented May 1, 2020 •

edited

Loading

hyanwong commented May 1, 2020 •

edited

Loading

hyanwong commented May 4, 2020 •

edited

Loading

hyanwong commented May 4, 2020 •

edited

Loading

hyanwong commented May 4, 2020 •

edited

Loading

hyanwong commented May 5, 2020 •

edited

Loading