Improving the class2tree resolution #611

gedankenstuecke · 2017-05-25T07:58:33Z

The function class2tree converts a list of taxonomic classifications into a tree. So far it only takes ranked levels into account. This leads to extremely multifurcations, as there are only so many ranked taxonomic levels, which limits the use of the resulting trees.

In comparison, this is how the NCBI Taxonomy tree for the same taxa looks like:

Based on an idea by @trvinh (also pinging as he's the one who had the issue in the first place) I started to explore the idea of using the total vectors of classifications for calculating the pairwise distances. The quick & dirty idea:

Calculating the distance between two taxa A and B as (A ∪ B) - (A ∩ B), reasoning that the closer the intersection of the two classifications approaches the union of it, the smaller the distance. An example implementation of that is here.

This yields a better resolved tree, alas the branchings are not fitting the taxonomy any longer. The reason for that in the example: Drosophila and Anopheles have a very detailed, shared sub-classification (starting at the phylum level), which joins them to the exclusion of basically all other taxa.

I guess my tl;dr is: The current class2tree-solution is too low resolution to resolve meaningful trees. My proposed alternative has it's own issues. What can we do to make it better? 😄

The text was updated successfully, but these errors were encountered:

sckott · 2017-05-27T19:25:09Z

Thanks for this @gedankenstuecke Will think about this and try out your solution

pinging @Edild wrt class2tree

gedankenstuecke · 2017-05-29T06:55:38Z

Thanks, and I haven't heard from @trvinh, but he might have some new insights as well! 👍

sckott · 2017-06-27T22:00:19Z

any thoughts on this @Edild ?

gedankenstuecke · 2017-06-27T22:02:55Z

Let me push to @trvinh as well, I vaguely remember him telling me something about how he fixed this.

sckott · 2017-06-27T22:09:48Z

pinging @jarioksa in case he has any opinions

trvinh · 2017-06-28T08:03:59Z

I did try to create a matrix that contains equal number of taxonomy ranks for all taxa, including noranks. The missing rank IDs in the matrix were complemented by the previous available IDs. Then using that matrix to sort the taxa, I got what I expected to have.
However, I have to use a perl script to create that matrix. It would be great if it can be done directly in R :)

gedankenstuecke · 2017-06-28T08:19:20Z

Ah, let me see whether I got this: So if you have the taxonomy-IDs like this

Species A: 1,2,3,4,5,6,7,8,9
Species B: 1,2,3,10,11

with 4,5,7,8 being unranked you will fill them up to

Species A: 1,2,3,4,5,6,7,8,9
Species B: 1,2,3,3,3,10,10,10,11

?

trvinh · 2017-06-28T08:36:22Z

yes, almost correct :)
However, I also fix the relative positions of all possible main ranks from strain to superkingdom (regardless to the taxa of interest), and also the relative positions of each individual taxon rank IDs.

In your example, as 9,10,11 are main ranks, I already knew their relative positions: 9->10->11, therefore

Species A: 1,2,3, 4 , 5 , 6 , 7 , 8 , 9 ,(9),(9)
Species B: 1,2,3,(3),(3),(3),(3),(3),(3), 10, 11

gedankenstuecke · 2017-06-28T08:52:59Z

Ah, okay! Is your perl script somewhere on GitHub so that people can have a look and potentially adapt it to R?

trvinh · 2017-06-28T08:56:02Z

yup, here it is
sorry for the dirty code :P

sckott · 2017-09-07T21:58:10Z

@gedankenstuecke any thoughts on this?

gedankenstuecke · 2017-09-08T17:38:37Z

@sckott Due to thesis writing I didn't have any time to look into this in depth. I remember @trvinh and I discussed moving his solution from Perl to R, but I have no clue what the status is for that.

sckott · 2017-09-08T17:52:54Z

okay, thanks for update. will wait and see if something happens - will leave open for now

gedankenstuecke · 2017-09-08T17:54:11Z

Thanks! I will be back in the office on Monday and then @trvinh and I can discuss how to best convert his code. I think for the tool he's writing it would be necessary/ideal to get a native R-solution as well.

sckott · 2017-09-08T17:56:24Z

Great, 👌

jarioksa · 2017-09-10T14:33:23Z

I quite can't follow this. The current solution may not be satisfactory, but I think it is based on the data: if you have taxa that are separated at the same level, they will have "multifurcations". That is, if you have only information about families, all genera within the family are separated at the same level. You can produce fancier plots, but they are artifactual. If I get something reproducible, I may think differently. Do you have some information in addition to classification which should be taken into account?

trvinh · 2017-09-10T17:13:35Z

@jarioksa: I hope that I understood your point correctly. Yes, you are right, if we have only information about families, we will have multifurcation for taxa within a family.
With our current idea, if the information about the splits are missing, we still have multifurcations. We are not trying to remove multifurcations by creating artifactual trees (plots). Instead, we just want to solve them as most as we can by using more information (here, the IDs of "norank" levels between "main" levels strain, species, genus, ....).

jarioksa · 2017-09-10T18:03:14Z

So it's about not using existing information. Got to see how the information is extracted.

trvinh · 2017-09-21T14:11:47Z

So, now we can reconstruct the NCBI taxonomy tree successfully :)

Our solution is that, we first index ALL available levels (also including the unranked) from input data in ascending order. Then create a data frame with columns are all levels and rows are the taxa. After all, we arrange the IDs for each taxon to that data frame. For a certain level where we found a missing ID, we assign its value (ID) with the ID of previous level. By that all taxa will have the same number of level IDs (all rows have the same length). That data frame is used as input for taxa2dist() .

Since I am still beginner in R, my code looks very ugly :-P I'm really appreciate if you could improve it. Many thanks in advance!!

sckott · 2017-09-21T16:24:00Z

thanks @trvinh !

@jarioksa can you chime in with any thoughts you have on the above solution/code?

jarioksa · 2017-09-21T16:56:53Z

Got to have a look at the details of the code. However, the solution is the correct one.

gedankenstuecke · 2017-09-21T17:27:44Z

I suspect there might be some ways to make the code much more R-like. Given that @trvinh and I are rather new to R (coming from Python and Perl) we learned quite a bit about R data types in the process. 😂

sckott · 2017-09-29T16:10:39Z

@gedankenstuecke and @trvinh - are you meaning to modify code more? you can send a PR, or would you rather i just add it?

trvinh · 2017-09-29T16:51:54Z

@sckott : we will send a PR next week.

sckott · 2017-09-29T16:53:18Z

sounds good

gedankenstuecke · 2017-10-07T08:10:08Z

Closed through #634 😊

gedankenstuecke mentioned this issue Oct 2, 2017

class2tree replacement #634

Merged

gedankenstuecke closed this as completed Oct 7, 2017

sckott added this to the v0.9.2 milestone Feb 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the class2tree resolution #611

Improving the class2tree resolution #611

gedankenstuecke commented May 25, 2017

sckott commented May 27, 2017

gedankenstuecke commented May 29, 2017

sckott commented Jun 27, 2017

gedankenstuecke commented Jun 27, 2017

sckott commented Jun 27, 2017

trvinh commented Jun 28, 2017

gedankenstuecke commented Jun 28, 2017

trvinh commented Jun 28, 2017

gedankenstuecke commented Jun 28, 2017

trvinh commented Jun 28, 2017

sckott commented Sep 7, 2017

gedankenstuecke commented Sep 8, 2017

sckott commented Sep 8, 2017

gedankenstuecke commented Sep 8, 2017

sckott commented Sep 8, 2017

jarioksa commented Sep 10, 2017

trvinh commented Sep 10, 2017

jarioksa commented Sep 10, 2017

trvinh commented Sep 21, 2017

sckott commented Sep 21, 2017

jarioksa commented Sep 21, 2017

gedankenstuecke commented Sep 21, 2017

sckott commented Sep 29, 2017

trvinh commented Sep 29, 2017

sckott commented Sep 29, 2017

gedankenstuecke commented Oct 7, 2017

Improving the class2tree resolution #611

Improving the class2tree resolution #611

Comments

gedankenstuecke commented May 25, 2017

sckott commented May 27, 2017

gedankenstuecke commented May 29, 2017

sckott commented Jun 27, 2017

gedankenstuecke commented Jun 27, 2017

sckott commented Jun 27, 2017

trvinh commented Jun 28, 2017

gedankenstuecke commented Jun 28, 2017

trvinh commented Jun 28, 2017

gedankenstuecke commented Jun 28, 2017

trvinh commented Jun 28, 2017

sckott commented Sep 7, 2017

gedankenstuecke commented Sep 8, 2017

sckott commented Sep 8, 2017

gedankenstuecke commented Sep 8, 2017

sckott commented Sep 8, 2017

jarioksa commented Sep 10, 2017

trvinh commented Sep 10, 2017

jarioksa commented Sep 10, 2017

trvinh commented Sep 21, 2017

sckott commented Sep 21, 2017

jarioksa commented Sep 21, 2017

gedankenstuecke commented Sep 21, 2017

sckott commented Sep 29, 2017

trvinh commented Sep 29, 2017

sckott commented Sep 29, 2017

gedankenstuecke commented Oct 7, 2017