Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the class2tree resolution #611

Closed
gedankenstuecke opened this issue May 25, 2017 · 26 comments
Closed

Improving the class2tree resolution #611

gedankenstuecke opened this issue May 25, 2017 · 26 comments
Milestone

Comments

@gedankenstuecke
Copy link
Contributor

The function class2tree converts a list of taxonomic classifications into a tree. So far it only takes ranked levels into account. This leads to extremely multifurcations, as there are only so many ranked taxonomic levels, which limits the use of the resulting trees.

screen shot 2017-05-25 at 09 34 51

In comparison, this is how the NCBI Taxonomy tree for the same taxa looks like:

screen shot 2017-05-25 at 09 38 01

Based on an idea by @trvinh (also pinging as he's the one who had the issue in the first place) I started to explore the idea of using the total vectors of classifications for calculating the pairwise distances. The quick & dirty idea:

Calculating the distance between two taxa A and B as (A ∪ B) - (A ∩ B), reasoning that the closer the intersection of the two classifications approaches the union of it, the smaller the distance. An example implementation of that is here.

screen shot 2017-05-25 at 09 35 40

This yields a better resolved tree, alas the branchings are not fitting the taxonomy any longer. The reason for that in the example: Drosophila and Anopheles have a very detailed, shared sub-classification (starting at the phylum level), which joins them to the exclusion of basically all other taxa.

I guess my tl;dr is: The current class2tree-solution is too low resolution to resolve meaningful trees. My proposed alternative has it's own issues. What can we do to make it better? 😄

@sckott
Copy link
Contributor

sckott commented May 27, 2017

Thanks for this @gedankenstuecke Will think about this and try out your solution

pinging @Edild wrt class2tree

@gedankenstuecke
Copy link
Contributor Author

Thanks, and I haven't heard from @trvinh, but he might have some new insights as well! 👍

@sckott
Copy link
Contributor

sckott commented Jun 27, 2017

any thoughts on this @Edild ?

@gedankenstuecke
Copy link
Contributor Author

Let me push to @trvinh as well, I vaguely remember him telling me something about how he fixed this.

@sckott
Copy link
Contributor

sckott commented Jun 27, 2017

pinging @jarioksa in case he has any opinions

@trvinh
Copy link
Contributor

trvinh commented Jun 28, 2017

I did try to create a matrix that contains equal number of taxonomy ranks for all taxa, including noranks. The missing rank IDs in the matrix were complemented by the previous available IDs. Then using that matrix to sort the taxa, I got what I expected to have.
However, I have to use a perl script to create that matrix. It would be great if it can be done directly in R :)

@gedankenstuecke
Copy link
Contributor Author

Ah, let me see whether I got this: So if you have the taxonomy-IDs like this

Species A: 1,2,3,4,5,6,7,8,9
Species B: 1,2,3,10,11

with 4,5,7,8 being unranked you will fill them up to

Species A: 1,2,3,4,5,6,7,8,9
Species B: 1,2,3,3,3,10,10,10,11

?

@trvinh
Copy link
Contributor

trvinh commented Jun 28, 2017

yes, almost correct :)
However, I also fix the relative positions of all possible main ranks from strain to superkingdom (regardless to the taxa of interest), and also the relative positions of each individual taxon rank IDs.

In your example, as 9,10,11 are main ranks, I already knew their relative positions: 9->10->11, therefore

Species A: 1,2,3, 4 , 5 , 6 , 7 , 8 , 9 ,(9),(9)
Species B: 1,2,3,(3),(3),(3),(3),(3),(3), 10, 11

@gedankenstuecke
Copy link
Contributor Author

Ah, okay! Is your perl script somewhere on GitHub so that people can have a look and potentially adapt it to R?

@trvinh
Copy link
Contributor

trvinh commented Jun 28, 2017

yup, here it is
sorry for the dirty code :P

@sckott
Copy link
Contributor

sckott commented Sep 7, 2017

@gedankenstuecke any thoughts on this?

@gedankenstuecke
Copy link
Contributor Author

@sckott Due to thesis writing I didn't have any time to look into this in depth. I remember @trvinh and I discussed moving his solution from Perl to R, but I have no clue what the status is for that.

@sckott
Copy link
Contributor

sckott commented Sep 8, 2017

okay, thanks for update. will wait and see if something happens - will leave open for now

@gedankenstuecke
Copy link
Contributor Author

Thanks! I will be back in the office on Monday and then @trvinh and I can discuss how to best convert his code. I think for the tool he's writing it would be necessary/ideal to get a native R-solution as well.

@sckott
Copy link
Contributor

sckott commented Sep 8, 2017

Great, 👌

@jarioksa
Copy link
Contributor

I quite can't follow this. The current solution may not be satisfactory, but I think it is based on the data: if you have taxa that are separated at the same level, they will have "multifurcations". That is, if you have only information about families, all genera within the family are separated at the same level. You can produce fancier plots, but they are artifactual. If I get something reproducible, I may think differently. Do you have some information in addition to classification which should be taken into account?

@trvinh
Copy link
Contributor

trvinh commented Sep 10, 2017

@jarioksa: I hope that I understood your point correctly. Yes, you are right, if we have only information about families, we will have multifurcation for taxa within a family.
With our current idea, if the information about the splits are missing, we still have multifurcations. We are not trying to remove multifurcations by creating artifactual trees (plots). Instead, we just want to solve them as most as we can by using more information (here, the IDs of "norank" levels between "main" levels strain, species, genus, ....).

@jarioksa
Copy link
Contributor

So it's about not using existing information. Got to see how the information is extracted.

@trvinh
Copy link
Contributor

trvinh commented Sep 21, 2017

So, now we can reconstruct the NCBI taxonomy tree successfully :)
taxonomytree

Our solution is that, we first index ALL available levels (also including the unranked) from input data in ascending order. Then create a data frame with columns are all levels and rows are the taxa. After all, we arrange the IDs for each taxon to that data frame. For a certain level where we found a missing ID, we assign its value (ID) with the ID of previous level. By that all taxa will have the same number of level IDs (all rows have the same length). That data frame is used as input for taxa2dist() .

Since I am still beginner in R, my code looks very ugly :-P I'm really appreciate if you could improve it. Many thanks in advance!!

@sckott
Copy link
Contributor

sckott commented Sep 21, 2017

thanks @trvinh !

@jarioksa can you chime in with any thoughts you have on the above solution/code?

@jarioksa
Copy link
Contributor

Got to have a look at the details of the code. However, the solution is the correct one.

@gedankenstuecke
Copy link
Contributor Author

I suspect there might be some ways to make the code much more R-like. Given that @trvinh and I are rather new to R (coming from Python and Perl) we learned quite a bit about R data types in the process. 😂

@sckott
Copy link
Contributor

sckott commented Sep 29, 2017

@gedankenstuecke and @trvinh - are you meaning to modify code more? you can send a PR, or would you rather i just add it?

@trvinh
Copy link
Contributor

trvinh commented Sep 29, 2017

@sckott : we will send a PR next week.

@sckott
Copy link
Contributor

sckott commented Sep 29, 2017

sounds good

@gedankenstuecke
Copy link
Contributor Author

Closed through #634 😊

@sckott sckott added this to the v0.9.2 milestone Feb 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants