New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving the class2tree resolution #611
Comments
Thanks for this @gedankenstuecke Will think about this and try out your solution pinging @Edild wrt |
Thanks, and I haven't heard from @trvinh, but he might have some new insights as well! 👍 |
any thoughts on this @Edild ? |
Let me push to @trvinh as well, I vaguely remember him telling me something about how he fixed this. |
pinging @jarioksa in case he has any opinions |
I did try to create a matrix that contains equal number of taxonomy ranks for all taxa, including noranks. The missing rank IDs in the matrix were complemented by the previous available IDs. Then using that matrix to sort the taxa, I got what I expected to have. |
Ah, let me see whether I got this: So if you have the taxonomy-IDs like this
with 4,5,7,8 being unranked you will fill them up to
? |
yes, almost correct :) In your example, as 9,10,11 are main ranks, I already knew their relative positions: 9->10->11, therefore
|
Ah, okay! Is your perl script somewhere on GitHub so that people can have a look and potentially adapt it to R? |
yup, here it is |
@gedankenstuecke any thoughts on this? |
okay, thanks for update. will wait and see if something happens - will leave open for now |
Thanks! I will be back in the office on Monday and then @trvinh and I can discuss how to best convert his code. I think for the tool he's writing it would be necessary/ideal to get a native R-solution as well. |
Great, 👌 |
I quite can't follow this. The current solution may not be satisfactory, but I think it is based on the data: if you have taxa that are separated at the same level, they will have "multifurcations". That is, if you have only information about families, all genera within the family are separated at the same level. You can produce fancier plots, but they are artifactual. If I get something reproducible, I may think differently. Do you have some information in addition to classification which should be taken into account? |
@jarioksa: I hope that I understood your point correctly. Yes, you are right, if we have only information about families, we will have multifurcation for taxa within a family. |
So it's about not using existing information. Got to see how the information is extracted. |
So, now we can reconstruct the NCBI taxonomy tree successfully :) Our solution is that, we first index ALL available levels (also including the unranked) from input data in ascending order. Then create a data frame with columns are all levels and rows are the taxa. After all, we arrange the IDs for each taxon to that data frame. For a certain level where we found a missing ID, we assign its value (ID) with the ID of previous level. By that all taxa will have the same number of level IDs (all rows have the same length). That data frame is used as input for taxa2dist() . Since I am still beginner in R, my code looks very ugly :-P I'm really appreciate if you could improve it. Many thanks in advance!! |
Got to have a look at the details of the code. However, the solution is the correct one. |
I suspect there might be some ways to make the code much more R-like. Given that @trvinh and I are rather new to R (coming from Python and Perl) we learned quite a bit about R data types in the process. 😂 |
@gedankenstuecke and @trvinh - are you meaning to modify code more? you can send a PR, or would you rather i just add it? |
@sckott : we will send a PR next week. |
sounds good |
Closed through #634 😊 |
The function
class2tree
converts a list of taxonomic classifications into a tree. So far it only takes ranked levels into account. This leads to extremely multifurcations, as there are only so many ranked taxonomic levels, which limits the use of the resulting trees.In comparison, this is how the NCBI Taxonomy tree for the same taxa looks like:
Based on an idea by @trvinh (also pinging as he's the one who had the issue in the first place) I started to explore the idea of using the total vectors of classifications for calculating the pairwise distances. The quick & dirty idea:
Calculating the distance between two taxa
A
andB
as(A ∪ B) - (A ∩ B)
, reasoning that the closer the intersection of the two classifications approaches the union of it, the smaller the distance. An example implementation of that is here.This yields a better resolved tree, alas the branchings are not fitting the taxonomy any longer. The reason for that in the example: Drosophila and Anopheles have a very detailed, shared sub-classification (starting at the phylum level), which joins them to the exclusion of basically all other taxa.
I guess my tl;dr is: The current
class2tree
-solution is too low resolution to resolve meaningful trees. My proposed alternative has it's own issues. What can we do to make it better? 😄The text was updated successfully, but these errors were encountered: