Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parentnames to output of bold_identify() #36

Closed
dougwyu opened this issue Dec 6, 2016 · 10 comments
Closed

Add parentnames to output of bold_identify() #36

dougwyu opened this issue Dec 6, 2016 · 10 comments
Milestone

Comments

@dougwyu
Copy link

dougwyu commented Dec 6, 2016

Hi there,

When I use bold_identify, I get the lowest level taxonomic identification for that sequence (taxonomicidentification field), but it would be very useful if we could get the parentnames for that identification. The bold APIs do provide this information if i use 3 different bold package commands (see below), but i now would have to do some programming in R (not my strength) to insert the parentnames into the bold_identify output table. It seems to me that this would be better done within the bold package, if you fancy it.

also, maybe i have missed something, but i think it would be nicer to get parentnames from a taxid, not a taxonomicidentification field, given the (small) possibility for ambiguity.

thanks,
doug

library(bold)
library(plyr)
testseq <- list(eb4909 = "GAATAAATAATATAAGATTTTGATTACTCCCTCCTTCTTTATTtttATTAATTTTAAGAAATTTTATTGGAACGGGTGTAGGAACCGGATGAACTTTATATCCTCCTTTATCATCTATTGTTGGACATGATTCACCTTCTGTAGATTTAGGAATTttttCTATCCATATTGCTGGAATTTCCTCAATTATAGGATCAATTAATTTTATTGTTACTATTTTAAATATACacacaAaaaCTCATTCACTAAATTTTCTTCCTTTATTCACATGATCAATTTTAATTACAGCAATTCTTCTTCTGTTATCATTACCAGTTCTTGCAGGAGCAATTACTATACTTCTTACAGATCGAAATCTTAATACATCTTtttttGATCCCGCAGGTGGgggggATCCAATTTTATACCAACACTTATTTT")
boldoutput_public <- bold_identify(testseq, db="COX1_SPECIES_PUBLIC")
boldoutput_public.df <- ldply(boldoutput_public, data.frame)
boldoutput_public_tax_name <- bold_tax_name(name=boldoutput_public.df[3,6]) # for a particular identification (third row)
boldoutput_public_tax_name.parents <- bold_tax_id(id=boldoutput_public_tax_name$taxid, includeTree = TRUE)
@sckott sckott added this to the v0.4 milestone Dec 6, 2016
@sckott
Copy link
Contributor

sckott commented Dec 6, 2016

So you want parentnames included in the output for the bold_identify function? Or, do you want to have a separate function to get parentnames for some subset (or all) of the output from bold_identify?

also, maybe i have missed something, but i think it would be nicer to get parentnames from a taxid, not a taxonomicidentification field, given the (small) possibility for ambiguity.

I don't follow this. You want to get parentnames using a taxonomic ID rather than taxonomic name? But isn't that what bold already allows you to do?

@dougwyu
Copy link
Author

dougwyu commented Jan 1, 2017

Yes, the convenient thing would be to have parentnames as additional columns in the bold_identify output. In general, i would expect that the taxonomic position of a genus_species would not be obvious to me (e.g. what is the Class/Order/Family of Allomerus octospinosus?). Also, with parentname columns, i would be able to sort output tables by higher ranks (e.g. Insecta, Hymenoptera), which is quite useful.

The reason that I request using taxid, not taxonomicidentification is that sometimes genus names are used in different kingdoms (famously, Anura is a plant and a frog).

Thanks for replying so quickly. I don't seem to be able to set up an email notification that you have replied on github. I'll look around.

@sckott
Copy link
Contributor

sckott commented Jan 2, 2017

The reason that I request using taxid, not taxonomicidentification is that sometimes genus names are used in different kingdoms (famously, Anura is a plant and a frog).

not sure we're on the same page here. is this line of discussion talking about the bold_tax_id function? If so, that does accept a taxonomic ID, which is what you want, correct? or not? If not that fxn, which one are you talking about

@sckott
Copy link
Contributor

sckott commented Jan 2, 2017

for email notifications, perhaps go to this page https://github.com/settings/notifications

@sckott
Copy link
Contributor

sckott commented Jan 2, 2017

@dougwyu i started a new fxn. reinstall like devtools::install_github("ropensci/bold")

see bold_identify_parents() and its examples

let me know what you think

@dougwyu
Copy link
Author

dougwyu commented Jan 5, 2017

Fantastic and thank you!

I successfully ran the new command and was initially confused by all the additional rows, but i see what you've done: each ID is effectively its own little dataframe.

I am thinking that it might be more useful for the output to be wider, such that each of the original hits remains one line. Here is an image of what I'm thinking. It maintains most of the newly added information but allows one to filter, sort, and tally the output more easily.

screen shot 2017-01-05 at 08 47 31

Perhaps the number of returned taxids differs per sequence(?), but it seems fine to settle on a fixed set of taxonomic ranks: phylum, class, order, family, subfamily, genus, species.


ps this is what I ran:

testseq <- list(eb4909 = "GAATAAATAATATAAGATTTTGATTACTCCCTCCTTCTTTATTtttATTAATTTTAAGAAATTTTATTGGAACGGGTGTAGGAACCGGATGAACTTTATATCCTCCTTTATCATCTATTGTTGGACATGATTCACCTTCTGTAGATTTAGGAATTttttCTATCCATATTGCTGGAATTTCCTCAATTATAGGATCAATTAATTTTATTGTTACTATTTTAAATATACacacaAaaaCTCATTCACTAAATTTTCTTCCTTTATTCACATGATCAATTTTAATTACAGCAATTCTTCTTCTGTTATCATTACCAGTTCTTGCAGGAGCAATTACTATACTTCTTACAGATCGAAATCTTAATACATCTTtttttGATCCCGCAGGTGGgggggATCCAATTTTATACCAACACTTATTTT")
boldoutput_public <- bold_identify(testseq, db="COX1_SPECIES_PUBLIC")
boldoutput_public_parents <- bold_identify_parents(boldoutput_public)
boldoutput_public_parents.df <- ldply(boldoutput_public_parents, data.frame)

@sckott
Copy link
Contributor

sckott commented Jan 5, 2017

I thought about using wide format, but thought it made more sense to give back the results as I did with repeated rows for each record.

Data.frame for parents is like

$`Paratergatis longimanus`
   taxid                   taxon  tax_rank tax_division parentid   parentname     taxonrep
1     20              Arthropoda    phylum      Animals        1         <NA>   Arthropoda
2     69            Malacostraca     class      Animals       20   Arthropoda Malacostraca
3    336                Decapoda     order      Animals       69 Malacostraca     Decapoda
4   1541               Xanthidae    family      Animals      336     Decapoda    Xanthidae
5 305321               Zosiminae subfamily      Animals     1541    Xanthidae         <NA>
6 322442            Paratergatis     genus      Animals   305321    Zosiminae         <NA>
7 503362 Paratergatis longimanus   species      Animals   322442 Paratergatis         <NA>

In your eg above you just use two of those columns. When I was thinking wide format, i thought it way to many columns to add if we used all, but I guess if it's just two columns it's more palatable to add those columns.

@sckott
Copy link
Contributor

sckott commented Jan 5, 2017

@dougwyu try it again after reinstalling, see new parameter wide

@dougwyu
Copy link
Author

dougwyu commented Jan 6, 2017

That works great! I have tried with one sequence and with 5 sequences. Exactly what I need (and I suspect many others). Thanks very much Scott.

@sckott
Copy link
Contributor

sckott commented Jan 6, 2017

great

@sckott sckott closed this as completed Jan 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants