Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree data( #47) #62

Merged
merged 15 commits into from Apr 21, 2016
Merged

Tree data( #47) #62

merged 15 commits into from Apr 21, 2016

Conversation

dwinter
Copy link
Member

@dwinter dwinter commented Jan 25, 2016

@fmichonneau and @josephwb

What do you think about this as a solution to #47? Here's how it works at the moment.

Demonstration

There are two functions for getting external IDs, one for studies...

study_external_IDs("pg_1940")
External data identifiers for study 
 $doi:  10.1017/S001667231000008X 
 $pubmed_id:  20433773 
 $popset_ids: vector of 5 IDs 
 $nucleotide_ids: vector of 164 IDs
 $external_data_url http://purl.org/phylo/treebase/phylows/study/TB2:S10691 

... and another for taxa

taxon_external_IDs(712902)
  source      id
1   ncbi  325167
2   gbif 4827728
3   gbif 4267261
4  irmng 1249869
5  irmng 1452570

Those two only get you as far as finding IDs, rather than importing data into an R session. I did write some (currently un-exported) functions to summarise a set of nucleotide or popset IDs. Here's one example

summarize_nucleotide_data(ids$nucleotide_ids[1:10])
                uid
295388256 295388256
295388254 295388254
295388253 295388253
295388251 295388251
295388249 295388249
295388247 295388247
295388245 295388245
295388243 295388243
295388241 295388241
295388239 295388239
                                                                                                                      title
295388256                Hirtodrosophila thoracis cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388254                 Hirtodrosophila duncani cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388253 Hirtodrosophila sp. KVDL-2010 cytochrome c oxidase subunit III-like (COIII) gene, partial sequence; mitochondrial
295388251                Mycodrosophila claytonae cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388249                     Drosophila pinicola cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388247                   Drosophila macrospina cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388245                    Drosophila guttifera cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388243                      Drosophila falleni cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388241                      Zaprionus indianus cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
295388239                     Zaprionus sepsoides cytochrome c oxidase subunit III (COIII) gene, partial cds; mitochondrial
          slen                      organism completeness
295388256  445      Hirtodrosophila thoracis             
295388254  445       Hirtodrosophila duncani             
295388253  363 Hirtodrosophila sp. KVDL-2010             
295388251  445      Mycodrosophila claytonae             
295388249  445           Drosophila pinicola             
295388247  445         Drosophila macrospina             
295388245  445          Drosophila guttifera             
295388243  445            Drosophila falleni             
295388241  445            Zaprionus indianus             
295388239  445           Zaprionus sepsoides 

I'm not sure how useful this really is, and think it might just be better to document this sort of workflow including packages for ncbi/gbif/whatever?

What's in the PR

In addition to the functions there is

  • Documentation for the exported functions
  • A new section in the mashups vignette demonstrating their use
  • Tests for these functions and thier print method
  • A small tweak to the strip_ott_ids that's let's users optionally replace underscores with spaces
  • A new import for rentrez to get the NCBI data

@fmichonneau
Copy link
Member

That looks @dwinter. Thanks for doing that!

Were you thinking that all of this should go into a vignette or just the summary functions?

@dwinter
Copy link
Member Author

dwinter commented Feb 1, 2016

Hi @fmichonneau,

So, I was really questioning wether the summarize* functions. were going to be very useful. It's hard to predict what users will want to do with the IDs, and we can't really write wrappers for every package the might use them?So, it might make more sense to have rotl functions for gathering IDs, and let users do whatever they want with them.

If you and @josephwb agree. Maybe I'll modify this PR to remove the summarize* functions, but include a similar example in the vignette (maybe finding sequences for a given taxon). We can use that to talk about packages that can make use of the other IDs.?

*SQUASH*
        rewrite tree_data docs

        Explicitly print for tests of printed output   (new for testthat 1.0)

        taxonomy_taxon -> taxonomy_taxon_info for v3

        Pick study_external test case that doesn't throw warning

        Comment out not-very-helpful summary functions

        Add sequencing fetching e.g. to data mashup vignette

        Use new strip_ott_id fxn in metaanalysis vignette
@dwinter
Copy link
Member Author

dwinter commented Apr 21, 2016

Hey @josephwb and @fmichonneau

I think this should now pass all tests.

Just to remind you, it adds two functions, study_external_IDs and taxon_external_IDs that gather whatever external data is available for studies or taxa. There are also tests and vignette examples for these.

I decided to remove the summarize_nucleotide_data function, which I just don't thnk would be very helpful. Instead I added an example of how to fetch DNA sequences using the external IDs to the vignette.

Oh, and a new version of rentrez is about to go to CRAN, which should remove the message about guessing the encoding is UTF-8 in the vignette.

Tell me what you think

@fmichonneau fmichonneau merged commit 3f883ca into master Apr 21, 2016
@fmichonneau
Copy link
Member

Thanks for this! I think it's going to be really useful.

@fmichonneau fmichonneau deleted the tree_data branch June 12, 2023 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants