Add the newly compliled README

ropensci · Sep 22, 2015 · b0ba7d5 · b0ba7d5
1 parent 1e723d2
commit b0ba7d5
Showing 1 changed file with 44 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -45,17 +45,12 @@ hox_data
 # elink object with contents:
 #  $links: IDs for linked records from NCBI
 # 
-hox_data$links
-# elink result with information from 14 databases:
-#  [1] pubmed_medgen              pubmed_mesh_major         
-#  [3] pubmed_nuccore             pubmed_nucleotide         
-#  [5] pubmed_pmc_refs            pubmed_protein            
-#  [7] pubmed_pubmed              pubmed_pubmed_alsoviewed  
-#  [9] pubmed_pubmed_citedin      pubmed_pubmed_combined    
-# [11] pubmed_pubmed_five         pubmed_pubmed_reviews     
-# [13] pubmed_pubmed_reviews_five pubmed_taxonomy_entrez
 ```
 
+In this case all the data is in the `links` element:
+
+    hox_data$links
+
 Each of the character vectors in this object contain unique IDs for records in the named databases. These functions try to make the most useful bits of the returned files available to users, but they also return the original file in case you want to dive into the XML yourself.
 
 In this case we'll get the protein sequences as fasta files, using ' `entrez_fetch`:
@@ -94,6 +89,11 @@ katipo_summs
 #  [1] uid        caption    title      extra      gi         settype   
 #  [7] createdate updatedate flags      taxid      authors    article   
 # [13] journal    strain     statistics properties oslt
+```
+
+An we can extract specific elements from list of summary records with `extract_from_esummary`:
+
+``` r
 extract_from_esummary(katipo_summs, "title")
 #                                                                                                                                                                                                                  167843272 
 # "Latrodectus katipo 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence." 
@@ -125,15 +125,21 @@ write(COI, "Test/COI.fasta")
 write(trnL, "Test/trnL.fasta")
 ```
 
-Once you've got the sequences you can do what you want with them, but I wanted a phylogeny so let's do that with ape:
+Once you've got the sequences you can do what you want with them, but I wanted a phylogeny so let's do. To get a nice tree with legible tip labels I'm gong to use `stringr` to extract just the species names and `ape` to built and root and neighbor joining tree:
 
 ``` r
 library(ape)
-coi <- read.dna("Test/COI.fasta", "fasta")
-coi_aligned <- clustal(coi)
+tf <- tempfile()
+write(COI, tf)
+coi <- read.dna(tf, format="fasta")
+coi_aligned <- muscle(coi)
 tree <- nj(dist.dna(coi_aligned))
+tree$tip.labels <- stringr::str_extract(tr$tip.label, "Steatoda [a-z]+|Latrodectus [a-z]+")
+plot( root(tr, outgroup="Steatoda grossa" ))
 ```
 
+![](http://i.imgur.com/OanQIgx.png)
+
 ### web\_history and big queries
 
 The NCBI provides search history features, which can be useful for dealing with large lists of IDs (which will not fit in a single URL) or repeated searches. As an example, imagine you wanted to learn something about all of the SNPs on the Y-Chromosome in umans. You could first find these SNPs using `entrez_fetch`
@@ -152,17 +158,35 @@ snp_search <- entrez_search(db="snp", term="y[CHR] AND Homo[ORGN]",  use_history
 snp_search
 # Entrez search result with 368846 hits (object contains 20 IDs and a web_history object)
 #  Search term (as translated):  y[CHR] AND "Homo"[Organism]
-snp_search$web_history
-# Web history object (QueryKey = 1, WebEnv = NCID_1_19150...)
 ```
 
-We can now use the `web_history` object to refer to all those IDs in later calls using `entrez_link` or `entrez_fetch`. Here we will just fetch complete records of the first 5 SNPs in tabular "cluster report" format:
+\`\``We can now use the`web\_history`object to refer to all those IDs in later calls using`entrez\_link`or`entrez\_fetch\`. Here we will just fetch complete records of the first 5 SNPs in tabular "cluster report" format:
 
 ``` r
 recs <- entrez_fetch(db="snp", web_history=snp_search$web_history, retmax=5, rettype="rsr")
-recs
-# [1] "snp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID\n-----------   -------------    ----------------------------------------\n781782243\t1556739404\t1000GENOMES\tPHASE3_chrY_197777\n\nsnp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID\n-----------   -------------    ----------------------------------------\n781782114\t1556776570\t1000GENOMES\tPHASE3_chrY_234943\n\nsnp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID\n-----------   -------------    ----------------------------------------\n781781837\t1694440247\tEVA_EXAC\tEXAC_0.3.X:g595429g>a\n\nsnp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID\n-----------   -------------    ----------------------------------------\n781781523\t1536954443\tDDI\tkw335948\n781781523\t1577678940\tEVA_GENOME_DK\tgatk.Y:g19138349cta>c\n\nsnp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID\n-----------   -------------    ----------------------------------------\n781780734\t1553226422\t1000GENOMES\tPHASE3_chrX_246\n\n"
-y_chrom_linked_disaese <- entrez_link(dbfrom="snp", db="omim", web_history=snp_search$web_history)
+cat(recs, "\n")
+# snp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID
+# -----------   -------------    ----------------------------------------
+# 781782243 1556739404  1000GENOMES PHASE3_chrY_197777
+# 
+# snp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID
+# -----------   -------------    ----------------------------------------
+# 781782114 1556776570  1000GENOMES PHASE3_chrY_234943
+# 
+# snp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID
+# -----------   -------------    ----------------------------------------
+# 781781837 1694440247  EVA_EXAC    EXAC_0.3.X:g595429g>a
+# 
+# snp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID
+# -----------   -------------    ----------------------------------------
+# 781781523 1536954443  DDI kw335948
+# 781781523 1577678940  EVA_GENOME_DK   gatk.Y:g19138349cta>c
+# 
+# snp_id(rs)    subsnp_id(ss)    submitter handle    submitter snp ID
+# -----------   -------------    ----------------------------------------
+# 781780734 1553226422  1000GENOMES PHASE3_chrX_246
+# 
+# 
 ```
 
 ### Getting information about NCBI databases
@@ -201,7 +225,7 @@ entrez_db_summary("cdd")
 #  Description: Conserved Domain Database
 #  DbBuild: Build150814-1106.1
 #  Count: 50648
-#  LastUpdate: 2015/08/14 18:28
+#  LastUpdate: 2015/08/14 18:49
 ```
 
 `entrez_db_searchable()` lets you discover the fields available for search terms for a given database. You get back a named-list, with names are fields. Each element has additional information about each named search field (you can also use `as.data.frame` to create a dataframe, with one search-field per row):
@@ -212,7 +236,7 @@ search_fields$GRNT
 #  Name: GRNT
 #  FullName: Grant Number
 #  Description: NIH Grant Numbers
-#  TermCount: 2220883
+#  TermCount: 2220915
 #  IsDate: N
 #  IsNumerical: N
 #  SingleToken: Y