Skip to content

Commit

Permalink
Add the newly compliled README
Browse files Browse the repository at this point in the history
  • Loading branch information
dwinter committed Sep 22, 2015
1 parent 1e723d2 commit b0ba7d5
Showing 1 changed file with 44 additions and 20 deletions.
64 changes: 44 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,12 @@ hox_data
# elink object with contents:
# $links: IDs for linked records from NCBI
#
hox_data$links
# elink result with information from 14 databases:
# [1] pubmed_medgen pubmed_mesh_major
# [3] pubmed_nuccore pubmed_nucleotide
# [5] pubmed_pmc_refs pubmed_protein
# [7] pubmed_pubmed pubmed_pubmed_alsoviewed
# [9] pubmed_pubmed_citedin pubmed_pubmed_combined
# [11] pubmed_pubmed_five pubmed_pubmed_reviews
# [13] pubmed_pubmed_reviews_five pubmed_taxonomy_entrez
```

In this case all the data is in the `links` element:

hox_data$links

Each of the character vectors in this object contain unique IDs for records in the named databases. These functions try to make the most useful bits of the returned files available to users, but they also return the original file in case you want to dive into the XML yourself.

In this case we'll get the protein sequences as fasta files, using ' `entrez_fetch`:
Expand Down Expand Up @@ -94,6 +89,11 @@ katipo_summs
# [1] uid caption title extra gi settype
# [7] createdate updatedate flags taxid authors article
# [13] journal strain statistics properties oslt
```

An we can extract specific elements from list of summary records with `extract_from_esummary`:

``` r
extract_from_esummary(katipo_summs, "title")
# 167843272
# "Latrodectus katipo 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence."
Expand Down Expand Up @@ -125,15 +125,21 @@ write(COI, "Test/COI.fasta")
write(trnL, "Test/trnL.fasta")
```

Once you've got the sequences you can do what you want with them, but I wanted a phylogeny so let's do that with ape:
Once you've got the sequences you can do what you want with them, but I wanted a phylogeny so let's do. To get a nice tree with legible tip labels I'm gong to use `stringr` to extract just the species names and `ape` to built and root and neighbor joining tree:

``` r
library(ape)
coi <- read.dna("Test/COI.fasta", "fasta")
coi_aligned <- clustal(coi)
tf <- tempfile()
write(COI, tf)
coi <- read.dna(tf, format="fasta")
coi_aligned <- muscle(coi)
tree <- nj(dist.dna(coi_aligned))
tree$tip.labels <- stringr::str_extract(tr$tip.label, "Steatoda [a-z]+|Latrodectus [a-z]+")
plot( root(tr, outgroup="Steatoda grossa" ))
```

![](http://i.imgur.com/OanQIgx.png)

### web\_history and big queries

The NCBI provides search history features, which can be useful for dealing with large lists of IDs (which will not fit in a single URL) or repeated searches. As an example, imagine you wanted to learn something about all of the SNPs on the Y-Chromosome in umans. You could first find these SNPs using `entrez_fetch`
Expand All @@ -152,17 +158,35 @@ snp_search <- entrez_search(db="snp", term="y[CHR] AND Homo[ORGN]", use_history
snp_search
# Entrez search result with 368846 hits (object contains 20 IDs and a web_history object)
# Search term (as translated): y[CHR] AND "Homo"[Organism]
snp_search$web_history
# Web history object (QueryKey = 1, WebEnv = NCID_1_19150...)
```

We can now use the `web_history` object to refer to all those IDs in later calls using `entrez_link` or `entrez_fetch`. Here we will just fetch complete records of the first 5 SNPs in tabular "cluster report" format:
\`\``We can now use the`web\_history`object to refer to all those IDs in later calls using`entrez\_link`or`entrez\_fetch\`. Here we will just fetch complete records of the first 5 SNPs in tabular "cluster report" format:

``` r
recs <- entrez_fetch(db="snp", web_history=snp_search$web_history, retmax=5, rettype="rsr")
recs
# [1] "snp_id(rs) subsnp_id(ss) submitter handle submitter snp ID\n----------- ------------- ----------------------------------------\n781782243\t1556739404\t1000GENOMES\tPHASE3_chrY_197777\n\nsnp_id(rs) subsnp_id(ss) submitter handle submitter snp ID\n----------- ------------- ----------------------------------------\n781782114\t1556776570\t1000GENOMES\tPHASE3_chrY_234943\n\nsnp_id(rs) subsnp_id(ss) submitter handle submitter snp ID\n----------- ------------- ----------------------------------------\n781781837\t1694440247\tEVA_EXAC\tEXAC_0.3.X:g595429g>a\n\nsnp_id(rs) subsnp_id(ss) submitter handle submitter snp ID\n----------- ------------- ----------------------------------------\n781781523\t1536954443\tDDI\tkw335948\n781781523\t1577678940\tEVA_GENOME_DK\tgatk.Y:g19138349cta>c\n\nsnp_id(rs) subsnp_id(ss) submitter handle submitter snp ID\n----------- ------------- ----------------------------------------\n781780734\t1553226422\t1000GENOMES\tPHASE3_chrX_246\n\n"
y_chrom_linked_disaese <- entrez_link(dbfrom="snp", db="omim", web_history=snp_search$web_history)
cat(recs, "\n")
# snp_id(rs) subsnp_id(ss) submitter handle submitter snp ID
# ----------- ------------- ----------------------------------------
# 781782243 1556739404 1000GENOMES PHASE3_chrY_197777
#
# snp_id(rs) subsnp_id(ss) submitter handle submitter snp ID
# ----------- ------------- ----------------------------------------
# 781782114 1556776570 1000GENOMES PHASE3_chrY_234943
#
# snp_id(rs) subsnp_id(ss) submitter handle submitter snp ID
# ----------- ------------- ----------------------------------------
# 781781837 1694440247 EVA_EXAC EXAC_0.3.X:g595429g>a
#
# snp_id(rs) subsnp_id(ss) submitter handle submitter snp ID
# ----------- ------------- ----------------------------------------
# 781781523 1536954443 DDI kw335948
# 781781523 1577678940 EVA_GENOME_DK gatk.Y:g19138349cta>c
#
# snp_id(rs) subsnp_id(ss) submitter handle submitter snp ID
# ----------- ------------- ----------------------------------------
# 781780734 1553226422 1000GENOMES PHASE3_chrX_246
#
#
```

### Getting information about NCBI databases
Expand Down Expand Up @@ -201,7 +225,7 @@ entrez_db_summary("cdd")
# Description: Conserved Domain Database
# DbBuild: Build150814-1106.1
# Count: 50648
# LastUpdate: 2015/08/14 18:28
# LastUpdate: 2015/08/14 18:49
```

`entrez_db_searchable()` lets you discover the fields available for search terms for a given database. You get back a named-list, with names are fields. Each element has additional information about each named search field (you can also use `as.data.frame` to create a dataframe, with one search-field per row):
Expand All @@ -212,7 +236,7 @@ search_fields$GRNT
# Name: GRNT
# FullName: Grant Number
# Description: NIH Grant Numbers
# TermCount: 2220883
# TermCount: 2220915
# IsDate: N
# IsNumerical: N
# SingleToken: Y
Expand Down

0 comments on commit b0ba7d5

Please sign in to comment.