Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
609 lines (477 sloc) 27 KB
% Treebase: An R package for discovery, access and manipulation of online phylogenies
1. The TreeBASE portal is an important and rapidly growing repository of
phylogenetic data. The R statistical environment has also become
a primary tool for applied phylogenetic analyses across a range
of questions, from comparative evolution to community ecology to
conservation planning.
2. We have developed `treebase`, an open-source software package (freely available from
[http://cran.r-project.org/web/packages/treebase](http://cran.r-project.org/web/packages/treebase))
for the R programming environment, providing simplified, *programmatic* and
interactive access to phylogenetic data in the TreeBASE repository.
3. We illustrate how this package creates a bridge between the TreeBASE
repository and the rapidly growing collection of R packages for
phylogenetics that can reduce barriers to discovery and integration
across phylogenetic research.
4. We show how the `treebase` package can be used to facilitate replication
of previous studies and testing of methods and hypotheses across
a large sample of phylogenies, which may help make such
important reproducibility practices more common.
#### Keywords
R, software, API, TreeBASE, database, programmatic, workflow
Introduction
============
Applications that use phylogenetic information as part of their analyses
are becoming increasingly central to both evolutionary and ecological
research. The exponential growth in genetic sequence data available
for all forms of life has driven rapid advances in the methods that
can infer the phylogenetic relationships and divergence times across
different taxa [@Huelsenbeck2001b; @Stamatakis2006; @Drummond2007].
The product of one field has become the raw data of the next.
Unfortunately, while the discipline of bioinformatics has emerged to
help harness and curate the wealth of genetic data with cutting edge
computer science, statistics, and Internet technologies, its counterpart
in evolutionary informatics remains “scattered, poorly documented,
and in formats that impede discovery and integration” [@Parr2011a]. Our
goal in developing the `treebase` package is to provide steps to reduce
these challenges through programmatic and interactive access between the
repositories that store this data and the software tools commonly used
to analyse them.
The R statistical environment [@RTeam2012] has become a dominant
platform for researchers using phylogenetic data to address a
rapidly expanding set of questions in ecological and evolutionary
processes. These methods include, but are not limited to, ancestral
state reconstruction [@Paradis2004; @Butler2004], diversification
analysis [@Paradis2004; @Rabosky2006b; @Harmon2008], identifying
trait dependent speciation and extinction rates, [@Fitzjohn2010;
@Goldberg2011; @Stadler2011], quantifying the rate and tempo of
trait evolution [@Butler2004; @Harmon2008; @Eastman2011], identifying
evolutionary influences and proxies for community ecology [@Webb2008;
@Kembel2010], connecting phylogeny data to climate patterns [@Warren2008;
@Evans2009b], and simulation of speciation and character evolution
[@Harmon2008; @Stadler2011a; @Boettiger2012], as well as various
manipulations and visualizations of phylogenetic data [@Paradis2004;
@Schliep2010; @Jombart2010; @Revell2011]. A more comprehensive list
of R packages by analysis type is available on the phylogenetics taskview,
[http://cran.r-project.org/web/views/Phylogenetics.html](http://cran.r-project.org/web/views/Phylogenetics.html).
Libraries and packages are developed for use in other general purpose programming environments and languages, including Java [@Maddison2011],
MATLAB [@Blomberg2003] and Python [@Sukumaran2010] and online interfaces
[@Martins2004]. In particular, the Bio::Phylo toolkit [@Vos2011] not
only provides a PERL implementation of some of the common phylogenetic
simulation and visualization tools found in these R libraries, but can
already provide programmatic access to TreeBASE. Our goal is to bring
similar functionality to the larger suite of applied phylogenetics
methods and user in the R community.
TreeBASE ([http://treebase.org](http://treebase.org)) is an online
repository of phylogenetic data (e.g. trees of species, populations,
or genes) that have been published in a peer-reviewed academic journal,
book, thesis or conference proceedings [@Sanderson1994b; @Morell1996]. The
database can be searched through an online interface which allows users
to find a phylogenetic tree from a particular publication, author or
taxa of interest. TreeBASE provides an application programming interface
(API) that lets computer applications make queries to the database,
known as PhyloWS [@Vos2010]. Our `treebase` package uses this API to
create a direct link between this data and the R environment. This has
several immediate and important benefits:
1. *Data discovery.* Users can leverage the rich, higher-level
programming environment provided by the R environment to better identify
data sets appropriate for their research by iteratively constructing
queries for datasets that match appropriate metadata requirements.
2. *Programmatic data access.* Many tasks that are theoretically made
possible by the creation of the Web-base interface to the TreeBASE
repository are not pursued because they would be too laborious for an
exploratory analysis. The ability to use programmatic access across data
sets to automatically download and perform a reproducible and systematic
analysis using the rich set of tools available in R opens up new avenues
for research.
3. *Automatic updating*. The TreeBASE repository is expanding rapidly.
The scriptable nature of analyses run with our `treebase` package
means that a study can be rerun on the latest version of the repository
without additional effort but with potential new information.
Programmatic Web Access
-----------------------
The TreeBASE repository makes data accessible via Web queries through a
RESTful (REpresentational State Transfer) interface, which supplies search
conditions in the address URL. The repository returns the requested
data in XML (extensible markup language) format, conforming to the RSS1.0
standard. Because the RSS1.0 format allows the search results to also be
viewed in a human-readable format in common browsers such as Safari and
Firefox, the `treebase` package echoes this URL to the console, so that
the user can explore the results in the browser as well. The `treebase`
package uses the `RCurl` package [@RCurl] to make queries over the Web
to the repository, and the `XML` package [@XML] to parse the Web page
returned by the repository into meaningful R data objects. While these
querying and parsing functions comprise most of the code provided in the
`treebase` package, they are hidden from the end user who can interact
with these rich data retrieval and manipulation tools to access data
from these remote repositories in much the same way as data locally
available on the users hard-disk.
While the TreeBASE repository provides phylogenies in both the traditional
Nexus file format and the more data-rich NeXML format [@Vos2012], none
of the R packages currently available for phylogenetic research are
positioned to read these NeXML files. The next version of the `treebase`
package will provide extraction of metadata information from the NeXML
through XML parsing.
Basic queries
--------------
``` {r libs, echo=FALSE, cache=FALSE}
library(treebase)
library(ggplot2)
theme_set(theme_bw())
````
The `treebase` package allows these queries to be made directly from R.
Programmatic access also allows a user to go beyond the utilities of the
Web interface, constructing more complicated queries and allowing the
user to maintain a record of the commands used to collect their data as
an R script. Scripting the data-gathering process helps reduce errors
and assists in replicating the analysis later, either by the authors or
other researchers [@Peng2011a].
The `search_treebase` function forms the base of the `treebase` package.
Table 1 lists each of the types of queries available through the
`search_treebase` function. This list can also be found in the function
documentation through the R command `?search_treebase`.
Any of the queries available on the Web interface can now be made
directly from R, including downloading and importing a phylogeny into
the R session. For instance, one can search for phylogenies containing
dolphin taxa, "Delphinus," or all phylogenies submitted by a given
author, "Huelsenbeck" using the R commands
``` {r basicQueries, eval=FALSE }
search_treebase("Delphinus", by="taxon")
search_treebase("Huelsenbeck", by="author")
````
The `search_treebase` function returns the matching phylogenies as an
R object, ready for analysis. The package documentation provides many
examples of possible queries.
Search "by=" PURPOSE
--------------- -----------------------------------------------------
abstract search terms in the publication abstract
author match authors in the publication
subject Matches in the subject terms
doi The unique object identifier for the publication
ncbi NCBI identifier number for the taxon
kind.tree Kind of tree (Gene tree, species tree, barcode tree)
type.tree Type of tree (Consensus or Single)
ntax Number of taxa in the matrix
quality A quality score for the tree, if it has been rated.
study Match words in the title of the study or publication
taxon Taxon scientific name
id.study TreeBASE study ID
id.tree TreeBASE's unique tree identifier (Tr.id)
id.taxon Taxon identifier number from TreeBase
tree The title for the tree
Table: Queries available in `search_treebase`. The first argument is
the keyword used in the query such as an author's name and the second
argument indicates the type of query (_i.e._ "author").
Accessing all phylogenies
-------------------------
For certain applications, a user may wish to download all the available
phylogenies from TreeBASE. Using the `cache_treebase` function allows
a user to download a local copy of all trees. Because direct database
dumps are not currently available from treebase.org, this function has
intentional delays to avoid overtaxing the TreeBASE servers, and should
be allowed a full day to run.
``` {r buildcache, eval=FALSE }
treebase <- cache_treebase()
```
Users should still be mindful that these servers are a shared
community resource and not place many queries at once. Users
running large jobs should consider joining the TreeBASE mailing list
([http://sourceforge.net/mailarchive/forum.php?forum_name=treebase-users](http://sourceforge.net/mailarchive/forum.php?forum_name=treebase-users))
to discuss such queries ahead of time. When query
Once run, the cache is saved
compactly in memory where it can be easily and quickly restored. For
convenience, the `treebase` package comes with a copy already cached,
which can be loaded into memory.
``` {r loadcache}
data(treebase)
````
The cache included in the package will be updated during major package
revisions. The timestamp of the cache provided can be viewed in the help
file for the data object using the command `?treebase` (Current cache is May 14, 2012). All queries from `metadata()` and `search_treebase()` are run against the current online version of the database.
All of the examples shown in this manuscript are run as shown using
the `knitr` package for authoring dynamic documents [@knitr], which
helps ensure the results shown are reproducible. These examples can
be updated by copying and pasting the code shown into the R terminal,
or by recompiling the entire manuscript from the source files found on
the development Web page for the TreeBASE package,
[github.com/ropensci/treebase](https://github.com/ropensci/treebase).
Data was accessed to produce the examples shown on `r date()`.
Data discovery in TreeBASE
==========================
Data discovery involves searching for existing data that meets
certain desired characteristics. Such searches take advantage of
metadata -- summary information describing the data entries provided in
the repository. The Web repository uses separate interfaces (APIs)
to access metadata describing the publications associated with the
data entered, such as the publisher, year of publication, etc., and
a different interface to describe the metadata associated with an
individual phylogeny, such as the number of taxa or the kind of tree
(_e.g._ gene tree versus species tree). The `treebase` package can query
these individual sources of metadata separately, but this information
is most powerful when used in concert -- allowing the construction of
complicated searches that cannot be automated through the Web interface.
The `metadata` function updates a list of all available metadata from
both APIs and returns this information as an R `data.frame`.
```{r metadatatable }
meta <- metadata()
````
From the number of rows of the metadata list we see that there are currently
`r prettyNum(length(unique(meta[["Study.id"]])))` published studies in
the database. The field columns provided by `metadata` are listed in Table II.
metadata field description
--------------- ---------------------------------------------------
Study.id TreeBASE study ID
Tree.id TreeBASE's unique tree identifier
kind Kind of tree (Gene tree, species tree, barcode tree)
type Type of tree (Consensus or Single)
quality A quality score for the tree, if it has been rated.
ntaxa Number of taxa in the matrix
date Year the study was published
author First author in the publication
title The title of the publication
Table: Columns of metadata available from the `metadata` function
Metadata can also be used to reveal trends in the data
deposition which may be useful in identifying patterns or biases in
research or emerging potential types of data. As a simple example, we
look at trends in the submission patterns of publishers over time:
``` {r journals}
date <- meta[["date"]]
pub <- meta[["publisher"]]
````
Many journals have only a few submissions, so we will label any not
in the top ten contributing journals as "Other":
``` {r top_journals}
topten <- sort(table(pub), decreasing=TRUE)[1:10]
meta[["publisher"]] <- as.character(meta[["publisher"]])
meta[["publisher"]][!(pub %in% names(topten))] <- "Other"
meta[["publisher"]] <- as.factor(meta[["publisher"]])
````
We plot the distribution of publication years for phylogenies deposited
in TreeBASE, color coding by publisher in Figure 1.
``` {r dates, fig.width=8, fig.height=3.5, fig.cap="Histogram of publication dates by year, with the code required to generate the figure.", dev.opts=list(pointsize=8) }
library(ggplot2)
library(reshape2)
df <- acast(meta, date ~ publisher, value.var='publisher', length)
df <- melt(df, varnames=c("date", "publisher"))
ggplot(df) + geom_area(aes(x=date,y=value, fill = publisher))
````
Typically we are interested in the metadata describing the phylogenies
themselves rather than just in the publications in which they appeared.
Phylogenetic metadata includes features such as the number of taxa
in the tree, a quality score (if available), kind of tree (gene tree,
species tree, or barcode tree) or whether the phylogeny represents a
consensus tree from a distribution or just a single estimate.
Even simple queries can illustrate the advantage of interacting with
TreeBASE data through an R interface has over the Web interface. A Web
interface can only perform the tasks built in by design. For instance,
rather than performing six separate searches to determine the number
of consensus vs single phylogenies available for each kind of tree,
we can construct a 2 by 2 table with a single line of code:
``` {r kind, eval=FALSE }
table(meta[["kind"]], meta[["type"]])
````
``` {r kindtable, results='asis', echo=FALSE }
output <- table(meta[["kind"]], meta[["type"]])
xtable::xtable(output)
````
Reproducible computations
=========================
Reproducible research has become a topic of increasing interest in
recent years, and facilitating access to data and using scripts that
can replicate analyses can help lower barriers to the replication of
statistical and computational results [@Schwab2000; @Gentleman2004;
@Peng2011a]. The `treebase` package facilitates this process, as we
illustrate in a simple example.
Consider the shifts in speciation rate identified by @Derryberry2011
on a phylogeny of ovenbirds and treecreepers. We will seek to not
only replicate the results the authors obtained by fitting the models
provided in the R package `laser` [@Rabosky2006b], but also compare them
against methods presented in @Stadler2011 and implemented in the package
`TreePar`, which permits speciation models that were not available to
@Derryberry2011 at the time of their study.
Obtaining the tree
------------------
The most expedient way to identify the data uses the digital object identifier
(doi) at the top of most articles, which we use in a call to the
`search_treebase` function, such as
``` {r doiquery}
results <- search_treebase("10.1111/j.1558-5646.2011.01374.x", "doi")
````
The search returns a list, since some publications can contain many trees.
In this case our phylogeny is in the only element of the list.
Having imported the phylogenetic tree corresponding to this
study, we can quickly replicate their analysis of which diversification
process best fits the data. These steps can be easily implemented using
the phylogenetics packages we have just mentioned.
``` {r RRexamplelibs, echo=FALSE }
library(ape)
library(laser)
````
For instance, we can calculate the branching times of each node on the phylogeny,
``` {r RRexamp }
bt <- branching.times(results[[1]])
````
and then begin to fit each model the authors have tested, such as the pure birth model,
``` {r }
yule <- pureBirth(bt)
````
or the birth-death model,
``` {r }
birth_death <- bd(bt)
````
The estimated models are now available in the active R session where
we can further explore them as we go along. The appendix shows the
estimation and comparison of all the models originally considered by
@Derryberry2011.
In this fast-moving field, new methods often become available between the
time of submission and time of publication of a manuscript. For instance,
the more sophisticated models introduced in @Stadler2011 were not used
in the original study, but have since been made available in the recent
package, `TreePar`. These richer models permit a shift in the speciation
or extinction rate to occur multiple times throughout the course of
the phylogeny.
``` {r TreeSimError, echo=FALSE }
# Locale settings to be safe
Sys.setlocale(locale="C") -> locale
rm(list="locale") # clean up
````
We load the new method and format the phylogeny appropriately using the
R commands:
``` {r treepar }
library(TreePar)
x <- sort(getx(results[[1]]), decreasing = TRUE)
````
``` {r treepar_yule_demo }
yule_models <- bd.shifts.optim(x, sampling = c(1,1,1,1),
grid = 5, start = 0, end = 60, yule = TRUE)[[2]]
````
As a comparison of
speciation models is not the focus of this paper, the complete code
and explanation for these steps are provided in an appendix. Happily,
this analysis confirms the original author's conclusions, even when the
more general models of @Stadler2011 are considered.
Analyses across many phylogenies
================================
Large scale comparative analyses that seek to characterize evolutionary
patterns across many phylogenies are increasingly common in phylogenetic
methods research [_e.g._ @McPeek2007; @Phillimore2008; @McPeek2008;
@Quental2010; @Davies2011a]. Sometimes referred to by their authors as
meta-analyses, these approaches have focused on re-analyzing phylogenetic
trees collected from many different earlier publications. This is a
more direct approach than the traditional concept of meta-analysis where
statistical results from earlier studies are weighted by their sample size
without being able to access the raw data. Because the identical analysis
can be repeated on the original data from each study, this approach avoids
some of the statistical challenges inherent in traditional meta-analyses
summarizing results across heterogeneous approaches.
To date, researchers have gone through heroic efforts simply to assemble
these data sets from the literature. As described in @McPeek2007; (emphasis added)
> One data set was based on 163 published species-level molecular
phylogenies of arthropods, chordates, and mollusks. A PDF format file
of each article was obtained, and a digital snapshot of the figure was
taken in Adobe Acrobat 7.0. This image was transferred to a PowerPoint
(Microsoft) file and printed on a laser printer. The phylogenies included
in this study are listed in the appendix. _All branch lengths were
measured by hand from these printed sheets using dial calipers._
Despite the recent emergence of digital tools that could now facilitate
this analysis without mechanical calipers, [_e.g._ treesnatcher, @Laubach2007],
it is easier and less error-prone to pull properly
formatted phylogenies from the database for this purpose. Moreover, as the
available data increases with subsequent publications, updating earlier
meta-analyses can become increasingly tedious. Using `treebase`, a user
can apply any analysis they have written for a single phylogeny across
the entire collection of suitable phylogenies in TreeBASE, which can help
overcome such barriers to discovery and integration at this large scale.
Using the functions we introduced above, we provide a simple example that
computes the gamma statistic of @Pybus2000, which provides a measure
of when speciation patterns differ from the popular birth-death model.
Tests across many phylogenies
-----------------------------
A standard test of the constant rate of diversification is the gamma
statistic of @Pybus2000 which tests the null hypothesis that the rates
of speciation and extinction are constant. Under the null hypothesis,
The gamma statistic is normally distributed about 0; values larger than
0 indicate that internal nodes are closer to the tip then expected, while
values smaller than 0 indicate nodes farther from the tip then expected.
In this section, we collect all phylogenetic trees from TreeBASE and
select those with branch length data that we can time-calibrate using
tools available in R. We can then calculate the distribution of this
statistic for all available trees, and compare these results with those
from the analyses mentioned above. As we will use all trees in the
repository, we use the cached copy of TreeBASE phylogenies described
above to reduce load on TreeBASE servers.
We will only be able to use those phylogenies that include branch length data,
which we can determine from the `have_branchlength` function in the `treebase`
package. We drop those that do not from the data set,
``` {r branchlengthfilter}
have <- have_branchlength(treebase)
branchlengths <- treebase[have]
````
Like most comparative methods, this analysis will require ultrametric trees
(branch lengths proportional to time, rather than to the nucleotide substitution rate).
As most of these phylogenies are calibrated with branch length proportional
to mutational step, we must time-calibrate each of them first. The following function
drops trees which cannot meet the assumptions of the time-calibration function.
``` {r timetree, cache=TRUE}
timetree <- function(tree)
try(chronoMPL(multi2di(tree)), silent=TRUE)
tt <- drop_nontrees(sapply(branchlengths, timetree))
````
At this point we have `r length(tt)` time-calibrated phylogenies over which
we will apply the diversification rate analysis.
Computing the gamma test statistic to identify deviations from the
constant-rates model takes a single line,
``` {r }
gammas <- sapply(tt, gammaStat)
````
and the resulting distribution of the statistic across available
trees is shown Fig 2. While researchers have often considered this
statistic for individual phylogenies, we are unaware of any study that has
visualized the empirical distribution of this statistic across thousands
of phylogenies. The overall distribution appears slightly
skewed towards positive values. This could indicate increasing rate of
speciation or constant extinction rates. While differences in sampling
may account for much of the spread observed, the position and identity of
outlier phylogenies could suggest new hypotheses and potential directions for
further exploration.
``` {r gammadist, fig.width=6.2, fig.cap="Distribution of the gamma statistic across phylogenies in TreeBASE. Strongly positive values are indicative of an increasing rate of evolution (excess of nodes near the tips), very negative values indicate an early burst of diversification (an excess of nodes near the root)."}
qplot(gammas)+xlab("gamma statistic")
````
Conclusion
==========
While we have focused on examples that require no additional data
beyond the phylogeny, a wide array of methods combine this data with
information about the traits, geography, or ecological community of the
taxa represented. In such cases, we would need programmatic access to
the trait data as well as the phylogeny. The Dryad digital repository
([http://datadryad.org](http://datadryad.org)) is an effort in this
direction. While programmatic access to the repository is possible
through the `rdryad` package [@Chamberlain2012], variation in data
formatting must first be overcome before similar direct access to
the data is possible. Dedicated databases such as FishBASE
([http://fishbase.org](http://fishbase.org)) may be another alternative,
where morphological data can be queried for a list of species using the
`rfishbase` package [@rfishbase]. The development of similar software
for programmatic data access will rapidly extend the space and scale of
possible analyses.
The recent advent of mandatory data archiving in many of the major
journals publishing phylognetics-based research [_e.g._ @Fairbairn2010;
@Piwowar2011; @Whitlock2010], is a particularly promising development that
should continue to fuel the trend of submissions seen in Fig. 1. Accompanied
by faster and more inexpensive techniques of NextGen sequencing, and the
rapid expansion in phylogenetic applications, we anticipate this rapid
growth in available phylogenies will continue. Faced with this flood
of data, programmatic access becomes not only increasingly powerful but
an increasingly necessary way to ensure we can still see the forest for
all the trees.
Acknowledgements
================
CB wishes to thank S. Price for feedback on the manuscript, the
TreeBASE developer team for building and supporting the repository,
and all contributers to TreeBASE. CB is supported by a Computational
Sciences Graduate Fellowship from the Department of Energy under grant
number DE-FG02-97ER25308. The `treebase` package is is part of the
rOpenSci project ([ropensci.org](http://ropensci.org)).
``` {r save, echo=FALSE }
save(list=ls(), file="knit_treebase.rda")
````
# References
You can’t perform that action at this time.