Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
547 lines (511 sloc) 19.4 KB
% This is a Sweave file. It's a type of "literate progamming"
% developed by Donald Knuth for use with C programs. It's nice as you
% can mix LaTeX code (for document markup) with R code and generate
% reports that are reproducible and transparent. If you rerun this
% file with the same versions of R, diversitree, and assorted packages
% (see the end of the file) then you should get exactly the same
% output as I did when I produced the PDF.
%
% The simplest way of compiling the file is to type
% make
% This requires that you have the following programs installed and in
% your PATH:
% * make installed (most set-ups with a working compiler will have
% this installed, but this is least likely on Windows)
% * a working LaTeX distribution
% * R
% * Sweave.sh, available from ggorjan's website (I have version from 2009).
%
% You also need to have installed the R packages diversitree and
% cacheSweave.
%
% Some of the analysis takes a long time to run (minutes or hours).
% Using cacheSweave shortens most of the re-run time the file
% considerably, by reusing calculations when code chunks have not
% changed. In addition, the MCMC samples generated are manually
% cached in the 'cache' directory and can be directly loaded.
% The process (carried out by make, but documented here to some degree)
%
% 1. Sweave.sh is run on primates.Rnw. This extracts the code chunks
% (between "<<...>>" and "@" below), and runs them through R,
% replacing them with pretty input and output. This creates the file
% 'primates.tex'
%
% 2. LaTeX and BibTeX (and then LaTeX again) are run on primates.tex
% to generate primatex.pdf
%
% If you want to access the R session after it has finished, from an R
% session in this directory, type source(".interactive") to run the
% example, but keep R running. All the created objects are now usable.
\documentclass{article}
\usepackage{fullpage}
\usepackage{natbib}
\usepackage{amsmath}
\usepackage{color}
\definecolor{navy}{rgb}{0,0,0.4}
\usepackage[colorlinks,citecolor=navy,linkcolor=navy,urlcolor=navy]{hyperref}
\newcommand{\code}{\texttt}
\newcommand\codeh[1]{\texttt{\hyphenchar\font=45\relax #1}}
\usepackage{Sweave}
\usepackage{dtsweave}
\SweaveOpts{prefix.string=cache/primates,eps=FALSE}
<<results=hide,echo=FALSE>>=
dir.create("cache", FALSE)
if ( require(cacheSweave) )
setCacheDir("cache")
options(show.signif.stars=FALSE, continue=" ", width=70)
@
% Install using TeX's "getnonfreefonts" script, or comment out.
\usepackage[garamond]{mathdesign}
% These are for automatic inclusion into R's vignette system, once
% this file is added to the package.
%\VignetteIndexEntry{Primate MuSSE analysis}
%\VignettePackage{diversitree}
\title{Using MuSSE to investigate multitrait diversification in primates}
\author{Rich FitzJohn, March 2012}
% Not a date, but useful.
\date{Supplementary information to: ``Diversitree: Comparative
phylogenetic analyses of diversification in R''}
\begin{document}
\maketitle
% This is to flag the beginning of the section to be extracted for
% inclusion into the diversitree paper supp mat (see the Makefile).
%%% BEGIN
This is a full version of the analysis of social structure and mating
system in primates. However, it is not a reference and the reader is
directed to the diversitree help pages for further information (in
particular the help page \code{?make.musse.multitrait}).
% The language here assumes that this is part of the supp. mat. for
% the diversitree paper, or for my thesis. It should be changed in a
% standalone version, or removed when put into the main tutorial.
This section is a ``Sweave'' document: a mix of R and \LaTeX\ code
\citep[see][]{Sweave}.
%
It is possible to recompile the document, which runs the R code and
regenerates this output. For instructions on how to do this, please
see the instructions at the top of the \code{primates.Rnw} source
file, available on the diversitree github site\footnote{%
\url{https://github.com/richfitz/diversitree/tree/master/pub/example/}}.
%
To run the examples, or recompile this file, you will need two data
files: ``\code{primates-100.nex}'' containing the phylogenetic trees
and ``\codeh{primates-social.csv}'' containing the trait data. These
are also available on the diversitree github
site\footnote{%
\url{https://github.com/richfitz/diversitree/tree/master/pub/example/data},
or clone the diversitree github repository and run the analysis from
the \code{diversitree/pub/example} directory.}. I assume that these
files are in the directory ``\code{data}''.
<<>>=
library(diversitree)
@
First, load the distribution of trees. These were generated by Tyler
Kuhn, using the approach in \citet{Kuhn-2011-427}. Here, we will use
just the first tree to demonstrate the methods.
<<>>=
trees <- read.nexus("data/primates-100.nex")
tree <- trees[[1]]
@
%
Next, load the species trait data; the first line below creates a
data.frame with the species names as row names. The two data columns
are ``M'', and ``S'', which are \code{TRUE} when the species is
monogamous and social, respectively. Alternatively, these could be
integer values with ``0'' and ``1'' being equivalent to \code{FALSE}
and \code{TRUE}, respectively (I will refer to states 0 and 1 below).
<<>>=
dat <- read.csv("data/primates-social.csv", row.names=1)
head(dat)
@
Note that some of the species lack state information. The
distribution of traits on the phylogeny is shown in Figure
\ref{fig:primate-tree}.
\setkeys{Gin}{width=.95\textwidth}
\begin{figure}[p]
\centering
<<tree,fig=TRUE,width=6,height=6.5,echo=FALSE>>=
cols <- list(M=c("#a6cee3", "#1f78b4"), S=c("#fdbf6f", "#ff7f00"))
genus <- sub("_.+$", "", tree$tip.label) # extract genus name
trait.plot(tree, dat, cols, lab=c("Monogamous", "Solitary"),
str=c("no", "yes"), genus)
@
<<eval=FALSE>>=
<<tree>>
@
\caption[Primate phylogeny, showing distribution of monogamy and
solitariness among species]{Primate phylogeny, showing states of the
two traits considered here: monogamy (blue, inner circle) and
solitariness (orange, outer circle). Species are grouped by genus
(or groups of genera in the case of polyphyletic groupings).}
\label{fig:primate-tree}
\end{figure}
Start by creating a simple model, in which the speciation and
extinction rates do not depend on the character state, and the two
traits have forward ($0\to1$) and backward ($1\to0$) transition rates
that do not depend on the state of the other trait.
<<>>=
lik.0 <- make.musse.multitrait(tree, dat, depth=0)
@
All diversitree likelihood functions take as their first argument a
vector of parameters. To get the vector of names for the parameters,
use the \code{argnames} function:
<<>>=
argnames(lik.0)
@
%
This shows the six parameters: the speciation rate (\code{lambda0}),
extinction rate (\code{mu0}) and six transition rates (e.g.,
\code{qM01.0} is the rate of transition of the breeding system from
non-monogamous to monogamous). The ``\code{0}'' after all parameter
names indicates that these are intercepts.
To perform a maximum likelihood (ML) analysis, we search for the
parameter vector with the highest likelihood. To do this, diversitree
uses a numerical optimisation routine \citep[by default,
subplex;][]{subplex}; most algorithms start at some point in parameter
space and work ``uphill'' finding parameters that improve the
likelihood until no further improvement is possible. To start this
search, we therefore need a starting parameter vector from which the
ML point is reachable.
%
It is not possible in general to prove that a point will lead to the
ML point, or that the best point found really is the ML point.
However, the state-independent speciation and extinction, combined
with reasonable guesses for the character state transition rates
appears to converge with reasonable success:
%
<<>>=
p.0 <- c(starting.point.bd(tree), rep(.1, 4))
names(p.0) <- argnames(lik.0)
@
%
It is good practice to confirm that this point has finite likelihood:
<<>>=
lik.0(p.0)
@
%
We can then carry out the ML search, with \code{find.mle}:
<<ml0,cache=TRUE>>=
fit.0 <- find.mle(lik.0, p.0)
@
%
This returns an object of class ``\code{fit.mle}'', which has a
\code{lnLik} element with the log-likelihood at the ML point, and a
\code{par} element with the ML parameter vector. This object is
described in more detail in the help page \code{?find.mle}.
%
As with other model fits, the \code{coef} function can access parameters:
<<>>=
fit.0$lnLik
round(coef(fit.0), 4)
@
%$ % emacs
Now, we expand this model to allow state-dependent diversification.
To make a likelihood function that includes ``main effects'' of the
two traits for speciation and extinction, but leaves the character
state change parameters the same, we use \code{depth=c(1, 1, 0)}.
<<>>=
lik.1 <- make.musse.multitrait(tree, dat, depth=c(1, 1, 0))
argnames(lik.1)
@
%
(Specifying \code{depth=c(1,1,1)}, or equivalently \code{depth=1},
would also introduce main effects for the transition rates, which
would then mean we would have to determine why \code{lik.1} might fit
better than \code{lik.0} --- the additional degrees of freedom from
state dependent diversification or from correlated character
evolution.)
%
To find the ML point for this model, we again need a starting
parameter vector. The model \code{lik.0} is a special case of model
\code{lik.1}, so it makes sense to start the ML point found in
\code{fit.0}. To do this, expand the parameter vector of the model in
\code{lik.0} to include main effects, but set them equal to zero:
<<>>=
p.1 <- rep(0, length(argnames(lik.1)))
names(p.1) <- argnames(lik.1)
p.1[names(coef(fit.0))] <- coef(fit.0)
round(p.1, 4)
@
(compare this parameter vector to \code{coef(fit.0)}, above).
%
The parameter vector \code{p.1} must have the same likelihood as the
previous ML fit:
\begin{samepage}
<<>>=
lik.0(coef(fit.0))
lik.1(p.1)
@
\end{samepage}
Find the ML point for the model that includes main effects:
<<ml1,cache=TRUE>>=
fit.1 <- find.mle(lik.1, p.1)
@
This model fits substantially better than the state-independent model,
with a likelihood improvement of
%
\Sexpr{round(fit.1$lnLik - fit.0$lnLik, 1)}:
<<>>=
fit.1$lnLik - fit.0$lnLik
@
With a difference of $4$ parameters, we can compare twice this value to
a $\chi^2_4$ and see that the improvement is statistically
significant.
<<>>=
1 - pchisq(2*(fit.1$lnLik - fit.0$lnLik), 4)
@
The \code{anova} function does these likelihood ratio tests
automatically, also reporting AIC values:
<<>>=
anova(fit.1, noSDD=fit.0)
@
We can expand the model further to include interaction terms; is a
combination of mating system and sociality associated with elevated
speciation or extinction?
<<>>=
lik.2 <- make.musse.multitrait(tree, dat, depth=c(2, 2, 0))
argnames(lik.2)
@
Now fit the model, starting from the ML point found in \code{fit.1},
expanded to include intercept terms for both speciation and extinction
rates:
<<ml2,cache=TRUE>>=
p.2 <- rep(0, length(argnames(lik.2)))
names(p.2) <- argnames(lik.2)
p.2[names(coef(fit.1))] <- coef(fit.1)
fit.2 <- find.mle(lik.2, p.2)
@
Comparing this against the no-interaction fit, there is no significant
improvement in model fit:
<<>>=
anova(fit.2, noInteraction=fit.1)
@
The improvement in fit between \code{fit.1} and \code{fit.0} could be
due to either of the monogamy or sociality traits (or both). We can
determine the contribution of each by constructing models that omit
main effects of each trait.
%
The \code{constrain} function can be used to simplify models by
removing parameters. Parameters can either be set to be equal to a
different parameter or to a constant. For example, to remove the main
effect of the breeding system trait on speciation from the
\code{lik.1} model, we can enter:
<<>>=
lik.1M <- constrain(lik.1, lambdaM ~ 0)
@
Notice that \code{lambdaM} is no longer present in the \code{argnames}
of this likelihood function:
<<>>=
argnames(lik.1M)
@
Similarly, for sociality:
<<>>=
lik.1S <- constrain(lik.1, lambdaS ~ 0)
@
These models can then be fit as before (adjusting the starting point to
account for the reduction in the number of parameters)
<<mlMS,cache=TRUE>>=
fit.1M <- find.mle(lik.1M, p.1[argnames(lik.1M)])
fit.1S <- find.mle(lik.1S, p.1[argnames(lik.1S)])
@
There is a significant drop in likelihood when the breeding system
(monogamy/non-monogamy) speciation main effect is dropped from the
model, but the social/nonsocial trait association is marginally
non-significant (both comparisons here are made against \code{fit.1}):
<<>>=
anova(fit.1, noM=fit.1M, noS=fit.1S)
@
Alternatively, we might run use Markov chain Monte Carlo
(\textsc{mcmc}) to sample from the posterior distribution of the
\code{lambdaM} and \code{lambdaS} values.
%
This procedure requires a prior probability distribution for the
parameters. Here, I use a prior on the actual rates in the model, not
the multitrait parametrisation (partly because it is easier to
constrain the former to being positive; negative speciation and
extinction rates are not allowed, and partly because I do not have a
strong prior belief about the form of the main effect parameters).
%
Given a multi-trait parametrisation, the underlying rate parameters
can be found by passing in the argument \code{pars.only=TRUE} to the
likelihood function, which returns the underlying parameters without
computing the likelihood:
<<>>=
round(coef(fit.1), 3)
round(lik.1(coef(fit.1), pars.only=TRUE), 3)
@
To specify an exponential prior with a mean set to twice the
state-independent diversification rate:
<<>>=
r <- p.0[[1]] - p.0[[2]]
prior1 <- make.prior.exponential(1/(2*r))
@
Using the translation above, we can make a function that takes
as arguments the multitrait parametrisation and returns the prior
probability density:
<<>>=
prior <- function(pars)
prior1(lik.1(pars, pars.only=TRUE))
@
% If there is a file cache/samples.Rdata, this will be loaded instead,
% as the rerunning the output will take about 4 hours. This section
% uses a little bit of Sweave magic to make the output look nice while
% still remaining reproducible.
Running the \text{mcmc} for 10,000 steps (this takes several hours, and by
default will print the parameters visited on each step and their
posterior probability).
<<mcmc-musse,eval=FALSE>>=
samples <- mcmc(lik.1, p.1, nsteps=10000, w=0.5, prior=prior)
@
<<results=hide,echo=FALSE>>=
.save <- "cache/samples-musse.Rdata"
if ( file.exists(.save) ) {
load(.save)
} else {
<<mcmc-musse>>
save(samples, file=.save)
}
@
%
The posterior distributions for the main effects of the speciation
rates (\code{lambda.M} and \code{lambda.S}) are concentrated below
zero, with the 95\% credibility interval below zero (Figure
\ref{fig:primates-mcmc}). Therefore, we can conclude both traits have
a negative effect on speciation.
\vspace{2em}
For completeness, I will show how a similar analysis would proceed if
we treat each trait separately with BiSSE. First, we will need two
named state vectors --- one for each trait:
<<>>=
st.m <- dat$M
names(st.m) <- rownames(dat)
st.s <- dat$S
names(st.s) <- rownames(dat)
@
Then we build likelihood functions (using \code{make.bisse}).
<<>>=
lik.m <- make.bisse(tree, st.m)
lik.s <- make.bisse(tree, st.s)
@
We can then run \textsc{mcmc} chains from a ``sensible'' starting
point (see the help page \code{?starting.point.bisse}):
<<>>=
p <- starting.point.bisse(tree)
@
Running the chains:
<<mcmc-bisse,eval=FALSE>>=
samples.m <- mcmc(lik.m, p, nsteps=10000, w=.5, prior=prior1)
samples.s <- mcmc(lik.s, p, nsteps=10000, w=.5, prior=prior1)
@
<<results=hide,echo=FALSE>>=
## As above, cache these manually, though they are less slow to run
## (maybe 1.5 hours together).
.save <- "cache/samples-bisse.Rdata"
if ( file.exists(.save) ) {
load(.save)
} else {
<<mcmc-bisse>>
save(samples.m, samples.s, file=.save)
}
@
%
For a single trait, the difference in the speciation rates (i.e.,
$\lambda_1 - \lambda_0$) is mathematically equivalent to the main
effect of that trait.
%
Monogamy is significantly associated with decreased speciation rates
(the 95\% credibility interval of $\lambda_M$ is below zero; Figure
\ref{fig:primates-mcmc}). However, the effect of solitariness is no
longer significant.
\setkeys{Gin}{width=1.0\textwidth}
\begin{figure}[p]
\centering
<<profiles,fig=TRUE,width=7,height=5,echo=FALSE>>=
## par(bty="l", tcl=.3)
lab <- function(str) {
usr <- par("usr")
pin <- par("pin")
z <- 1.5*strheight("A", "inches")
text(usr[1] + diff(usr[1:2])*z/pin[1],
usr[4] - diff(usr[3:4])*z/pin[2],
str, adj=c(0, 1))
}
col <- c("#1f78b4", "#b2df8a")
col.fill <- diversitree:::add.alpha(col, .5)
cmp <- data.frame(M=(samples.m$lambda1 - samples.m$lambda0),
S=(samples.s$lambda1 - samples.s$lambda0))
par(mfrow=c(2, 1), mar=rep(.5, 4), oma=c(4, 4, 0, 0))
xlim <- c(-.3, 0.15)
profiles.plot(cmp[-(1:500),],
col.line=col, las=1, xlab="", xaxt="n",
ylab="", xlim=xlim)
axis(1, labels=FALSE, col=NA, col.ticks="black")
abline(v=0, lty=2)
lab("a) Independent\n(BiSSE)")
legend("topright", c("lambda (+monogamy)", "lambda (+solitary)"),
fill=col.fill, bty="n")
profiles.plot(samples[-(1:500),c("lambdaM", "lambdaS")],
col.line=col, las=1, xlab="",
ylab="", xlim=xlim)
abline(v=0, lty=2)
usr <- par("usr")
lab("b) Simultaneous\n(MuSSE multitrait)")
mtext("Trait effect on speciation", 1, 2, TRUE)
mtext("Posterior probability density", 2, 2, TRUE)
@
\caption{Posterior probability distributions for monogamy and
breeding system main effects on speciation in primates.
% TODO:
In the top panel,
%
The bars at the bottom of the distributions
and the shaded areas correspond to the 95\% credibility
intervals.}
\label{fig:primates-mcmc}
\end{figure}
\setkeys{Gin}{width=0.8\textwidth}
Similarly, we can run an ML analysis. First, fit the full six
parameter model for each trait:
<<ml-bisse,cache=TRUE>>=
fit.m <- find.mle(lik.m, p)
fit.s <- find.mle(lik.s, p)
@
Then fit constrained models where $\lambda_1$ is set equal to
$\lambda_0$ for both traits:
<<ml-bisse-MS,cache=TRUE>>=
lik.m.eqL <- constrain(lik.m, lambda1 ~ lambda0)
fit.m.eqL <- find.mle(lik.m.eqL, p[argnames(lik.m.eqL)])
lik.s.eqL <- constrain(lik.s, lambda1 ~ lambda0)
fit.s.eqL <- find.mle(lik.s.eqL, p[argnames(lik.s.eqL)])
@
%
Comparing the ``with state-dependent speciation'' model against the
simpler model that lacks state-dependent speciation with a likelihood
ratio test for the monogamy trait:
<<>>=
anova(fit.m, equal.lambda=fit.m.eqL)
@
and the solitariness trait
<<>>=
anova(fit.s, equal.lambda=fit.s.eqL)
@
%
confirms the previous BiSSE result: there is evidence of
state-dependent speciation for monogamy, but not for solitariness.
Note that it is not possible to directly compare the multitrait MuSSE
model with the BiSSE model, because these models explain different
data; MuSSE accounts for the observed distribution of multiple traits,
BiSSE accounts only for one trait. Likelihood ratio tests are only
valid where models are nested and where the data are identical between
both models. Attempting a non-nested comparison will generate an
error. Instead, to perform such comparisons, the user must generate
constrained models using MuSSE, which accounts for all of the trait
data but allows only one of the traits to affect diversification (as
performed using \code{lik.1M} and \code{lik.1S} above).
%%% END
\bibliographystyle{refstyle}
\bibliography{../refs}
\end{document}
% LocalWords: diversitree Sweave Rnw DOI dat csv lapply speciation lik qM mle
% LocalWords: argnames polyphyletic Euoticus lnLik coef signif pchisq anova
% LocalWords: noSDD sociality lambdaM lambdaS noM noS MCMC parametrisation str
% LocalWords: mcmc eval rownames bisse multitrait MuSSE usr