From 1390cb8d8dc08321b2e3bc73ed7a69511dc0d301 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 18 Jul 2016 22:12:14 +0100 Subject: [PATCH 01/17] Hilde edits to +ch2. --- Chapter2.Rtex | 39 ++++++++++++++++++++------------------- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/Chapter2.Rtex b/Chapter2.Rtex index aa9d9e5..c87ce5b 100644 --- a/Chapter2.Rtex +++ b/Chapter2.Rtex @@ -176,7 +176,8 @@ In comparative studies, a number of host traits have been shown to correlate wit A further factor that may affect pathogen richness is population structure. In comparative studies it is often assumed that factors that promote fast disease spread should promote high pathogen richness; the faster a new pathogen spreads through a population, the more likely it is to persist \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. However, this assumption ignores competitive mechanisms such as cross-immunity and depletion of susceptible hosts. -If competitive mechanisms are strong, endemic pathogens in might be able to out0compete invading pathogens irrespective of population structure. +If competitive mechanisms are strong, endemic pathogens in populations with high $R_0$ will be able to easily out-compete invading pathogens. +Only if competitive mechanisms are weak will high $R_0$ enable the invasion of new pathogens and allow higher pathogen richness. Overall, the evidence from comparative studies indicates that increased population structure correlates with higher pathogen richness. This conclusion is based on studies using a number of measures of population structure: genetic measures, the number of subspecies, the shape of species distributions and social group size (Chapter \ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). @@ -303,10 +304,10 @@ In particular, $\alpha = 0$ indicates no coinfections, $\alpha = 1$ indicates th \begin{figure}[t] {\centering -\subfloat[Fully connected\label{fig:fullyConnected}]{ +\subfloat[Minimally connected\label{fig:fullyConnected}]{ \includegraphics[width=0.45\textwidth]{imgs/minimallyConnected.pdf} } -\subfloat[Minimally connected +\subfloat[Fully connected \label{fig:minimallyConnected}]{ \includegraphics[width=0.45\textwidth]{imgs/fullyConnected.pdf} } @@ -314,7 +315,7 @@ In particular, $\alpha = 0$ indicates no coinfections, $\alpha = 1$ indicates th \caption[Network topologies used to compare network connectedness]{ The two network topologies used to test whether network connectedness influences a pathogen's ability to invade. A) Animals can only disperse to neighbouring colonies. -B) Dispersal can occur between any colony. +B) Dispersal can occur between any colonies. Blue circles are colonies of \SI{3000} individuals. Dispersal only occurs between colonies connected by an edge (black line). The dispersal rate is held constant between the two topologies. @@ -324,7 +325,7 @@ The dispersal rate is held constant between the two topologies. In the application of long term existence of pathogens it is necessary to include vital dynamics (births and deaths) as the SIR model without vital dynamics has no endemic state. -Birth and death rates ($\mu$ and $\Lambda$) are set as being equal meaning the population does not systematically increase or decrease. +Birth and death rates ($\Lambda$ and $\mu$) are set as being equal meaning the population does not systematically increase or decrease. The population size does however change as a random walk. New born individuals enter the susceptible class. Infection and coinfection were assumed to cause no extra mortality as for a number of viruses, bats show no clinical signs of infection \cite{halpin2011pteropid, deThoisy2016bioecological}. @@ -347,7 +348,7 @@ $x, y$ & Colony index &&\\ $p$ & Pathogen index i.e.\ $p\in\{1,2\}$ for pathogens 1 and 2 & &\\ $q$ & Disease class i.e.\ $q\in\{1,2,12\}$&\\ $S_x$ & Number of susceptible individuals in colony $x$ &&\\ -$I_qx$ & Number of individuals infected with disease(s) $q \in {1, 2, 12}$ in colony $x$ &&\\ +$I_{qx}$ & Number of individuals infected with disease(s) $q \in \{1, 2, 12\}$ in colony $x$ &&\\ $R_x$ & Number of individuals in colony $x$ in the recovered with immunity class &&\\ $N$ & Total Population size && 30,000\\ $m$ & Number of colonies&& 10\\ @@ -356,7 +357,7 @@ $a$ & Area & \si{\square\kilo\metre}& 10,000\\ $\beta$ & Transmission rate & & 0.1 -- 0.4\\ $\alpha$ & Coinfection adjustment factor & & 0.1\\ $\gamma$ & Recovery rate & year$^{-1}.$individual$^{-1}$ & 1\\ -$\xi$ & Dispersal & year$^{-1}.$individual$^{-1}$ & 0.001--0.1\\ +$\xi$ & Dispersal rate & year$^{-1}.$individual$^{-1}$ & 0.001--0.1\\ $\Lambda$ & Birth rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ $\mu$ & Death rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ $k_x$ & Degree of node $x$ (number of colonies that individuals from colony $x$ can disperse to). &&\\ @@ -373,7 +374,7 @@ $e_i$ & The rate at which event $i$ occurs & year$^{-1}$&\\ The population is modelled as a metapopulation, being divided into a number of subpopulations (colonies). This model is an intermediate level of complexity between fully-mixed populations and contact networks. There is ample evidence that bat populations are structured to some extent. -This evidence comes from the existence of subspecies, measurements of genetic dissimilarity and ecological studies provide \cite{kerth2011bats, mccracken1981social, burns2014correlates, wilson2005mammal}. +This evidence comes from the existence of subspecies, measurements of genetic dissimilarity and ecological studies \cite{kerth2011bats, mccracken1981social, burns2014correlates, wilson2005mammal}. Therefore a fully mixed population is a large oversimplification. However, trying to study the contact network relies on detailed knowledge of individual behaviour which is rarely available. @@ -381,9 +382,9 @@ The metapopulation is modelled as a network with colonies being nodes and disper Individuals within a colony interact randomly so that the colony is fully mixed. Dispersal between colonies occurs at a rate $\xi$. Individuals can only disperse to colonies connected to theirs by an edge in the network. -The rate of dispersal is not affected by the number of edges a colonies has (known as the degree of the colony and denoted $k$). +The rate of dispersal is not affected by the number of edges a colony has (known as the degree of the colony and denoted $k$). Therefore, the dispersal rate from a colony $y$ with degree $k_y$ to colony $x$ is $\xi / k_y$. -Note this rate is independent of the degree and size of colony $x$. +Note this rate is not affect by the degree and size of colony $x$. @@ -396,8 +397,8 @@ The Markov chain contains the random variables $((S_x)_{x = 1\ldots m}, (I_{x, q Here, $(S_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of susceptibles in each colony. $(I_{x, q})_{x =1\ldots m, q \in \{1, 2, 12\}}$ is a length $m \times 3$ vector describing the number of individuals of each disease class ($q \in \{1, 2, 12\}$) in each colony. Finally, $(R_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of individuals in the recovered class. -The model is a Markov chain where extinction of both pathogens species and extinction of the host species are absorbing states. -However, the expected time to reach this state is much larger than the duration of the simulations. +The model is a Markov chain where extinction of both pathogen species and extinction of the host species are absorbing states. +The expected time for either host to go extinct is much larger than the duration of the simulations. At any time, suppose the system is in state $((s_x), (i_{x,q}), (r_x))$. At each step in the simulation we calculate the rate at which each possible event might occur. @@ -439,18 +440,18 @@ Infection of a susceptible with either Pathogen 1 or 2 is therefore given by while coinfection, given the coinfection adjustment factor $\alpha$, is given by \begin{align} i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{1x} \rightarrow i_{1x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{1x}\left(i_{2x} + i_{12x}\right),\\ - i_{12,x} \rightarrow i_{12,x}+1,\;\;\; I_{2x} \rightarrow i_{2x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{2x}\left(i_{1x} + i_{12x}\right). + i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{2x} \rightarrow i_{2x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{2x}\left(i_{1x} + i_{12x}\right). \end{align} Note that lower values of $\alpha$ give lower rates of infection as in \textcite{castillo1989epidemiological}. -The probability of migration from colony $y$ (with degree $k_y$) to colony $x$, given a dispersal rate $\xi$ is given by +The rate of migration from colony $y$ (with degree $k_y$) to colony $x$, given a dispersal rate $\xi$ is given by \begin{align} s_x \rightarrow s_x+1,\;\;\; s_y \rightarrow s_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi s_y}{k_y},\\ i_{qx} \rightarrow i_{qx}+1,\;\;\; i_{qy} \rightarrow i_{qy}-1 &\;\;\text{at a rate of}\;\; \frac{\xi i_{qy}}{k_y},\\ r_x \rightarrow r_x+1,\;\;\; r_y \rightarrow r_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi r_y}{k_y}. \end{align} -Not that the dispersal rate does not change with infection. +Note that the dispersal rate does not change with infection. As above, this is due to the low virulence of bat viruses. Finally, recovery from any infectious class occurs at a rate $\gamma$ \begin{align} @@ -1010,7 +1011,7 @@ The birth rate $\Lambda$ was set to be equal to $\mu$. This yields a population that does not systematically increase or decrease. However, the size of each colony changes as a random walk. Given the length of the simulations, colonies were very unlikely to go extinct (Figure~\ref{fig:plotsNoInvade2}). -The starting size of each colony was set to \SI{3000}. +The starting size of each colony was set to \si{3000}. This is appropriate for many bat species \cite{jones2009pantheria}, especially the large, frugivorous \emph{Pteropodidae} that have been particularly associated with recent zoonotic diseases. The recovery rate $\gamma$ was set to one, giving an average infection duration of one year. @@ -1051,7 +1052,7 @@ Again, visual inspection of preliminary simulations was used to determine that a The choice to use a fixed number of events, rather than a fixed number of years, was for computational convenience. However, this choice creates a risk of bias as simulations with a greater total rate of events $\sum_j e_j$ (e.g.,\ faster disease transmission) will last for a shorter time overall (i.e.\ a smaller $\sum \delta$ over all events). -However, visual inspection of the dynamics of disease extinction (\ref{fig:plotsNoInvade1}), and examination of the typical time to extinction implies that this bias is very weak. +However, visual inspection of the dynamics of disease extinction (Figure~\ref{fig:plotsNoInvade1}), and examination of the typical time to extinction suggests that this bias is negligible. For example, of the simulations where extinction occurred, the extinction occurred more than 50 years before the end of the simulation in 90\% of cases. On a preliminary run of 106 simulations across all combinations of dispersal and transmission rates, examining the population after \SI{700000} events instead of \SI{\rinline{nEvent}} events gave exactly the same result with respect to the binary state of invasion or no invasion. @@ -1069,7 +1070,7 @@ I ran 100 simulations at each transmission rate. Two parameters control population structure in the model: dispersal rate and the topology of the metapopulation network. The values used for these parameters were chosen to highlight the effects of population structure. -I selected the dispersal rates $\xi = 0, 0.1, 0.01$ and $ 0.001$ dispersals per individual per year. +I selected the dispersal rates $\xi = 0, 0.001, 0.01$ and $0.1$ dispersals per individual per year. The probability that an individual disperses at least once in its lifetime is given by $\xi / \left(\xi + \mu\right)$. Therefore, $\xi = 0.1$ relates to 67\% of individuals dispersing between colonies at least once in their lifetime. Exclusively juvenile dispersal would have dispersal rates similar to this value. %todo cite @@ -1482,7 +1483,7 @@ However, the rate of recovery from pathogens in the presence of coinfections has In humans, the rate of recovery from respiratory syncytial virus was faster in individuals that had recently recovered from one of a number of co-circulating viruses \cite{munywoki2015influence}. However, currently coinfected individuals recovered more slowly than average \cite{munywoki2015influence}. -However, further work could relax this assumption using a model similar to \cite{poletto2015characterising} which contains additional classes for ``infected with Pathogen 1, immune to Pathogen 2'' and ``infected with Pathogen 1, immune to Pathogen 2''. +However, further work could relax this assumption using a model similar to \cite{poletto2015characterising} which contains additional classes for ``infected with Pathogen 1, immune to Pathogen 2'' and ``infected with Pathogen 2, immune to Pathogen 1''. The model here was formulated such that the study of systems with greater than two pathogens (an avenue for further study) is still computationally feasible. A model such as used in \cite{poletto2015characterising} contains $3^\rho$ classes for a system with $\rho$ pathogen species. This quickly becomes computationally restrictive. From c55bc2e85041144e74663a6a4edccd9089b691d4 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Thu, 21 Jul 2016 22:29:13 +0100 Subject: [PATCH 02/17] kj final edits to +abstract. --- Preamble.tex | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/Preamble.tex b/Preamble.tex index fa7fc09..d5d245b 100644 --- a/Preamble.tex +++ b/Preamble.tex @@ -3,18 +3,19 @@ \begin{abstract} % 300 word limit -\lettr{T}he huge number of pathogen species strongly affects human health and ecological systems. +\lettr{P}athogens acquired from animals make up the majority of emerging human diseases, are often highly virulent and can have large affects on public health and economic development. +Identifying species with high pathogen species richness enables efficient sampling and monitoring of potentially dangerous pathogens. I examine the role of host population structure and size in maintaining pathogen species richness in an important reservoir host for zoonotic viruses, bats (Order, Chiroptera). -Firstly I test whether population structure is associated with high viral richness across wild bat species within a comparative phylogenetic analysis. +Firstly I test whether population structure is associated with high viral richness across bat species within a comparative, phylogenetic analysis. I find evidence that bat species with more structured populations have more virus species. As this type of study cannot distinguish between specific mechanisms, I then formulate epidemiological models to test whether more structured host populations may allow invading pathogens to avoid competition. -These models show that population structure does not affect the rate of pathogen invasion by this mechanism. -Rather, in these models only the disease dynamics within the local group matter. -As both global host population structure and local group size appear to be important for disease invasion, I use the same modelling framework to compare the importance of host group size and number of groups. -I find that host group size has a stronger affect than number of groups. -There are very few population size estimates for bats to directly test the importance of host population size on pathogen richness. -Therefore I develop a method for estimating bat population sizes from acoustic surveys to assist future research. -Overall in this thesis, I show that the structure and size of host bat populations can affect their ability to maintain many pathogen species and I provide a method to measure population sizes of bats. These findings increase our understanding of the ecological process of pathogen community construction and can help optimise host surveillance for zoonotic pathogens. +However, these models show that increasing population structure decreases the rate of pathogen invasion. +As both global host population structure and local group size appear to be important for disease invasion, I use the same modelling framework to compare the importance of host density, group size and number of groups. +I find that host group size has a stronger affect than density or number of groups. +There are few bat population size estimates to empirically test the importance of host population size on pathogen richness. +Therefore, to assist future research, I develop a method for estimating bat population sizes from acoustic surveys. +Overall in this thesis, I show that the structure and size of host bat populations can affect their ability to maintain many pathogen species and I provide a method to measure population sizes of bats. +These findings increase our understanding of the ecological process of pathogen community construction and can help optimise surveillance for zoonotic pathogens. \end{abstract} From 9ab175e21f57970c9ee2c051df19e1d823912f95 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Thu, 21 Jul 2016 23:25:32 +0100 Subject: [PATCH 03/17] Some more +kj edits to +ch3 and +intro. --- Chapter3.Rtex | 21 ++++++++------------- Introduction.tex | 12 ++++++------ 2 files changed, 14 insertions(+), 19 deletions(-) diff --git a/Chapter3.Rtex b/Chapter3.Rtex index dc4f2ae..ae2c1f8 100644 --- a/Chapter3.Rtex +++ b/Chapter3.Rtex @@ -119,7 +119,7 @@ rangeUseable <- 0.20 %\tmpsection{One or two sentences providing a basic introduction to the field} % comprehensible to a scientist in any discipline. \lettr{Z}oonotic diseases make up the majority of human infectious diseases and are a major drain on healthcare resources and economies. -Species that host many pathogen species are more likely to be the source of a novel zoonotic disease than species with few pathogens. +Species that host many pathogen species are more likely to be the source of a novel zoonotic disease than species with few pathogens, all else being equal. However, the factors that influence pathogen richness in animal species are poorly understood. % % @@ -275,7 +275,7 @@ Furthermore, I found that the role of phylogeny is very weak both in the models To measure pathogen richness I used data from \textcite{luis2013comparison}. This data simply includes known infections of a bat species with a virus species. -I have used viral richness as a proxy for pathogen richness more generally, but the analysis could also be considered as representative of viral richness only. +I have used viral richness as a proxy for pathogen richness more generally. Rows with host species that were not identified to species level according to \textcite{wilson2005mammal} were removed. Many viruses were not identified to species level or their specified species names were not in the ICTV virus taxonomy \cite{ICTV}. Therefore, I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified (present in the ICTV taxonomy). @@ -1827,7 +1827,8 @@ This variable can be used to benchmark how important other explanatory variables The whole analysis was run \rinline{nBoots} times, resampling the random variable each time. -To control for phylogenetic non-independence of data points I used the best-supported phylogeny from \textcite{fritz2009geographical} (Figure~\ref{fig:treePlot}) which is the supertree from \textcite{bininda2007delayed} with names updated to match the taxonomy by \textcite{wilson2005mammal}. +To control for phylogenetic non-independence of data points I used the best-supported phylogeny from \textcite{fritz2009geographical} which is the supertree from \textcite{bininda2007delayed} with names updated to match the taxonomy by \textcite{wilson2005mammal}. +This tree was pruned to include only the species I had data for (Figure~\ref{fig:treePlot}). Phylogenetic manipulation was performed using the \emph{ape} package \cite{ape}. I also performed the analysis using the phylogeny from \textcite{jones2005bats} as this has some broad topological differences including the Rhinolophoidea being sister to the Pteropodidae rather than being related to the other insectivorous bats (Figure~\ref{fig:treePlot2}). @@ -1836,18 +1837,12 @@ I also performed the analysis using the phylogeny from \textcite{jones2005bats} %%begin.rcode treeCapt -treeCapt <- paste0(' +treeCapt <- ' The phylogenetic distribution of viral richness. -There is no clear association between phylogeny and virus richness (pgls: $\\lambda =$ ', -round(virusLambda$param['lambda'], 2), -', $p =$ ', -round(virusLambda$param.CI$lambda$bounds.p[1], 2), -').', -' -The phylogeny is from \\cite{bininda2007delayed} pruned to include all species used in either the number of subspecies or gene flow analysis. +The phylogeny is from \\cite{fritz2009geographical} pruned to include all species used in either the number of subspecies or gene flow analysis. Dot size shows the number of known viruses for that species and colour shows family. The red scale bar shows 25 million years.' -) + treeTitle <- 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.' @@ -2099,7 +2094,7 @@ Body mass and range size are also probably in the best model ($b = $ \rinline{va When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and \ref{t:variables2}). Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. -However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{bininda2007delayed}. +However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. %%begin.rcode ITCombPlotCapt diff --git a/Introduction.tex b/Introduction.tex index 43d21be..00ff9ef 100644 --- a/Introduction.tex +++ b/Introduction.tex @@ -14,7 +14,7 @@ \section{Pathogen richness and the impacts of zoonotic diseases} For example, both Liberia and Guinea experienced negative per capita growth rates of -2\% due to the Ebola epidemic in 2014 \cite{ebolaWorldbank, ebola2015worldbank}. More generally, death rates per 1,000 people living with AIDS are up to ten times higher in developing countries than in Europe and North America \cite{granich2015trends}. The global richness of pathogens is large but mostly unknown \cite{poulin2014parasite}. -Recent large studies suggest that the global number of mammalian virus species is of the order of hundreds of thousands \cite{anthony2013strategy} while only 3,000 virus species, across all host groups, are currently described \cite{ICTV}. +Recent studies suggest that the global number of mammalian virus species is of the order of hundreds of thousands \cite{anthony2013strategy} while only 3,000 virus species, across all taxonomic groups, are currently described \cite{ICTV}. This large pool of unknown pathogens presents a continuing risk of new pathogens spilling over into humans. @@ -47,7 +47,7 @@ \section{Pathogen richness and the impacts of zoonotic diseases} -\section{Influence of population structure and size on pathogen richness} +\section{Influence of population size and structure on pathogen richness} \tmpsection{Theoretical evidence that structure and density increase richness} @@ -57,19 +57,19 @@ \section{Influence of population structure and size on pathogen richness} The roles of population size and density in disease dynamics are well established \cite{may1979population, anderson1979population, heesterbeek2002brief, lloyd2005should}. Broadly, larger populations can maintain diseases more easily by having a larger pool of susceptible individuals (individuals without acquired immunity) and having a greater number of new susceptible individuals enter the population by birth or immigration \cite{may1979population, anderson1979population}. High density populations are expected to have a greater number of contacts between individuals and so promote disease spread. -However, there is much discussion on if and when the number of contacts might scale independently of density \cite{mccallum2001should}. +However, there is much discussion about if and when the number of contacts might scale independently of density \cite{mccallum2001should}. % Structure -There is a large literature on the role of population structure on disease dynamics, as reviewed by \textcite{pastor2015epidemic}, driven by applications to human health as well as computer viruses \cite{pastor2001epidemic} and the social spread of information \cite{goffman1964generalization}. +There is also a large literature on the role of population structure on disease dynamics, as reviewed by \textcite{pastor2015epidemic}, driven by applications to human health as well as computer viruses \cite{pastor2001epidemic} and the social spread of information \cite{goffman1964generalization}. In particular, work has concentrated on how population structure affects the basic reproduction number, $R_0$ \cite{colizza2007invasion, barthelemy2010fluctuation, wu2013threshold, may2001infection, pastor2001epidemic}. This value combines relevant parameters to yield a threshold above which a disease is expected to infect a significant proportion of the population \cite{may1979population, anderson1979population}. Below the threshold, only small outbreaks that quickly die out are expected. However, the majority of theoretical work considers single pathogens with models examining whether a pathogen can spread and persist in a population, ignoring all other pathogens. -Recent large studies have found tens \cite{anthony2013strategy} or even hundreds \cite{anthony2015non} of virus species in a single host species. -This suggests that the global number of mammalian virus species is of the order of hundreds of thousands \cite{anthony2013strategy} while recent large databases include nearly 2,000 pathogens from approximately 400 wild animal hosts \cite{wardeh2015database}. +Studies have found tens \cite{anthony2013strategy} or even hundreds \cite{anthony2015non} of virus species in a single host species. +%This suggests that the global number of mammalian virus species is of the order of hundreds of thousands \cite{anthony2013strategy} while recent large databases include nearly 2,000 pathogens from approximately 400 wild animal hosts \cite{wardeh2015database}. Therefore ignoring inter-pathogen competition is an oversimplification. A number of studies have considered the case where two pathogens spread concurrently and examine which pathogen infects more individuals. From 613ce484d8441c0b72e604f77745a6132cf0e562 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Fri, 22 Jul 2016 23:26:58 +0100 Subject: [PATCH 04/17] Fix fig references. --- Chapter5.Rtex | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/Chapter5.Rtex b/Chapter5.Rtex index 7300141..76502a1 100644 --- a/Chapter5.Rtex +++ b/Chapter5.Rtex @@ -254,7 +254,7 @@ ggplot(reg, aes(x = x, y = y)) + I have identified the parameter space for the combinations of $\theta$ and $\alpha$ for which the derivation of the equations are the same (defined as sub-models in the gREM) (Figure~\ref{fig:equalRegions}). For example, the gas model becomes the simplest gREM sub-model (upper right in Figure~\ref{fig:equalRegions}) and the REM from \cite{rowcliffe2008estimating} is another gREM sub-model where $\theta<\pi/2$ and $\alpha = 2\pi$. -I derive one gREM sub-model SE2 as an example below, where $2 \pi - \alpha/2 < \theta < 2\pi ,\; 0 < \alpha <\pi$ (see Appendix \ref{gremAppendix} for derivations of all gREM sub-models). +I derive one gREM sub-model SE2 as an example below, where $2 \pi - \alpha/2 < \theta < 2\pi ,\; 0 < \alpha <\pi$ (see Appendix~\ref{gremAppendix} for derivations of all gREM sub-models). Any estimate of density would require prior knowledge of animal velocity, $v$ and animal signal width, $\alpha$ taken from other sources, for example, existing literature \cite{brinklov2011, carbone2005far}. Sensor width, $\theta$, and detection distance, $r$ would also need to be measured or obtained from manufacturer specifications \cite{holderied2003echolocation, adams2012you}. @@ -263,14 +263,14 @@ Sensor width, $\theta$, and detection distance, $r$ would also need to be measur In order to calculate $\bar{p}$, we have to integrate over the focal angle, $x_1$ (Figure~\ref{f:x1AndInt}A). This is the angle taken from the centre line of the sensor. -Other focal angles are possible ($x_2$, $x_3$, $x_4$) and are used in other gREM sub-models (see Appendix \ref{gremAppendix}). +Other focal angles are possible ($x_2$, $x_3$, $x_4$) and are used in other gREM sub-models (see Appendix~\ref{gremAppendix}). As the size of the profile depends on the approach angle, we present the derivation across all approach angles. When the sensor is directly approaching the animal $x_1 = \pi/2$. Starting from $x_1 = \pi/2$ until $\theta/2 + \pi/2 - \alpha/2$, the size of the profile is $2r\sin \alpha/2$ (Figure~\ref{f:x1AndInt}B). During this first interval, the size of $\alpha$ limits the width of the profile. -When the animal reaches $x_1$ = $\theta/2 + \pi/2 - \alpha/2$ (Figure~\ref{f:x1AndInt}C), the size of the profile is $r\sin( \alpha/2) + r\cos( x_1 - \theta/2)$ and the size of $\theta$ and $\alpha$ both limit the width of the profile (Figure~ \ref{f:x1AndInt}C). -Finally, at $x_1 = 5\pi/2 - \theta/2 - \alpha/2$ until $x_1 = 3\pi/2$, the width of the profile is again $2r\sin\alpha/2$ (Figure~ \ref{f:x1AndInt}D) and the size of $\alpha$ again limits the width of the profile. +When the animal reaches $x_1$ = $\theta/2 + \pi/2 - \alpha/2$ (Figure~\ref{f:x1AndInt}C), the size of the profile is $r\sin( \alpha/2) + r\cos( x_1 - \theta/2)$ and the size of $\theta$ and $\alpha$ both limit the width of the profile (Figure~\ref{f:x1AndInt}C). +Finally, at $x_1 = 5\pi/2 - \theta/2 - \alpha/2$ until $x_1 = 3\pi/2$, the width of the profile is again $2r\sin\alpha/2$ (Figure~\ref{f:x1AndInt}D) and the size of $\alpha$ again limits the width of the profile. \begin{figure}[t] @@ -301,7 +301,7 @@ The average profile $\bar{p}$ is the size of the profile averaged across all app \end{figure} -The profile width $p$ for $\pi$ radians of rotation (from directly towards the sensor to directly behind the sensor) is completely characterised by the three intervals (Figure \ref{f:x1AndInt}B -- D). +The profile width $p$ for $\pi$ radians of rotation (from directly towards the sensor to directly behind the sensor) is completely characterised by the three intervals (Figure~\ref{f:x1AndInt}B -- D). Average profile width $\bar{p}$ is calculated by integrating these profiles over their appropriate intervals of $x_1$ and dividing by $\pi$ which gives \begin{align} @@ -319,7 +319,7 @@ D = z/vt\bar{p}. Rather than having one equation that describes $\bar{p}$ globally, the gREM must be split into submodels due to discontinuous changes in $p$ as $\alpha$ and $\beta$ change. These discontinuities can occur for a number of reasons such as a profile switching between being limited by $\alpha$ and $\theta$, the difference between very small profiles and profiles of size zero, and the fact that the width of a sector stops increasing once the central angle reaches $\pi$ radians (i.e., a semi-circle is just as wide as a full circle). -As an example, if $\alpha$ is small, there is an interval between Figure \ref{f:x1AndInt}C and \ref{f:x1AndInt}D where the `blind spot' would prevent animals being detected giving $p=0$. +As an example, if $\alpha$ is small, there is an interval between Figure~\ref{f:x1AndInt}C and \ref{f:x1AndInt}D where the `blind spot' would prevent animals being detected giving $p=0$. This would require an extra integral in our equation, as simply putting our small value of $\alpha$ into \ref{e:SE2int} would not give us this integral of $p=0$. gREM submodel specifications were done by hand, and the integration was done using \emph{SymPy} \cite{sympy} in \emph{Python}. @@ -544,8 +544,8 @@ ggplot(captures, aes(x = count, y = percentageerror, \subsubsection{Movement models} Within the four gREM submodels tested (NW1, SW1, SE3, NE1), neither the accuracy or precision was affected by the average amount of time spent stationary. -The median difference between the estimated and true values was less than 2\% for each category of stationary time (0, 0.25, 0.5 and 0.75) (Figure~\ref{fig:movtFig}). -Altering the maximum change in direction in each step (0, $\pi/3$, $2\pi/3$, and $\pi$) did not affect the accuracy or precision of the four gREM submodels (Figure~\ref{fig:movtFig}). +The median difference between the estimated and true values was less than 2\% for each category of stationary time (0, 0.25, 0.5 and 0.75) (Figure~\ref{f:movtFig}). +Altering the maximum change in direction in each step (0, $\pi/3$, $2\pi/3$, and $\pi$) did not affect the accuracy or precision of the four gREM submodels (Figure~\ref{f:movtFig}). \subsubsection{Impact of parameter error} @@ -637,7 +637,7 @@ Simulation model results of the accuracy and precision of four gREM submodels (N The percentage error between estimated and true density within each gREM sub model for the different movement models is shown within each box plot, where the white line represents the median percentage error across all simulations, boxes represent the middle 50\% of the data, whiskers represent variability outside the upper and lower quartiles with outliers plotted as individual points. Notches in boxplots show the 95\% confidence for the median. The simple model is represented where time and maximum change in direction equals 0. -The colour of each box plot corresponds to the expressions for average profile width $\bar{p}$ given in Figure \ref{f:equalModelResults}. +The colour of each box plot corresponds to the expressions for average profile width $\bar{p}$ given in Figure~\ref{f:equalModelResults}. } \label{f:movtFig} \end{figure} From eb1c2b795300213dcdf11ca8df2bef73e274ed5f Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Fri, 22 Jul 2016 23:27:35 +0100 Subject: [PATCH 05/17] More final +kj edits. --- Chapter2.Rtex | 14 +- Chapter3.Rtex | 336 ++++++++++++++++++++++++----------------------- Chapter4.Rtex | 8 +- Introduction.tex | 35 ++--- 4 files changed, 199 insertions(+), 194 deletions(-) diff --git a/Chapter2.Rtex b/Chapter2.Rtex index c87ce5b..abbbfbe 100644 --- a/Chapter2.Rtex +++ b/Chapter2.Rtex @@ -180,7 +180,7 @@ If competitive mechanisms are strong, endemic pathogens in populations with high Only if competitive mechanisms are weak will high $R_0$ enable the invasion of new pathogens and allow higher pathogen richness. Overall, the evidence from comparative studies indicates that increased population structure correlates with higher pathogen richness. -This conclusion is based on studies using a number of measures of population structure: genetic measures, the number of subspecies, the shape of species distributions and social group size (Chapter \ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). +This conclusion is based on studies using a number of measures of population structure: genetic measures, the number of subspecies, the shape of species distributions and social group size (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). However, there are a number of studies that contradict this conclusion \cite{gay2014parasite, bordes2007rodent, ezenwa2006host}. Comparative studies are often contradictory due to small sample sizes, noisy data and because empirical relationships often do not extrapolate well to other taxa. Furthermore, multicollinearity between many traits also makes it hard to clearly distinguish which factors are important \cite{nunn2015infectious}. @@ -240,7 +240,7 @@ Furthermore, this close evolutionary relationship means that competition via cro \tmpsection{Why bats} The metapopulations were parameterised to broadly mimic wild bat populations. -Population structure has already been found to correlate with pathogen richness in bats (Chapter \ref{ch:empirical}, \cites{gay2014parasite, maganga2014bat, turmelle2009correlates}). +Population structure has already been found to correlate with pathogen richness in bats (Chapter~\ref{ch:empirical}, \cites{gay2014parasite, maganga2014bat, turmelle2009correlates}). Furthermore, bats have an unusually large variety of social structures. Colony sizes range from ten to 1 million individuals \cite{jones2009pantheria} and colonies can be very stable \cite{kerth2011bats, mccracken1981social}. This strong colony fidelity means they fit the assumptions of metapopulations well. @@ -336,8 +336,8 @@ Infection and coinfection were assumed to cause no extra mortality as for a numb \begin{table}[tb] \centering -\caption[All symbols used in Chapters \ref{ch:sims1} and \ref{ch:sims2}.]{A summary of all symbols used in Chapters \ref{ch:sims1} and \ref{ch:sims2} along with their units and default values. -The justifications for parameter values are given in Section \ref{s:paramSelect}.} +\caption[All symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2}.]{A summary of all symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2} along with their units and default values. +The justifications for parameter values are given in Section~\ref{s:paramSelect}.} \begin{tabular}{@{}lp{6cm}p{2.9cm}r@{}} \toprule @@ -1446,17 +1446,17 @@ Increasing transmission rate quickly reaches a state where new pathogens always Decreasing the transmission rate quickly reaches a state where invasion is impossible. The result that increased population structure decreases pathogen richness supports many existing predictions that increasing $R_0$ should increase pathogen richness \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. -However, many comparative studies have found the opposite relationship, with increased population structure increasing pathogen richness (Chapter \ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). +However, many comparative studies have found the opposite relationship, with increased population structure increasing pathogen richness (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). Furthermore, simple analytical models suggest that population structure should increase pathogen richness \cite{qiu2013vector, allen2004sis, nunes2006localized} and I find no evidence of this. \tmpsection{Link results to consequences} -These results suggest that if population structure does in fact affect pathogen richness, as observed in comparative studies (Chapter \ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), it must occur by a mechanism other than the one studied here. +These results suggest that if population structure does in fact affect pathogen richness, as observed in comparative studies (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), it must occur by a mechanism other than the one studied here. In this study the hypothesised mechanism for the relationship between population structure and pathogen richness, was that the spread and persistence of a newly evolved pathogen would be facilitated in highly structured populations as the lack of movement between colonies would stochastically create areas of low prevalence of the endemic pathogen. If the invading pathogen evolved (i.e.\ was seeded) in one of these areas of low prevalence, invasion would be more likely. Instead, reduced population structure allowed the new pathogen to quickly spread outside of the colony in which it evolved. -As the mechanism studied here cannot explain the relationship between population structure and pathogen richness seen in wild species (Chapter \ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), other mechanisms should be studied. +As the mechanism studied here cannot explain the relationship between population structure and pathogen richness seen in wild species (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), other mechanisms should be studied. Other mechanisms that should be examined include reduced competitive exclusion of already established pathogens or increased invasion of less closely related and less strongly competing pathogens, perhaps mediated by ecological competition of pathogens (i.e.\ reduction of the susceptible pool by disease induced mortality). Furthermore, single pathogen dynamics could have an important role such as population structure causing a much slower, asynchronous epidemic preventing acquired herd immunity \cite{plowright2011urban}. diff --git a/Chapter3.Rtex b/Chapter3.Rtex index ae2c1f8..1a55c20 100644 --- a/Chapter3.Rtex +++ b/Chapter3.Rtex @@ -128,17 +128,18 @@ However, the factors that influence pathogen richness in animal species are poor % Theory led. The pattern of contacts between individuals (i.e.\ population structure) can be influenced by habitat fragmentation, sociality and dispersal behaviour. Epidemiological theory suggests that increased population structure can promote pathogen richness by reducing competition between pathogen species. -Conversely, it is often assumed that as greater population structure slows the spread of a new pathogen, less structured populations should have greater pathogen richness. +Conversely, it is often assumed that as greater population structure slows the spread of a new pathogen (i.e.\ lowers $R_0$), less structured populations should have greater pathogen richness. % % %\tmpsection{One sentence clearly stating the general problem (the gap)} % being addressed by this particular study. -Previous studies have had contradictory results and different measures of population structure have been used, complicating the interpretation. +Previous comparative studies comparing pathogen richness and population structure measured population structure differently and have had contradictory results, complicating the interpretation. % % %\tmpsection{One sentence summarising the main result} % (with the words “here we show” or their equivalent). -Here I used comparative data across 203 bat species, controlling for body mass, geographic range size, study effort and phylogeny, to test whether increased population structure correlates with viral richness. +Here I test whether increased population structure correlates with viral richness using comparative data across 203 bat species, controlling for body mass, geographic range size, study effort and phylogeny. +This is an indirect test between the two competing hypotheses: does increased population structure allow pathogen coexistence by reducing competition, or does increased population structure decrease $R_0$ and therefore cause fewer new pathogens to enter the population. Bats, as a group, make a useful case study because they have been associated with a number of important, recent zoonotic outbreaks. Unlike previous studies, I used two measures of population structure: the number of subspecies and effective levels of gene flow. I find that both measures are positively associated with pathogen richness. @@ -148,7 +149,7 @@ I find that both measures are positively associated with pathogen richness. % or how the main result adds to previous knowledge My results add more robust support to the hypothesis that increased population structure promotes viral richness in bats. The results support the prediction that increased population structure allows greater pathogen richness by reducing competition between pathogens -The prediction that factors that increase $R_0$ should increase pathogen richness is not supported. +The prediction that factors that decrease $R_0$ should decrease pathogen richness is not supported. % % %\tmpsection{One or two sentences to put the results into a more general context.} @@ -1786,7 +1787,7 @@ These bat species with no known viruses were included to make the greatest use o After data cleaning there was data for \rinline{nrow(nSpecies)} bat species in \rinline{length(unique(nSpecies$Family))} families for the subspecies analysis. Due to the limited number of studies and the restrictive requirements imposed on study design, there was only data for \rinline{nrow(fstFinal)} bat species in \rinline{length(unique(fstFinal$Family))} families for the effective gene flow analysis. -The raw data are included in Table \ref{A-rawData}. +The raw data are included in Table~\ref{A-rawData}. @@ -1795,8 +1796,8 @@ The raw data are included in Table \ref{A-rawData}. -To control for study bias I collected the number of PubMed and Google Scholar citations for each bat species name including synonyms from ITIS \cite{itis} via the \emph{taxize} package \cite{chamberlain2013taxize}. -The counts were scraped using the \emph{rvest} package \cite{rvest}. +To control for study bias I collected the number of PubMed and Google Scholar citations for each bat species name including synonyms from ITIS \cite{itis}. +This was performed in \emph{R} \cite{R} using the \emph{rvest} package \cite{rvest}, with ITIS synonyms being accessed with the \emph{taxize} package \cite{chamberlain2013taxize}. I log transformed these variables as they were strongly right skewed. I tested for correlation between these two proxies for study effort using phylogenetic least squares regression (pgls), using the best-supported phylogeny from \textcite{fritz2009geographical}, and likelihood ratio tests using the \emph{caper} package \cite{caper} (Figures~\ref{fig:treePlot} and \ref{fig:scholarvspubmedPlot}). The log number of citations on PubMed and Google scholar were highly correlated (pgls: $t$ = \rinline{studyEffortCor$coefficients['log(pubmedRefs + 1)', 't value']}, df = \rinline{studyEffortCor$df[2]}, $p < 10^{-5}$). @@ -1817,7 +1818,7 @@ Distribution size was estimated by downloading range maps for all species from I \subsection{Statistical analysis} Statistical analysis for both response variables --- number of subspecies and effective level of gene flow --- was conducted using an information theoretical approach \cite{burnham2002model}, specifically following \textcite{whittingham2005habitat, whittingham2006we}. -All analysis were performed in R \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex}. +All analyses were performed in \emph{R} \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex}. I chose a credible set of models including all combinations of explanatory variables and a model with just an intercept. In the analysis using the number of subspecies response variable I also modelled the interaction study effort and number of subspecies by including their product. This interaction was included as I believed \emph{a priori} that this interaction may be present as subspecies in well studied species are more likely to be identified. @@ -2033,150 +2034,12 @@ ggplot(nSpeciesCounts, aes(NumberOfSubspecies, virusSpecies, size = n)) + The number of described virus species for a bat host ranged up to \rinline{max(nSpecies$virusSpecies)} viruses in \emph{\rinline{nSpecies$binomial[which.max(nSpecies$virusSpecies)]}}. There appears to be a positive relationship between the number of subspecies and viral richness (Figure~\ref{fig:boxplot}) though few species have more than five subspecies. -Out of \rinline{nrow(modelWeights)} fitted models, the top seven models all had $\Delta\text{AICc} < 4$ meaning there was no clear best model (Table~\ref{t:models} and Table \ref{A-modelWeights}). +Out of \rinline{nrow(modelWeights)} fitted models, the top seven models all had $\Delta\text{AICc} < 4$ meaning there was no clear best model (Table~\ref{t:models} and Table~\ref{A-modelWeights}). However these top seven models all contained study effort, number of subspecies and the interaction between these two variables. The explanatory variables log(Mass), log(Range Size) and the uniformly random variable are each in three of the top seven models. These top seven models had a combined weight of \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))} meaning that there is a \rinline{sprintf("%.0f", round(100 * modelWeights[7, 5]))}\% chance that one of these models is the best model amongst those examined. - -Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}. -The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}). -When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). -Models with an interaction term between the number of subspecies and study effort have a positive interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}). - - - - -%%begin.rcode fstRawCapt - -fstRawDataCapt <- -paste( -'Relationship between viral richness and log effective gene flow per generation for', -nrow(fstFinal), -'bat species. -Green points are studies that estimated effective gene flow using allozymes and blue points are studies using microsatellites. -') - - - -fstRawDataTitle <- -paste( -'Relationship between viral richness and log effective gene flow per generation for', -nrow(fstFinal), -'bat species. -') - -%%end.rcode - - - -%%begin.rcode fstRawData, fig.height = 2.3, fig.cap = fstRawDataCapt, fig.scap = fstRawDataTitle - -# Plot raw fst data - -ggplot(fstFinal, aes(x = Nm, y = virusSpecies, colour = Marker)) + - geom_point(size = 2) + - scale_colour_poke(pokemon = 'oddish', spread = 3) + - scale_x_log10() + - geom_abline(intercept = nmFstUni$coef[1, 1], slope = nmFstUni$coef[2, 1], lwd = 0.7, colour = pokepal('nidorina')[10]) + - xlab('Gene Flow (per gen.)') + - ylab('Viral Richness') - -%%end.rcode - - - - -Study effort is very likely in the best model ($b = $ \rinline{varCoefMeans['beta.scholarRefs']}, $Pr > $ \rinline{substring(as.character(varWeights['scholarRefs']), 1, 4)}). -Body mass and range size are also probably in the best model ($b = $ \rinline{varCoefMeans['beta.mass']}, $Pr = $ \rinline{varWeights['mass']} and $b = $ \rinline{varCoefMeans['beta.distrSize']}, $Pr = $ \rinline{varWeights['distrSize']} respectively) with positive relationships of slightly lower strength than the number of subspecies in models without an interaction term ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). - - -When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and \ref{t:variables2}). -Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. -However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. - -%%begin.rcode ITCombPlotCapt - -ITPlotCapts <- " -The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness. -The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis. -The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. -The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles. -The red ``Random'' box is the uniformly random variable. -Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses." - -ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness." - -%%end.rcode - - -%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = 'Akaike variable weights', out.width = '\\textwidth', cache = FALSE - -# Reorder var levels to get structure at beginning. -fstSepVarWeights$variable <- factor(fstSepVarWeights$variable, levels(fstSepVarWeights$variable)[c(2, 1, 3, 4, 5)]) - -# Draw the fst model selection plot -fstIT <- ggplot(fstSepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + - geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + - scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + - scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + - ylim(0, 1) + - theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), - panel.grid.major.x = element_blank(), - axis.text.y = element_text(size = 8)) + - scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + - scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + - ylim(0, 1) + - ylab('P(in best model)') + - xlab('') - - -#plot_grid(ITPlot, fstIT, labels = c("A", "B"), align = 'h', label_size = 10) - - -# Combine and print the plots. -ggdraw() + - draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(ITPlot, 0, 0, 0.5, 1) + - draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(fstIT, 0.5, 0.164, 0.5, 0.855) + - draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) - - -%%end.rcode - -\tmpsection{Model results} - - - - -\subsection{Gene Flow} - -\tmpsection{More Descriptive} - -%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. -The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). -Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table \ref{t:models} and Table \ref{A-modelWeights}). -The effective level of gene flow was likely in the best model ($Pr > 0.999$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). -On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). -Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. - -Study effort was very likely in the best model ($Pr > 0.999$) as was body mass ($Pr > 0.999$). -However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}). -In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}. -%This variable being less supported than the random variable may be because range size is closely correlated with study effort (pgls: $b$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Estimate']}, $t$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 't value']}, df = \rinline{fstDistrStudyEffort$df[2]}, $p$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Pr(>|t|)']}). -Of the three explanatory variables in the best model, study effort had the largest effect ($b = $ \rinline{fstCoefMeans['beta.scholarRefs']}, variance = \rinline{fstCoefVars['beta.scholarRefs']}). -The effect size of gene flow ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) was approximately twice the size of that of body mass ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}) - - - -When using the phylogeny from \textcite{jones2005bats} the analysis became very unstable (Figure~\ref{f:A-itplots}). -The support for each variable changed dramatically with each resampling of the random variable. -On average however, only the model containing mass and range size is supported (Tables~\ref{A-fstModelWeights} and \ref{t:variables2}). - - - \afterpage{ % use after page to make sure this whole table is at the end of a page. \begin{landscape} \begin{table}[t] @@ -2260,27 +2123,14 @@ log(Mass) & \end{landscape} } +Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}. +The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}). +When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). +Models with an interaction term between the number of subspecies and study effort have a positive interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}). +Study effort is very likely in the best model ($b = $ \rinline{varCoefMeans['beta.scholarRefs']}, $Pr > $ \rinline{substring(as.character(varWeights['scholarRefs']), 1, 4)}). +Body mass and range size are also probably in the best model ($b = $ \rinline{varCoefMeans['beta.mass']}, $Pr = $ \rinline{varWeights['mass']} and $b = $ \rinline{varCoefMeans['beta.distrSize']}, $Pr = $ \rinline{varWeights['distrSize']} respectively) with positive relationships of slightly lower strength than the number of subspecies in models without an interaction term ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). -\subsection{Phylogenetic Analysis} - -\subsubsection{Number of subspecies} - -Figure~\ref{fig:treePlot} shows the phylogeny used and the number of viruses for each species. -The mean number of viruses across families is fairly constant with \rinline{familyMeans$Family[which.min(familyMeans$mean)]} having the smallest mean, (\rinline{min(familyMeans$mean)}). -The highest mean is \rinline{familyMeans$Family[which.max(familyMeans$mean)]} with \rinline{max(familyMeans$mean)} virus species per bat species, but this is based on only \rinline{familyMeans$n[which.max(familyMeans$mean)]} species. -The \rinline{familyMeans$Family[order(familyMeans$mean, decreasing = TRUE)[2]]} have the second highest mean of \rinline{familyMeans$mean[order(familyMeans$mean, decreasing = TRUE)[2]]} ($n$ = \rinline{familyMeans$n[order(familyMeans$mean, decreasing = TRUE)[2]]}). - - - -The small change in mean pathogen richness across families and the lack of clear pattern in Figure~\ref{fig:treePlot} implies that viral richness is not strongly phylogenetic. -This is corroborated by the small estimated size of $\lambda$ ($\lambda$ = \rinline{virusLambda$param['lambda']}, $p$ = \rinline{virusLambda$param.CI$lambda$bounds.p[1]}). -%This fact implies that other factors must control pathogen richness. -%It also implies that pathogens are not directly inherited down the phylogeny, although this is to be expected by the fast evolution of viruses. - -Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.999$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). -Across all multiple regression models the mean value of $\lambda$ was \rinline{mean(na.omit(allResults$lambda))} which implied that the residuals from the models were very weakly phylogenetic. -A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%) had negatively phylogenetically distributed residuals. \begin{table}[t!] @@ -2319,6 +2169,158 @@ Random & \rinline{sprintf('%.2f', varWeights['rand'])} & \rinline{varCoefMeans +When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and~\ref{t:variables2}). +Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. +However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. + +%%begin.rcode ITCombPlotCapt + +ITPlotCapts <- " +The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness. +The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis. +The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. +The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles. +The red ``Random'' box is the uniformly random variable. +Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses." + +ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness." + +%%end.rcode + + +%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = 'Akaike variable weights', out.width = '\\textwidth', cache = FALSE + +# Reorder var levels to get structure at beginning. +fstSepVarWeights$variable <- factor(fstSepVarWeights$variable, levels(fstSepVarWeights$variable)[c(2, 1, 3, 4, 5)]) + +# Draw the fst model selection plot +fstIT <- ggplot(fstSepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + ylim(0, 1) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +#plot_grid(ITPlot, fstIT, labels = c("A", "B"), align = 'h', label_size = 10) + + +# Combine and print the plots. +ggdraw() + + draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(ITPlot, 0, 0, 0.5, 1) + + draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(fstIT, 0.5, 0.164, 0.5, 0.855) + + draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) + + +%%end.rcode + + + + + +\tmpsection{Model results} + + + + +\subsection{Gene Flow} + +\tmpsection{More Descriptive} + +%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. +The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). +Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}). +The effective level of gene flow was likely in the best model ($Pr > 0.999$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). +On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). +Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. + +Study effort was very likely in the best model ($Pr > 0.999$) as was body mass ($Pr > 0.999$). +However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}). +In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}. +%This variable being less supported than the random variable may be because range size is closely correlated with study effort (pgls: $b$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Estimate']}, $t$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 't value']}, df = \rinline{fstDistrStudyEffort$df[2]}, $p$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Pr(>|t|)']}). +Of the three explanatory variables in the best model, study effort had the largest effect ($b = $ \rinline{fstCoefMeans['beta.scholarRefs']}, variance = \rinline{fstCoefVars['beta.scholarRefs']}). +The effect size of gene flow ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) was approximately twice the size of that of body mass ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}) + + + + +%%begin.rcode fstRawCapt + +fstRawDataCapt <- +paste( +'Relationship between viral richness and log effective gene flow per generation for', +nrow(fstFinal), +'bat species. +Green points are studies that estimated effective gene flow using allozymes and blue points are studies using microsatellites. +') + + + +fstRawDataTitle <- +paste( +'Relationship between viral richness and log effective gene flow per generation for', +nrow(fstFinal), +'bat species. +') + +%%end.rcode + + + +%%begin.rcode fstRawData, fig.height = 2.3, fig.cap = fstRawDataCapt, fig.scap = fstRawDataTitle + +# Plot raw fst data + +ggplot(fstFinal, aes(x = Nm, y = virusSpecies, colour = Marker)) + + geom_point(size = 2) + + scale_colour_poke(pokemon = 'oddish', spread = 3) + + scale_x_log10() + + geom_abline(intercept = nmFstUni$coef[1, 1], slope = nmFstUni$coef[2, 1], lwd = 0.7, colour = pokepal('nidorina')[10]) + + xlab('Gene Flow (per gen.)') + + ylab('Viral Richness') + +%%end.rcode + + + +When using the phylogeny from \textcite{jones2005bats} the analysis became very unstable (Figure~\ref{f:A-itplots}). +The support for each variable changed dramatically with each resampling of the random variable. +On average however, only the model containing mass and range size is supported (Tables~\ref{A-fstModelWeights} and~\ref{t:variables2}). + + + + +\subsection{Phylogenetic Analysis} + +\subsubsection{Number of subspecies} + +Figure~\ref{fig:treePlot} shows the phylogeny used and the number of viruses for each species. +The mean number of viruses across families is fairly constant with \rinline{familyMeans$Family[which.min(familyMeans$mean)]} having the smallest mean, (\rinline{min(familyMeans$mean)}). +The highest mean is \rinline{familyMeans$Family[which.max(familyMeans$mean)]} with \rinline{max(familyMeans$mean)} virus species per bat species, but this is based on only \rinline{familyMeans$n[which.max(familyMeans$mean)]} species. +The \rinline{familyMeans$Family[order(familyMeans$mean, decreasing = TRUE)[2]]} have the second highest mean of \rinline{familyMeans$mean[order(familyMeans$mean, decreasing = TRUE)[2]]} ($n$ = \rinline{familyMeans$n[order(familyMeans$mean, decreasing = TRUE)[2]]}). + + + +The small change in mean pathogen richness across families and the lack of clear pattern in Figure~\ref{fig:treePlot} implies that viral richness is not strongly phylogenetic. +This is corroborated by the small estimated size of $\lambda$ ($\lambda$ = \rinline{virusLambda$param['lambda']}, $p$ = \rinline{virusLambda$param.CI$lambda$bounds.p[1]}). +%This fact implies that other factors must control pathogen richness. +%It also implies that pathogens are not directly inherited down the phylogeny, although this is to be expected by the fast evolution of viruses. + +Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.999$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). +Across all multiple regression models the mean value of $\lambda$ was \rinline{mean(na.omit(allResults$lambda))} which implied that the residuals from the models were very weakly phylogenetic. +A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%) had negatively phylogenetically distributed residuals. + + + \subsubsection{Effective gene flow} @@ -2352,8 +2354,8 @@ In this study I have used known viral richness in bats as a case study for the m In both analyses I found that a positive effect of increasing population structure (a positive effect of the number of subspecies and a negative effect of gene flow) is likely to be in the best model for explaining viral richness. Only the effective gene flow analysis, when performed using the phylogeny from \textcite{jones2005bats}, does not support this hypothesis. Therefore my study supports the broader hypothesis that increased population structure promotes pathogen richness. -Furthermore it contradicts the assumption that factors that promote high $R_0$ will automatically promote high pathogen richness \cite{nunn2003comparative, morand2000wormy}. The positive relationship between increased population structure and pathogen richness implies that direct or indirect competitive mechanisms are acting such that increased population structure allows escape from competition which promotes pathogen richness. +Furthermore my study contradicts the assumption that factors that promote high $R_0$ will automatically promote high pathogen richness by increasing the rate of spread of new pathogens entering into the population \cite{nunn2003comparative, morand2000wormy}. diff --git a/Chapter4.Rtex b/Chapter4.Rtex index 9330e7a..46427ab 100644 --- a/Chapter4.Rtex +++ b/Chapter4.Rtex @@ -172,7 +172,7 @@ Host density is commonly included in comparative studies and seems to promote hi In contrast, host population size has rarely been directly studied as a predictor of pathogen richness. Studies also often test for correlations between pathogen richness and range size \cite{lindenfors2007parasite, nunn2003comparative, turmelle2009correlates, huang2015parasite, kamiya2014determines}. Overall it seems that species with larger geographic range sizes have higher pathogen richness \cite{kamiya2014determines}. -While host population structure can be difficult to define and measure, a number of studies have found that increased population structure is associated with increased pathogen richness (Chapter \ref{ch:empirical}, \cites{maganga2014bat, turmelle2009correlates}). +While host population structure can be difficult to define and measure, a number of studies have found that increased population structure is associated with increased pathogen richness (Chapter~\ref{ch:empirical}, \cites{maganga2014bat, turmelle2009correlates}). Finally, many studies have tested for correlations between pathogen richness and group size, though results are equivocal \cite{vitone2004body, gay2014parasite, ezenwa2006host, rifkin2012animals, nunn2003comparative}. @@ -814,7 +814,7 @@ To ensure connected metapopulation networks I would have had to repeatedly resam However, this would bias $\bar{k}$. Therefore, it was considered preferential to keep the unconnected networks. The threshold of \SI{100}{\kilo\metre} was arbitrary but I aimed to maximise the range of $\bar{k}$ (Figure~\ref{fig:plotK}) while not having many simulations with networks that were unconnected. -Given this setup, populations with low densities had relatively unconnected metapopulation networks while high density populations had fully connected networks (Figure \ref{fig:plotK}). +Given this setup, populations with low densities had relatively unconnected metapopulation networks while high density populations had fully connected networks (Figure~\ref{fig:plotK}). @@ -837,12 +837,12 @@ The values of range size used were \SI{40000}, \SI{20000}, \SI{10000}, \SI{5000} In the second set of simulations, population size was varied by changing colony size while the number of colonies was kept constant. To keep host density constant, range size was reduced as population size increased. The values of colony size used were 100, 200, 400, 800 and \SI{600} while range size was set to \SI{40000}, \SI{20000}, \SI{10000}, \SI{5000} and \SI{\rinline{dep3[5]}}{\square\kilo\metre}. -This gave population size values of \SI{2000}, \SI{4000}, \SI{8000}, \SI{16000} and \SI{32000} while host density remained at 0.8 hosts{\si{\per\square\kilo\metre}}. +This gave population size values of \SI{2000}, \SI{4000}, \SI{8000}, \SI{16000} and \SI{32000} while host density remained at 0.8 hosts per {\si{\square\kilo\metre}}. In the third set of simulations, population size was varied by changing the number of colonies while colony size was kept constant. Again, to keep host density constant, range size was reduced as population size increased. The numbers of colonies used were 5, 10, 20, 40 and 80 while range size was set to \SI{40000}, \SI{20000}, \SI{10000}, \SI{5000} and \SI{\rinline{dep3[5]}}{\square\kilo\metre}. -Again, this gave population size values of \SI{2000}, \SI{4000}, \SI{8000}, \SI{16000} and \SI{32000} while host density remained at 0.8 hosts{\si{\per\square\kilo\metre}}. +Again, this gave population size values of \SI{2000}, \SI{4000}, \SI{8000}, \SI{16000} and \SI{32000} while host density remained at 0.8 hosts per {\si{\square\kilo\metre}}. \subsubsection{Colony size and the number of colonies} diff --git a/Introduction.tex b/Introduction.tex index 00ff9ef..fa5afb7 100644 --- a/Introduction.tex +++ b/Introduction.tex @@ -52,26 +52,28 @@ \section{Influence of population size and structure on pathogen richness} \tmpsection{Theoretical evidence that structure and density increase richness} -% Density +\subsection{Single-pathogen models} +% Structure + single path theory -The roles of population size and density in disease dynamics are well established \cite{may1979population, anderson1979population, heesterbeek2002brief, lloyd2005should}. -Broadly, larger populations can maintain diseases more easily by having a larger pool of susceptible individuals (individuals without acquired immunity) and having a greater number of new susceptible individuals enter the population by birth or immigration \cite{may1979population, anderson1979population}. -High density populations are expected to have a greater number of contacts between individuals and so promote disease spread. -However, there is much discussion about if and when the number of contacts might scale independently of density \cite{mccallum2001should}. - -% Structure - -There is also a large literature on the role of population structure on disease dynamics, as reviewed by \textcite{pastor2015epidemic}, driven by applications to human health as well as computer viruses \cite{pastor2001epidemic} and the social spread of information \cite{goffman1964generalization}. +There is a large literature on the role of population structure on single-disease dynamics, as reviewed by \textcite{pastor2015epidemic}, driven by applications to human health as well as computer viruses \cite{pastor2001epidemic} and the social spread of information \cite{goffman1964generalization}. In particular, work has concentrated on how population structure affects the basic reproduction number, $R_0$ \cite{colizza2007invasion, barthelemy2010fluctuation, wu2013threshold, may2001infection, pastor2001epidemic}. This value combines relevant parameters to yield a threshold above which a disease is expected to infect a significant proportion of the population \cite{may1979population, anderson1979population}. Below the threshold, only small outbreaks that quickly die out are expected. +% Density + single path theory +The roles of population size and density in the dynamics of single pathogens are also well established \cite{may1979population, anderson1979population, heesterbeek2002brief, lloyd2005should}. +Broadly, larger populations can maintain a disease more easily by having a larger pool of susceptible individuals (individuals without acquired immunity) and having a greater number of new susceptible individuals enter the population by birth or immigration \cite{may1979population, anderson1979population}. +High density populations are expected to have a greater number of contacts between individuals and so promote the spread of a pathogen. +However, there is much discussion about if and when the number of contacts might scale independently of density \cite{mccallum2001should}. -However, the majority of theoretical work considers single pathogens with models examining whether a pathogen can spread and persist in a population, ignoring all other pathogens. +\subsection{Multi-pathogen models} +% multi path is important +While the majority of theoretical work considers single pathogens, with models examining whether a pathogen can spread and persist in a population, much less work has been done on multiple pathogen systems. Studies have found tens \cite{anthony2013strategy} or even hundreds \cite{anthony2015non} of virus species in a single host species. %This suggests that the global number of mammalian virus species is of the order of hundreds of thousands \cite{anthony2013strategy} while recent large databases include nearly 2,000 pathogens from approximately 400 wild animal hosts \cite{wardeh2015database}. Therefore ignoring inter-pathogen competition is an oversimplification. +% structure and multi theory A number of studies have considered the case where two pathogens spread concurrently and examine which pathogen infects more individuals. These studies have found that increased population structure reduces dominance of the more competitive strain \cite{van2014domination, poletto2013host, poletto2015characterising}. However, this again reveals little about how pathogen communities form and what factors control total pathogen richness. @@ -79,6 +81,7 @@ \section{Influence of population size and structure on pathogen richness} Those that do commonly find that competitive exclusion is likely \cite{castillo1995dynamics, bremermann1989competitive, martcheva2013competitive, ackleh2003competitive, ackleh2014robust, turner2002impact}. Mechanisms that have been shown to allow pathogen coexistence include superinfection \cite{may1994superinfection, li2010age}, density-dependent deaths \cite{ackleh2003competitive, kirupaharan2004coexistence} and differing transmission routes \cite{allen2003dynamics}. +% density + multi theory The specific role of density on the ability of pathogens to coexist has not been theoretically studied though it is commonly found to promote pathogen richness in comparative empirical studies \cite{kamiya2014determines, nunn2003comparative, arneberg2002host}. The few papers that have directly studied how coexistence of pathogens responds to population structure have found that population structure can allow pathogens to coexist even though competitive exclusion would occur in a fully mixed population \cite{qiu2013vector, allen2004sis, nunes2006localized}. Furthermore, genetic diversity has been shown to be maximised at intermediate levels of population structure \cite{campos2006pathogen}. @@ -129,7 +132,7 @@ \section{Thesis overview} \tmpsection{Chapter 2} -First, in Chapter \ref{ch:empirical}, I empirically tested the hypothesis that population structure is associated with pathogen richness (measured as known viral richness) in wild bat populations. +First, in Chapter~\ref{ch:empirical}, I empirically tested the hypothesis that population structure is associated with pathogen richness (measured as known viral richness) in wild bat populations. To ensure robust results I used two measures of population structure --- the number of subspecies and gene flow --- and a larger data set than previous studies. For both measures I found that bat species with more structured populations have more known viruses. This relationship is still present after controlling for study bias and phylogenetic nonindependence. @@ -137,7 +140,7 @@ \section{Thesis overview} \tmpsection{Chapter 3} -In Chapter \ref{ch:sims1}, I examined one specific mechanism by which population structure may promote increased pathogen richness. +In Chapter~\ref{ch:sims1}, I examined one specific mechanism by which population structure may promote increased pathogen richness. I tested whether increased population structure can allow newly evolved pathogen strains to invade and persist more easily. I modelled bat populations as individual-based, stochastic metapopulations and examined the competition dynamics of two identical pathogen strains. I tested two factors related to host population structure: dispersal rate and the number of links between colonies. @@ -147,10 +150,10 @@ \section{Thesis overview} \tmpsection{Chapter 4} -Next, I examined the relationships between a number of elements of population structure (Chapter \ref{ch:sims2}). +Next, I examined the relationships between a number of elements of population structure (Chapter~\ref{ch:sims2}). I clarified the interdependence between range size, population size and density. I also noted that population size can be decomposed into colony size and the number of colonies. -Using the same model as in Chapter \ref{ch:sims1}, I then tested which of these factors are most important in promoting pathogen richness. +Using the same model as in Chapter~\ref{ch:sims1}, I then tested which of these factors are most important in promoting pathogen richness. Specifically I tested which factor most strongly promotes the invasion and establishment of newly evolved pathogens. I found that population size is more important than population density and that colony size is the important component of population size. @@ -159,7 +162,7 @@ \section{Thesis overview} Given the importance of host population size and density on pathogen richness it is important to have good population estimates for wild bat populations. However, there are currently very few measurements of bat population size due to their small size, nocturnal habit and difficulties in identification. Therefore I aimed to develop a method for estimating bat population size from acoustic data, specifically data collected by the iBats project \cite{jones2011indicator}. -In Chapter \ref{ch:grem} I developed a generally applicable method --- based on random encounter models \cite{rowcliffe2008estimating, yapp1956theory} --- for estimating population sizes of animal populations using camera traps or acoustic detectors. +In Chapter~\ref{ch:grem} I developed a generally applicable method --- based on random encounter models \cite{rowcliffe2008estimating, yapp1956theory} --- for estimating population sizes of animal populations using camera traps or acoustic detectors. I used spatial simulations to test the method for biases and to assess its precision. I found that the method is unbiased and precise as long as a reasonable amount of data is collected. @@ -167,7 +170,7 @@ \section{Thesis overview} \tmpsection{Chapter 6: Conclusions} %to do Conclusions chapter -Finally, in Chapter \ref{ch:discussion}, I discuss broader conclusions, applications and implications of my results. +Finally, in Chapter~\ref{ch:discussion}, I discuss broader conclusions, applications and implications of my results. I also discuss potential future directions for research. From 81bbe07450168206b8bfd1bc474577a985d16b5f Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Sun, 24 Jul 2016 17:31:28 +0100 Subject: [PATCH 06/17] Change +ch3 plot regression lines to bivariate only. --- Chapter3.Rtex | 98 ++++++++++++++++++--------------------------------- 1 file changed, 35 insertions(+), 63 deletions(-) diff --git a/Chapter3.Rtex b/Chapter3.Rtex index 1a55c20..6c73090 100644 --- a/Chapter3.Rtex +++ b/Chapter3.Rtex @@ -1930,6 +1930,7 @@ $\kappa$ and $\delta$ parameters were constrained to one as they are more concer Further, fitting multiple parameters makes interpretation difficult. + To establish the importance of variables I calculated the probability, $Pr$, that each variable would be in the best model amongst those examined (under the assumption that all models are \emph{a priori} equally likely). This value can more generally, and with fewer assumptions, be considered as simply the relative weight of evidence for each variable being in the best model amongst those examined. I calculated AICc for each model. @@ -1946,8 +1947,6 @@ To aid interpretation, the mean coefficient for the number of subspecies was cal - - %%begin.rcode boxplotCapt # Caption for the main boxplot of subspecies vs virus @@ -1958,8 +1957,7 @@ nrow(nSpecies), 'bat species. The area of the circle shows the number of bat species at each discrete value. 48 bat species have one subspecies and one known virus species. -The red line represents a phylogenetic multiple regression including all the explanatory variables but no interaction term. -The line shows the slope from the multiple regression with the intercept being calculated by setting other explanatory variables to their median values. +The red line represents a phylogenetic simple regression between the two variables. ' ) @@ -1973,36 +1971,6 @@ nrow(nSpecies), %%begin.rcode boxplot, fig.cap = boxplotCapt, fig.scap = boxplotTitle, fig.height = 2.3 -# Make a boxplot of subspecies vs virus -# Add model lines for pgls models. - -# Make predictions of model w/out interactions using median values of other variables. - -predData <- data.frame(scholarRefs = quantile(nSpecies$scholarRefs, 0.5), - mass = quantile(nSpecies$mass, 0.5), - NumberOfSubspecies = - seq(min(nSpecies$NumberOfSubspecies), max(nSpecies$NumberOfSubspecies), - length.out = 200) - ) - - -lines <- data.frame(NumberOfSubspecies = predData$NumberOfSubspecies, - virusSpecies = predict(subspeciesJointUnlog, newdata = predData)) - -## Make predictions of model w/ interactions using median values of other variables. - -predDataInter <- data.frame(scholarRefs = quantile(nSpecies$scholarRefs, 0.5), - mass = quantile(nSpecies$mass, 0.5), - NumberOfSubspecies = - seq(min(nSpecies$NumberOfSubspecies), max(nSpecies$NumberOfSubspecies), - length.out = 200) - ) - - - -linesInter <- data.frame(NumberOfSubspecies = predData$NumberOfSubspecies, - virusSpecies = predict(subspeciesInter, newdata = predDataInter)) - nSpeciesCounts <- nSpecies %>% group_by(NumberOfSubspecies, virusSpecies) %>% @@ -2015,12 +1983,13 @@ ggplot(nSpeciesCounts, aes(NumberOfSubspecies, virusSpecies, size = n)) + scale_x_continuous(breaks = c(1, 4, 8, 12, 16)) + xlab('Number of Subspecies') + ylab('Viral Richness') + - geom_line(data = lines, aes(x = NumberOfSubspecies, y = virusSpecies, group = 1), - colour = pokepal('nidorina')[10], lwd = 0.7) + geom_abline(slope = sspUni$coef[2, 1], intercept = sspUni$coef[1,1], lwd = 0.7, colour = pokepal('nidorina')[10]) %%end.rcode + + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Results} @@ -2039,10 +2008,16 @@ However these top seven models all contained study effort, number of subspecies The explanatory variables log(Mass), log(Range Size) and the uniformly random variable are each in three of the top seven models. These top seven models had a combined weight of \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))} meaning that there is a \rinline{sprintf("%.0f", round(100 * modelWeights[7, 5]))}\% chance that one of these models is the best model amongst those examined. +Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}. +The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}). +When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). +Models with an interaction term between the number of subspecies and study effort have a positive interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}). + + \afterpage{ % use after page to make sure this whole table is at the end of a page. \begin{landscape} -\begin{table}[t] +\begin{table}[p!] \centering %\rowcolors{2}{gray!25}{white} \caption[Model selection results]{ @@ -2123,16 +2098,18 @@ log(Mass) & \end{landscape} } -Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}. -The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}). -When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). -Models with an interaction term between the number of subspecies and study effort have a positive interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}). -Study effort is very likely in the best model ($b = $ \rinline{varCoefMeans['beta.scholarRefs']}, $Pr > $ \rinline{substring(as.character(varWeights['scholarRefs']), 1, 4)}). -Body mass and range size are also probably in the best model ($b = $ \rinline{varCoefMeans['beta.mass']}, $Pr = $ \rinline{varWeights['mass']} and $b = $ \rinline{varCoefMeans['beta.distrSize']}, $Pr = $ \rinline{varWeights['distrSize']} respectively) with positive relationships of slightly lower strength than the number of subspecies in models without an interaction term ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). + + +When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and~\ref{t:variables2}). +Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. +However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. +\tmpsection{Model results} + + \begin{table}[t!] \centering \caption[Estimated variable weights and coefficients for number of subspecies and gene flow analyses]{ @@ -2169,9 +2146,20 @@ Random & \rinline{sprintf('%.2f', varWeights['rand'])} & \rinline{varCoefMeans -When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and~\ref{t:variables2}). -Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. -However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. +\subsection{Gene Flow} + +\tmpsection{More Descriptive} + +%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. +The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). +Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}). +The effective level of gene flow was likely in the best model ($Pr > 0.999$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). +On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). +Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. + + + + %%begin.rcode ITCombPlotCapt @@ -2226,23 +2214,6 @@ ggdraw() + - -\tmpsection{Model results} - - - - -\subsection{Gene Flow} - -\tmpsection{More Descriptive} - -%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. -The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). -Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}). -The effective level of gene flow was likely in the best model ($Pr > 0.999$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). -On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). -Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. - Study effort was very likely in the best model ($Pr > 0.999$) as was body mass ($Pr > 0.999$). However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}). In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}. @@ -2261,6 +2232,7 @@ paste( nrow(fstFinal), 'bat species. Green points are studies that estimated effective gene flow using allozymes and blue points are studies using microsatellites. +The red line represents a phylogenetic simple regression between the two variables. ') From 10d319d0cfabfd8c1f2fc7ee8b297de3f7a052cf Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Sun, 24 Jul 2016 22:40:05 +0100 Subject: [PATCH 07/17] New abstract fo +ch2. --- Chapter2.Rtex | 51 +++++++++++++-------------------------------------- 1 file changed, 13 insertions(+), 38 deletions(-) diff --git a/Chapter2.Rtex b/Chapter2.Rtex index abbbfbe..606425b 100644 --- a/Chapter2.Rtex +++ b/Chapter2.Rtex @@ -74,53 +74,28 @@ library(cowplot) \section{Abstract} - %\tmpsection{One or two sentences providing a basic introduction to the field} % comprehensible to a scientist in any discipline. -\lettr{A}n increasingly large proportion of emerging human diseases comes from animals. -These diseases have a huge impact on human health, healthcare systems and economic development. +%\lettr{A}n increasingly large proportion of emerging human diseases comes from animals. +%These diseases have a huge impact on human health, healthcare systems and economic development. The chance that a new zoonosis will come from any particular wild host species increases with the number of pathogen species occurring in that host species. -However, the factors that control pathogen richness in wild animal species remain unclear. +Comparative, phylogenetic studies have shown that host-species traits such as population density and population structure correlate with pathogen richness +However, the mechanisms by which these factors control pathogen richness in wild animal species remain unclear. % % %\tmpsection{Two to three sentences of more detailed background} % comprehensible to scientists in related disciplines. % Add mechanistic vs empirical -Comparative, phylogenetic studies have shown that host-species traits such as population density, longevity and body size correlate with pathogen richness. -Further comparative studies have found correlations between population structure and pathogen richness. -Typically it is assumed that well-connected, unstructured populations (that therefore have a high basic reproductive number, $R_0$) promote the invasion of new pathogens and therefore increase pathogen richness. % Where or how to define well-connected? -However, this assumption is largely untested in the multipathogen context. +Typically it is assumed that well-connected, unstructured populations (that therefore have a high basic reproductive number, $R_0$) promote the invasion of new pathogens and therefore increase pathogen richness. However, this assumption is largely untested in the multipathogen context. In the presence of inter-pathogen competition, the opposite effect might occur; increased population structure may increase pathogen richness by reducing the effects of competition. A more mechanistic understanding of how population structure affects pathogen richness could discriminate between these two broad hypotheses. -% -%\tmpsection{One sentence clearly stating the general problem (the gap)} -% being addressed by this particular study. -%It is unknown whether greater population structure allows invading pathogens to escape from competition by stochastically creating areas of low pathogen prevalence. -I hypothesised that both low dispersal rates and a low number of connections in a metapopulation network would allow invading pathogens to establish more easily, thus increasing pathogen richness. -I tested this hypothesis using metapopulation networks parameterised to mimic wild bat populations as bats have highly varied social structures and have recently been implicated in a number of high profile diseases such as Ebola, SARS, Hendra and Nipah. -% -% -%\tmpsection{One sentence summarising the main result} -% (with the words “here we show” or their equivalent). -I simulated the process of a new pathogen invading into a metapopulation already occupied by an identical pathogen. -I varied the dispersal rate, topology of the metapopulation and transmission rate. -I found significant evidence that increased dispersal rate increased the probability that a new pathogen would invade into a population. -I found marginal evidence that network topology affected the probability that a new pathogen would invade. -%\paragraph{Two or three sentences explaining what the main result reveals in direct comparison to what was thought to be the case previously} -% or how the main result adds to previous knowledge -Therefore, the assumption that factors causing high $R_0$ allow new pathogens to invade and therefore increase pathogen richness was supported. -However, my results contradict theoretical studies that predict that increased population structure should promote coexistence of pathogens. -My results also contradict empirical patterns of pathogen richness with respect to population structure. -Therefore, it is likely that population structure affects pathogen richness via a different mechanism to the one modelled here. -% -% -\tmpsection{One or two sentences to put the results into a more general context.} - - -%\tmpsection{Two or three sentences to provide a broader perspective, } -% readily comprehensible to a scientist in any discipline. - - +Here I have examined one mechanism by which increased population structure may cause greater pathogen richness. +I used simulations to test whether increased population structure could increase the probability that a newly evolved pathogen would invade into a population already infected with an identical, endemic pathogen. +I tested this hypothesis using individual-based, metapopulation networks parameterised to mimic wild bat populations as bats have highly varied social structures and have recently been implicated in a number of high profile diseases such as Ebola, SARS, Hendra and Nipah. +In a metapopulation, dispersal rate and the number of links between colonies can both affect population structure. +I tested whether either of these factors could increase the probability that a pathogen would invade and persist in the population. +I found that, at intermediate transmission rates, increasing dispersal rate significantly increased the probability of a newly evolved pathogen invading into the metapopulation. +However, there was very limited evidence that the number of links between colonies affected pathogen invasion probability. @@ -152,7 +127,7 @@ Therefore, it is likely that population structure affects pathogen richness via %1. zoonotics bad, need to predict spillover, factors controlling pathogen richness unknown \tmpsection{Why is pathogen richness? important?} -Over 50\% of emerging infectious diseases have an animal source \cite{jones2008global, smith2014global}. +Over 60\% of emerging infectious diseases have an animal source \cite{jones2008global, smith2014global}. Zoonotic pathogens can be highly virulent \cite{luby2009recurrent, lefebvre2014case} and can have huge public health impacts \cite{granich2015trends}, economic costs \cite{knobler2004learning} and slow down international development \cite{ebolaWorldbank}. Therefore understanding and predicting changes in the process of zoonotic spillover is a global health priority \cite{taylor2001risk}. The number of pathogen species hosted by a wild animal species affects the chance that a disease from that species will infect humans \cite{wolfe2000deforestation}. From ca66977549d8810be5672d6dafea2b7e366c29ab Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 00:23:14 +0100 Subject: [PATCH 08/17] Lots of edits. Perhaps last, last edits!. --- Appendix2.Rtex | 15 ++--- Appendix3.Rtex | 48 ++++++++------- Chapter2.Rtex | 6 +- Chapter3.Rtex | 60 +++++++++---------- Chapter4.Rtex | 34 +++++------ Chapter5.Rtex | 21 ++++--- imgs/movtFig-Edit.svg | 24 ++++---- ...t_al_supplementarymaterial_2015-01-20.Rtex | 8 +-- 8 files changed, 114 insertions(+), 102 deletions(-) diff --git a/Appendix2.Rtex b/Appendix2.Rtex index 416e11d..75681ca 100644 --- a/Appendix2.Rtex +++ b/Appendix2.Rtex @@ -40,7 +40,7 @@ saveData <- TRUE %%end.rcode -%%begin.rcode libs +%%begin.rcode libs, cache = FALSE library(MetapopEpi) library(cowplot) @@ -322,8 +322,9 @@ noInvade1 <- pSIR(p1) + guides(colour=guide_legend(title = '')) + theme(legend.text.align = 0) + xlab('Time (years)') + - xlim(0, maxt) -pop1 <- pPop(p1) + theme_tcdl + xlab('Time (years)') + xlim(0, maxt) + scale_x_continuous(breaks = c(0, 20, 40, 60, 80), limits = c(0, maxt)) + +pop1 <- pPop(p1) + theme_tcdl + xlab('Time (years)') + scale_x_continuous(breaks = c(0, 20, 40, 60, 80), limits = c(0, maxt)) @@ -337,8 +338,8 @@ noInvade2 <- pSIR(p2) + guides(colour=guide_legend(title = '')) + theme(legend.text.align = 0) + xlab('Time (years)') + - xlim(0, maxt) -pop2 <- pPop(p2) + theme_tcdl + xlab('Time (years)') + xlim(0, maxt) + scale_x_continuous(breaks = c(0, 20, 40, 60, 80), limits = c(0, maxt)) +pop2 <- pPop(p2) + theme_tcdl + xlab('Time (years)') + scale_x_continuous(breaks = c(0, 20, 40, 60, 80), limits = c(0, maxt)) # Combine and print the plots. @@ -456,7 +457,7 @@ fullSim <- function(x){ } %%end.rcode -%%begin.rcode runDispSim, eval = TRUE, cache = TRUE +%%begin.rcode runDispSim, eval = runSims, cache = TRUE # Create and set seed (seed object is used to set seed in each seperate simulation.' seed <- 33355 @@ -571,7 +572,7 @@ fullSim <- function(x){ } %%end.rcode -%%begin.rcode runTopoSim, eval = TRUE, cache = TRUE +%%begin.rcode runTopoSim, eval = runSims, cache = TRUE # Create and set seed (seed object is used to set seed in each seperate simulation.' seed <- 1230202 diff --git a/Appendix3.Rtex b/Appendix3.Rtex index d65a3b3..9d63ef7 100644 --- a/Appendix3.Rtex +++ b/Appendix3.Rtex @@ -153,7 +153,7 @@ Dmax is the distance between furthest apart $F_{ST}$ sampling locations. The references are for the $F_{ST}$ data only. ' rawDataTitle <- ' -Raw data for both analyses. +Raw data for both analyses ' @@ -258,7 +258,7 @@ virusLambda <- summary(pgls(virusSpecies ~ 1, data = compSubspecies, lambda = 'M refsCapt <- paste0(" - Logged number of references on Scholar and PubMed, with a fitted phylogenetic linear model. + Logged number of references on Google Scholar and PubMed, with a fitted phylogenetic linear model. Colours indicate family. (pgls: $t$ = ", round(citeCor2$coefficients['log(pubmedRefs + 1)', 't value'], 2), @@ -268,7 +268,7 @@ refsCapt <- paste0(" %%end.rcode -%%begin.rcode scholarvspubmedPlot, fig.show = TRUE, fig.height = 3.5, out.width = '0.9\\textwidth', fig.cap = refsCapt, cache = FALSE +%%begin.rcode scholarvspubmedPlot, fig.show = TRUE, fig.height = 5, out.width = '0.9\\textwidth', fig.cap = refsCapt, cache = FALSE pp <- c(pokepal('oddish')[c(1,3,5,7,9,10)], pokepal('Carvanha')[c(2, 4, 13, 12, 9, 1)]) @@ -433,7 +433,7 @@ colnames(fstModelWeights) <- c("Model", "$\\bar{\\text{AICc}}$", "$\\Delta$AICc" modelSelectCapt <- " Model selection results for number of subspecies analysis. - $\\bar{\\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. + $\\bar{\\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\\Delta$AICc is the model's $\\bar{\\text{AICc}}$ score minus $\\text{min}(\\bar{\\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\\sum w$ is the cumulative sum of the Akaike weights. @@ -441,7 +441,7 @@ modelSelectCapt <- " " modelSelectTitle <- " - Full model selection results for number of subspecies analysis. + Full model selection results for number of subspecies analysis " # floating.environment = 'sidewaystable') ? possible to do upright caption. @@ -471,14 +471,14 @@ print(xtable(modelWeights, fstSelectCapt <- " Model selection results for effective gene flow analysis. - $\\bar{\\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. + $\\bar{\\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\\Delta$AICc is the model's $\\bar{\\text{AICc}}$ score minus $\\text{min}(\\bar{\\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\\sum w$ is the cumulative sum of the Akaike weights. " fstSelectTitle <- " - Full model selection results for effective gene flow analysis. + Full model selection results for effective gene flow analysis " print(xtable(fstModelWeights, @@ -592,7 +592,7 @@ Dot size shows the number of known viruses for that species and colour shows fam The red scale bar shows 25 million years.' -treeTitle <- 'Pruned alternative phylogeny with dot size showing number of pathogens and colour showing family.' +treeTitle <- 'Pruned alternative phylogeny showing number of pathogens and family' %%end.rcode @@ -659,11 +659,13 @@ p %<+% nSpecies + \begin{figure}[t] \centering \includegraphics[width=1\textwidth]{figure/fstITPlots2-1.pdf} - \caption[Akaike variable weights for analysis using alternative phylogeny]{ -Akaike variable weights for both analyses using the phylogeny from \cite{jones2005bats}. -The probability that each variable is in the best model (amongst the models test) is shown, with the boxplots showing the variation amongst the models over 50 resamplings of the uniformly random ``null'' variable. -The three bars of the boxplot show the median values and upper and lower quartiles of the data, vertical lines show the range and points display outliers. -The red ``Random'' box is the uniformly random variable. + \caption[The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness using alternative phylogeny]{ +The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness using the phylogeny from \cite{jones2005bats}. +The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis. +The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. +The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles. +The red ``Random'' box is the uniformly random variable. +Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses. } \label{f:A-itplots} \end{figure} @@ -817,7 +819,7 @@ colnames(fstModelWeightsFull2) <- c("Model", "$\\bar{\\text{AICc}}$", "$\\Delta$ modelSelectCapt <- " Model selection results for number of subspecies analysis using phylogeny from \\cite{jones2005bats}. - $\\bar{\\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. + $\\bar{\\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\\Delta$AICc is the model's $\\bar{\\text{AICc}}$ score minus $\\text{min}(\\bar{\\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\\sum w$ is the cumulative sum of the Akaike weights. @@ -825,7 +827,7 @@ modelSelectCapt <- " " modelSelectTitle <- " - Full model selection results for number of subspecies analysis using alternative phylogeny. + Full model selection results for number of subspecies analysis using alternative phylogeny " # floating.environment = 'sidewaystable') ? possible to do upright caption. @@ -856,14 +858,14 @@ print(xtable(modelWeightsFull2, fstSelectCapt <- " Model selection results for effective gene flow analysis using phylogeny from \\cite{jones2005bats}. - $\\bar{\\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. + $\\bar{\\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\\Delta$AICc is the model's $\\bar{\\text{AICc}}$ score minus $\\text{min}(\\bar{\\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\\sum w$ is the cumulative sum of the Akaike weights. " fstSelectTitle <- " - Full model selection results for effective gene flow analysis with alternative phylogeny. + Full model selection results for effective gene flow analysis with alternative phylogeny " print(xtable(fstModelWeightsFull2, @@ -908,15 +910,15 @@ Number of subspecies &&&&\\ \hspace{3mm}Models without interaction term && \rinline{nSpeciesCoefMean2} &&\\ \hspace{3mm}Models with interaction term && \rinline{nSpeciesCoefMeanI2} &&\\ Number of subspecies*log(Scholar) & \rinline{varWeights2['scholarRefs.NumberOfSubspecies']} & \rinline{sprintf('%.2f', varCoefMeans2['beta.scholarRefs.NumberOfSubspecies'])} && \\[2.5mm] -Gene flow & & & \rinline{fstVarWeights2['Nm']} & \rinline{fstCoefMeans2['beta.Nm']}\\[2.5mm] +Gene flow & & & \rinline{sprintf('%.2f', fstVarWeights2['Nm'])} & \rinline{fstCoefMeans2['beta.Nm']}\\[2.5mm] log(Scholar) & \rinline{sprintf('%.2f', varWeights2['scholarRefs'])} & \rinline{varCoefMeans2['beta.scholarRefs']} & - \rinline{fstVarWeights2['scholarRefs']} & \rinline{fstCoefMeans2['beta.scholarRefs']}\\ + \rinline{sprintf('%.2f', fstVarWeights2['scholarRefs'])} & \rinline{fstCoefMeans2['beta.scholarRefs']}\\ log(Mass) & \rinline{sprintf('%.2f', varWeights2['mass'])} & \rinline{varCoefMeans2['beta.mass']} & - \rinline{fstVarWeights2['mass']} & \rinline{fstCoefMeans2['beta.mass']}\\ + \rinline{sprintf('%.2f', fstVarWeights2['mass'])} & \rinline{fstCoefMeans2['beta.mass']}\\ log(Range size) & \rinline{sprintf('%.2f', varWeights2['distrSize'])} & \rinline{varCoefMeans2['beta.distrSize']}& - \rinline{fstVarWeights2['distrSize']} & \rinline{fstCoefMeans2['beta.distrSize']}\\ -Random & \rinline{sprintf('%.2f', varWeights2['rand'])} & \rinline{varCoefMeans2['beta.rand']}& - \rinline{fstVarWeights2['rand']} & \rinline{fstCoefMeans2['beta.rand']}\\ + \rinline{sprintf('%.2f', fstVarWeights2['distrSize'])} & \rinline{fstCoefMeans2['beta.distrSize']}\\ +Random & \rinline{sprintf('%.2f', varWeights2['rand'])} & \rinline{sprintf('%.2f', varCoefMeans2['beta.rand'])}& + \rinline{sprintf('%.2f', fstVarWeights2['rand'])} & \rinline{fstCoefMeans2['beta.rand']}\\ \bottomrule \end{tabular} diff --git a/Chapter2.Rtex b/Chapter2.Rtex index 606425b..b36297a 100644 --- a/Chapter2.Rtex +++ b/Chapter2.Rtex @@ -311,7 +311,7 @@ Infection and coinfection were assumed to cause no extra mortality as for a numb \begin{table}[tb] \centering -\caption[All symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2}.]{A summary of all symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2} along with their units and default values. +\caption[All symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2}]{A summary of all symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2} along with their units and default values. The justifications for parameter values are given in Section~\ref{s:paramSelect}.} \begin{tabular}{@{}lp{6cm}p{2.9cm}r@{}} @@ -1026,7 +1026,7 @@ Therefore, if at least one individual was in class $I_2$ or $I_{12}$ at the end Again, visual inspection of preliminary simulations was used to determine that after \SI{\rinline{nEvent - invadeT}} events, if an invading pathogen was still present, it was well established (Figures~\ref{fig:plotsInvade} and \ref{fig:plotsNoInvade1}). The choice to use a fixed number of events, rather than a fixed number of years, was for computational convenience. -However, this choice creates a risk of bias as simulations with a greater total rate of events $\sum_j e_j$ (e.g.,\ faster disease transmission) will last for a shorter time overall (i.e.\ a smaller $\sum \delta$ over all events). +However, this choice creates a risk of bias as simulations with a greater total rate of events, $\sum_j e_j$ (e.g.,\ faster disease transmission) will last for a shorter time overall (i.e.\ a smaller $\sum \delta$ over all events). However, visual inspection of the dynamics of disease extinction (Figure~\ref{fig:plotsNoInvade1}), and examination of the typical time to extinction suggests that this bias is negligible. For example, of the simulations where extinction occurred, the extinction occurred more than 50 years before the end of the simulation in 90\% of cases. On a preliminary run of 106 simulations across all combinations of dispersal and transmission rates, examining the population after \SI{700000} events instead of \SI{\rinline{nEvent}} events gave exactly the same result with respect to the binary state of invasion or no invasion. @@ -1298,7 +1298,7 @@ invasionPropCaption <- sprintf(" When network topology is varied, $\\xi = 0.01$.", as.integer(each)) -invasionPropShort <- "The probability of invasion across different dispersal rates and network topologies." +invasionPropShort <- "The probability of invasion across different dispersal rates and network topologies" %%end.rcode diff --git a/Chapter3.Rtex b/Chapter3.Rtex index 6c73090..456cc37 100644 --- a/Chapter3.Rtex +++ b/Chapter3.Rtex @@ -183,7 +183,7 @@ Although my analysis implies that increased population structure does promote pa %#1. Zoonotic disease is bad (as you have written it already) Zoonotic pathogens make up the majority of newly emerging diseases and have profound consequences for public health, economics and international development \cite{jones2008global, smith2014global, ebolaWorldbank}. Better statistical models for predicting which wild host species are potential reservoirs of zoonotic diseases would allow us to optimise zoonotic disease surveillance and anticipate how the risks of disease spillover might change with global change. -The chance that a host species will be the source of a zoonotic pathogen depends on a number of factors, such as its proximity and interactions with humans, and the prevalence and the number of pathogen species it carries \cite{wolfe2000deforestation}. +The chance that a host species will be the source of a zoonotic pathogen depends on a number of factors, such as its proximity and interactions with humans, the prevalence of its pathogens and the number of pathogen species it carries \cite{wolfe2000deforestation}. However, the factors that control the number of pathogen species a host species carries remain poorly understood. @@ -196,7 +196,7 @@ However, the factors that control the number of pathogen species a host species A number of species traits that might control pathogen richness have been studied. These traits can be at the level of the individual (e.g., body mass and longevity) or the level of the population (e.g., population density, sociality and species range size). Large bodied animals have been shown to have high pathogen richness with large bodies providing more resources for pathogens \cite{kamiya2014determines, arneberg2002host, poulin1995phylogeny, bordes2008bat, luis2013comparison}. -Long lived species are expected to have high pathogen richness, because the number of pathogens a host encounters in its lifetime will be higher \cite{nunn2003comparative, ezenwa2006host, luis2013comparison}. +Long lived species are expected to have high pathogen richness because the number of pathogens a host encounters in its lifetime will be higher \cite{nunn2003comparative, ezenwa2006host, luis2013comparison}. Animal density \cite{kamiya2014determines, nunn2003comparative, arneberg2002host} and sociality \cite{bordes2007rodent, vitone2004body, altizer2003social, ezenwa2006host} are both predicted to increase pathogen richness by increasing the rate of spread, $R_0$, of a new pathogen. Finally, widely distributed species have high pathogen richness, potentially because they experience a wider range of environments or because they are sympatric with more species \cite{kamiya2014determines, nunn2003comparative, luis2013comparison}. @@ -224,7 +224,7 @@ However, in primates only a weak positive association between sociality and path Furthermore, a negative association was found in rodents \cite{bordes2007rodent} and in even and odd-toed hoofed mammals \cite{ezenwa2006host}. Finally, two studies tested for an association between group size and parasite richness in bats \cite{bordes2008bat, gay2014parasite}. Amongst 138 bat species, \textcite{bordes2008bat} found no relationship between group size (coded into four classes) and bat fly species richness. -\textcite{gay2014parasite} a negative relationship between colony size and viral richness but a positive relationship between colony size and ectoparasite richness. +\textcite{gay2014parasite} found a negative relationship between colony size and viral richness but a positive relationship between colony size and ectoparasite richness. While sociality is an important component of population structure it does not capture fully how connected the population is globally. @@ -433,7 +433,7 @@ Number of viruses against number of subspecies. Points are coloured by family, with families with less than 10 species being grouped into "other". Contours show the 2D density of points and suggest a positive correlation. ' -subvsvirusTitle <- 'Number of viruses against number of subspecies.' +subvsvirusTitle <- 'Number of viruses against number of subspecies' %%end.rcode %%begin.rcode subsDataFrame, fig.show = extraFigs, fig.cap = subvsvirus, fig.scap = subvsvirusTitle, out.width = '\\textwidth' @@ -1801,13 +1801,13 @@ This was performed in \emph{R} \cite{R} using the \emph{rvest} package \cite{rve I log transformed these variables as they were strongly right skewed. I tested for correlation between these two proxies for study effort using phylogenetic least squares regression (pgls), using the best-supported phylogeny from \textcite{fritz2009geographical}, and likelihood ratio tests using the \emph{caper} package \cite{caper} (Figures~\ref{fig:treePlot} and \ref{fig:scholarvspubmedPlot}). The log number of citations on PubMed and Google scholar were highly correlated (pgls: $t$ = \rinline{studyEffortCor$coefficients['log(pubmedRefs + 1)', 't value']}, df = \rinline{studyEffortCor$df[2]}, $p < 10^{-5}$). -As the correlation between citation counts is strong, I only used Google Scholar reference counts in subsequent analyses. +As the correlation between citation counts was strong, I only used Google Scholar reference counts in subsequent analyses. %See the appendix for analyses run using PubMed citations. -A number of other factors that have previously been found to be important were included as additional explanatory variables: body mass \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat, han2015infectious, bordes2008bat}, range size \cite{kamiya2014determines, turmelle2009correlates, maganga2014bat}. +Two factors that have previously been found to be important were included as additional explanatory variables: body mass \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat, han2015infectious, bordes2008bat}, range size \cite{kamiya2014determines, turmelle2009correlates, maganga2014bat}. These other factors were included to avoid spurious positive results occurring simply due to correlations between pathogen richness and a different, causal factor. -Despite commonly being associated with pathogen richness \cite{arneberg2002host, kamiya2014determines, nunn2003comparative}, population density is not included in the analysis as there is very little data for bat densities. -Measures of body mass were taken from Pantheria \cite{jones2009pantheria} and primary literature \cite{canals2005relative, arita1993rarity, lopez2014echolocation, orr2013does, lim2001bat, aldridge1987turning, ma2003dietary, owen2003home, henderson2008movements, heaney2012nyctalus, oleksy2015high, zhang2009recent}. +Despite commonly being associated with pathogen richness \cite{arneberg2002host, kamiya2014determines, nunn2003comparative}, population density was not included in the analysis as there is very little data for bat densities. +Measurements of body mass were taken from Pantheria \cite{jones2009pantheria} and primary literature \cite{canals2005relative, arita1993rarity, lopez2014echolocation, orr2013does, lim2001bat, aldridge1987turning, ma2003dietary, owen2003home, henderson2008movements, heaney2012nyctalus, oleksy2015high, zhang2009recent}. \emph{Pipistrellus pygmaeus} was assigned the same mass as \emph{P. pipistrellus} as they are indistinguishable by mass. Body mass measurements were log transformed as they were strongly right skewed. Distribution size was estimated by downloading range maps for all species from IUCN \cite{iucn} and were also log transformed due to right skew. @@ -1820,8 +1820,8 @@ Distribution size was estimated by downloading range maps for all species from I Statistical analysis for both response variables --- number of subspecies and effective level of gene flow --- was conducted using an information theoretical approach \cite{burnham2002model}, specifically following \textcite{whittingham2005habitat, whittingham2006we}. All analyses were performed in \emph{R} \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex}. I chose a credible set of models including all combinations of explanatory variables and a model with just an intercept. -In the analysis using the number of subspecies response variable I also modelled the interaction study effort and number of subspecies by including their product. -This interaction was included as I believed \emph{a priori} that this interaction may be present as subspecies in well studied species are more likely to be identified. +In the analysis using the number of subspecies response variable I also modelled the interaction between study effort and number of subspecies by including their product. +This interaction was included as I believed \emph{a priori} that this interaction may be important as subspecies in well studied species are more likely to be identified. The interaction was only included in models with both study effort and number of subspecies as individual terms. Following \textcite{whittingham2005habitat} I included a uniformly distributed random variable. This variable can be used to benchmark how important other explanatory variables are. @@ -1846,7 +1846,7 @@ The red scale bar shows 25 million years.' -treeTitle <- 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.' +treeTitle <- 'Pruned phylogeny showing number of pathogens and family' %%end.rcode @@ -1926,7 +1926,7 @@ The explanatory variables were centred and scaled to allow direct comparison of For each regression model I simultaneously fitted the $\lambda$ parameter as this avoids misspecifying the model \cite{revell2010phylogenetic}. Unlike the \emph{pgls} function, \emph{gls} does not constrain $\lambda$ to be in the range $\lbrack 0, 1\rbrack$. $\lambda < 0$ indicates that residuals from the fitted model are distributed on the phylogeny more uniformly than expected by chance. -$\kappa$ and $\delta$ parameters were constrained to one as they are more concerned with when along a branch evolution occurs than the importance of the phylogeny. +$\kappa$ and $\delta$ parameters were constrained to one as they are more concerned with when evolution occurs along a branch than the importance of the phylogeny. Further, fitting multiple parameters makes interpretation difficult. @@ -1934,7 +1934,7 @@ Further, fitting multiple parameters makes interpretation difficult. To establish the importance of variables I calculated the probability, $Pr$, that each variable would be in the best model amongst those examined (under the assumption that all models are \emph{a priori} equally likely). This value can more generally, and with fewer assumptions, be considered as simply the relative weight of evidence for each variable being in the best model amongst those examined. I calculated AICc for each model. -I calculated the average AICc, $\bar{\text{AICc}}$, by averaging AICc scores within models. +As each model was fitted 50 times, I calculated the average AICc, $\bar{\text{AICc}}$, by averaging AICc scores for each model. $\Delta\text{AICc}$ was calculated as $\text{min}(\bar{\text{AICc}}) - \bar{\text{AICc}}$, not the mean of the individual $\Delta\text{AICc}$ scores, to guarantee that the best model has $\Delta\text{AICc} = 0$. From these $\Delta\text{AICc}$ values I calculated Akaike weights, $w$. This value can be interpreted as the probability that a model is the best model, given the data, amongst those examined. @@ -1964,7 +1964,7 @@ The red line represents a phylogenetic simple regression between the two variabl boxplotTitle <- paste( 'The relationship between number of subspecies and viral richness for', nrow(nSpecies), -'bat species.' +'bat species' ) %%end.rcode @@ -2128,11 +2128,11 @@ Number of subspecies &&&&\\ \hspace{3mm}Models without interaction term && \rinline{nSpeciesCoefMean} &&\\ \hspace{3mm}Models with interaction term && \rinline{nSpeciesCoefMeanI} &&\\ Number of subspecies*log(Scholar) & \rinline{varWeights['scholarRefs.NumberOfSubspecies']} & \rinline{sprintf('%.2f', varCoefMeans['beta.scholarRefs.NumberOfSubspecies'])} && \\[2.5mm] -Gene flow & & & \rinline{fstVarWeights['Nm']} & \rinline{fstCoefMeans['beta.Nm']}\\[2.5mm] +Gene flow & & & \rinline{sprintf('%.2f', fstVarWeights['Nm'])} & \rinline{fstCoefMeans['beta.Nm']}\\[2.5mm] log(Scholar) & \rinline{sprintf('%.2f', varWeights['scholarRefs'])} & \rinline{varCoefMeans['beta.scholarRefs']} & - \rinline{fstVarWeights['scholarRefs']} & \rinline{fstCoefMeans['beta.scholarRefs']}\\ + \rinline{sprintf('%.2f', fstVarWeights['scholarRefs'])} & \rinline{fstCoefMeans['beta.scholarRefs']}\\ log(Mass) & \rinline{sprintf('%.2f', varWeights['mass'])} & \rinline{varCoefMeans['beta.mass']} & - \rinline{fstVarWeights['mass']} & \rinline{fstCoefMeans['beta.mass']}\\ + \rinline{sprintf('%.2f', fstVarWeights['mass'])} & \rinline{fstCoefMeans['beta.mass']}\\ log(Range size) & \rinline{sprintf('%.2f', varWeights['distrSize'])} & \rinline{varCoefMeans['beta.distrSize']}& \rinline{fstVarWeights['distrSize']} & \rinline{fstCoefMeans['beta.distrSize']}\\ Random & \rinline{sprintf('%.2f', varWeights['rand'])} & \rinline{varCoefMeans['beta.rand']}& @@ -2153,7 +2153,7 @@ Random & \rinline{sprintf('%.2f', varWeights['rand'])} & \rinline{varCoefMeans %Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}). -The effective level of gene flow was likely in the best model ($Pr > 0.999$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). +The effective level of gene flow was likely in the best model ($Pr > 0.99$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. @@ -2171,12 +2171,12 @@ The thick bar of the boxplot shows the median value, the interquartile range is The red ``Random'' box is the uniformly random variable. Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses." -ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness." +ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness" %%end.rcode -%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = 'Akaike variable weights', out.width = '\\textwidth', cache = FALSE +%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = ITPlotTitle, out.width = '\\textwidth', cache = FALSE # Reorder var levels to get structure at beginning. fstSepVarWeights$variable <- factor(fstSepVarWeights$variable, levels(fstSepVarWeights$variable)[c(2, 1, 3, 4, 5)]) @@ -2214,7 +2214,7 @@ ggdraw() + -Study effort was very likely in the best model ($Pr > 0.999$) as was body mass ($Pr > 0.999$). +Study effort was very likely in the best model ($Pr > 0.99$) as was body mass ($Pr > 0.99$). However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}). In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}. %This variable being less supported than the random variable may be because range size is closely correlated with study effort (pgls: $b$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Estimate']}, $t$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 't value']}, df = \rinline{fstDistrStudyEffort$df[2]}, $p$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Pr(>|t|)']}). @@ -2241,7 +2241,7 @@ fstRawDataTitle <- paste( 'Relationship between viral richness and log effective gene flow per generation for', nrow(fstFinal), -'bat species. +'bat species ') %%end.rcode @@ -2287,7 +2287,7 @@ This is corroborated by the small estimated size of $\lambda$ ($\lambda$ = \rinl %This fact implies that other factors must control pathogen richness. %It also implies that pathogens are not directly inherited down the phylogeny, although this is to be expected by the fast evolution of viruses. -Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.999$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). +Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.99$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). Across all multiple regression models the mean value of $\lambda$ was \rinline{mean(na.omit(allResults$lambda))} which implied that the residuals from the models were very weakly phylogenetic. A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%) had negatively phylogenetically distributed residuals. @@ -2296,8 +2296,8 @@ A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%) \subsubsection{Effective gene flow} -There was no phylogenetic signal in the number of virus species ($\lambda$ = \rinline{virusFstLambda$param['lambda']}, $p > 0.999$). -Gene flow also had no phylogenetic autocorrelation ($\lambda$ = \rinline{nmFstLambda$param['lambda']}, $p > 0.999$). +There was no phylogenetic signal in the number of virus species ($\lambda$ = \rinline{virusFstLambda$param['lambda']}, $p > 0.99$). +Gene flow also had no phylogenetic autocorrelation ($\lambda$ = \rinline{nmFstLambda$param['lambda']}, $p > 0.99$). Due to the limited sample size, significance tests are unlikely to have much power. There is little evidence of phylogenetic autocorrelation in study effort ($\lambda$ = \rinline{scholarFstLambda$param['lambda']}, $p$ = \rinline{scholarFstLambda$param.CI$lambda$bounds.p[1]}). However, there is some weak evidence of phylogenetic signal in range size as the estimated size of $\lambda$ is large while $p$ is also large, potentially due to a lack of statistical power ($\lambda$ = \rinline{distrFstLambda$param['lambda']}, $p$ = \rinline{distrFstLambda$param.CI$lambda$bounds.p[1]}). @@ -2344,7 +2344,7 @@ In contrast, one study \textcite{gay2014parasite} found the opposite relationshi Furthermore, \textcite{bordes2008bat} found no relationship between increased colony size and pathogen richness while \textcite{gay2014parasite} found relationships in opposite directions for virus and ectoparasite richness. However, the study by \textcite{gay2014parasite} uses relatively few species while the study by \textcite{bordes2008bat} uses group size which is a measure of local rather than global population structure. The overall weight of evidence suggests that population structure and pathogen richness are associated. -fur + @@ -2391,8 +2391,8 @@ This result is probably due to correlations with other variables in the analysis The relationship between increased population structure and pathogen richness suggests that population structure has at least some potential as being predictive of high pathogen richness and therefore of a species' likelihood of being a reservoir of a potentially zoonotic pathogen. However, given that it is difficult to measure population structure and given that the relationship appears to be weak at best, this trait on its own is unlikely to be useful in predicting zoonotic risk. -However, as number of other factors are also associated with pathogen richness such as body mass and to a lesser extent range size as shown here as well as other traits studied elsewhere \cite{turmelle2009correlates, luis2013comparison}. -Therefore, using a combination of traits in a predictive (i.e.\ machine learning) framework has potential to be used in prioritising zoonotic disease surveillance. +However, as a number of other factors are also associated with pathogen richness such as body mass and to a lesser extent range size as shown here as well as other traits studied elsewhere \cite{turmelle2009correlates, luis2013comparison}. +Therefore, using a combination of traits in a predictive (i.e.\ machine learning) framework has potential for use in prioritising zoonotic disease surveillance. The main hurdle in this approach is finding a way to validate models; due to the study effort bias in current data, predictive models will also be biased. As unbiased pathogen surveys such as \textcite{anthony2013strategy} become more common good validation may become possible. Alternatively, predictive models could be trained on all available --- and therefore biased --- data and validated by predicting smaller, unbiased data sets such as the data collected in \textcite{maganga2014bat}. @@ -2400,9 +2400,9 @@ Alternatively, predictive models could be trained on all available --- and there The relationship between increased population structure and pathogen richness also has implications for habitat fragmentation and range shifts due to global change. In short, habitat fragmentation and range shifts that reduce movement between populations would be predicted to increase pathogen richness. However, depending on the mechanisms by which increased population structure increases pathogen richness this may not be a cause for concern. -If the main mechanism is one that reduces pathogen extinction rates, a newly fragmented population is unlikely to increase its pathogen richness over any appreciable timescale. +If the main mechanism is one that reduces pathogen extinction rates, a newly fragmented population is unlikely to increase its pathogen richness over any short to medium-term timescales. If, however, increased population structure actively promotes the evolution of new pathogen strains or allows the persistence of more virulent strains \cite{blackwood2013resolving, pons2014insights, plowright2011urban} this could have important public health implications. -Therefore further studies on the exact mechanisms by which increased population structure affects pathogen richness is needed. +Therefore further studies on the exact mechanisms by which increased population structure affects pathogen richness are needed. \subsection{Study limitations} diff --git a/Chapter4.Rtex b/Chapter4.Rtex index 46427ab..d73701e 100644 --- a/Chapter4.Rtex +++ b/Chapter4.Rtex @@ -99,7 +99,7 @@ As these factors are all completely interdependent, it is impossible to identify % being addressed by this particular study. %\tmpsection{One sentence summarising the main result} % (with the words ``here we show'' or their equivalent). -Here I use metapopulation susceptible-infected-recovered (SIR) models to test whether the ability of a newly evolved pathogen to invade and persist in a population is controlled specifically by host density as opposed to host population size +Here I use metapopulation susceptible-infected-recovered (SIR) models to test whether the ability of a newly evolved pathogen to invade and persist in a population is controlled specifically by host density as opposed to host population size. I also test whether pathogens invade more readily into a population with large groups or many groups. I parameterised these metapopulations to mimic bat populations, as bats exhibit a large range in group (colony) size and geographic range size as well as being associated with a number of important zoonotic pathogens. %\tmpsection{Two or three sentences explaining what the main result reveals in direct comparison to what was thought to be the case previously} @@ -192,8 +192,8 @@ The amount of movement between groups is at least partially dependent on the dis %4. then talk about the limitations of taking a comparative approach and how mechanistic models can help Collinearity between explanatory variables is a common problem in correlative studies. -However, this issue is exacerbated when there are clear, causal relationships between explanatory variables (e.g.\ an increase in host density will directly cause an increase in host population size). -Therefore, correlative comparative studies will be especially poor at identifying which of factors are closely correlated with pathogen richness. +However, this issue is exacerbated when there are clear, causal relationships between explanatory variables (e.g.,\ an increase in host density will directly cause an increase in host population size). +Therefore, correlative comparative studies will be especially poor at identifying which of these factors are closely correlated with pathogen richness. If the aim of correlative studies is to create predictive models for estimating pathogen richness of wild animal species, these relationships are not an issue. In each of the above relationships ($d = N / a$ and $N = mn$), as long as two of three variables are included in a statistical model, all the variance in the third variable will also be captured. @@ -237,7 +237,7 @@ I used bats as a case study as the size of bat groups (colonies) is very variabl Furthermore, bats are particularly relevant in the context of zoonotic disease as they are thought to be reservoirs for a number of important, recent outbreaks \cite{calisher2006bats, li2005bats}. I examined how the interrelated population factors affect the ability of a newly evolved pathogen to invade and persist in a population in the presence of strong competition from an endemic pathogen strain. I used these simulations to test two specific hypotheses. -First, I tested whether host density or population size more strongly promotes the invasion of a new pathogen. +First, I tested whether host population size or density more strongly promotes the invasion of a new pathogen. Secondly, I tested whether the invasion of a new pathogen is more strongly promoted by colony size or the number of colonies. I found that population size has a much stronger affect on the invasion of a new pathogen than host density and that increasing population size by increasing group size promotes pathogen invasion much more than increasing population size by increasing the number of groups. @@ -697,7 +697,7 @@ popModel <- pop %>% plotKcapt <- ' Change in average metapopulation network degree ($\\bar{k}$) with increasing range size. Bars show the median, boxes show the interquartile range, vertical lines show the range and grey dots indicate outlier values. -Notches indicate the 95\\% confidence interval of the mean. +Notches indicate the 95\\% confidence interval of the median. All simulations had 20 colonies, meaning 19 is the maximum value of $\\bar{k}$. ' @@ -784,7 +784,7 @@ ggplot(dens1, aes(x = factor(area), y = meanK, colour = 'a', fill = 'a')) + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% I used a two-pathogen, metapopulation SIR model to compare the roles of population parameters on pathogen species richness. -The multipathogen SIR model was identical to that in Chapter~\ref{ch:sims1} using the same \emph{R} implementation \cite{R} \cite{metapopepi}. +The multipathogen SIR model was identical to that in Chapter~\ref{ch:sims1} using the same \emph{R} implementation \cite{metapopepi}. Specifically, I let two identical pathogens compete: an endemic pathogen (Pathogen 1) and an invading pathogen (Pathogen 2). I used persistence (coded as 1) or extinction (coded as 0) of Pathogen 2 as a binomial response variable. I tested whether host population size is more important than host density. @@ -804,7 +804,7 @@ The metapopulation network was created for each simulation by randomly placing c This square space was considered to be the species geographic range, the size of which was varied. Range size was varied between \SI{\rinline{min(area)}} and \SI{\rinline{max(area)}}{\square\kilo\metre}. This corresponds to square areas with sides of \rinline{sqrt(min(area))} to \SI{\rinline{sqrt(max(area))}}{\kilo\metre}. -Dispersal was only allowed to occur between two colonies if they are within \SI{100}{\kilo\metre} of each other i.e.\ they were then counted as connected nodes in the metapopulation network. +Dispersal was only allowed to occur between two colonies if they were within \SI{100}{\kilo\metre} of each other i.e.\ they were then counted as connected nodes in the metapopulation network. The number of connections each colony has is called its degree, $k$. The mean degree, $\bar{k}$ is a measure of how well connected the metapopulation network is overall. @@ -848,7 +848,7 @@ Again, this gave population size values of \SI{2000}, \SI{4000}, \SI{8000}, \SI{ \subsubsection{Colony size and the number of colonies} To compare colony size and the number of colonies, only the second and third set of simulations above were used. -However, colony size and the number of colonies were directly used as independent variables instead of using the derived values for host density or population size. +However, colony size and the number of colonies were directly used as independent variables instead of using the derived values for host population size or density. It can be seen that population density and range size are equivalent in the two sets of simulations. Therefore, the only difference between these two sets of simulations is the factor used to increase population size: this factor being either colony size or the number of colonies. @@ -910,7 +910,7 @@ densPreds <- rbind(dens1Model, dens2Model) %>% valueChangeMeansCapt <- ' Comparison of the effect of colony size, number of colonies and host density on probability of invasion. -The $x$-axis shows the relative change in the default values of each of these factors ($\\times$0.25, 0.5, 1, 2 and $4$). +The $x$-axis shows the change ($\\times$0.25, 0.5, 1, 2 and $4$) in the each of these factors relative to the default value. Default values are: colony number = 20, colony size = 400 and density = 0.8 animals per \\si{\\square\\kilo\\metre}. Red lines: population size is altered by changing colony number. Blue lines: population size is altered by changing colony size. @@ -920,7 +920,7 @@ Curves are simple logistic regression fits for each independent variable. Relationships are shown separately for each transmission value, $\\beta$. ' -valueChangeMeansTitle <- 'Comparison of the effect of colony size, number of colonies and host density on probability of invasion.' +valueChangeMeansTitle <- 'Comparison of the effect of colony size, number of colonies and host density on probability of invasion' %%end.rcode @@ -1048,7 +1048,7 @@ It can be seen that changes in colony size give a much greater increase in invas Note that this is the same data as Figure~\\ref{fig:plotValueChangeMeans} but with the $x$-axis scaled by population size, rather than relative parameter change. ' -transMeansTitle <- 'Comparison of the probability of invasion when host population size is altered by changing colony size or colony number.' +transMeansTitle <- 'Comparison of the probability of the affect of colony size and number of colonies on probability of invasion' %%end.rcode @@ -1132,18 +1132,18 @@ fewAllExtinct <- rbind(dens1, dens2, pop) %>% %%%%%%%%%%%%%%%%%%%%%%%%%%%%% At the default parameter settings, the probability of invasion and establishment of the second pathogen, $P(I)$, was rare (Figure~\ref{fig:plotValueChangeMeans} and Tables~\ref{C-pop} -- \ref{C-dens2}). -These proportions significantly increase with transmission rate (GLM: $b$ = \rinline{transTest$estimate[2]}, $p$ = \rinline{transTest$p.value[2]}). +These proportions significantly increased with transmission rate (GLM: $b$ = \rinline{transTest$estimate[2]}, $p$ = \rinline{transTest$p.value[2]}). In \rinline{nAllExtinct} simulations, both of the pathogens went extinct. The number of simulations where both pathogens went extinct did not depend on transmission rate (GLM: $b$ = \rinline{transExtinctTest$estimate[2]}, $p$ = \rinline{transExtinctTest$p.value[2]}). -However all of the simulations with extinction of both pathogens had either the smallest colony size (colony size = 100, \rinline{smallAllExtinct} simulations) or with the fewest colonies (5 colonies, \rinline{fewAllExtinct} simulations). +However all of the simulations with extinction of both pathogens had either the smallest colony size (colony size = 100, \rinline{smallAllExtinct} simulations) or the fewest number of colonies (5 colonies, \rinline{fewAllExtinct} simulations). Results from these simulations were removed before further analyses. -\subsection{Host density or population size} +\subsection{Host population size or density} -To test whether host density or population size had a stronger effect on invasion probability I compared the regression coefficients of the multiple regressions fitted to simulation results (Figure~\ref{fig:plotValueChangeMeans}). +To test whether host population size or density had a stronger effect on invasion probability I compared the regression coefficients of the multiple regressions fitted to simulation results (Figure~\ref{fig:plotValueChangeMeans}). Increasing host population size, either by increasing colony size or number of colonies, increased the probability of invasion (Table~\ref{t:regrCoefs}). The relationship between colony size and invasion is strong and significant at all transmission rates, while the relationship between colony number and invasion is weaker and more marginally significant. In contrast, varying host density does not alter invasion probability. @@ -1236,7 +1236,7 @@ However, it should be noted that these conclusions apply only to the specific me \subsection{Comparative studies} Many comparative studies measure some aspect of a species population size or structure, yet it is rarely discussed how these are related. -Instead most studies use the data that are available, without considering \emph{a priori} how the explanatory variables are causally related (though statistical correlations between independent variables is usually considered and dealt with using PCA or by removing collinear variables). +Instead most studies use the data that are available, without considering \emph{a priori} how the explanatory variables are causally related (though statistical correlations between independent variables are usually considered and dealt with using PCA or by removing collinear variables). Host density is often measured \cite{morand1998density, lindenfors2007parasite, nunn2003comparative, arneberg2002host} yet density is directly associated with population size. My results suggest that it is in fact population size that is important (in the context of social species as studied here). This leads to the suggestion that the density measures in these comparative studies are in fact proxies for population size rather than the true causal factor. @@ -1247,7 +1247,7 @@ Range size has been suggested to affect pathogen richness by a number of mechani The studies that have specifically tested the effect of group size have in fact found both positive \cite{vitone2004body} and negative associations \cite{gay2014parasite} or no relationship \cite{ezenwa2006host}. Meta-analyses suggest that the relationship between social group size and pathogen richness is weak \cite{rifkin2012animals}. However, I have found that group size is the most important population factor. -This suggests that the mechanism studied here --- invasion of recently evolved pathogens --- may not the major mechanism by which pathogen richness is created in wild populations. +This suggests that the mechanism studied here --- invasion of recently evolved pathogens --- may not be the major mechanism by which pathogen richness is created in wild populations. diff --git a/Chapter5.Rtex b/Chapter5.Rtex index 76502a1..7cedd3d 100644 --- a/Chapter5.Rtex +++ b/Chapter5.Rtex @@ -197,7 +197,6 @@ polys <- list( SW9=list(c(0, 0, pi/4),c(0, pi, pi/2)) ) - reg <- do.call(rbind, lapply(1:length(polys), function(x) data.frame(do.call(cbind, polys[[x]]), names(polys[x])))) names(reg) <- c('x', 'y', 'model') @@ -368,7 +367,7 @@ The equation for $\bar{p}$ has been newly derived for each submodel in the gREM, However, many models, although derived separately, have the same expression for $\bar{p}$. Figure~\ref{f:equalModelResults} shows the expression for $\bar{p}$ in each case. The general equation for density, \ref{e:gas}, is used with the correct value of $\bar{p}$ substituted. -Although more thorough checks are performed in the additional Python script, it can be seen that all adjacent expressions in Figure~\ref{f:equalModelResults} are equal when expressions for the boundaries between them are substituted in. +Although more thorough checks are performed in the additional \emph{Python} script, it can be seen that all adjacent expressions in Figure~\ref{f:equalModelResults} are equal when expressions for the boundaries between them are substituted in. %%begin.rcode equalRegionsResultCapt equalRegionsResultCapt <- ' @@ -472,7 +471,7 @@ ggplot(allMods, aes(x = model, y = percentageerror, colour = expression, fill = %\begin{figure}[t] % \centering % \includegraphics[width=7cm]{imgs/lucas_et_al_figure5.pdf} -% \caption[Simulation model results of the accuracy and precision for gREM submodels]{Simulation model results of the accuracy and precision for gREM submodels. +% \caption[Accuracy and precision for gREM submodels given ]{Simulation model results of the accuracy and precision for gREM submodels. %The percentage error between estimated and true density for each gREM sub model is shown within each box plot, where the black line represents the median percentage error across all simulations, boxes represent the middle 50\% of the data, whiskers represent variability outside the upper and lower quartiles with outliers plotted as individual points. %Box colours correspond to the expressions for average profile width $\bar{p}$ given in Figure 4. %} @@ -497,7 +496,7 @@ The numbers beneath each plot represent the coefficient of variation. The colour of each box plot corresponds to the expressions for average profile width $\\bar{p}$ given in Figure \\ref{f:equalModelResults}. ' -CapturesTitle <- 'Simulation model results of the accuracy and precision of four gREM submodels' +CapturesTitle <- 'Accuracy and precision of four gREM submodels given different numbers of captures' %%end.rcode @@ -562,7 +561,7 @@ The simple model is represented where time and maximum change in direction equal The colour of each box plot corresponds to the expressions for average profile width $\\bar{p}$ given in Figure \\ref{f:equalModelResults}. %todo ' -movtTitle <- 'Simulation model results of the accuracy and precision of four gREM submodels' +movtTitle <- 'Accuracy and precision of four gREM submodels given different movement models' %%end.rcode @@ -621,8 +620,16 @@ tortPlot <- ggplot(tort, aes(x = maxAngle, y = percentageerror, colour = express expression(phantom(over(0, 0))*pi*phantom(0)))) -plot_grid(waitPlot, tortPlot, labels = c("A", "B"), align = 'h', label_size = 10, ncol = 1) +# plot_grid(waitPlot, tortPlot, labels = c("A", "B"), align = 'h', label_size = 10, ncol = 1, fontfamily = 'lato light') + +# Combine and print the plots. +ggdraw() + + draw_label("A)", 0.02, 0.98, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(tortPlot, 0, 0, 1, 0.5) + + draw_label("B)", 0.02, 0.48, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(waitPlot, 0, 0.5, 1, 0.5) + %%end.rcode @@ -670,7 +677,7 @@ Both of these problems may cause biases in the gREM, as animals can move through As the gREM assumes constant surveillance, the error created by switching the sensor on and off quickly will become more important if the sensor is only on for short periods of time. I recommend that the gREM is applied to constantly sampled data, and the impacts of breaking these assumptions on the gREM should be further explored. -\subsection{Accuracy, Precision and Recommendations for Best Practice} +\subsection{Accuracy, precision and recommendations for best practice} Based on our simulations, I believe that the gREM has the potential to produce accurate estimates for many different species, using either camera traps or acoustic detectors. However, the precision of the gREM differed between submodels. For example, when the sensor and signal width were small, the precision of the model was reduced. diff --git a/imgs/movtFig-Edit.svg b/imgs/movtFig-Edit.svg index c99a7aa..3bfe375 100644 --- a/imgs/movtFig-Edit.svg +++ b/imgs/movtFig-Edit.svg @@ -15,10 +15,10 @@ width="463.75" height="450" xml:space="preserve" - sodipodi:docname="movtFig-Edit.pdf">image/svg+xmlPercentage Error A + id="tspan478">A) Percent Error B + id="tspan1176">B) Date: Mon, 25 Jul 2016 10:55:24 +0100 Subject: [PATCH 09/17] Add +ch2 data readme re #38. --- data/Chapter2/README.md | 68 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 data/Chapter2/README.md diff --git a/data/Chapter2/README.md b/data/Chapter2/README.md new file mode 100644 index 0000000..b0d6192 --- /dev/null +++ b/data/Chapter2/README.md @@ -0,0 +1,68 @@ +Data for Chapter 2 +=================== + +Understanding how population structure affects pathogen richness in a mechanistic model of bat populations +----------------------------------------------------------------------------------------------------------- + +The data in this folder are all outputs from simulations. +Running the simulations in [Chapter2.Rtex](Chapter2.Rtex) will reproduce these files. + + +Main study +----------- + +[DispSims.csv](data/Chapter2/DispSims.csv) and [extraMidBeta.csv](data/Chapter2/extraMidBeta.csv) are the results for the dispersal simulations. + +[TopoSims.csv](data/Chapter2/TopoSims.csv) contains the results for the network topology simulations. + +[unstructuredSims.csv](data/Chapter2/unstructuredSims.csv) contains the results for the unstructured simulations (i.e. one colony with population size 30,000). + +[noDispSims.csv](data/Chapter2/noDispSims.csv) contains the results for the zero dispersal simulations which are then also used for the completely unconnected metapopulaiton network as well. + +These files are all flat csv files, with missing data indicated by 'NA'. +They contain a column of row names, which in this case are just integers 1 - number of rows. + + +All these files contain columns: +* transmission - the transmission rate (beta) +* dispersal - the dispersal rate (delta) +* nExtantDis - the number of disease extant at the end of the simulation +* singleInf - the number of individuals infected with one pathogen +* doubleInf - the number of individuals infected with two pathogens +* nColonies - the number of colonies in the metapopulation +* meanK - the mean degree of the metapopulation network +* maxDistance - The threshold distance within which two colonies are joined in the metapopulation network (always 100 in these simulations) +* nEvents - the total number of events in the simulation (8e5) + +[extraMidBeta.csv](data/Chapter2/extraMidBeta.csv) and [noDispSims.csv](data/Chapter2/noDispSims.csv) additionally contains columns relating to the absolute length of time the simulation covers (simulation time, not computational time). + +* extinctionTime - at what time did the invading pathogen go extinct (NA if it didn't go extinct). +* totalTime - total length of time simulated +* survivalTime - how long did the invading pathogen survive (NA if it didn't go extinct). +* pathInv - the time that the second pathogen was seeded into the simulation. Pathogen 2 is seeded after a certain number events, not a strict amount of time, after the beginning of the simulation. + + + + +Appendices +------------ + +[Appen1.RData](data/Chapter2/Appen1.RData), [Appen22.RData](data/Chapter2/Appen22.RData), [Appen24.RData](data/Chapter2/Appen24.RData) and [Appen25.RData](data/Chapter2/Appen25.RData) contain the full simulation objects for the four example simulations plotted in the appendix. +The objects are named either p1 or p2. +They are a large list containing all the information used to define the simulations and the full output of the simulations (see [help files](https://github.com/timcdlucas/MetapopEpi/blob/master/man/makePop.Rd) in the [MetapopEpi](https://github.com/timcdlucas/metapopEpi) package for more details). +See [Appendix2.Rtex](Appendix2.Rtex) for code for generating and analysing the data. + + + + +Length of simulations +--------------------- + +The files in [DispSims](data/Chapter2/DispSims) and [TopoSims](data/Chapter2/TopoSims) are generated and analysed in [Appendix2.Rtex](Appendix2.Rtex) and are used to check whether using a fixed number of events rather than a fixed simulation time introduced bias (see methods in [Chapter2.Rtex](Chapter2.Rtex)). +They are all `.RData` files containing a single simulation object as described above. + + + + + + From df2370fb8fb94a9310549b1c756fe08c106b009c Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 11:04:51 +0100 Subject: [PATCH 10/17] Fix relative links in data README. --- data/Chapter2/README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/data/Chapter2/README.md b/data/Chapter2/README.md index b0d6192..6f690af 100644 --- a/data/Chapter2/README.md +++ b/data/Chapter2/README.md @@ -5,19 +5,19 @@ Understanding how population structure affects pathogen richness in a mechanisti ----------------------------------------------------------------------------------------------------------- The data in this folder are all outputs from simulations. -Running the simulations in [Chapter2.Rtex](Chapter2.Rtex) will reproduce these files. +Running the simulations in [Chapter2.Rtex](../../Chapter2.Rtex) will reproduce these files. Main study ----------- -[DispSims.csv](data/Chapter2/DispSims.csv) and [extraMidBeta.csv](data/Chapter2/extraMidBeta.csv) are the results for the dispersal simulations. +[DispSims.csv](DispSims.csv) and [extraMidBeta.csv](extraMidBeta.csv) are the results for the dispersal simulations. -[TopoSims.csv](data/Chapter2/TopoSims.csv) contains the results for the network topology simulations. +[TopoSims.csv](TopoSims.csv) contains the results for the network topology simulations. -[unstructuredSims.csv](data/Chapter2/unstructuredSims.csv) contains the results for the unstructured simulations (i.e. one colony with population size 30,000). +[unstructuredSims.csv](unstructuredSims.csv) contains the results for the unstructured simulations (i.e. one colony with population size 30,000). -[noDispSims.csv](data/Chapter2/noDispSims.csv) contains the results for the zero dispersal simulations which are then also used for the completely unconnected metapopulaiton network as well. +[noDispSims.csv](noDispSims.csv) contains the results for the zero dispersal simulations which are then also used for the completely unconnected metapopulaiton network as well. These files are all flat csv files, with missing data indicated by 'NA'. They contain a column of row names, which in this case are just integers 1 - number of rows. @@ -34,7 +34,7 @@ All these files contain columns: * maxDistance - The threshold distance within which two colonies are joined in the metapopulation network (always 100 in these simulations) * nEvents - the total number of events in the simulation (8e5) -[extraMidBeta.csv](data/Chapter2/extraMidBeta.csv) and [noDispSims.csv](data/Chapter2/noDispSims.csv) additionally contains columns relating to the absolute length of time the simulation covers (simulation time, not computational time). +[extraMidBeta.csv](extraMidBeta.csv) and [noDispSims.csv](noDispSims.csv) additionally contains columns relating to the absolute length of time the simulation covers (simulation time, not computational time). * extinctionTime - at what time did the invading pathogen go extinct (NA if it didn't go extinct). * totalTime - total length of time simulated @@ -47,10 +47,10 @@ All these files contain columns: Appendices ------------ -[Appen1.RData](data/Chapter2/Appen1.RData), [Appen22.RData](data/Chapter2/Appen22.RData), [Appen24.RData](data/Chapter2/Appen24.RData) and [Appen25.RData](data/Chapter2/Appen25.RData) contain the full simulation objects for the four example simulations plotted in the appendix. +[Appen1.RData](Appen1.RData), [Appen22.RData](Appen22.RData), [Appen24.RData](Appen24.RData) and [Appen25.RData](Appen25.RData) contain the full simulation objects for the four example simulations plotted in the appendix. The objects are named either p1 or p2. They are a large list containing all the information used to define the simulations and the full output of the simulations (see [help files](https://github.com/timcdlucas/MetapopEpi/blob/master/man/makePop.Rd) in the [MetapopEpi](https://github.com/timcdlucas/metapopEpi) package for more details). -See [Appendix2.Rtex](Appendix2.Rtex) for code for generating and analysing the data. +See [Appendix2.Rtex](../../Appendix2.Rtex) for code for generating and analysing the data. @@ -58,7 +58,7 @@ See [Appendix2.Rtex](Appendix2.Rtex) for code for generating and analysing the d Length of simulations --------------------- -The files in [DispSims](data/Chapter2/DispSims) and [TopoSims](data/Chapter2/TopoSims) are generated and analysed in [Appendix2.Rtex](Appendix2.Rtex) and are used to check whether using a fixed number of events rather than a fixed simulation time introduced bias (see methods in [Chapter2.Rtex](Chapter2.Rtex)). +The files in [DispSims](DispSims) and [TopoSims](TopoSims) are generated and analysed in [Appendix2.Rtex](../../Appendix2.Rtex) and are used to check whether using a fixed number of events rather than a fixed simulation time introduced bias (see methods in [Chapter2.Rtex](../../Chapter2.Rtex)). They are all `.RData` files containing a single simulation object as described above. From 2b4c567b62b3cf8b10d2ff53dd9c03d345ba36f9 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 11:39:50 +0100 Subject: [PATCH 11/17] Add data readme for +ch5. re #38. --- data/Chapter5/README.md | 49 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 data/Chapter5/README.md diff --git a/data/Chapter5/README.md b/data/Chapter5/README.md new file mode 100644 index 0000000..dc4c2b5 --- /dev/null +++ b/data/Chapter5/README.md @@ -0,0 +1,49 @@ +Chapter 5: A generalised random encounter model for estimating animal density with remote sensor data +======================================================================================================= + +This directory contains the data generated for Chapter 5. + +The data is all generated by simulations. +The code for the simulations was run by Elizabeth Moorcroft. +However, I have failed to get the code off her yet. +Sorry about that. + + + + + +[Allmodels_percenterror.csv](Allmodels_percenterror.csv) contains the raw data for the error of each simulation for each model over a fixed length of simulation time. +Code to reformat this data is in [Chapter5DataReformat.R](../../Chater5DataReformat.R) and the tidy data is in [all_models_tidy.csv](all_models_tidy.csv). + + +[AllmodelBias.csv](AllmodelBias.csv) contains a summary of the [Allmodels_percenterror.csv](Allmodels_percenterror.csv) data. +It contains columns: +* Model - which gREM submodle +* Mean - Mean error between gREM estimated and true density +* StErr - StErr of error between gREM estimated and true density + + + +[Allmodels_fixedcaps_variabletime_percenterror.csv](Allmodels_fixedcaps_variabletime_percenterror.csv) contains the raw data for the error of each simulation for four of the submodels for a given number of captures. +Code to reformat this data is in [Chapter5DataReformat.R](../../Chater5DataReformat.R) and the tidy data is in [captures_tidy.csv](captures_tidy.csv). +This data is used to create [Figure 5.6](../../figure/Captures-1.pdf). + + + + +[Prop_time_still_percentageerror.csv](Prop_time_still_percentageerror.csv) contains the raw data for the error of each simulation for four of the submodels with individuals remaining stationary for different proportions of events. +Code to reformat this data is in [Chapter5DataReformat.R](../../Chater5DataReformat.R) and the tidy data is in [prop_time_still_tidy.csv](prop_time_still_tidy.csv). +This data is used to create [Figure 5.7A](../../figure/movtFig-1.pdf). + + +[max_angle_change_percentageerror.csv](max_angle_change_percentageerror.csv) contains the raw data for the error of each simulation for four of the submodels for different movement angles in the random walk. +Code to reformat this data is in [Chapter5DataReformat.R](../../Chater5DataReformat.R) and the tidy data is in [max_angle_tidy.csv](max_angle_tidy.csv). +This data is used to create [Figure 5.7B](../../figure/movtFig-1.pdf). +[AngleCorrWalkData.csv](AngleCorrWalkData.csv) contains a summary of this data. + + + +[Sensitivity_percentageerror.csv](Sensitivity_percentageerror.csv) contains data on the error of estimated populaiton densities when different parameters (e.g. call width, detector width) are inputed incorrectly. +That is, this data is for a sensitivity analysis on the fixed parameters. +This data was used to create Figure D.2 though this figure was created by Elizabeth Moorcroft and I don't have the code for it. +Figure D.2 is a panel figure including [AverageModelBias_callerror.pdf](../../imgs/AverageModelBias_callerror.pdf), [AverageModelBias_cameraerror.pdf](../../imgs/AverageModelBias_cameraerror.pdf), [AverageModelBias_radiuserror.pdf](../../imgs/AverageModelBias_radiuserror.pdf) and [AverageModelBias_speederror.pdf](../../imgs/AverageModelBias_speederror.pdf). From f98999f1c0f7942a9bc6eb7853eb66ce50b5fa9f Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 11:51:02 +0100 Subject: [PATCH 12/17] +ch4 data readme. re #38. --- data/Chapter4/README.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 data/Chapter4/README.md diff --git a/data/Chapter4/README.md b/data/Chapter4/README.md new file mode 100644 index 0000000..daa89f5 --- /dev/null +++ b/data/Chapter4/README.md @@ -0,0 +1,32 @@ +Chapter 4: A mechanistic model to compare the importance of interrelated population measures: host population size, density and colony size +=================================================================================================================================================================== + +The four csv files in this directory contain the data needed for Chapter 4. +As with Chapter 2, the raw simulation data was not saved (it was pretty large). +Only a summary was recorded. +However, the full, raw data, or this summary can be recreated by running [Chapter4.Rtex](../../Chapter4.Rtex). + +[PopSims.csv](PopSims.csv) contains the results for the simulations where population size is kept constant while density is altered by changing area. + +[DensSims.csv](DensSims.csv) contains the results for the simulations where population size is altered by changing colony size and population density is kept constant by altering area to match the changing population size. + + +[Dens2Sims.csv](Dens2Sims.csv) contains the results for the simulations where population size is altered by changing the number of colonies and population density is kept constant by altering area to match the changing population size. + + + + +All these files contain columns: +* transmission - the transmission rate (beta) +* dispersal - the dispersal rate (delta) +* nExtantDis - the number of disease extant at the end of the simulation +* nPathogens - The number of pathogens put into the simulations (always 2) +* meanK - the mean degree of the metapopulation network +* maxDistance - The threshold distance within which two colonies are joined in the metapopulation network (always 100 in these simulations) +* nEvents - the total number of events in the simulation (8e5) +* colonySize - the starting number of individuals in each colony +* colonyNumber - the number of colonies in the metapopulation +* pop - the total population size +* area - the spatial area for each simulation +* dens - population per area. + From 35fae81b3080d9b78c40308445d1df937a05d5b8 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 16:47:47 +0100 Subject: [PATCH 13/17] Add copies of +ch2 and +ch3 so that github links make sense. *added a note in each copy directing to original file. *I think for now it is easiest to still use Chapter*.Rtex as the master files and update the copies. --- Chapter2.Rtex | 2 +- Chapter3.Rtex | 2 +- ...t of the role of population structure.Rtex | 2707 +++++++++++++++++ comparative-test-of-pop-structure.Rtex | 2707 +++++++++++++++++ ...s-pathogen-richness-mechanistic-model.Rtex | 1503 +++++++++ 5 files changed, 6919 insertions(+), 2 deletions(-) create mode 100644 comparative test of the role of population structure.Rtex create mode 100644 comparative-test-of-pop-structure.Rtex create mode 100644 population-structure-affects-pathogen-richness-mechanistic-model.Rtex diff --git a/Chapter2.Rtex b/Chapter2.Rtex index b36297a..9424734 100644 --- a/Chapter2.Rtex +++ b/Chapter2.Rtex @@ -1083,7 +1083,7 @@ As with the $\xi = 0$ results, these tests were performed both with and without Finally, I also used binomial GLMs to test the hypothesis that the probability of invasion increased with transmission rate. Separate GLMs were fitted for each dispersal rate and network topology. All statistical analyses were performed using the \emph{stats} package in \emph{R}. -The code used for running the simulations and analysing the results is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/Chapter2.Rtex}. +The code used for running the simulations and analysing the results is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/population-structure-affects-pathogen-richness-mechanistic-model.Rtex}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% diff --git a/Chapter3.Rtex b/Chapter3.Rtex index 456cc37..004ea2a 100644 --- a/Chapter3.Rtex +++ b/Chapter3.Rtex @@ -1818,7 +1818,7 @@ Distribution size was estimated by downloading range maps for all species from I \subsection{Statistical analysis} Statistical analysis for both response variables --- number of subspecies and effective level of gene flow --- was conducted using an information theoretical approach \cite{burnham2002model}, specifically following \textcite{whittingham2005habitat, whittingham2006we}. -All analyses were performed in \emph{R} \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex}. +All analyses were performed in \emph{R} \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/comparative-test-of-pop-structure.Rtex}. I chose a credible set of models including all combinations of explanatory variables and a model with just an intercept. In the analysis using the number of subspecies response variable I also modelled the interaction between study effort and number of subspecies by including their product. This interaction was included as I believed \emph{a priori} that this interaction may be important as subspecies in well studied species are more likely to be identified. diff --git a/comparative test of the role of population structure.Rtex b/comparative test of the role of population structure.Rtex new file mode 100644 index 0000000..e77ad2c --- /dev/null +++ b/comparative test of the role of population structure.Rtex @@ -0,0 +1,2707 @@ +%--------------------------------------------------------------------------------------------------------------------------------% +% Code and text for "A comparative test of the role of population structure in determining pathogen richness" +% Chapter 2 of thesis "The role of population structure and size in determining bat pathogen richness" +% by Tim CD Lucas +% +% NB This file is a copy due to the mess up with chapter numbers. +% To see the full commit history see https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex +% +%---------------------------------------------------------------------------------------------------------------------------------% + + + + + +%%begin.rcode settings, echo = FALSE, cache = FALSE, message = FALSE, results = 'hide', eval = TRUE + + +################################## +### Run web scraping? ### +################################## + +# There's some slow webscrapping functions. Run them? +runPubmedScrape <- FALSE +runScholarScrape <- FALSE +runFstScrape <- FALSE + + +# Run slow bootstrapping? +subBoots <- FALSE +fstBoots <- FALSE +batclocksBoots <- FALSE + +# Run slow fst data wrangling as some is slow. +fstComb <- FALSE +runIucn <- FALSE + +# There are figures created in the data analysis which are not in the final chapter document. +# If TRUE, they will be included in the output. +# Use 'hide' to remove them. +extraFigs <- 'hide' + +#knitr options +opts_chunk$set(cache.path = '.Ch3Cache/') +source('misc/KnitrOptions.R') + +# ggplot2 theme. +source('misc/theme_tcdl.R') +theme_set(theme_grey() + theme_tcdl) + + +# Choose the number of cores to use +nCores <- 4 + +%%end.rcode + + +%%begin.rcode libs, cache = FALSE, result = FALSE + +# Data handling +library(dplyr) +library(broom) +library(readxl) +library(sqldf) +library(reshape2) + +# phylogenetic regression +library(ape) +library(caper) +library(phytools) +library(nlme) +library(qpcR) +library(car) + +# weighted means + var +library(Hmisc) + +# Plotting +library(ggplot2) +library(ggtree) +library(palettetown) +library(ggthemes) +library(GGally) +library(cowplot) + + +# Web scraping. +library(rvest) + +# For synonym list +library(taxize) + +# Spatial analysis +library(maptools) +library(geosphere) + +# Parllel computation +library(parallel) + +%%end.rcode + + + +%%begin.rcode parameters + + +# Define some parameters. +# This is useful at the top so that it can go in text. + +# How many bootstraps for model selection NULL variable +nBoots <- 50 + +# What proportion of a species range should be covered for an Fst study to count as valid. +rangeUseable <- 0.20 + +%%end.rcode + +\section{Abstract} + + +%\tmpsection{One or two sentences providing a basic introduction to the field} +% comprehensible to a scientist in any discipline. +\lettr{Z}oonotic diseases make up the majority of human infectious diseases and are a major drain on healthcare resources and economies. +Species that host many pathogen species are more likely to be the source of a novel zoonotic disease than species with few pathogens, all else being equal. +However, the factors that influence pathogen richness in animal species are poorly understood. +% +% +%\tmpsection{Two to three sentences of more detailed background} +% comprehensible to scientists in related disciplines. +% Theory led. +The pattern of contacts between individuals (i.e.\ population structure) can be influenced by habitat fragmentation, sociality and dispersal behaviour. +Epidemiological theory suggests that increased population structure can promote pathogen richness by reducing competition between pathogen species. +Conversely, it is often assumed that as greater population structure slows the spread of a new pathogen (i.e.\ lowers $R_0$), less structured populations should have greater pathogen richness. +% +% +%\tmpsection{One sentence clearly stating the general problem (the gap)} +% being addressed by this particular study. +Previous comparative studies comparing pathogen richness and population structure measured population structure differently and have had contradictory results, complicating the interpretation. +% +% +%\tmpsection{One sentence summarising the main result} +% (with the words “here we show” or their equivalent). +Here I test whether increased population structure correlates with viral richness using comparative data across 203 bat species, controlling for body mass, geographic range size, study effort and phylogeny. +This is an indirect test between the two competing hypotheses: does increased population structure allow pathogen coexistence by reducing competition, or does increased population structure decrease $R_0$ and therefore cause fewer new pathogens to enter the population. +Bats, as a group, make a useful case study because they have been associated with a number of important, recent zoonotic outbreaks. +Unlike previous studies, I used two measures of population structure: the number of subspecies and effective levels of gene flow. +I find that both measures are positively associated with pathogen richness. +% +% +%\tmpsection{Two or three sentences explaining what the main result reveals in direct comparison to what was thoughts to be the case previously} +% or how the main result adds to previous knowledge +My results add more robust support to the hypothesis that increased population structure promotes viral richness in bats. +The results support the prediction that increased population structure allows greater pathogen richness by reducing competition between pathogens +The prediction that factors that decrease $R_0$ should decrease pathogen richness is not supported. +% +% +%\tmpsection{One or two sentences to put the results into a more general context.} +Although my analysis implies that increased population structure does promote pathogen richness in bats, the weakness of the relationship and the difficulty in obtaining some measurements means that this is probably not a useful, predictive factor on its own for optimising zoonotic surveillance. +%However, the relationship has implications for global change, implying that increased habitat fragmentation might promote greater viral richness in bats. + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Introduction} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%#the introduction is not bad and starts very well but i think you need a bit more from studies of other mammals (not bats) to put the study into context as well as explaining why particularly you focus on pop structure, some justification of why bats, and less detail about the specific Fst measures (move to methods) and more stuff on your actual methods and approach you use in this study. + +%#Structure could be: +%#1. Zoonotic disease is bad (as you have written it already) +%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. However, pop structure (explain what this means) is of particular interest because of blah blah. +%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition +%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa). +%#5. Bats are very interesting in this regard because of blah +%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny. +%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. +%#8. I found ... + +\tmpsection{General Intro} + +%#1. Zoonotic disease is bad (as you have written it already) +Zoonotic pathogens make up the majority of newly emerging diseases and have profound consequences for public health, economics and international development \cite{jones2008global, smith2014global, ebolaWorldbank}. +Better statistical models for predicting which wild host species are potential reservoirs of zoonotic diseases would allow us to optimise zoonotic disease surveillance and anticipate how the risks of disease spillover might change with global change. +The chance that a host species will be the source of a zoonotic pathogen depends on a number of factors, such as its proximity and interactions with humans, the prevalence of its pathogens and the number of pathogen species it carries \cite{wolfe2000deforestation}. +However, the factors that control the number of pathogen species a host species carries remain poorly understood. + + +\tmpsection{Specific Intro} + +%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. +\tmpsection{Theoretical background} + + +A number of species traits that might control pathogen richness have been studied. +These traits can be at the level of the individual (e.g., body mass and longevity) or the level of the population (e.g., population density, sociality and species range size). +Large bodied animals have been shown to have high pathogen richness with large bodies providing more resources for pathogens \cite{kamiya2014determines, arneberg2002host, poulin1995phylogeny, bordes2008bat, luis2013comparison}. +Long lived species are expected to have high pathogen richness because the number of pathogens a host encounters in its lifetime will be higher \cite{nunn2003comparative, ezenwa2006host, luis2013comparison}. +Animal density \cite{kamiya2014determines, nunn2003comparative, arneberg2002host} and sociality \cite{bordes2007rodent, vitone2004body, altizer2003social, ezenwa2006host} are both predicted to increase pathogen richness by increasing the rate of spread, $R_0$, of a new pathogen. +Finally, widely distributed species have high pathogen richness, potentially because they experience a wider range of environments or because they are sympatric with more species \cite{kamiya2014determines, nunn2003comparative, luis2013comparison}. + +%# However, pop structure (explain what this means) is of particular interest because of blah blah. + +%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition + + +A further population level factor that may affect pathogen richness is population structure. +Population structure can be defined as the extent to which interactions between individuals in a population are non-random. +The role of population structure on human epidemics has been studied in depth and it has been shown that decreased population structure increases the speed of pathogen spread and makes establishment of a new pathogen more likely \cite{colizza2007invasion, vespignani2008reaction}. +In comparative studies of pathogen richness in wild animals, this relationship with $R_0$ is often taken as a prediction that decreased population structure will increase pathogen richness relative to other host species \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. +However, epidemiological models of highly virulent pathogens have shown that increased population structure can allow persistence of a pathogen where a well-mixed population would experience a single, large epidemic followed by pathogen extinction \cite{blackwood2013resolving, plowright2011urban}. +Furthermore, the assumption that high $R_0$ leads to high pathogen richness ignores inter-pathogen competition. +Simple epidemiological models of competition between multiple pathogens show that, in completely unstructured populations, a competitive exclusion process occurs but that adding population structure makes coexistence possible \cite{qiu2013vector, allen2004sis, nunes2006localized}. + + +\tmpsection{Previous Studies} + +%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa). + +There is a lack of large, comparative studies of the role of population structure on pathogen richness. +Sociality, which is one constituent part of population structure, has been well studied. +However, in primates only a weak positive association between sociality and pathogen richness was found \cite{vitone2004body}. +Furthermore, a negative association was found in rodents \cite{bordes2007rodent} and in even and odd-toed hoofed mammals \cite{ezenwa2006host}. +Finally, two studies tested for an association between group size and parasite richness in bats \cite{bordes2008bat, gay2014parasite}. +Amongst 138 bat species, \textcite{bordes2008bat} found no relationship between group size (coded into four classes) and bat fly species richness. +\textcite{gay2014parasite} found a negative relationship between colony size and viral richness but a positive relationship between colony size and ectoparasite richness. +While sociality is an important component of population structure it does not capture fully how connected the population is globally. + + +%#5. Bats are very interesting in this regard because of blah + +%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny. + + +Three studies have used comparative data to test for an association between global population structure and viral richness in bats. +A study on 15 African bat species found a positive relationship between the extent of distribution fragmentation and viral richness \cite{maganga2014bat}. +Conversely, a study on 20 South-East Asian bat species found the opposite relationship \cite{gay2014parasite}. +These studies used the ratio between the perimeter and area of the species' geographic range as their measure of population structure. +However, range maps are very coarse for many species. +Furthermore, range maps are likely to be more detailed (and therefore have a greater perimeter) in well studied species. + +A global study on 33 bat species found a positive relationship between $F_{ST}$ --- a measure of genetic structure --- and viral richness \cite{turmelle2009correlates}. +However, this study included measures using mtDNA which only measures female dispersal which may have biased the results as many bat species show female philopatry \cite{kerth2002extreme, hulva2010mechanisms}. +Furthermore, this study used measures of $F_{ST}$ irrespective of the spatial scale of the study including studies covering from tens \cite{mccracken1981social} to thousands \cite{petit1999male} of kilometres. +As isolation by distance has been shown in a number of bat species \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}, this could bias results further. +Finally, when a global $F_{ST}$ value is not given, \textcite{turmelle2009correlates} used the mean of all pairwise $F_{ST}$ values between sites. +This is not correct as pairwise and global $F_{ST}$ values have different relationships with effective migration rates. + + + +\tmpsection{The gap} +\tmpsection{What I did/found} + +%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. +%#8. I found ... + +Here I used a phylogenetic comparative approach to test for a relationship between increased population structure and pathogen richness in the largest study of bats to date. +I used phylogenetic linear models, controlling for the other life history characteristics known to impact pathogen richness, to quantify the relationship between viral richness (as a proxy for pathogen richness) and two measures of population structure: the number of subspecies and effective gene flow. +I used two measures of population structure to increase the robustness of the analysis; this is particularly important as previous studies have had contradictory results \cite{maganga2014bat, gay2014parasite, turmelle2009correlates}. + +I found that increases in both measures of population structure are positively associated with viral richness and are included as explanatory variables in the best models for describing viral richness. +Furthermore, I found that the role of phylogeny is very weak both in the models and in the distribution of viral richness amongst taxa. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Methods} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\subsection{Data Collection} + +\subsubsection{Pathogen richness} + +To measure pathogen richness I used data from \textcite{luis2013comparison}. +This data simply includes known infections of a bat species with a virus species. +I have used viral richness as a proxy for pathogen richness more generally. +Rows with host species that were not identified to species level according to \textcite{wilson2005mammal} were removed. +Many viruses were not identified to species level or their specified species names were not in the ICTV virus taxonomy \cite{ICTV}. +Therefore, I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified (present in the ICTV taxonomy). +For example, if a host is recorded as harbouring an unknown Paramyxoviridae virus, then it is logical to assume that the host carries at least one Paramyxoviridae virus. +If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, it is hard to confirm that the unknown virus is not another record of the known virus. +In this case, the host would be counted as having one virus species. + + +%$F_{ST}$ studies are conducted at a range of spatial scales, but $F_{ST}$ often increases with distance studied \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}. +%To minimise the effects of this I only used data from studies that cover \rinline{rangeUseable * 100}\% of the diameter of the species range. +%This is a largely arbitrary value that could be considered to reflect a ``global'' estimate of $F_{ST}$ while keeping a reasonable number of data points available. +%I calculated the diameter of the species range by finding the furthest apart points in the IUCN species range \cite{iucn} even if the range is split into multiple polygons. +%The width covered by each study was the distance between the most distant sampling sites. +%When this was not explicit in the paper, the centre of the lowest level of geographic area was used. + + + + +%%begin.rcode luis2013virusRead + +#read in luis2013virus data +virus2 <- read.csv('data/Chapter3/luis2013comparison.csv', stringsAsFactors = FALSE) + + +virus2$binomial <- paste(virus2$host.genus, virus2$host.species) + + +# From methods +#Many viruses were not identified to species level or their identified species was not in the ICTV virus taxonomy \cite{ICTV}. +#I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified in the ICTV taxonomy. +#That is, if a host carries an unknown Paramyxoviridae virus, then it must carry at least one Paramyxoviridae virus. +#If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, then it is hard to confirm that the unknown virus is not another record of the known virus. +#In this case, this would be counted as one virus species. + +# This has been implemented manually and indicated in the column `remove` + +virus2 <- virus2[!virus2$remove, ] + +%%end.rcode + + +%%begin.rcode wilsonReaderTaxonomyRead, fig.show = extraFigs, fig.cap = 'Histogram of number of subspecies' + +################################################################## +### Subspecies vs Viruses analysis. ### +################################################################## + + +# Read in the wilson Reader Taxonomy and use it to calculate the number of subspecies each bat species has. + +tax <- read.csv('data/Chapter3/msw3-all.csv', stringsAsFactors = FALSE) + +chir <- tax %>% + filter(Order == 'CHIROPTERA') + +# Save some memory. +rm(tax) + +# Count the number of subspecies each bat species has. +subs <- sqldf(' + SELECT Family, Genus, Species, COUNT(Subspecies) + AS NumberOfSubspecies + FROM chir + Where Species <> "" + GROUP BY Genus, Species + ') + + + +# I think each species has 1 row for species and extra rows for subspecies +# Check this is true. +# If that is correct, then Species with >1 NumberOfSubspecies should be one less. + +SpeciesRows <- sqldf(' + SELECT Genus, Species, COUNT(Subspecies) + AS SpeciesRows + FROM chir + WHERE Subspecies == "" AND Species <> "" + GROUP BY Genus, Species + ') + +# +(SpeciesRows$SpeciesRows != 1) %>% sum +all(SpeciesRows$SpeciesRows == 1) + +# Species with >1 NumberOfSubspecies should be one less +subs$NumberOfSubspecies <- ifelse(subs$NumberOfSubspecies > 1, + subs$NumberOfSubspecies - 1, + subs$NumberOfSubspecies) + +# Quick look at species with highest number of subspecies. +subs[order(subs$NumberOfSubspecies, decreasing = TRUE ),] %>% head + +# Megaderma spasma is top. It's widespread across south east asia islands. +# So this makes sense. + +# Quick look at the number of subspecies. +ggplot(subs, aes(x = NumberOfSubspecies)) + + geom_histogram(binwidth = 2) + + xlab('Number of Subspecies') + + ylab('Count') + + +# Create a combined binomial name column +subs$binomial <- paste(subs$Genus, subs$Species) + + + + +# Check overlap of datasets. +sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)) + +notInTax <- (virus2$binomial[virus2$host.species != ''])[!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)] + +# Run this to find synonyms of names not in Wilson and Reeder +# Doesn't find much of use. +# syns <- synonyms(notInTax, db = 'itis') + +# Clean some names +# As taxize::synonyms didn't find most of them, I am using IUCN. +# And checking that the IUCN name is then in The Wilson & Reeder taxonomy + +virus2$binomial[virus2$binomial == 'Myotis pilosus'] <- 'Myotis ricketti' +virus2$binomial[virus2$binomial == 'Tadarida pumila'] <- 'Chaerephon pumilus' +virus2$binomial[virus2$binomial == 'Tadarida condylura'] <- 'Mops condylurus' +virus2$binomial[virus2$binomial == 'Rhinolophus hildebrandti'] <- 'Rhinolophus hildebrandtii' +# Rhinolophus horsfeldi: I can't find this species anywhere. Will exclude. +# Possibly Megaderma spasma according to http://www.fao.org/3/a-i2407e.pdf +virus2$binomial[virus2$binomial == 'Tadarida plicata'] <- 'Chaerephon plicatus' +virus2$binomial[virus2$binomial == 'Artibeus planirostris'] <- 'Artibeus jamaicensis' + +sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)) + +%%end.rcode + +%%begin.rcode subsHistsByFam, fig.show = extraFigs, fig.height = 3, fig.cap = 'Histograms of number of subspecies for the families with many species.' + +# Compare the histograms of numbers of subspecies over the families with many species. +subs %>% + filter(Family %in% names(which(table(subs$Family) > 99))) %>% + ggplot(., aes(x = NumberOfSubspecies, y = ..density..)) + + geom_histogram() + + facet_grid(. ~ Family) + + xlab('Number of Subspecies') + + ylab('Density') + +%%end.rcode + +%%begin.rcode, subvsvirusCaption + +# Caption for subspecies vs n. viruses plot. +subvsvirus <- ' +Number of viruses against number of subspecies. +Points are coloured by family, with families with less than 10 species being grouped into "other". +Contours show the 2D density of points and suggest a positive correlation. +' +subvsvirusTitle <- 'Number of viruses against number of subspecies' +%%end.rcode + +%%begin.rcode subsDataFrame, fig.show = extraFigs, fig.cap = subvsvirus, fig.scap = subvsvirusTitle, out.width = '\\textwidth' +# create combined dataframe + +# Join dataframes +species <- sqldf(" + SELECT subs.binomial, virus2.[virus.species] + FROM subs + INNER JOIN virus2 + ON subs.binomial=virus2.binomial; + ") + +# Count number of virus species for each bat species +nSpecies <- species %>% + unique %>% + group_by(binomial) %>% + summarise(virusSpecies = n()) + +# Add other Subspecies data. +nSpecies <- sqldf(" + SELECT nSpecies.binomial, virusSpecies, NumberOfSubspecies, Genus, Family + FROM nSpecies + LEFT JOIN subs + ON nSpecies.binomial=subs.binomial + ") + +# Create another column to make plotting easier. +# Group families with few rows into 'other' + +nSpecies$familyPlotCol <- nSpecies$Family +nSpecies$familyPlotCol[ + nSpecies$Family %in% names(which(table(nSpecies$Family) < 10))] <- 'Other' + +table(nSpecies$familyPlotCol) + +ggplot(nSpecies, aes(x = log(NumberOfSubspecies), y = log(virusSpecies))) + + # geom_smooth(method = 'lm') + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, alpha = 0.8, + position = position_jitter(width = .1, height = .1)) + + scale_colour_hc() + + geom_density2d() + + labs(colour = 'Family') + +%%end.rcode + +%%begin.rcode virusHist, fig.show = extraFigs, fig.cap = 'Histogram of known viruses per species' + +ggplot(nSpecies, aes(x = virusSpecies)) + + geom_histogram() + +%%end.rcode + + + + +%%begin.rcode euthRead + +# Read in pantheria data base +pantheria <- read.table(file = 'data/Chapter3/PanTHERIA_1-0_WR05_Aug2008.txt', + header = TRUE, sep = "\t", na.strings = c("-999", "-999.00")) + +mass <- sqldf(" + SELECT [X5.1_AdultBodyMass_g] + FROM nSpecies + LEFT JOIN pantheria + ON nSpecies.binomial=pantheria.MSW05_Binomial + ") + +nSpecies$mass <- mass[, 1] + +# Now add additional mass estimates. + +additionalMass <- read.csv('data/Chapter3/AdditionalBodyMass.csv', stringsAsFactors = FALSE) +meanAdditionalMass <- additionalMass %>% + group_by(binomial) %>% + summarise(mass = mean(Body.Mass.grams)) + +nSpecies$mass[ + sapply(meanAdditionalMass$binomial, function(x) which(nSpecies$binomial == x)) + ] <- meanAdditionalMass$mass + + +%%end.rcode + + + +%%begin.rcode IUCNranges, eval = runIucn + +# Read in iucn ranges and calculate range sizes for each species. +ranges <- readShapePoly('data/Chapter3/TERRESTRIAL_MAMMALS/TERRESTRIAL_MAMMALS.shp') + +ranges <- ranges[ranges$order_name == 'CHIROPTERA', ] + +levels(ranges$binomial) <- c(levels(ranges$binomial), 'Myotis ricketti') +ranges$binomial[ranges$binomial == 'Myotis pilosus'] <- 'Myotis ricketti' + + + + +nSpecies$binomial[!(nSpecies$binomial %in% ranges$binomial)] + +findArea <- function(name){ + #cat(name) + A <- areaPolygon(ranges[ranges$binomial == name, ]) + sum(A) +} + +iucnDistr <- sapply(nSpecies$binomial, findArea) + +write.csv(iucnDistr, 'data/Chapter3/iucnDistr.csv') + +%%end.rcode + +%%begin.rcode readIucnIn + +iucnDistr <- read.csv('data/Chapter3/iucnDistr.csv', row.names = 1) + +nSpecies$distrSize <- iucnDistr$x + +%%end.rcode + + + +%%begin.rcode pubmedScrapeFunc + +# Scrape from pubmed + +scrapePub <- function(sp){ + + Sys.sleep(2) + + # Initialise refs + refs <- NA + + # Find synonyms from taxize + syns <- synonyms(sp, db = 'itis') + if(NROW(syns[[1]]) == 1){ + spString <- tolower(gsub(' ', '%20', sp)) + } else { + spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22') + } + + + url <- paste0('http://www.ncbi.nlm.nih.gov/pubmed/?term=%22', spString, '%22') + + + page <- html(url) + + # Test if exact phrase was found. + phraseFound <- try(page %>% + html_node('.icon') %>% + html_text() %>% + grepl("The following term was not found in PubMed:", .), silent = TRUE) + + if (class(phraseFound) == "logical") { + if(phraseFound){ + if(phraseFound) refs <- NA + } + } + if (class(phraseFound) != "logical") { + try({ + refs <- page %>% + html_node('.result_count') %>% + html_text() %>% + strsplit(' ') %>% + .[[1]] %>% + .[length(.)] %>% + as.numeric() + }) + } + + return(refs) +} + + +%%end.rcode + + +%%begin.rcode pubmedScrape, eval = runPubmedScrape + +# Create empty vector +pubmedRefs <- rep(NA, nrow(nSpecies)) + +for(i in 1:NROW(nSpecies)){ + pubmedRefs[i] <- scrapePub(nSpecies$binomial[i]) +} + +pubmedScrapeDate <- Sys.Date() + +pubmedRefs <- cbind(binomial = nSpecies$binomial, pubmedRefs = pubmedRefs) + +# Write out. +write.csv(pubmedRefs, file = 'data/Chapter3/pubmedRefs.csv') + +%%end.rcode + + + + +%%begin.rcode pubmedRead + + +pubmedRefs <- read.csv('data/Chapter3/pubmedRefs.csv', stringsAsFactors = FALSE, row.names = 1) + +# Function returns NA for none found. Change that to a zero. +pubmedRefs$pubmedRefs[is.na(pubmedRefs$pubmedRefs)] <- 0 +nSpecies$pubmedRefs <- pubmedRefs$pubmedRefs + +%%end.rcode + +%%begin.rcode scholarScrapeFunc + +scrapeScholar <- function(sp){ + + wait <- rnorm(1, 120, 2) + Sys.sleep(wait) + + + syns <- synonyms(sp, db = 'itis') + if(NROW(syns[[1]]) == 1){ + spString <- tolower(gsub(' ', '%20', sp)) + } else { + spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22') + } + + url <- paste0('https://scholar.google.co.uk/scholar?hl=en&q=%22', + spString, '%22&btnG=&as_sdt=1%2C5&as_sdtp=') + + + page <- html(url) + + try({ + refs <- page %>% + html_node('#gs_ab_md') %>% + html_text() %>% + gsub('About\\s(.*)\\sresults.*', '\\1', .) %>% + gsub(',', '', .) %>% + as.numeric + }) + return(refs) +} + +%%end.rcode + +%%begin.rcode scholarScrape, eval = runScholarScrape + +# Create empty vector +scholarRefs <- rep(NA, nrow(nSpecies)) + +for(i in 1:NROW(nSpecies)){ + scholarRefs[i] <- scrapeScholar(nSpecies$binomial[i]) +} + +scholarScrapeDate <- Sys.Date() + +scholarRefs <- cbind(binomial = nSpecies$binomial, scholarRefs = scholarRefs) + +# Write out. +write.csv(scholarRefs, file = 'data/Chapter3/scholarRefs.csv') + +%%end.rcode + + + + +%%begin.rcode scholarRead + + +scholarRefs <- read.csv('data/Chapter3/scholarRefs.csv', stringsAsFactors = FALSE, row.names = 1) + +# Function returns NA for none found. Change that to a zero. +scholarRefs$scholarRefs[is.na(scholarRefs$scholarRefs)] <- 0 + +nSpecies$scholarRefs <- sqldf(' + SELECT scholarRefs + FROM nSpecies + INNER JOIN scholarRefs + ON scholarRefs.binomial=nSpecies.binomial + ' + ) %>% + .$scholarRefs + +%%end.rcode + + + + + + + +%%begin.rcode subsRemoveNAs + +# Remove missing data and sort out the data frame a little. + +nSpecies <- nSpecies[complete.cases(nSpecies), ] + +# Add number of subspecies as a factor. Might help plotting. +nSpecies$SubspeciesFactor <- factor(nSpecies$NumberOfSubspecies, + levels = as.character(1:max(nSpecies$NumberOfSubspecies))) + +# Rownames to species names +rownames(nSpecies) <- nSpecies$binomial + +%%end.rcode + + + +%%begin.rcode savenSpecies +######################################################## +### At this point, nSpecies should be in final form ### +######################################################## + +write.csv(nSpecies, file = 'data/Chapter3/nSpecies.csv') + +%%end.rcode + + + +%%begin.rcode treeRead + +# Read in trees +t <- read.nexus('data/Chapter3/fritz2009geographical.tre') + +# Select best supported tree +tr1 <- t[[1]] + +# Make names match previous names +tr1$tip.label <- gsub('_', ' ', tr1$tip.label) + +# Which tips are not needed +unneededTips <- tr1$tip.label[!(tr1$tip.label %in% nSpecies$binomial)] + +# Prune tree down to only needed tips. +pruneTree <- drop.tip(tr1, unneededTips) + +rm(t) + +%%end.rcode + +%%begin.rcode nSpeciesTreePlot, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.show = extraFigs + +# Plot tree +p <- ggtree(pruneTree, layout = 'fan') + +p %<+% nSpecies[, 1:6] + + geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + + scale_size(range = c(0.2, 2)) + + scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12)])) + + theme_tcdl + + theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) + + theme(legend.position = 'right') + + labs(size = 'Virus Richness') + + theme(legend.key.size = unit(0.6, "lines"), + legend.text = element_text(size = 6), + legend.title = element_text(size = 8)) + + +%%end.rcode + + + +%%begin.rcode scholarvspubmed, fig.show = extraFigs, fig.cap = 'Logged number of references on scholar and pubmed, with a fitted (unphylogenetic) linear model. Colours indicate family.' + +# Check how correlated pubmed and scholar are. + + +compSubspecies <- comparative.data(data = nSpecies, phy = pruneTree, names.col = 'binomial') + +citeCor <- pgls(log(scholarRefs) ~ log(pubmedRefs + 1), data = compSubspecies, lambda = 'ML') + +studyEffortCor <- summary(citeCor) +# And plot +ggplot(nSpecies, aes(x = scholarRefs, y = pubmedRefs + 1)) + + geom_point(aes(colour = familyPlotCol), size = 2.5) + + geom_smooth(method = 'lm') + + scale_x_log10() + + scale_y_log10() + + scale_colour_hc() + +%%end.rcode + +%%begin.rcode subsDataCapts +subsDataCapts <- c( +'Unlogged number of virus species against log mass with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.', +'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.', +'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added.', +'Virus species against study effort (log pubmed references +1)') +%%end.rcode + +%%begin.rcode subsDataviz, fig.show = extraFigs, fig.cap = subsDataCapts + +# A number of exploratory plots + +# Mass against viruses +ggplot(nSpecies, aes(log(mass), virusSpecies)) + + geom_point(aes(colour = familyPlotCol), size = 2.5) + + geom_smooth(method = 'lm')+ + labs(colour = 'Family') + + scale_colour_hc() + + + +# N Subspecies and against viruses +ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, + position = position_jitter(width = .3, height = .3)) + + geom_smooth(method = 'lm')+ + labs(colour = 'Family') + + scale_colour_hc() + + +# Log(N Subspecies) and against viruses + +ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, + position = position_jitter(width = .05, height = .2)) + + scale_x_log10() + + geom_smooth(method = 'lm')+ + labs(colour = 'Family') + + scale_colour_hc() + + +# N. Subspecies against viruses as a boxplot to deal with overplotting. +ggplot(nSpecies, aes(SubspeciesFactor, virusSpecies)) + + geom_boxplot() + + scale_x_discrete(limits = levels(nSpecies$SubspeciesFactor), drop=FALSE) + + geom_smooth(method = 'lm', aes(group = 1)) + + xlab('# subspecies') + + +# Study effort against virusSpecies +ggplot(nSpecies, aes(log(pubmedRefs + 1), virusSpecies)) + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, + position = position_jitter(width = .1, height = .1)) + + geom_smooth(method = 'lm') + + labs(colour = 'Family')+ + scale_colour_hc() + + +# Distribution size aginst virus + + +ggplot(nSpecies, aes(distrSize, virusSpecies)) + + geom_point(aes(colour = familyPlotCol), size = 2.5) + + geom_smooth(method = 'lm') + + labs(colour = 'Family') + + scale_colour_hc() + + scale_x_log10() + + +# Correlation plot +nSpecies %>% + dplyr::select(virusSpecies, NumberOfSubspecies, mass, distrSize, pubmedRefs, scholarRefs) %>% + mutate(mass = log(mass), distrSize = log(distrSize), pubmedRefs = log(pubmedRefs + 1), scholarRefs = log(scholarRefs)) %>% + ggpairs(.) + +%%end.rcode + + + +%%begin.rcode, subsAnalysis, fig.show = extraFigs + +################################################################################## +## N Virus ~ subs + log(cites + mass) + +subspeciesJointUnlog <- pgls( + virusSpecies ~ log(scholarRefs) + NumberOfSubspecies + log(mass), + data = compSubspecies, lambda = 'ML') + + + +## N Virus ~ subs + log(cites + mass) + subs*log(cites) + +subspeciesInter <- pgls( + virusSpecies ~ log(mass) + + NumberOfSubspecies*log(scholarRefs), + data = compSubspecies, lambda = 'ML') + +#subInter.summary <- summary(subspeciesInter) + + + + +## Look at Variance inflation factors. +## Couple of help messages imply lm vif is fine. + +#sqrt(vif(lm(virusSpecies ~ log(scholarRefs) + NumberOfSubspecies + log(mass) + log(distrSize), data = nSpecies))) + +%%end.rcode + + + + + + + + + +%%begin.rcode ITanalysis + +varList <- c('scholarRefs', 'NumberOfSubspecies', 'mass', 'distrSize', 'rand') + +findCombs <- function(k, vars, longest){ + x <- t(combn(vars, k)) + nas <- matrix(NA, ncol = longest - NCOL(x), nrow = nrow(x)) + mat <- cbind(x, nas) + return(mat) +} + +modelList <- lapply(0:5, function(k) findCombs(k, varList, 6)) +modelMat <- do.call(rbind, modelList) + +interMat <- modelMat[apply(modelMat, 1, function(x) "scholarRefs" %in% x & "NumberOfSubspecies" %in% x), ] +interMat[, 2:5] <- interMat[, 1:4] +interMat[, 1] <- "scholarRefs:NumberOfSubspecies" + +allModelMat <- rbind(modelMat, interMat) + + +allFormulae <- apply(allModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + ')))) + +allFormulae <- c(as.formula('virusSpecies ~ 1'), allFormulae) + + + +modelSelect <- function(allForm, data, phy, boot, allModelMat, varList){ + + set.seed(paste0('123', boot)) + bootData <- cbind(data, rand = runif(nrow(data))) + + # log some predictors + bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')]) + + # scale + bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')]) + + coefs <- matrix(NA, ncol = length(varList) + 2, nrow = nrow(allModelMat), + dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList, 'scholarRefs:NumberOfSubspecies')))) + + results <- apply(allModelMat, 1, function(x) sapply(c(varList, "scholarRefs:NumberOfSubspecies"), function(y) y %in% x)) %>% + t %>% + data.frame %>% + cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs) + + # Fit each model + # I'm having problems with convergence so sometimes have to try different starting values. + for(m in 1:length(allForm)){ + if(exists('model')){ + rm(model) + } + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 1 + }) + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 2 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 3 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 4 + }) + } + if(!exists('model')){ + try({ + model <- lm(allForm[[m]], data = bootData) + results$attempt[m] <- 5 + message('Running lm') + }) + } + #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML') + results$AIC[m] <- AICc(model) + + if(inherits(model, 'gls')){ + results$lambda[m] <- model$modelStruct$corStruct[1] + } + + results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3] + + + results[m, paste0('beta.', names(coef(model)))] <- coef(model) + + message(paste('Boot:', boot, ', m:', m, '\n')) + } + + results$dAIC <- results$AIC - min(results$AIC) + results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC)) + + + return(results) + +} + + + + +%%end.rcode + +%%begin.rcode modelSelectBoots, eval = subBoots + +fitModelsBootStrap <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies, pruneTree, b, allModelMat, varList), mc.cores = nCores) + +allResults <- do.call(rbind, fitModelsBootStrap) + +write.csv(allResults, file = 'data/Chapter3/modelSelectSubspecies.csv') + + +%%end.rcode + +%%begin.rcode analyseModelSelect, fig.show = extraFigs + +allResults <- read.csv('data/Chapter3/modelSelectSubspecies.csv', row.names = 1) + +#varWeights <- sapply(names(allResults)[1:6], function(x) sum(allResults$weight[allResults[, x]])/nBoots) + +sepVarWeights <- lapply(1:nBoots, function(b) + sapply(names(allResults)[1:6], + function(x) + sum(allResults[allResults$boot == b, 'weight'][allResults[allResults$boot == b, x]]) + ) + ) + +sepVarWeights <- do.call(rbind, sepVarWeights) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +sepVarWeights$col <- 'Other Variables' +sepVarWeights$col[grep('NumberOf', sepVarWeights$variable)] <- 'Population Structure' +sepVarWeights$col[sepVarWeights$variable == 'rand'] <- 'Null' + + + +modelWeights <- allResults %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) %>% + mutate(string = predictors) + + +# Calculate variable weights based on mean(AIC) rather than raw AIC. +varWeights <- sapply(names(allResults)[1:6], + function(x) sum(modelWeights$modelWeight[grep(x, as.character(modelWeights$predictors))])) + + + +allResults %>% + filter(rand, !`scholarRefs.NumberOfSubspecies`, NumberOfSubspecies) %>% +ggplot(., aes(x = lambda, colour = predictors)) + + geom_density() + + scale_colour_hc() + +ggplot(allResults, aes(x = lambda)) + + geom_density() + +allResults %>% + filter(boot == 1) %>% + dplyr::select(predictors, lambda) + +%%end.rcode + + + +%%begin.rcode ITPlots + +# reorder factors to get structure vars at beginning. +sepVarWeights$variable <- factor(sepVarWeights$variable, levels(sepVarWeights$variable)[c(2, 6, 1, 3, 4, 5)]) + +ITPlot <- ggplot(sepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +%%end.rcode + +%%begin.rcode nSpeciesCoef, fig.show = extraFigs + +ggplot(allResults, aes(x = 'beta.NumberOfSubspecies', colour = scholarRefs)) + + geom_density() + + + +mean(allResults$NumberOfSubspecies, na.rm = TRUE) + + +varCoefMeans <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.mean(x, allResults$weight, na.rm = TRUE)) +varCoefVars <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.var(x, allResults$weight, na.rm = TRUE)) + +nSpeciesCoefMean <- wtd.mean(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesCoefMeanI <- wtd.mean(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesInterMean <- wtd.mean(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE) + + +nSpeciesCoefVar <- wtd.var(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesCoefVarI <- wtd.var(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesInterVar <- wtd.var(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE) + + + +# Direction of interaction models + +min(nSpecies$NumberOfSubspecies) + +max(nSpecies$NumberOfSubspecies) + +# At minimum study effort +nSpeciesInterMean*log(min(nSpecies$scholarRefs)) + nSpeciesCoefMeanI +nSpeciesInterMean*log(max(nSpecies$scholarRefs)) + nSpeciesCoefMeanI +nSpeciesInterMean*log(median(nSpecies$scholarRefs)) + nSpeciesCoefMeanI + +mean(nSpeciesInterMean*log(nSpecies$scholarRefs) + nSpeciesCoefMeanI > 0) + + + +%%end.rcode + + + +%%begin.rcode familyMeans + +familyMeans <- nSpecies %>% + group_by(Family) %>% + summarise(mean = mean(virusSpecies), n = n()) + +%%end.rcode + + +%%begin.rcode univariatePGLS + +#orderedNSpecies <- nSpecies[sapply(pruneTree$tip.label, function(x) which(nSpecies$binomial == x)),] + + +sspLambda <- summary(pgls(NumberOfSubspecies ~ 1, data = compSubspecies, lambda = 'ML')) +massLambda <- summary(pgls(log(mass) ~ 1, data = compSubspecies, lambda = 'ML')) +scholarLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compSubspecies, lambda = 'ML')) +virusLambda <- summary(pgls(virusSpecies ~ 1, data = compSubspecies, lambda = 'ML')) +distrLambda <- summary(pgls(log(distrSize) ~ 1, data = compSubspecies, lambda = 'ML')) + +sspUni <- summary(pgls(virusSpecies ~ NumberOfSubspecies, data = compSubspecies, lambda = 'ML')) + + +%%end.rcode + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%% FST ANALYSIS %%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +%%begin.rcode fstRead, eval = fstComb + +# Read in Fst data. +# Then add extra columns needed. + +fst <- read.csv('data/Chapter3/FstDataCompData.csv') + +# Check overlap of datasets. +sum(!(fst$binomial %in% virus2$binomial[virus2$host.species != ''])) + +notInFst <- fst$binomial[!(fst$binomial %in% virus2$binomial)] +# lots of sp not in virus2. MAybe will include 0 virus species. Kinda makes sense. + + + +######################################################################################### +#### Get distribution size and width #### +######################################################################################### + + + + +fst$binomial[!(fst$binomial %in% ranges$binomial)] + +fst <- fst[(fst$binomial %in% ranges$binomial), ] + +unique(fst$binomial) %>% length + + + + +findAreaFst <- function(name){ + #cat(name) + A <- areaPolygon(ranges[ranges$binomial == as.character(name), ]) + sum(A) +} + +fstIucnDistr <- sapply(fst$binomial, findAreaFst) + + +fst$distrSize <- fstIucnDistr + + +#### Now get distribution width + +findWidth <- function(name){ + #print(name) + distr <- ranges[ranges$binomial == as.character(name), ] + + coords <- list() + # Get coordinates from all polygons into one matrix. + for(i in 1:length(distr@polygons)){ + coords[[i]] <- distr@polygons[[i]]@Polygons[[1]]@coords + } + coords <- do.call(rbind, coords) + + # Take the convex hull of coordinates to speed up last step. + hullCoords <- coords[chull(coords), ] + + maxDist <- max(apply(hullCoords, 1, function(x) distGeo(coords, x)))/1000 + return(maxDist) + +} + +# Calculate widest part of all species distributions. +# This is slow but also RAM heavy. +# 3 cores doesn't crash my computer with 16GB RAM. +rangeWidth <- mclapply(fst$binomial, findWidth, mc.cores = 3) %>% do.call(c, .) + +#rangeWidth <- sapply(fst$binomial, findWidth) + +fst$rangeWidth <- rangeWidth +fst$rangeCoverage <- fst$Dmax..km. / fst$rangeWidth + + + +fst$Useable <- fst$rangeCoverage > rangeUseable +sum(fst$Useable, na.rm = TRUE) +fst$binomial[fst$Useable] %>% unique %>% .[!is.na(.)] %>% length + +# Need to go back and check data but for now if fst$Useable is na, then it's FALSE (i.e.\ it's not a useable row) +fst$Useable[is.na(fst$Useable)] <- FALSE + + +%%end.rcode + + + +%%begin.rcode fstStudyEffort, eval = fstComb + +# First take what data we can from nSpecies analysis. +fstStudy <- sqldf(" + SELECT fst.binomial, nSpecies.scholarRefs, nSpecies.pubmedRefs + FROM fst + LEFT JOIN nSpecies + ON nSpecies.binomial=fst.binomial + ") + +%%end.rcode + +%%begin.rcode fstScrape, eval = runFstScrape + +######################################################## +#### Sloow bit that might get you blocked by google #### +######################################################## + +fstNewStudy <- fstStudy[is.na(fstStudy[,2]),1] %>% + lapply(., function(x) c(x, scrapeScholar(x), scrapePub(x))) %>% + do.call(rbind, .) + +names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs') + +write.csv(fstNewStudy, file = 'data/Chapter3/fstScrape.csv') + +%%end.rcode + + +%%begin.rcode fstCombine, eval = fstComb + +fstNewStudy <- read.csv('data/Chapter3/fstScrape.csv', row.names = 1) +names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs') + +# NAs are from searches with 0 references. +fstNewStudy$pubmedRefs[is.na(fstNewStudy$pubmedRefs)] <- 0 + +whichRows <- lapply(fstNewStudy$binomial, function(x) which(fstStudy$binomial == x)) +for(i in 1:length(whichRows)){ + fstStudy[whichRows[[i]], 2:3] <- fstNewStudy[i, 2:3] +} + + +fst <- cbind(fst, fstStudy[, 2:3]) + +# Remove rows whose scale is too small +fst <- fst[fst$Useable, ] + + +# Don't want rows using mtDNA due to female baised dispersal +fst <- fst[fst$Marker != 'mtDNA', ] + +%%end.rcode + +%%begin.rcode convertFst, eval = fstComb + +calcNm <- function(Fst){ (1 - Fst)/(4 * Fst) } + + +fst$Nm <- calcNm(fst$Value) + + + +fst <- fst[!is.na(fst$Nm) & !(fst$Nm == Inf), ] + +fstFinal <- fst + +# Take means of species with multiple measurements + +fstFinal <- fstFinal[!duplicated(fstFinal$binomial), ] +fstFinal$Nm <- sapply(fstFinal$binomial, function(x) mean(fst$Nm[fst$binomial == x])) + +# Add number of viruses to fst dataset +# Includes zeros for species with no known viruses. + +fstFinal$virusSpecies <- sapply(fstFinal$binomial, function(x) sum(virus2$binomial == x)) + + + + +# Add mass data. + + +mass <- sqldf(" + SELECT [X5.1_AdultBodyMass_g] + FROM fstFinal + LEFT JOIN pantheria + ON fstFinal.binomial=pantheria.MSW05_Binomial + ") + +# Don't need pantheria data anymore +rm(pantheria) + +fstFinal$mass <- mass[, 1] + +fstFinal$mass[fstFinal$binomial == 'Myotis ricketti'] <- meanAdditionalMass$mass[meanAdditionalMass$binomial == 'Myotis ricketti'] + +fstFinal$mass[fstFinal$binomial == 'Myotis macropus'] <- 9.8 + +fstFinal <- fstFinal[!is.na(fstFinal$mass), ] + + +############################# +### fst data is finished ### +############################# + +write.csv(fstFinal, 'data/Chapter3/fstFinal.csv') +%%end.rcode + + +%%begin.rcode + +#### Read is full fstFinal dataframe + +fstFinal <- read.csv('data/Chapter3/fstFinal.csv', row.names = 1) + +%%end.rcode + +%%begin.rcode fstCors, fig.show = extraFigs + +fstFinal[, c('mass', 'scholarRefs', 'rangeWidth', 'Nm')] %>% + log %>% + cbind(virusSpecies = fstFinal$virusSpecies) %>% + ggpairs(.) + + +%%end.rcode + + + +%%begin.rcode compareNm, fig.show = extraFigs + +ggplot(fstFinal, aes(x = Marker, y = Nm)) + + geom_point() + + scale_y_log10() + +lm(fstFinal$Nm ~ fstFinal$Marker) %>% aov %>% summary + + +%%end.rcode + + +%%begin.rcode fstTree + +# Prune the tree for the fst data. + +# Which tips are not needed +fstUnneededTips <- tr1$tip.label[!(tr1$tip.label %in% fstFinal$binomial)] + +# Prune tree down to only needed tips. +fstTree <- drop.tip(tr1, fstUnneededTips) + + + +%%end.rcode + + +%%begin.rcode fstTreePlot, fig.show = extraFigs, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.height = 3.6 + +# Plot tree +p <- ggtree(fstTree) + + +fstFinal$lengthNames <- fstFinal$binomial %>% + as.character %>% + paste0(' ', .) + + +p %<+% fstFinal[, c('binomial', 'virusSpecies')] + + #geom_tiplab(family = 'lato light', align = FALSE) + + geom_text2(aes(x = x + 15, label = as.character(label), subset = isTip), + family = 'Lato light', hjust = 0, size = 3.3) + + #geom_text(aes(x = x + 15, label = as.character(label)), subset=.(isTip), + # family = 'Lato light', hjust = 0, size = 3.3) + + ggplot2::xlim(0, 210) + + theme_tcdl + + geom_point2(aes(x = x + 8, size = virusSpecies, subset = isTip)) + + scale_size(range = c(0, 4)) + + theme(legend.key.size = unit(0.8, "lines"), + legend.text = element_text(size = 9), + legend.title = element_text(size = 8), + legend.position = "right", + text = element_text(colour = 'darkgrey'), + legend.key = element_blank()) + + labs(size = 'Virus Richness') + + + +%%end.rcode + + + +%%begin.rcode fstITanalysis + +fstVarList <- c('scholarRefs', 'Nm', 'mass', 'distrSize', 'rand') + + +fstModelList <- lapply(0:5, function(k) findCombs(k, fstVarList, 5)) +fstModelMat <- do.call(rbind, fstModelList) + +fstAllFormulae <- apply(fstModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + ')))) + +fstAllFormulae <- c(as.formula('virusSpecies ~ 1'), fstAllFormulae) + +%%end.rcode + +%%begin.rcode fstModelSelectFun + + +fstModelSelect <- function(allForm, data, phy, boot, allModelMat, varList){ + + set.seed(paste0('2388', boot)) + bootData <- cbind(data, rand = runif(nrow(data))) + row.names(bootData) <- bootData$binomial + + + # log some predictors + bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')]) + + # scale + bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')]) + + + coefs <- matrix(NA, ncol = length(varList) + 1, nrow = nrow(allModelMat), + dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList)))) + + results <- apply(allModelMat, 1, function(x) sapply(varList, function(y) y %in% x)) %>% + t %>% + data.frame %>% + cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs) + + # Fit each model + # I'm having problems with convergence so sometimes have to try different starting values. + for(m in 1:length(allForm)){ + if(exists('model')){ + rm(model) + } + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 1 + }) + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 2 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 3 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 4 + }) + } + if(!exists('model')){ + try({ + model <- lm(allForm[[m]], data = bootData) + results$attempt[m] <- 5 + message('Running lm') + }) + } + #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML') + results$AIC[m] <- AICc(model) + + if(inherits(model, 'gls')){ + results$lambda[m] <- model$modelStruct$corStruct[1] + } + + results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3] + + + results[m, paste0('beta.', names(coef(model)))] <- coef(model) + + message(paste('Boot:', boot, ', m:', m, '\n')) + } + + results$dAIC <- results$AIC - min(results$AIC) + results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC)) + + + return(results) + +} + +%%end.rcode + +%%begin.rcode fstModelSelectBoots, eval = fstBoots + + + +fstModelsBootStrap <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree, b, fstModelMat, fstVarList), mc.cores = nCores) + +fstAllResults <- do.call(rbind, fstModelsBootStrap) + +write.csv(fstAllResults, file = 'data/Chapter3/fstModelSelectSubspecies.csv') + + +%%end.rcode + +%%begin.rcode fstAnalyseModelSelect, fig.show = extraFigs + +fstAllResults <- read.csv('data/Chapter3/fstModelSelectSubspecies.csv', row.names = 1) + +fstSepVarWeights <- lapply(1:nBoots, function(b) + sapply(names(fstAllResults)[1:5], + function(x) + sum(fstAllResults[fstAllResults$boot == b, 'weight'][fstAllResults[fstAllResults$boot == b, x]]) + ) + ) + +fstSepVarWeights <- do.call(rbind, fstSepVarWeights) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +fstSepVarWeights$col <- 'Other Variables' +fstSepVarWeights$col[fstSepVarWeights$variable == 'Nm'] <- 'Population Structure' +fstSepVarWeights$col[fstSepVarWeights$variable == 'rand'] <- 'Null' + + + + + +fstModelWeights <- fstAllResults %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) + +# Calculate variable weights based on mean(AIC) rather than raw AIC. +fstVarWeights <- sapply(names(fstAllResults)[1:5], + function(x) sum(fstModelWeights$modelWeight[grep(x, as.character(fstModelWeights$predictors))])) + +%%end.rcode + + + + +%%begin.rcode fstITlambda, fig.show = extraFigs, fig.cap = 'Values of $\\lambda$ found in $F_{ST}$ analysis.', fig.height = 3 + +ggplot(fstAllResults, aes(x = lambda)) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + +%%end.rcode + + +%%begin.rcode fstITlambdaFacets, fig.show = extraFigs, fig.height = 4 + + +transform(fstAllResults, mass = c('Other', 'Mass' )[factor(mass)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ mass) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +transform(fstAllResults, Nm = c('Other', 'Nm' )[factor(Nm)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ Nm) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +transform(fstAllResults, distrSize = c('Other', 'distrSize' )[factor(distrSize)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ distrSize) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +transform(fstAllResults, scholarRefs = factor(c('Scholar Refs', 'Other')[factor(!scholarRefs)], levels = c('Scholar Refs', 'Other'))) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ scholarRefs) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + +transform(fstAllResults, rand = c('Other', 'Rand' )[factor(rand)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ rand) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +%%end.rcode + +%%begin.rcode lookAtLambda, fig.show = extraFigs + +fstComp <- comparative.data(fstTree, fstFinal, 'binomial') + +fullFst <- pgls(virusSpecies ~ log(Nm) + log(mass) + log(distrSize) + log(distrSize) + log(scholarRefs), fstComp, lambda = 'ML') + +fst.lambda.profile <- pgls.profile(fullFst, "lambda") +plot(fst.lambda.profile) + +data.frame(x = fst.lambda.profile$x, L = fst.lambda.profile$logLik) %>% +ggplot(aes(x, L)) + + geom_line() + + geom_vline(xintercept = fst.lambda.profile$ci$ci.val, col = 'steelblue') + + +%%end.rcode + + +%%begin.rcode fstCoef, fig.show = extraFigs + +ggplot(fstAllResults, aes(x = beta.Nm)) + + geom_histogram() + + +ggplot(fstAllResults, aes(x = beta.Nm, colour = scholarRefs)) + + geom_density() + + + +ggplot(fstAllResults, aes(x = beta.Nm, colour = distrSize)) + + geom_density() + + +fstCoefMeans <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.mean(x, fstAllResults$weight, na.rm = TRUE)) +fstCoefVars <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.var(x, fstAllResults$weight, na.rm = TRUE)) + +pcCoefLzero <- 100*sum(na.omit(fstAllResults$beta.Nm) < 0) / length(na.omit(fstAllResults$beta.Nm)) + +%%end.rcode + + + +%%begin.rcode univariateFstPGLS + +#orderedFst <- fstFinal[sapply(fstTree$tip.label, function(x) which(fstFinal$binomial == x)),] + +compFst <- comparative.data(data = fstFinal, phy = fstTree, names.col = 'binomial') + +nmFstLambda <- summary(pgls(log(Nm) ~ 1, data = compFst, lambda = 'ML')) +massFstLambda <- summary(pgls(log(mass) ~ 1, data = compFst, lambda = 'ML')) +scholarFstLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compFst, lambda = 'ML')) +virusFstLambda <- summary(pgls(virusSpecies ~ 1, data = compFst, lambda = 'ML')) +distrFstLambda <- summary(pgls(distrSize ~ 1, data = compFst, lambda = 'ML')) + +nmFstUni <- summary(pgls(virusSpecies ~ log(Nm), data = compFst, lambda = 'ML')) + +massFstUni <- summary(pgls(virusSpecies ~ log(mass), data = compFst, lambda = 'ML')) +fstDistrStudyEffort <- summary(pgls(log(scholarRefs) ~ log(distrSize), data = compFst, lambda = 'ML')) + +fstMassStudyEffort <- summary(pgls(log(scholarRefs) ~ log(mass), data = compFst, lambda = 'ML')) + +%%end.rcode + + + + + + + + +\subsubsection{Population structure data} + +I used two measures of population structure: the number of subspecies and the effective level of gene flow. +The number of subspecies was counted using the taxonomy from \textcite{wilson2005mammal}. +The effective level of gene flow was calculated from estimates of $F_{ST}$ collated from the literature. +The studies were from a wide range of spatial scales, from local ($\sim\SI{10}{\kilo\metre}$) to continental. +As $F_{ST}$ often increases with spatial scale \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range} I controlled for this by only using data from studies where a large proportion of the species range was studied. +I used the ratio of the furthest distance between $F_{ST}$ samples (taken from the paper or measured with \url{http://www.distancefromto.net/} if not stated) to the length of the IUCN species range \cite{iucn} and only used studies if this ratio was greater than \rinline{rangeUseable}. +This is an arbitrary value that was a compromise between retaining a reasonable number of data points and controlling for the bias in spatial scale. +I only used global $F_{ST}$ estimates as the mean of pairwise $F_{ST}$ values is not necessarily equal to the global $F_{ST}$ value. +I converted all $F_{ST}$ values to effective migration rates using $M = (1-F_{ST})/4F_{ST}$. +This transforms the data from being bound by $(0, 1)$ to being in the range $\lbrack 0, \infty)$ and is easier to interpret. + +The two measures of population structure were analysed separately because the number of subspecies data set had \rinline{nrow(nSpecies)} data points but there was only $F_{ST}$ data for \rinline{nrow(fstFinal)} bat species. +For the subspecies analysis, all bat species in \textcite{luis2013comparison} were used (i.e.\ all species with at least one known virus species). +This was to avoid using the very large number of bat species that have simply never been sampled for viruses. +However, for the gene flow analysis, all bat species with suitable $F_{ST}$ estimates were used. +As some bat species had suitable $F_{ST}$ estimates but were not present in \textcite{luis2013comparison}, some bat species with zero known virus species were included. +These bat species with no known viruses were included to make the greatest use of the $F_{ST}$ data available and because the number of species with no known virus species was not unduly large (\rinline{sum(fstFinal$virusSpecies == 0)} species). + +After data cleaning there was data for \rinline{nrow(nSpecies)} bat species in \rinline{length(unique(nSpecies$Family))} families for the subspecies analysis. +Due to the limited number of studies and the restrictive requirements imposed on study design, there was only data for \rinline{nrow(fstFinal)} bat species in \rinline{length(unique(fstFinal$Family))} families for the effective gene flow analysis. +The raw data are included in Table~\ref{A-rawData}. + + + + +\subsubsection{Other explanatory variables} + + + +To control for study bias I collected the number of PubMed and Google Scholar citations for each bat species name including synonyms from ITIS \cite{itis}. +This was performed in \emph{R} \cite{R} using the \emph{rvest} package \cite{rvest}, with ITIS synonyms being accessed with the \emph{taxize} package \cite{chamberlain2013taxize}. +I log transformed these variables as they were strongly right skewed. +I tested for correlation between these two proxies for study effort using phylogenetic least squares regression (pgls), using the best-supported phylogeny from \textcite{fritz2009geographical}, and likelihood ratio tests using the \emph{caper} package \cite{caper} (Figures~\ref{fig:treePlot} and \ref{fig:scholarvspubmedPlot}). +The log number of citations on PubMed and Google scholar were highly correlated (pgls: $t$ = \rinline{studyEffortCor$coefficients['log(pubmedRefs + 1)', 't value']}, df = \rinline{studyEffortCor$df[2]}, $p < 10^{-5}$). +As the correlation between citation counts was strong, I only used Google Scholar reference counts in subsequent analyses. +%See the appendix for analyses run using PubMed citations. + +Two factors that have previously been found to be important were included as additional explanatory variables: body mass \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat, han2015infectious, bordes2008bat}, range size \cite{kamiya2014determines, turmelle2009correlates, maganga2014bat}. +These other factors were included to avoid spurious positive results occurring simply due to correlations between pathogen richness and a different, causal factor. +Despite commonly being associated with pathogen richness \cite{arneberg2002host, kamiya2014determines, nunn2003comparative}, population density was not included in the analysis as there is very little data for bat densities. +Measurements of body mass were taken from Pantheria \cite{jones2009pantheria} and primary literature \cite{canals2005relative, arita1993rarity, lopez2014echolocation, orr2013does, lim2001bat, aldridge1987turning, ma2003dietary, owen2003home, henderson2008movements, heaney2012nyctalus, oleksy2015high, zhang2009recent}. +\emph{Pipistrellus pygmaeus} was assigned the same mass as \emph{P. pipistrellus} as they are indistinguishable by mass. +Body mass measurements were log transformed as they were strongly right skewed. +Distribution size was estimated by downloading range maps for all species from IUCN \cite{iucn} and were also log transformed due to right skew. + + + + +\subsection{Statistical analysis} + +Statistical analysis for both response variables --- number of subspecies and effective level of gene flow --- was conducted using an information theoretical approach \cite{burnham2002model}, specifically following \textcite{whittingham2005habitat, whittingham2006we}. +All analyses were performed in \emph{R} \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/comparative-test-of-pop-structure.Rtex}. +I chose a credible set of models including all combinations of explanatory variables and a model with just an intercept. +In the analysis using the number of subspecies response variable I also modelled the interaction between study effort and number of subspecies by including their product. +This interaction was included as I believed \emph{a priori} that this interaction may be important as subspecies in well studied species are more likely to be identified. +The interaction was only included in models with both study effort and number of subspecies as individual terms. +Following \textcite{whittingham2005habitat} I included a uniformly distributed random variable. +This variable can be used to benchmark how important other explanatory variables are. +The whole analysis was run \rinline{nBoots} times, resampling the random variable each time. + + +To control for phylogenetic non-independence of data points I used the best-supported phylogeny from \textcite{fritz2009geographical} which is the supertree from \textcite{bininda2007delayed} with names updated to match the taxonomy by \textcite{wilson2005mammal}. +This tree was pruned to include only the species I had data for (Figure~\ref{fig:treePlot}). +Phylogenetic manipulation was performed using the \emph{ape} package \cite{ape}. +I also performed the analysis using the phylogeny from \textcite{jones2005bats} as this has some broad topological differences including the Rhinolophoidea being sister to the Pteropodidae rather than being related to the other insectivorous bats (Figure~\ref{fig:treePlot2}). + + + + +%%begin.rcode treeCapt + +treeCapt <- ' +The phylogenetic distribution of viral richness. +The phylogeny is from \\cite{fritz2009geographical} pruned to include all species used in either the number of subspecies or gene flow analysis. +Dot size shows the number of known viruses for that species and colour shows family. +The red scale bar shows 25 million years.' + + + +treeTitle <- 'Pruned phylogeny showing number of pathogens and family' + +%%end.rcode + +%%begin.rcode treePlot, out.width = '1\\textwidth', out.extra = 'trim = 0cm 0cm 0cm 0cm', fig.height = 5, fig.height = 5.5, fig.cap = treeCapt, fig.scap = treeTitle + +combUneeded <- tr1$tip.label[!(tr1$tip.label %in% c(as.character(fstFinal$binomial), nSpecies$binomial))] + +# Prune tree down to only needed tips. +combTree <- drop.tip(tr1, combUneeded) + +combdf <- nSpecies %>% + dplyr::select(binomial, virusSpecies, Family) %>% + rbind(fstFinal %>% dplyr::select(binomial, virusSpecies, Family)) %>% + distinct(binomial) + +# Plot tree +p <- ggtree(combTree, layout = 'fan') + +p %<+% combdf + + geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + + scale_size(range = c(0.1, 3)) + + scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) + + theme_tcdl + + theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) + + theme(legend.position = c(0.5, -0.04)) + + geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) + + labs(size = 'Virus Richness') + +# guides(size = guide_legend(override.aes = list(shape = 1))) + + theme(legend.key.size = unit(0.8, "lines"), + legend.text = element_text(size = 10), + legend.margin = unit(c(0.05), "cm"), + legend.title = element_text(size = 12), + legend.direction = "horizontal") + + guides(colour = guide_legend(ncol=3)) + + +# Attempt at concentric circle time bar. +#scale <- data.frame(x = c(0, 0), y = c(0, 0), l = c(1200, 2400)) + +#p %<+% combdf + +# geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + +# scale_size(range = c(0.1, 100), breaks = c(1, 5, 10)) + +# scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) + +# theme_tcdl + +# theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) + +# theme(legend.position = c(0.5, -0.04)) + +# geom_point(data = scale, aes(x = x, y = y, size = l), alpha = 0.2) + +# geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) + +# labs(size = 'Virus Richness') + +## guides(size = guide_legend(override.aes = list(shape = 1)), alpha = 0.9) + +# theme(legend.key.size = unit(0.8, "lines"), +# legend.text = element_text(size = 10), +# legend.margin = unit(c(0.05), "cm"), +# legend.title = element_text(size = 12), +# legend.direction = "horizontal") + +# guides(colour = guide_legend(ncol=3)) + +# Or using bars + +#scale2 <- data.frame(x = c(1, 1), y = c(10, 200), w = c(1, 1)) + +#p %<+% combdf + +# geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + +# geom_bar(data = scale2, aes(x = x, y = y, size = w), alpha = 0.3, stat = 'identity', position = 'identity') + +%%end.rcode + + + +The importance of the phylogeny on each variable separately was examined by estimating the $\lambda$ parameter when regressing the variable against an intercept using the \emph{pgls} function in \emph{caper} \cite{caper}. +The parameter $\lambda$ usually takes values between zero and one and \emph{pgls} constrains $\lambda$ within these bounds. +$\lambda = 0$ implies no autocorrelation while a trait evolving by Brownian motion along the tree would have $\lambda = 1$. +I tested fitted $\lambda$ values against the null hypothesis of $\lambda = 0$ (no correlation between species) with log-likelihood ratio tests using \emph{caper} \cite{caper}. + +I fitted phylogenetic regressions for all models in the credible set using the function \emph{gls} in the package \emph{nlme} \cite{nlme}. +The explanatory variables were centred and scaled to allow direct comparison of the coefficients \cite{schielzeth2010simple}. +For each regression model I simultaneously fitted the $\lambda$ parameter as this avoids misspecifying the model \cite{revell2010phylogenetic}. +Unlike the \emph{pgls} function, \emph{gls} does not constrain $\lambda$ to be in the range $\lbrack 0, 1\rbrack$. +$\lambda < 0$ indicates that residuals from the fitted model are distributed on the phylogeny more uniformly than expected by chance. +$\kappa$ and $\delta$ parameters were constrained to one as they are more concerned with when evolution occurs along a branch than the importance of the phylogeny. +Further, fitting multiple parameters makes interpretation difficult. + + + +To establish the importance of variables I calculated the probability, $Pr$, that each variable would be in the best model amongst those examined (under the assumption that all models are \emph{a priori} equally likely). +This value can more generally, and with fewer assumptions, be considered as simply the relative weight of evidence for each variable being in the best model amongst those examined. +I calculated AICc for each model. +As each model was fitted 50 times, I calculated the average AICc, $\bar{\text{AICc}}$, by averaging AICc scores for each model. +$\Delta\text{AICc}$ was calculated as $\text{min}(\bar{\text{AICc}}) - \bar{\text{AICc}}$, not the mean of the individual $\Delta\text{AICc}$ scores, to guarantee that the best model has $\Delta\text{AICc} = 0$. +From these $\Delta\text{AICc}$ values I calculated Akaike weights, $w$. +This value can be interpreted as the probability that a model is the best model, given the data, amongst those examined. +For each variable, the sum of the Akaike weights of models containing that variable are summed to give $Pr$. +This value can be interpreted as the probability that the given variable is in the best model. + +To determine the direction and strength of the effect of each variable the mean of its regression coefficient, $b$, in all models that contained that variable, weighted by the model's Akaike weight, was also calculated. +In the subspecies analysis the inclusion of an interaction term between number of subspecies and study effort makes interpretation of this mean coefficient more difficult, particularly because the interaction term greatly affects the estimated value of $b$. +To aid interpretation, the mean coefficient for the number of subspecies was calculated for: \emph{i}) all models containing the number of species, \emph{ii}) only models with the interaction term and \emph{iii}) only models with the number of subspecies but not the interaction term. + + + +%%begin.rcode boxplotCapt + +# Caption for the main boxplot of subspecies vs virus + +boxplotCapt <- paste( +'The relationship between number of subspecies and viral richness for', +nrow(nSpecies), +'bat species. +The area of the circle shows the number of bat species at each discrete value. +48 bat species have one subspecies and one known virus species. +The red line represents a phylogenetic simple regression between the two variables. +' +) + +boxplotTitle <- paste( +'The relationship between number of subspecies and viral richness for', +nrow(nSpecies), +'bat species' +) + +%%end.rcode + +%%begin.rcode boxplot, fig.cap = boxplotCapt, fig.scap = boxplotTitle, fig.height = 2.3 + + +nSpeciesCounts <- nSpecies %>% + group_by(NumberOfSubspecies, virusSpecies) %>% + dplyr::summarize(n = n()) + +ggplot(nSpeciesCounts, aes(NumberOfSubspecies, virusSpecies, size = n)) + + geom_point() + + scale_size(range = c(0.5, 4.3), breaks = c(1, 20, 40)) + + scale_y_continuous(breaks = c(1, 5, 10, max(nSpecies$virusSpecies))) + + scale_x_continuous(breaks = c(1, 4, 8, 12, 16)) + + xlab('Number of Subspecies') + + ylab('Viral Richness') + + geom_abline(slope = sspUni$coef[2, 1], intercept = sspUni$coef[1,1], lwd = 0.7, colour = pokepal('nidorina')[10]) + +%%end.rcode + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Results} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + + +\subsection{Number of Subspecies} +\tmpsection{More descriptive} + +The number of described virus species for a bat host ranged up to \rinline{max(nSpecies$virusSpecies)} viruses in \emph{\rinline{nSpecies$binomial[which.max(nSpecies$virusSpecies)]}}. +There appears to be a positive relationship between the number of subspecies and viral richness (Figure~\ref{fig:boxplot}) though few species have more than five subspecies. +Out of \rinline{nrow(modelWeights)} fitted models, the top seven models all had $\Delta\text{AICc} < 4$ meaning there was no clear best model (Table~\ref{t:models} and Table~\ref{A-modelWeights}). +However these top seven models all contained study effort, number of subspecies and the interaction between these two variables. +The explanatory variables log(Mass), log(Range Size) and the uniformly random variable are each in three of the top seven models. +These top seven models had a combined weight of \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))} meaning that there is a \rinline{sprintf("%.0f", round(100 * modelWeights[7, 5]))}\% chance that one of these models is the best model amongst those examined. + +Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}. +The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}). +When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). +Models with an interaction term between the number of subspecies and study effort have a positive interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}). + + + +\afterpage{ % use after page to make sure this whole table is at the end of a page. +\begin{landscape} +\begin{table}[p!] +\centering +%\rowcolors{2}{gray!25}{white} +\caption[Model selection results]{ +Model selection results for number of subspecies and effective level of gene flow analysis. +Models are ranked according to $\bar{\text{AICc}}$ and only the best nine and three models are shown respectively. +Models were fitted to all combinations of variables (in total \rinline{nrow(modelWeights)} number of subspecies models and \rinline{nrow(fstModelWeights)} effective gene flow models). +$\bar{\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. +$\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. +$w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). +$\sum w$ is the cumulative sum of the Akaike weights. +log(Scholar)*NSubspecies indicates the interaction term between study effort and number of subspecies. +%In the number of subspecies analysis there are many models with low $\Delta$AICc scores suggesting there there is no single `best model'. +%In the gene flow analysis, only the top model is supported. +} + + +\begin{tabular}{@{}>{\footnotesize}lrrrr@{}} + +\toprule +\normalsize{Model} & $\bar{\text{AICc}}$ & $\Delta$AICc & $w$ & $\sum w$\\ +\midrule +&&&&\\[-3mm] +\textit{\small{Number of Subspecies}} &&&&\\ +%1 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(Mass) + log(RangeSize) & +\rinline{round(modelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[1, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[1, 5], 2))}\\ +%2 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(Mass) & +\rinline{round(modelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[2, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[2, 5], 2))}\\ +%3 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random + log(Mass) & +\rinline{round(modelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[3, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[3, 5], 2))}\\ +%4 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies & +\rinline{round(modelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[4, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[4, 5], 2))}\\ +%5 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(RangeSize) & +\rinline{round(modelWeights[5 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[5, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[5, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[5, 5], 2))}\\ +%6 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random + log(RangeSize) & +\rinline{round(modelWeights[6 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[6, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[6, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[6, 5], 2))}\\ +%7 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random & +\rinline{round(modelWeights[7 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[7, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[7, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))}\\ +%8 +log(Scholar) + NSubspecies + log(Mass) + Random & +\rinline{round(modelWeights[8 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[8, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[8, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[8, 5], 2))}\\ +%9 +log(Scholar) + NSubspecies + log(Mass) + log(RangeSize) + rand& +\rinline{round(modelWeights[9 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[9, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[9, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[9, 5], 2))}\\[5mm] +\textit{\small{Gene flow}} &&&&\\ +log(Scholar) + log(Gene flow) + log(Mass) & +\rinline{round(fstModelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 3], 2))} & +\rinline{sprintf("%.2f", round(fstModelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 5], 2))}\\ +log(Range size) & +\rinline{round(fstModelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 3], 2))} & +\rinline{sprintf("%.2f", round(fstModelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 5], 2))}\\ +log(Mass) & +\rinline{round(fstModelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 3], 2))} & +\rinline{sprintf("%.2f", round(fstModelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 5], 2))}\\ +%log(Scholar) + log(Gene flow) + log(Mass) + Random & +%\rinline{round(fstModelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 3], 2))} & +%\rinline{sprintf("%.2f", round(fstModelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 5], 2))}\\ +\bottomrule +\end{tabular} + +\label{t:models} +\end{table} +\end{landscape} +} + + + + +When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and~\ref{t:variables2}). +Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. +However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. + + + +\tmpsection{Model results} + + +\begin{table}[t!] +\centering +\caption[Estimated variable weights and coefficients for number of subspecies and gene flow analyses]{ +Estimated variable weights (probability that a variable is in the best model) and their estimated coefficients for both number of subspecies and gene flow analyses. +The coefficients for the number of subspecies variable are given for models with and without the interaction term because this term strongly changes the coefficient and because the coefficient can only be usefully interpreted when estimated without the interaction. +However, there are no weights for these separated terms as they are not directly compared in the model selection framework. +} +%\rowcolors{2}{gray!25}{white} +\begin{tabular}{@{}>{\small}l rrrr@{}} +\toprule +& \multicolumn{2}{c}{\textit{Number of Subspecies}} & \multicolumn{2}{c}{\textit{Gene flow}}\\\cmidrule(rl){2-3}\cmidrule(rl){4-5} +\normalsize{Variable} & $Pr$ & Coefficient & $Pr$ & Coefficient\\ +\midrule +Number of subspecies &&&&\\ +\hspace{3mm}Total & \rinline{sprintf('%.2f', varWeights['NumberOfSubspecies'])} & \rinline{varCoefMeans['beta.NumberOfSubspecies']} &&\\ +\hspace{3mm}Models without interaction term && \rinline{nSpeciesCoefMean} &&\\ +\hspace{3mm}Models with interaction term && \rinline{nSpeciesCoefMeanI} &&\\ +Number of subspecies*log(Scholar) & \rinline{varWeights['scholarRefs.NumberOfSubspecies']} & \rinline{sprintf('%.2f', varCoefMeans['beta.scholarRefs.NumberOfSubspecies'])} && \\[2.5mm] +Gene flow & & & \rinline{sprintf('%.2f', fstVarWeights['Nm'])} & \rinline{fstCoefMeans['beta.Nm']}\\[2.5mm] +log(Scholar) & \rinline{sprintf('%.2f', varWeights['scholarRefs'])} & \rinline{varCoefMeans['beta.scholarRefs']} & + \rinline{sprintf('%.2f', fstVarWeights['scholarRefs'])} & \rinline{fstCoefMeans['beta.scholarRefs']}\\ +log(Mass) & \rinline{sprintf('%.2f', varWeights['mass'])} & \rinline{varCoefMeans['beta.mass']} & + \rinline{sprintf('%.2f', fstVarWeights['mass'])} & \rinline{fstCoefMeans['beta.mass']}\\ +log(Range size) & \rinline{sprintf('%.2f', varWeights['distrSize'])} & \rinline{varCoefMeans['beta.distrSize']}& + \rinline{fstVarWeights['distrSize']} & \rinline{fstCoefMeans['beta.distrSize']}\\ +Random & \rinline{sprintf('%.2f', varWeights['rand'])} & \rinline{varCoefMeans['beta.rand']}& + \rinline{fstVarWeights['rand']} & \rinline{fstCoefMeans['beta.rand']}\\ +\bottomrule +\end{tabular} + +\label{t:variables} +\end{table} + + + + +\subsection{Gene Flow} + +\tmpsection{More Descriptive} + +%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. +The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). +Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}). +The effective level of gene flow was likely in the best model ($Pr > 0.99$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). +On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). +Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. + + + + + +%%begin.rcode ITCombPlotCapt + +ITPlotCapts <- " +The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness. +The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis. +The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. +The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles. +The red ``Random'' box is the uniformly random variable. +Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses." + +ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness" + +%%end.rcode + + +%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = ITPlotTitle, out.width = '\\textwidth', cache = FALSE + +# Reorder var levels to get structure at beginning. +fstSepVarWeights$variable <- factor(fstSepVarWeights$variable, levels(fstSepVarWeights$variable)[c(2, 1, 3, 4, 5)]) + +# Draw the fst model selection plot +fstIT <- ggplot(fstSepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + ylim(0, 1) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +#plot_grid(ITPlot, fstIT, labels = c("A", "B"), align = 'h', label_size = 10) + + +# Combine and print the plots. +ggdraw() + + draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(ITPlot, 0, 0, 0.5, 1) + + draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(fstIT, 0.5, 0.164, 0.5, 0.855) + + draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) + + +%%end.rcode + + + + +Study effort was very likely in the best model ($Pr > 0.99$) as was body mass ($Pr > 0.99$). +However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}). +In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}. +%This variable being less supported than the random variable may be because range size is closely correlated with study effort (pgls: $b$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Estimate']}, $t$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 't value']}, df = \rinline{fstDistrStudyEffort$df[2]}, $p$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Pr(>|t|)']}). +Of the three explanatory variables in the best model, study effort had the largest effect ($b = $ \rinline{fstCoefMeans['beta.scholarRefs']}, variance = \rinline{fstCoefVars['beta.scholarRefs']}). +The effect size of gene flow ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) was approximately twice the size of that of body mass ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}) + + + + +%%begin.rcode fstRawCapt + +fstRawDataCapt <- +paste( +'Relationship between viral richness and log effective gene flow per generation for', +nrow(fstFinal), +'bat species. +Green points are studies that estimated effective gene flow using allozymes and blue points are studies using microsatellites. +The red line represents a phylogenetic simple regression between the two variables. +') + + + +fstRawDataTitle <- +paste( +'Relationship between viral richness and log effective gene flow per generation for', +nrow(fstFinal), +'bat species +') + +%%end.rcode + + + +%%begin.rcode fstRawData, fig.height = 2.3, fig.cap = fstRawDataCapt, fig.scap = fstRawDataTitle + +# Plot raw fst data + +ggplot(fstFinal, aes(x = Nm, y = virusSpecies, colour = Marker)) + + geom_point(size = 2) + + scale_colour_poke(pokemon = 'oddish', spread = 3) + + scale_x_log10() + + geom_abline(intercept = nmFstUni$coef[1, 1], slope = nmFstUni$coef[2, 1], lwd = 0.7, colour = pokepal('nidorina')[10]) + + xlab('Gene Flow (per gen.)') + + ylab('Viral Richness') + +%%end.rcode + + + +When using the phylogeny from \textcite{jones2005bats} the analysis became very unstable (Figure~\ref{f:A-itplots}). +The support for each variable changed dramatically with each resampling of the random variable. +On average however, only the model containing mass and range size is supported (Tables~\ref{A-fstModelWeights} and~\ref{t:variables2}). + + + + +\subsection{Phylogenetic Analysis} + +\subsubsection{Number of subspecies} + +Figure~\ref{fig:treePlot} shows the phylogeny used and the number of viruses for each species. +The mean number of viruses across families is fairly constant with \rinline{familyMeans$Family[which.min(familyMeans$mean)]} having the smallest mean, (\rinline{min(familyMeans$mean)}). +The highest mean is \rinline{familyMeans$Family[which.max(familyMeans$mean)]} with \rinline{max(familyMeans$mean)} virus species per bat species, but this is based on only \rinline{familyMeans$n[which.max(familyMeans$mean)]} species. +The \rinline{familyMeans$Family[order(familyMeans$mean, decreasing = TRUE)[2]]} have the second highest mean of \rinline{familyMeans$mean[order(familyMeans$mean, decreasing = TRUE)[2]]} ($n$ = \rinline{familyMeans$n[order(familyMeans$mean, decreasing = TRUE)[2]]}). + + + +The small change in mean pathogen richness across families and the lack of clear pattern in Figure~\ref{fig:treePlot} implies that viral richness is not strongly phylogenetic. +This is corroborated by the small estimated size of $\lambda$ ($\lambda$ = \rinline{virusLambda$param['lambda']}, $p$ = \rinline{virusLambda$param.CI$lambda$bounds.p[1]}). +%This fact implies that other factors must control pathogen richness. +%It also implies that pathogens are not directly inherited down the phylogeny, although this is to be expected by the fast evolution of viruses. + +Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.99$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). +Across all multiple regression models the mean value of $\lambda$ was \rinline{mean(na.omit(allResults$lambda))} which implied that the residuals from the models were very weakly phylogenetic. +A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%) had negatively phylogenetically distributed residuals. + + + + +\subsubsection{Effective gene flow} + +There was no phylogenetic signal in the number of virus species ($\lambda$ = \rinline{virusFstLambda$param['lambda']}, $p > 0.99$). +Gene flow also had no phylogenetic autocorrelation ($\lambda$ = \rinline{nmFstLambda$param['lambda']}, $p > 0.99$). +Due to the limited sample size, significance tests are unlikely to have much power. +There is little evidence of phylogenetic autocorrelation in study effort ($\lambda$ = \rinline{scholarFstLambda$param['lambda']}, $p$ = \rinline{scholarFstLambda$param.CI$lambda$bounds.p[1]}). +However, there is some weak evidence of phylogenetic signal in range size as the estimated size of $\lambda$ is large while $p$ is also large, potentially due to a lack of statistical power ($\lambda$ = \rinline{distrFstLambda$param['lambda']}, $p$ = \rinline{distrFstLambda$param.CI$lambda$bounds.p[1]}). +Body mass showed significant phylogenetic autocorrelation ($\lambda$ = \rinline{massFstLambda$param['lambda']}, $p$ = \rinline{massFstLambda$param.CI$lambda$bounds.p[1]}). + + +Across all multiple regression models the mean value of $\lambda$ is \rinline{mean(na.omit(fstAllResults$lambda))} and a large number of individual models (\rinline{round(mean(na.omit(fstAllResults$lambda < 0))*100)}\%) had negatively phylogenetically distributed residuals implying the residuals from the model are spread more uniformly on the phylogeny than expected by chance. +Due to the small sample size this was probably due to a small number of data points with large residuals being distant on the tree. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Discussion} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\tmpsection{Discuss results in more detail} + + +\tmpsection{Pop structure relates to pathogen richness} + +% It does so here +% I hope this study is more robust. +In this study I have used known viral richness in bats as a case study for the more general hypothesis that increased population structure promotes pathogen richness. +In both analyses I found that a positive effect of increasing population structure (a positive effect of the number of subspecies and a negative effect of gene flow) is likely to be in the best model for explaining viral richness. +Only the effective gene flow analysis, when performed using the phylogeny from \textcite{jones2005bats}, does not support this hypothesis. +Therefore my study supports the broader hypothesis that increased population structure promotes pathogen richness. +The positive relationship between increased population structure and pathogen richness implies that direct or indirect competitive mechanisms are acting such that increased population structure allows escape from competition which promotes pathogen richness. +Furthermore my study contradicts the assumption that factors that promote high $R_0$ will automatically promote high pathogen richness by increasing the rate of spread of new pathogens entering into the population \cite{nunn2003comparative, morand2000wormy}. + + + +% It does so in some lit +This analysis is in agreement with two studies that have specifically tested this same hypothesis \cite{turmelle2009correlates, maganga2014bat}. +These two studies used $F_{ST}$ \cite{turmelle2009correlates} and fragmentation of species distributions \cite{maganga2014bat}. +Combined with the analysis here using the number of subspecies, three different measures of population structure have been shown to correlate with pathogen richness in bats. +By analysing data on two measures of population structure, and using larger data sets than previous studies, it is hoped that the results here may be more robust than in previous analyses \cite{gay2014parasite, turmelle2009correlates, maganga2014bat}. + + + +% The pattern is reversed in other lit +In contrast, one study \textcite{gay2014parasite} found the opposite relationship using fragmentation of species distribution. +Furthermore, \textcite{bordes2008bat} found no relationship between increased colony size and pathogen richness while \textcite{gay2014parasite} found relationships in opposite directions for virus and ectoparasite richness. +However, the study by \textcite{gay2014parasite} uses relatively few species while the study by \textcite{bordes2008bat} uses group size which is a measure of local rather than global population structure. +The overall weight of evidence suggests that population structure and pathogen richness are associated. + + + + +\tmpsection{There is an interaction between study effort and number of subspecies} + +% interpretations +% Biases are known in the lit. gippoliti2007problem % maybe should add to methods? + +There was strong support for a positive interaction between the number of subspecies and study effort. +The support for this interaction implies that increased population structure has a stronger relationship with known pathogen richness in the presence of study effort. +One interpretation of this is that increased population structure alone does not predict high known viral richness; reasonable study effort is also needed to turn the expected high viral richness into known and recorded viral richness. +Biases in identification of subspecies have been noted before \cite{gippoliti2007problem}. +The number of subspecies is more commonly used as a variable in comparative analyses of birds than mammals but the fact that it is associated with study effort is often not taken into account \cite{phillimore2007biogeographical, belliure2000dispersal}. + +\tmpsection{Other explanatory vars} + + +% study effort is important. Never forget. +% body mass behaves wierdly. +% Range size is very marginal + +Of the other explanatory variables considered, study effort and body mass were selected as being in the best model while there was marginal evidence for range size being associated with viral richness. +Study effort positively predicted pathogen richness, confirming the expectation that additional study of a bat species yields more known viruses infecting that host species. +Therefore, this bias cannot be ignored in studies using known pathogen richness as a proxy for total pathogen richness \cite{nunn2003comparative, gregory1990parasites}. +While body mass is selected as being in the best model in both the number of subspecies analysis and the effective gene flow analysis the estimated coefficients have opposite signs in the two analyses. +In the number of subspecies analysis, body mass has a positive relationship with pathogen richness which is in agreement with previous studies \cite{kamiya2014determines, bordes2008bat, turmelle2009correlates, gay2014parasite, maganga2014bat}. +However, in the effective gene flow analysis, body mass has a negative estimated coefficient. +This is in contrast to the number of subspecies analysis, previous studies in the literature and the single-predictor model. +This result is probably due to correlations with other variables in the analysis and exacerbated by the small sample size in this analysis. + + +\tmpsection{phylogeny} +% Phylogeny is not very important +% phylogeny is weird in Fst study? + + + +%Another interpretation is that having few subspecies does not predict low viral richness unless the species has been adequately studied as otherwise the low number of subspecies is probably due to a lack of study rather than an accurate measurement. + +%Another potential mechanism by which structure might be promoting increased richness is by slowing the spread of highly virulent viruses such as rabies and preventing them from having short, intense epidemics followed by extinction. +%This mechanism has interesting parallel to metapopulation theory in ecology in which a metapopulation structure can allow persistence of species that would otherwise go extinct. + +\subsection{Broader implications} + +The relationship between increased population structure and pathogen richness suggests that population structure has at least some potential as being predictive of high pathogen richness and therefore of a species' likelihood of being a reservoir of a potentially zoonotic pathogen. +However, given that it is difficult to measure population structure and given that the relationship appears to be weak at best, this trait on its own is unlikely to be useful in predicting zoonotic risk. +However, as a number of other factors are also associated with pathogen richness such as body mass and to a lesser extent range size as shown here as well as other traits studied elsewhere \cite{turmelle2009correlates, luis2013comparison}. +Therefore, using a combination of traits in a predictive (i.e.\ machine learning) framework has potential for use in prioritising zoonotic disease surveillance. +The main hurdle in this approach is finding a way to validate models; due to the study effort bias in current data, predictive models will also be biased. +As unbiased pathogen surveys such as \textcite{anthony2013strategy} become more common good validation may become possible. +Alternatively, predictive models could be trained on all available --- and therefore biased --- data and validated by predicting smaller, unbiased data sets such as the data collected in \textcite{maganga2014bat}. + +The relationship between increased population structure and pathogen richness also has implications for habitat fragmentation and range shifts due to global change. +In short, habitat fragmentation and range shifts that reduce movement between populations would be predicted to increase pathogen richness. +However, depending on the mechanisms by which increased population structure increases pathogen richness this may not be a cause for concern. +If the main mechanism is one that reduces pathogen extinction rates, a newly fragmented population is unlikely to increase its pathogen richness over any short to medium-term timescales. +If, however, increased population structure actively promotes the evolution of new pathogen strains or allows the persistence of more virulent strains \cite{blackwood2013resolving, pons2014insights, plowright2011urban} this could have important public health implications. +Therefore further studies on the exact mechanisms by which increased population structure affects pathogen richness are needed. + + +\subsection{Study limitations} + +Although I have used measures of study effort to try to control for biases in the viral richness data, this bias could still make the results here unreliable --- this is especially true as study effort is by far the strongest predictor of viral richness in both data sets. +It is hoped that as untargeted sequencing of viral genetic material becomes cheaper and more common this bias can be reduced \cite{anthony2013strategy}. +The strength of the relationship between study effort and known viral richness also highlights the number of bat-virus host-pathogen relationships yet to be discovered and the number of virus species that are yet to be described. + +I have included a number of explanatory variables to avoid spurious correlations. +However, there is little data on bat density or population size. +Given that studies in other mammalian groups have found relationships between host density and pathogen richness this would be a useful variable to include in further analyses \cite{kamiya2014determines, nunn2003comparative, arneberg2002host}. +Acoustic monitoring is becoming cheaper and less labour intensive and may provide suitable data for estimating population densities or population sizes for more bat species. +However, it is not clear whether host population density or host population size is the more appropriate measure with respect to disease dynamics \cite{begon2002clarification}. +Given the importance of geographic range size found here and elsewhere \cite{lindenfors2007parasite, nunn2003comparative, turmelle2009correlates, huang2015parasite, kamiya2014determines} comparative studies may struggle to select between these three related factors: host population size, population density and geographic range size. + +I have used two measures of population structure and the number of subspecies data set is larger than those used in previous studies. +However it is clear that the gene flow data set is small ($n$ = \rinline{nrow(fstFinal)}). +This may explain some unexpected results. +While the model averaging approach has given a negative model averaged coefficient for gene flow, the single-predictor model of gene flow against viral richness gave a positive coefficient. +Furthermore body mass has a negative average coefficient. +This is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model. +It is not easy to interpret these contradictions but it is clear that the results from the gene flow analysis alone should not be considered strong evidence for a relationship between increased population structure and pathogen richness. +These contradictions also reiterate the need to use large data sets where possible and the need to use multiple measures of population structure to promote robust conclusions. + +Finally, while comparative studies are a useful tool for examining broad trends of pathogen richness across large taxonomic groups, they cannot examine the specific mechanisms that may be underpinning the correlations found. +Therefore, further work is needed to test which mechanisms are actually causing the relationship between increased population structure and pathogen richness that I have identified here. +A number of mechanisms might be involved. +A reduced rate of pathogen extinction might be caused by a reduction in competition due to the slow dispersal of competing pathogens. +Alternatively, increased population structure may promote the invasion of new pathogens, by creating localised areas of low competition or host immunity. +One method for testing these mechanisms would be through mechanistic epidemiological models. + +\subsection{Conclusions} + + +I have used phylogenetic linear models to identify positive relationships between two measures of population structure (the number of subspecies and effective levels of gene flow) and viral richness in bats. +This study adds to the evidence that increased population structure may promote pathogen richness. +It does not support the view that factors that increase $R_0$ will increase pathogen richness. +Using larger data sets and multiple measurements makes the weight of the evidence here stronger than in previous studies. +However, caution must still be taken in interpreting these results as the data is biased and particularly sparse in one of the analyses. + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%% Repeat analysis with bat clocks and rocks %%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +%\section{Appendix} + + + +%%begin.rcode treeRead2 + +# Read in trees +t2 <- read.nexus('data/Chapter3/BatST2BL.nex') + +# Make names match previous names +t2$tip.label <- gsub('_', ' ', t2$tip.label) + +#missing <- nSpecies$binomial[!nSpecies$binomial %in% pruneTree2$tip.label ] + +## Copy binomial column. binomial will be changed to fit t2. +#nSpecies$oldBinomial <- nSpecies$binomial + +## Replace with agrep where possible +#closeMatch <- sapply(missing, function(i) t2$tip.label[agrep(i, t2$tip.label, max.distance = 0.11)]) + +#closeMatch <- closeMatch[sapply(closeMatch, function(i) length(i) > 0)] + + + + +unneededTips2 <- t2$tip.label[!(t2$tip.label %in% nSpecies$binomial)] + +# Prune tree down to only needed tips. +pruneTree2 <- drop.tip(t2, unneededTips2) + + +nSpecies2 <- sapply(pruneTree2$tip.label, function(x) which(nSpecies$binomial == x)) %>% + nSpecies[., ] + + +################ +## Fst tree ## +################ + + +# Which tips are not needed +fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)] + +# Prune tree down to only needed tips. +fstTree2 <- drop.tip(t2, fstUnneededTips2) + +# Which tips in Fst analysis are not in bats clocks tree. +fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)] + + +# Hacky cruddy way of placing the missing tips into the tree. Should end up with genus level polytomies in trimmed tree. +# Just replacing some of the uneeded tips with the ones I need. + +t2$tip.label[t2$tip.label == 'Miniopterus pusillus'] <- 'Miniopterus natalensis' +t2$tip.label[t2$tip.label == 'Miniopterus schreibersi'] <- 'Miniopterus schreibersii' +t2$tip.label[t2$tip.label == 'Rousettus celebensis'] <- 'Rousettus leschenaultii' +t2$tip.label[t2$tip.label == 'Myotis oxyotus'] <- 'Myotis macropus' +t2$tip.label[t2$tip.label == 'Myotis leibii'] <- 'Myotis ciliolabrum' + +#Re prune tree +# Which tips are not needed +fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)] + +# Prune tree down to only needed tips. +fstTree2 <- drop.tip(t2, fstUnneededTips2) + +# Check we now have all the tips. +fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)] + +rm(t2) + + + + +%%end.rcode + + +%%begin.rcode treePlot2, show.figs = 'hide', out.width = '\\textwidth', fig.cap = 'Pruned phylogeny \\cite{jones2005bats} with dot size showing number of pathogens and colour showing family.' + + + +## Plot tree +#p2 <- ggtree(pruneTree2, layout = 'fan') + +#p2 %<+% nSpecies2[, 1:6] + +# geom_point(aes(size = virusSpecies, colour = Family), subset=.(isTip)) + +# scale_size(range = c(0.8, 3)) + +# scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12)])) + +# theme_tcdl + +# theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) + +# theme(legend.position = 'right') + +# labs(size = 'Virus Richness') + +# theme(legend.key.size = unit(0.6, "lines"), +# legend.text = element_text(size = 6), +# legend.title = element_text(size = 8)) + + + +%%end.rcode + + + +%%begin.rcode runBatClocks, eval = TRUE + + +fitModelsBootStrap2 <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies2, pruneTree2, b, allModelMat, varList), mc.cores = nCores) + +allResults2 <- do.call(rbind, fitModelsBootStrap2) + +write.csv(allResults2, file = 'data/Chapter3/modelSelectSubspeciesBatClocks.csv') + + +## FST analysis + +fstModelsBootStrap2 <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree2, b, fstModelMat, fstVarList), mc.cores = nCores) + +fstAllResults2 <- do.call(rbind, fstModelsBootStrap2) + +write.csv(fstAllResults2, file = 'data/Chapter3/fstModelSelectSubspeciesBatClocks.csv') + + +%%end.rcode + + +%%begin.rcode batClocksAnalyse + +allResults2 <- read.csv('data/Chapter3/modelSelectSubspeciesBatClocks.csv', row.names = 1) + +varWeights2 <- sapply(names(allResults2)[1:6], function(x) sum(allResults2$weight[allResults2[, x]])/nBoots) + + +sepVarWeights2 <- lapply(1:nBoots, function(b) + sapply(names(allResults2)[1:6], + function(x) + sum(allResults2[allResults2$boot == b, 'weight'][allResults2[allResults2$boot == b, x]]) + ) + ) + +sepVarWeights2 <- do.call(rbind, sepVarWeights2) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +sepVarWeights2$col <- 'Other Variables' +sepVarWeights2$col[grep('NumberOf', sepVarWeights2$variable)] <- 'Population Structure' +sepVarWeights2$col[sepVarWeights2$variable == 'rand'] <- 'Null' + + + +modelWeights2 <- allResults2 %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) %>% + mutate(string = levels(predictors)[predictors]) + + +#### FST + + +fstAllResults2 <- read.csv('data/Chapter3/fstModelSelectSubspeciesBatClocks.csv', row.names = 1) + +fstSepVarWeights2 <- lapply(1:nBoots, function(b) + sapply(names(fstAllResults2)[1:5], + function(x) + sum(fstAllResults2[fstAllResults2$boot == b, 'weight'][fstAllResults2[fstAllResults2$boot == b, x]]) + ) + ) + +fstSepVarWeights2 <- do.call(rbind, fstSepVarWeights2) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +fstSepVarWeights2$col <- 'Other Variables' +fstSepVarWeights2$col[fstSepVarWeights2$variable == 'Nm'] <- 'Population Structure' +fstSepVarWeights2$col[fstSepVarWeights2$variable == 'rand'] <- 'Null' + + +fstVarWeights2 <- sapply(names(fstAllResults2)[1:5], function(x) sum(fstAllResults2$weight[fstAllResults2[, x]])/nBoots) + + +fstModelWeights2 <- fstAllResults2 %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) + + +%%end.rcode + + +%% ------------------------------------------- %% +%% plot bat clocks rocks +%% ------------------------------------------- %% + + +%%begin.rcode ITPlots2 + +# reorder factors to get structure vars at beginning. +sepVarWeights2$variable <- factor(sepVarWeights2$variable, levels(sepVarWeights2$variable)[c(2, 6, 1, 3, 4, 5)]) + +ITPlot2 <- ggplot(sepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +%%end.rcode + + +%%begin.rcode fstITPlots2, fig.show = extraFigs, fig.cap = "Akaike variable weights for both analyses using the phylogeny from \\textcite{jones2005bats}. The probability that each variable is in the best model (amongst the models test) is shown, with the boxplots showing the variation amongst the models over 50 resamplings of the uniformly random ``null'' variable. The three bars of the boxplot show the median values and upper and lower quartiles of the data, vertical lines show the range and points display outliers. The red ``Random'' box is the uniformly random variable.", fig.height = 2.5, fig.scap = 'Akaike variable weights', out.width = '\\textwidth', out.extra = 'trim = 0 1cm 0 0' + + +# Reorder var levels to get structure at beginning. +fstSepVarWeights2$variable <- factor(fstSepVarWeights2$variable, levels(fstSepVarWeights2$variable)[c(2, 1, 3, 4, 5)]) + +# Draw the fst model selection plot +fstIT2 <- ggplot(fstSepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + ylim(0, 1) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +# Combine and print. +ggdraw() + + draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(ITPlot2, 0, 0, 0.5, 1) + + draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(fstIT2, 0.5, 0.164, 0.5, 0.855) + + draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) + + +%%end.rcode + diff --git a/comparative-test-of-pop-structure.Rtex b/comparative-test-of-pop-structure.Rtex new file mode 100644 index 0000000..e8c7b18 --- /dev/null +++ b/comparative-test-of-pop-structure.Rtex @@ -0,0 +1,2707 @@ +%--------------------------------------------------------------------------------------------------------------------------------% +% Code and text for "A comparative test of the role of population structure in determining pathogen richness" +% Chapter 2 of thesis "The role of population structure and size in determining bat pathogen richness" +% by Tim CD Lucas +% +% NB This file is a copy due to the mess up with chapter numbers. +% To see the full commit history see https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex +% +%---------------------------------------------------------------------------------------------------------------------------------% + + + + + +%%begin.rcode settings, echo = FALSE, cache = FALSE, message = FALSE, results = 'hide', eval = TRUE + + +################################## +### Run web scraping? ### +################################## + +# There's some slow webscrapping functions. Run them? +runPubmedScrape <- FALSE +runScholarScrape <- FALSE +runFstScrape <- FALSE + + +# Run slow bootstrapping? +subBoots <- FALSE +fstBoots <- FALSE +batclocksBoots <- FALSE + +# Run slow fst data wrangling as some is slow. +fstComb <- FALSE +runIucn <- FALSE + +# There are figures created in the data analysis which are not in the final chapter document. +# If TRUE, they will be included in the output. +# Use 'hide' to remove them. +extraFigs <- 'hide' + +#knitr options +opts_chunk$set(cache.path = '.Ch3Cache/') +source('misc/KnitrOptions.R') + +# ggplot2 theme. +source('misc/theme_tcdl.R') +theme_set(theme_grey() + theme_tcdl) + + +# Choose the number of cores to use +nCores <- 4 + +%%end.rcode + + +%%begin.rcode libs, cache = FALSE, result = FALSE + +# Data handling +library(dplyr) +library(broom) +library(readxl) +library(sqldf) +library(reshape2) + +# phylogenetic regression +library(ape) +library(caper) +library(phytools) +library(nlme) +library(qpcR) +library(car) + +# weighted means + var +library(Hmisc) + +# Plotting +library(ggplot2) +library(ggtree) +library(palettetown) +library(ggthemes) +library(GGally) +library(cowplot) + + +# Web scraping. +library(rvest) + +# For synonym list +library(taxize) + +# Spatial analysis +library(maptools) +library(geosphere) + +# Parllel computation +library(parallel) + +%%end.rcode + + + +%%begin.rcode parameters + + +# Define some parameters. +# This is useful at the top so that it can go in text. + +# How many bootstraps for model selection NULL variable +nBoots <- 50 + +# What proportion of a species range should be covered for an Fst study to count as valid. +rangeUseable <- 0.20 + +%%end.rcode + +\section{Abstract} + + +%\tmpsection{One or two sentences providing a basic introduction to the field} +% comprehensible to a scientist in any discipline. +\lettr{Z}oonotic diseases make up the majority of human infectious diseases and are a major drain on healthcare resources and economies. +Species that host many pathogen species are more likely to be the source of a novel zoonotic disease than species with few pathogens, all else being equal. +However, the factors that influence pathogen richness in animal species are poorly understood. +% +% +%\tmpsection{Two to three sentences of more detailed background} +% comprehensible to scientists in related disciplines. +% Theory led. +The pattern of contacts between individuals (i.e.\ population structure) can be influenced by habitat fragmentation, sociality and dispersal behaviour. +Epidemiological theory suggests that increased population structure can promote pathogen richness by reducing competition between pathogen species. +Conversely, it is often assumed that as greater population structure slows the spread of a new pathogen (i.e.\ lowers $R_0$), less structured populations should have greater pathogen richness. +% +% +%\tmpsection{One sentence clearly stating the general problem (the gap)} +% being addressed by this particular study. +Previous comparative studies comparing pathogen richness and population structure measured population structure differently and have had contradictory results, complicating the interpretation. +% +% +%\tmpsection{One sentence summarising the main result} +% (with the words “here we show” or their equivalent). +Here I test whether increased population structure correlates with viral richness using comparative data across 203 bat species, controlling for body mass, geographic range size, study effort and phylogeny. +This is an indirect test between the two competing hypotheses: does increased population structure allow pathogen coexistence by reducing competition, or does increased population structure decrease $R_0$ and therefore cause fewer new pathogens to enter the population. +Bats, as a group, make a useful case study because they have been associated with a number of important, recent zoonotic outbreaks. +Unlike previous studies, I used two measures of population structure: the number of subspecies and effective levels of gene flow. +I find that both measures are positively associated with pathogen richness. +% +% +%\tmpsection{Two or three sentences explaining what the main result reveals in direct comparison to what was thoughts to be the case previously} +% or how the main result adds to previous knowledge +My results add more robust support to the hypothesis that increased population structure promotes viral richness in bats. +The results support the prediction that increased population structure allows greater pathogen richness by reducing competition between pathogens +The prediction that factors that decrease $R_0$ should decrease pathogen richness is not supported. +% +% +%\tmpsection{One or two sentences to put the results into a more general context.} +Although my analysis implies that increased population structure does promote pathogen richness in bats, the weakness of the relationship and the difficulty in obtaining some measurements means that this is probably not a useful, predictive factor on its own for optimising zoonotic surveillance. +%However, the relationship has implications for global change, implying that increased habitat fragmentation might promote greater viral richness in bats. + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Introduction} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%#the introduction is not bad and starts very well but i think you need a bit more from studies of other mammals (not bats) to put the study into context as well as explaining why particularly you focus on pop structure, some justification of why bats, and less detail about the specific Fst measures (move to methods) and more stuff on your actual methods and approach you use in this study. + +%#Structure could be: +%#1. Zoonotic disease is bad (as you have written it already) +%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. However, pop structure (explain what this means) is of particular interest because of blah blah. +%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition +%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa). +%#5. Bats are very interesting in this regard because of blah +%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny. +%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. +%#8. I found ... + +\tmpsection{General Intro} + +%#1. Zoonotic disease is bad (as you have written it already) +Zoonotic pathogens make up the majority of newly emerging diseases and have profound consequences for public health, economics and international development \cite{jones2008global, smith2014global, ebolaWorldbank}. +Better statistical models for predicting which wild host species are potential reservoirs of zoonotic diseases would allow us to optimise zoonotic disease surveillance and anticipate how the risks of disease spillover might change with global change. +The chance that a host species will be the source of a zoonotic pathogen depends on a number of factors, such as its proximity and interactions with humans, the prevalence of its pathogens and the number of pathogen species it carries \cite{wolfe2000deforestation}. +However, the factors that control the number of pathogen species a host species carries remain poorly understood. + + +\tmpsection{Specific Intro} + +%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. +\tmpsection{Theoretical background} + + +A number of species traits that might control pathogen richness have been studied. +These traits can be at the level of the individual (e.g., body mass and longevity) or the level of the population (e.g., population density, sociality and species range size). +Large bodied animals have been shown to have high pathogen richness with large bodies providing more resources for pathogens \cite{kamiya2014determines, arneberg2002host, poulin1995phylogeny, bordes2008bat, luis2013comparison}. +Long lived species are expected to have high pathogen richness because the number of pathogens a host encounters in its lifetime will be higher \cite{nunn2003comparative, ezenwa2006host, luis2013comparison}. +Animal density \cite{kamiya2014determines, nunn2003comparative, arneberg2002host} and sociality \cite{bordes2007rodent, vitone2004body, altizer2003social, ezenwa2006host} are both predicted to increase pathogen richness by increasing the rate of spread, $R_0$, of a new pathogen. +Finally, widely distributed species have high pathogen richness, potentially because they experience a wider range of environments or because they are sympatric with more species \cite{kamiya2014determines, nunn2003comparative, luis2013comparison}. + +%# However, pop structure (explain what this means) is of particular interest because of blah blah. + +%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition + + +A further population level factor that may affect pathogen richness is population structure. +Population structure can be defined as the extent to which interactions between individuals in a population are non-random. +The role of population structure on human epidemics has been studied in depth and it has been shown that decreased population structure increases the speed of pathogen spread and makes establishment of a new pathogen more likely \cite{colizza2007invasion, vespignani2008reaction}. +In comparative studies of pathogen richness in wild animals, this relationship with $R_0$ is often taken as a prediction that decreased population structure will increase pathogen richness relative to other host species \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. +However, epidemiological models of highly virulent pathogens have shown that increased population structure can allow persistence of a pathogen where a well-mixed population would experience a single, large epidemic followed by pathogen extinction \cite{blackwood2013resolving, plowright2011urban}. +Furthermore, the assumption that high $R_0$ leads to high pathogen richness ignores inter-pathogen competition. +Simple epidemiological models of competition between multiple pathogens show that, in completely unstructured populations, a competitive exclusion process occurs but that adding population structure makes coexistence possible \cite{qiu2013vector, allen2004sis, nunes2006localized}. + + +\tmpsection{Previous Studies} + +%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa). + +There is a lack of large, comparative studies of the role of population structure on pathogen richness. +Sociality, which is one constituent part of population structure, has been well studied. +However, in primates only a weak positive association between sociality and pathogen richness was found \cite{vitone2004body}. +Furthermore, a negative association was found in rodents \cite{bordes2007rodent} and in even and odd-toed hoofed mammals \cite{ezenwa2006host}. +Finally, two studies tested for an association between group size and parasite richness in bats \cite{bordes2008bat, gay2014parasite}. +Amongst 138 bat species, \textcite{bordes2008bat} found no relationship between group size (coded into four classes) and bat fly species richness. +\textcite{gay2014parasite} found a negative relationship between colony size and viral richness but a positive relationship between colony size and ectoparasite richness. +While sociality is an important component of population structure it does not capture fully how connected the population is globally. + + +%#5. Bats are very interesting in this regard because of blah + +%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny. + + +Three studies have used comparative data to test for an association between global population structure and viral richness in bats. +A study on 15 African bat species found a positive relationship between the extent of distribution fragmentation and viral richness \cite{maganga2014bat}. +Conversely, a study on 20 South-East Asian bat species found the opposite relationship \cite{gay2014parasite}. +These studies used the ratio between the perimeter and area of the species' geographic range as their measure of population structure. +However, range maps are very coarse for many species. +Furthermore, range maps are likely to be more detailed (and therefore have a greater perimeter) in well studied species. + +A global study on 33 bat species found a positive relationship between $F_{ST}$ --- a measure of genetic structure --- and viral richness \cite{turmelle2009correlates}. +However, this study included measures using mtDNA which only measures female dispersal which may have biased the results as many bat species show female philopatry \cite{kerth2002extreme, hulva2010mechanisms}. +Furthermore, this study used measures of $F_{ST}$ irrespective of the spatial scale of the study including studies covering from tens \cite{mccracken1981social} to thousands \cite{petit1999male} of kilometres. +As isolation by distance has been shown in a number of bat species \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}, this could bias results further. +Finally, when a global $F_{ST}$ value is not given, \textcite{turmelle2009correlates} used the mean of all pairwise $F_{ST}$ values between sites. +This is not correct as pairwise and global $F_{ST}$ values have different relationships with effective migration rates. + + + +\tmpsection{The gap} +\tmpsection{What I did/found} + +%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. +%#8. I found ... + +Here I used a phylogenetic comparative approach to test for a relationship between increased population structure and pathogen richness in the largest study of bats to date. +I used phylogenetic linear models, controlling for the other life history characteristics known to impact pathogen richness, to quantify the relationship between viral richness (as a proxy for pathogen richness) and two measures of population structure: the number of subspecies and effective gene flow. +I used two measures of population structure to increase the robustness of the analysis; this is particularly important as previous studies have had contradictory results \cite{maganga2014bat, gay2014parasite, turmelle2009correlates}. + +I found that increases in both measures of population structure are positively associated with viral richness and are included as explanatory variables in the best models for describing viral richness. +Furthermore, I found that the role of phylogeny is very weak both in the models and in the distribution of viral richness amongst taxa. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Methods} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\subsection{Data Collection} + +\subsubsection{Pathogen richness} + +To measure pathogen richness I used data from \textcite{luis2013comparison}. +This data simply includes known infections of a bat species with a virus species. +I have used viral richness as a proxy for pathogen richness more generally. +Rows with host species that were not identified to species level according to \textcite{wilson2005mammal} were removed. +Many viruses were not identified to species level or their specified species names were not in the ICTV virus taxonomy \cite{ICTV}. +Therefore, I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified (present in the ICTV taxonomy). +For example, if a host is recorded as harbouring an unknown Paramyxoviridae virus, then it is logical to assume that the host carries at least one Paramyxoviridae virus. +If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, it is hard to confirm that the unknown virus is not another record of the known virus. +In this case, the host would be counted as having one virus species. + + +%$F_{ST}$ studies are conducted at a range of spatial scales, but $F_{ST}$ often increases with distance studied \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}. +%To minimise the effects of this I only used data from studies that cover \rinline{rangeUseable * 100}\% of the diameter of the species range. +%This is a largely arbitrary value that could be considered to reflect a ``global'' estimate of $F_{ST}$ while keeping a reasonable number of data points available. +%I calculated the diameter of the species range by finding the furthest apart points in the IUCN species range \cite{iucn} even if the range is split into multiple polygons. +%The width covered by each study was the distance between the most distant sampling sites. +%When this was not explicit in the paper, the centre of the lowest level of geographic area was used. + + + + +%%begin.rcode luis2013virusRead + +#read in luis2013virus data +virus2 <- read.csv('data/Chapter3/luis2013comparison.csv', stringsAsFactors = FALSE) + + +virus2$binomial <- paste(virus2$host.genus, virus2$host.species) + + +# From methods +#Many viruses were not identified to species level or their identified species was not in the ICTV virus taxonomy \cite{ICTV}. +#I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified in the ICTV taxonomy. +#That is, if a host carries an unknown Paramyxoviridae virus, then it must carry at least one Paramyxoviridae virus. +#If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, then it is hard to confirm that the unknown virus is not another record of the known virus. +#In this case, this would be counted as one virus species. + +# This has been implemented manually and indicated in the column `remove` + +virus2 <- virus2[!virus2$remove, ] + +%%end.rcode + + +%%begin.rcode wilsonReaderTaxonomyRead, fig.show = extraFigs, fig.cap = 'Histogram of number of subspecies' + +################################################################## +### Subspecies vs Viruses analysis. ### +################################################################## + + +# Read in the wilson Reader Taxonomy and use it to calculate the number of subspecies each bat species has. + +tax <- read.csv('data/Chapter3/msw3-all.csv', stringsAsFactors = FALSE) + +chir <- tax %>% + filter(Order == 'CHIROPTERA') + +# Save some memory. +rm(tax) + +# Count the number of subspecies each bat species has. +subs <- sqldf(' + SELECT Family, Genus, Species, COUNT(Subspecies) + AS NumberOfSubspecies + FROM chir + Where Species <> "" + GROUP BY Genus, Species + ') + + + +# I think each species has 1 row for species and extra rows for subspecies +# Check this is true. +# If that is correct, then Species with >1 NumberOfSubspecies should be one less. + +SpeciesRows <- sqldf(' + SELECT Genus, Species, COUNT(Subspecies) + AS SpeciesRows + FROM chir + WHERE Subspecies == "" AND Species <> "" + GROUP BY Genus, Species + ') + +# +(SpeciesRows$SpeciesRows != 1) %>% sum +all(SpeciesRows$SpeciesRows == 1) + +# Species with >1 NumberOfSubspecies should be one less +subs$NumberOfSubspecies <- ifelse(subs$NumberOfSubspecies > 1, + subs$NumberOfSubspecies - 1, + subs$NumberOfSubspecies) + +# Quick look at species with highest number of subspecies. +subs[order(subs$NumberOfSubspecies, decreasing = TRUE ),] %>% head + +# Megaderma spasma is top. It's widespread across south east asia islands. +# So this makes sense. + +# Quick look at the number of subspecies. +ggplot(subs, aes(x = NumberOfSubspecies)) + + geom_histogram(binwidth = 2) + + xlab('Number of Subspecies') + + ylab('Count') + + +# Create a combined binomial name column +subs$binomial <- paste(subs$Genus, subs$Species) + + + + +# Check overlap of datasets. +sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)) + +notInTax <- (virus2$binomial[virus2$host.species != ''])[!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)] + +# Run this to find synonyms of names not in Wilson and Reeder +# Doesn't find much of use. +# syns <- synonyms(notInTax, db = 'itis') + +# Clean some names +# As taxize::synonyms didn't find most of them, I am using IUCN. +# And checking that the IUCN name is then in The Wilson & Reeder taxonomy + +virus2$binomial[virus2$binomial == 'Myotis pilosus'] <- 'Myotis ricketti' +virus2$binomial[virus2$binomial == 'Tadarida pumila'] <- 'Chaerephon pumilus' +virus2$binomial[virus2$binomial == 'Tadarida condylura'] <- 'Mops condylurus' +virus2$binomial[virus2$binomial == 'Rhinolophus hildebrandti'] <- 'Rhinolophus hildebrandtii' +# Rhinolophus horsfeldi: I can't find this species anywhere. Will exclude. +# Possibly Megaderma spasma according to http://www.fao.org/3/a-i2407e.pdf +virus2$binomial[virus2$binomial == 'Tadarida plicata'] <- 'Chaerephon plicatus' +virus2$binomial[virus2$binomial == 'Artibeus planirostris'] <- 'Artibeus jamaicensis' + +sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)) + +%%end.rcode + +%%begin.rcode subsHistsByFam, fig.show = extraFigs, fig.height = 3, fig.cap = 'Histograms of number of subspecies for the families with many species.' + +# Compare the histograms of numbers of subspecies over the families with many species. +subs %>% + filter(Family %in% names(which(table(subs$Family) > 99))) %>% + ggplot(., aes(x = NumberOfSubspecies, y = ..density..)) + + geom_histogram() + + facet_grid(. ~ Family) + + xlab('Number of Subspecies') + + ylab('Density') + +%%end.rcode + +%%begin.rcode, subvsvirusCaption + +# Caption for subspecies vs n. viruses plot. +subvsvirus <- ' +Number of viruses against number of subspecies. +Points are coloured by family, with families with less than 10 species being grouped into "other". +Contours show the 2D density of points and suggest a positive correlation. +' +subvsvirusTitle <- 'Number of viruses against number of subspecies' +%%end.rcode + +%%begin.rcode subsDataFrame, fig.show = extraFigs, fig.cap = subvsvirus, fig.scap = subvsvirusTitle, out.width = '\\textwidth' +# create combined dataframe + +# Join dataframes +species <- sqldf(" + SELECT subs.binomial, virus2.[virus.species] + FROM subs + INNER JOIN virus2 + ON subs.binomial=virus2.binomial; + ") + +# Count number of virus species for each bat species +nSpecies <- species %>% + unique %>% + group_by(binomial) %>% + summarise(virusSpecies = n()) + +# Add other Subspecies data. +nSpecies <- sqldf(" + SELECT nSpecies.binomial, virusSpecies, NumberOfSubspecies, Genus, Family + FROM nSpecies + LEFT JOIN subs + ON nSpecies.binomial=subs.binomial + ") + +# Create another column to make plotting easier. +# Group families with few rows into 'other' + +nSpecies$familyPlotCol <- nSpecies$Family +nSpecies$familyPlotCol[ + nSpecies$Family %in% names(which(table(nSpecies$Family) < 10))] <- 'Other' + +table(nSpecies$familyPlotCol) + +ggplot(nSpecies, aes(x = log(NumberOfSubspecies), y = log(virusSpecies))) + + # geom_smooth(method = 'lm') + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, alpha = 0.8, + position = position_jitter(width = .1, height = .1)) + + scale_colour_hc() + + geom_density2d() + + labs(colour = 'Family') + +%%end.rcode + +%%begin.rcode virusHist, fig.show = extraFigs, fig.cap = 'Histogram of known viruses per species' + +ggplot(nSpecies, aes(x = virusSpecies)) + + geom_histogram() + +%%end.rcode + + + + +%%begin.rcode euthRead + +# Read in pantheria data base +pantheria <- read.table(file = 'data/Chapter3/PanTHERIA_1-0_WR05_Aug2008.txt', + header = TRUE, sep = "\t", na.strings = c("-999", "-999.00")) + +mass <- sqldf(" + SELECT [X5.1_AdultBodyMass_g] + FROM nSpecies + LEFT JOIN pantheria + ON nSpecies.binomial=pantheria.MSW05_Binomial + ") + +nSpecies$mass <- mass[, 1] + +# Now add additional mass estimates. + +additionalMass <- read.csv('data/Chapter3/AdditionalBodyMass.csv', stringsAsFactors = FALSE) +meanAdditionalMass <- additionalMass %>% + group_by(binomial) %>% + summarise(mass = mean(Body.Mass.grams)) + +nSpecies$mass[ + sapply(meanAdditionalMass$binomial, function(x) which(nSpecies$binomial == x)) + ] <- meanAdditionalMass$mass + + +%%end.rcode + + + +%%begin.rcode IUCNranges, eval = runIucn + +# Read in iucn ranges and calculate range sizes for each species. +ranges <- readShapePoly('data/Chapter3/TERRESTRIAL_MAMMALS/TERRESTRIAL_MAMMALS.shp') + +ranges <- ranges[ranges$order_name == 'CHIROPTERA', ] + +levels(ranges$binomial) <- c(levels(ranges$binomial), 'Myotis ricketti') +ranges$binomial[ranges$binomial == 'Myotis pilosus'] <- 'Myotis ricketti' + + + + +nSpecies$binomial[!(nSpecies$binomial %in% ranges$binomial)] + +findArea <- function(name){ + #cat(name) + A <- areaPolygon(ranges[ranges$binomial == name, ]) + sum(A) +} + +iucnDistr <- sapply(nSpecies$binomial, findArea) + +write.csv(iucnDistr, 'data/Chapter3/iucnDistr.csv') + +%%end.rcode + +%%begin.rcode readIucnIn + +iucnDistr <- read.csv('data/Chapter3/iucnDistr.csv', row.names = 1) + +nSpecies$distrSize <- iucnDistr$x + +%%end.rcode + + + +%%begin.rcode pubmedScrapeFunc + +# Scrape from pubmed + +scrapePub <- function(sp){ + + Sys.sleep(2) + + # Initialise refs + refs <- NA + + # Find synonyms from taxize + syns <- synonyms(sp, db = 'itis') + if(NROW(syns[[1]]) == 1){ + spString <- tolower(gsub(' ', '%20', sp)) + } else { + spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22') + } + + + url <- paste0('http://www.ncbi.nlm.nih.gov/pubmed/?term=%22', spString, '%22') + + + page <- html(url) + + # Test if exact phrase was found. + phraseFound <- try(page %>% + html_node('.icon') %>% + html_text() %>% + grepl("The following term was not found in PubMed:", .), silent = TRUE) + + if (class(phraseFound) == "logical") { + if(phraseFound){ + if(phraseFound) refs <- NA + } + } + if (class(phraseFound) != "logical") { + try({ + refs <- page %>% + html_node('.result_count') %>% + html_text() %>% + strsplit(' ') %>% + .[[1]] %>% + .[length(.)] %>% + as.numeric() + }) + } + + return(refs) +} + + +%%end.rcode + + +%%begin.rcode pubmedScrape, eval = runPubmedScrape + +# Create empty vector +pubmedRefs <- rep(NA, nrow(nSpecies)) + +for(i in 1:NROW(nSpecies)){ + pubmedRefs[i] <- scrapePub(nSpecies$binomial[i]) +} + +pubmedScrapeDate <- Sys.Date() + +pubmedRefs <- cbind(binomial = nSpecies$binomial, pubmedRefs = pubmedRefs) + +# Write out. +write.csv(pubmedRefs, file = 'data/Chapter3/pubmedRefs.csv') + +%%end.rcode + + + + +%%begin.rcode pubmedRead + + +pubmedRefs <- read.csv('data/Chapter3/pubmedRefs.csv', stringsAsFactors = FALSE, row.names = 1) + +# Function returns NA for none found. Change that to a zero. +pubmedRefs$pubmedRefs[is.na(pubmedRefs$pubmedRefs)] <- 0 +nSpecies$pubmedRefs <- pubmedRefs$pubmedRefs + +%%end.rcode + +%%begin.rcode scholarScrapeFunc + +scrapeScholar <- function(sp){ + + wait <- rnorm(1, 120, 2) + Sys.sleep(wait) + + + syns <- synonyms(sp, db = 'itis') + if(NROW(syns[[1]]) == 1){ + spString <- tolower(gsub(' ', '%20', sp)) + } else { + spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22') + } + + url <- paste0('https://scholar.google.co.uk/scholar?hl=en&q=%22', + spString, '%22&btnG=&as_sdt=1%2C5&as_sdtp=') + + + page <- html(url) + + try({ + refs <- page %>% + html_node('#gs_ab_md') %>% + html_text() %>% + gsub('About\\s(.*)\\sresults.*', '\\1', .) %>% + gsub(',', '', .) %>% + as.numeric + }) + return(refs) +} + +%%end.rcode + +%%begin.rcode scholarScrape, eval = runScholarScrape + +# Create empty vector +scholarRefs <- rep(NA, nrow(nSpecies)) + +for(i in 1:NROW(nSpecies)){ + scholarRefs[i] <- scrapeScholar(nSpecies$binomial[i]) +} + +scholarScrapeDate <- Sys.Date() + +scholarRefs <- cbind(binomial = nSpecies$binomial, scholarRefs = scholarRefs) + +# Write out. +write.csv(scholarRefs, file = 'data/Chapter3/scholarRefs.csv') + +%%end.rcode + + + + +%%begin.rcode scholarRead + + +scholarRefs <- read.csv('data/Chapter3/scholarRefs.csv', stringsAsFactors = FALSE, row.names = 1) + +# Function returns NA for none found. Change that to a zero. +scholarRefs$scholarRefs[is.na(scholarRefs$scholarRefs)] <- 0 + +nSpecies$scholarRefs <- sqldf(' + SELECT scholarRefs + FROM nSpecies + INNER JOIN scholarRefs + ON scholarRefs.binomial=nSpecies.binomial + ' + ) %>% + .$scholarRefs + +%%end.rcode + + + + + + + +%%begin.rcode subsRemoveNAs + +# Remove missing data and sort out the data frame a little. + +nSpecies <- nSpecies[complete.cases(nSpecies), ] + +# Add number of subspecies as a factor. Might help plotting. +nSpecies$SubspeciesFactor <- factor(nSpecies$NumberOfSubspecies, + levels = as.character(1:max(nSpecies$NumberOfSubspecies))) + +# Rownames to species names +rownames(nSpecies) <- nSpecies$binomial + +%%end.rcode + + + +%%begin.rcode savenSpecies +######################################################## +### At this point, nSpecies should be in final form ### +######################################################## + +write.csv(nSpecies, file = 'data/Chapter3/nSpecies.csv') + +%%end.rcode + + + +%%begin.rcode treeRead + +# Read in trees +t <- read.nexus('data/Chapter3/fritz2009geographical.tre') + +# Select best supported tree +tr1 <- t[[1]] + +# Make names match previous names +tr1$tip.label <- gsub('_', ' ', tr1$tip.label) + +# Which tips are not needed +unneededTips <- tr1$tip.label[!(tr1$tip.label %in% nSpecies$binomial)] + +# Prune tree down to only needed tips. +pruneTree <- drop.tip(tr1, unneededTips) + +rm(t) + +%%end.rcode + +%%begin.rcode nSpeciesTreePlot, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.show = extraFigs + +# Plot tree +p <- ggtree(pruneTree, layout = 'fan') + +p %<+% nSpecies[, 1:6] + + geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + + scale_size(range = c(0.2, 2)) + + scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12)])) + + theme_tcdl + + theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) + + theme(legend.position = 'right') + + labs(size = 'Virus Richness') + + theme(legend.key.size = unit(0.6, "lines"), + legend.text = element_text(size = 6), + legend.title = element_text(size = 8)) + + +%%end.rcode + + + +%%begin.rcode scholarvspubmed, fig.show = extraFigs, fig.cap = 'Logged number of references on scholar and pubmed, with a fitted (unphylogenetic) linear model. Colours indicate family.' + +# Check how correlated pubmed and scholar are. + + +compSubspecies <- comparative.data(data = nSpecies, phy = pruneTree, names.col = 'binomial') + +citeCor <- pgls(log(scholarRefs) ~ log(pubmedRefs + 1), data = compSubspecies, lambda = 'ML') + +studyEffortCor <- summary(citeCor) +# And plot +ggplot(nSpecies, aes(x = scholarRefs, y = pubmedRefs + 1)) + + geom_point(aes(colour = familyPlotCol), size = 2.5) + + geom_smooth(method = 'lm') + + scale_x_log10() + + scale_y_log10() + + scale_colour_hc() + +%%end.rcode + +%%begin.rcode subsDataCapts +subsDataCapts <- c( +'Unlogged number of virus species against log mass with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.', +'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.', +'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added.', +'Virus species against study effort (log pubmed references +1)') +%%end.rcode + +%%begin.rcode subsDataviz, fig.show = extraFigs, fig.cap = subsDataCapts + +# A number of exploratory plots + +# Mass against viruses +ggplot(nSpecies, aes(log(mass), virusSpecies)) + + geom_point(aes(colour = familyPlotCol), size = 2.5) + + geom_smooth(method = 'lm')+ + labs(colour = 'Family') + + scale_colour_hc() + + + +# N Subspecies and against viruses +ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, + position = position_jitter(width = .3, height = .3)) + + geom_smooth(method = 'lm')+ + labs(colour = 'Family') + + scale_colour_hc() + + +# Log(N Subspecies) and against viruses + +ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, + position = position_jitter(width = .05, height = .2)) + + scale_x_log10() + + geom_smooth(method = 'lm')+ + labs(colour = 'Family') + + scale_colour_hc() + + +# N. Subspecies against viruses as a boxplot to deal with overplotting. +ggplot(nSpecies, aes(SubspeciesFactor, virusSpecies)) + + geom_boxplot() + + scale_x_discrete(limits = levels(nSpecies$SubspeciesFactor), drop=FALSE) + + geom_smooth(method = 'lm', aes(group = 1)) + + xlab('# subspecies') + + +# Study effort against virusSpecies +ggplot(nSpecies, aes(log(pubmedRefs + 1), virusSpecies)) + + geom_jitter(aes(colour = familyPlotCol), size = 2.5, + position = position_jitter(width = .1, height = .1)) + + geom_smooth(method = 'lm') + + labs(colour = 'Family')+ + scale_colour_hc() + + +# Distribution size aginst virus + + +ggplot(nSpecies, aes(distrSize, virusSpecies)) + + geom_point(aes(colour = familyPlotCol), size = 2.5) + + geom_smooth(method = 'lm') + + labs(colour = 'Family') + + scale_colour_hc() + + scale_x_log10() + + +# Correlation plot +nSpecies %>% + dplyr::select(virusSpecies, NumberOfSubspecies, mass, distrSize, pubmedRefs, scholarRefs) %>% + mutate(mass = log(mass), distrSize = log(distrSize), pubmedRefs = log(pubmedRefs + 1), scholarRefs = log(scholarRefs)) %>% + ggpairs(.) + +%%end.rcode + + + +%%begin.rcode, subsAnalysis, fig.show = extraFigs + +################################################################################## +## N Virus ~ subs + log(cites + mass) + +subspeciesJointUnlog <- pgls( + virusSpecies ~ log(scholarRefs) + NumberOfSubspecies + log(mass), + data = compSubspecies, lambda = 'ML') + + + +## N Virus ~ subs + log(cites + mass) + subs*log(cites) + +subspeciesInter <- pgls( + virusSpecies ~ log(mass) + + NumberOfSubspecies*log(scholarRefs), + data = compSubspecies, lambda = 'ML') + +#subInter.summary <- summary(subspeciesInter) + + + + +## Look at Variance inflation factors. +## Couple of help messages imply lm vif is fine. + +#sqrt(vif(lm(virusSpecies ~ log(scholarRefs) + NumberOfSubspecies + log(mass) + log(distrSize), data = nSpecies))) + +%%end.rcode + + + + + + + + + +%%begin.rcode ITanalysis + +varList <- c('scholarRefs', 'NumberOfSubspecies', 'mass', 'distrSize', 'rand') + +findCombs <- function(k, vars, longest){ + x <- t(combn(vars, k)) + nas <- matrix(NA, ncol = longest - NCOL(x), nrow = nrow(x)) + mat <- cbind(x, nas) + return(mat) +} + +modelList <- lapply(0:5, function(k) findCombs(k, varList, 6)) +modelMat <- do.call(rbind, modelList) + +interMat <- modelMat[apply(modelMat, 1, function(x) "scholarRefs" %in% x & "NumberOfSubspecies" %in% x), ] +interMat[, 2:5] <- interMat[, 1:4] +interMat[, 1] <- "scholarRefs:NumberOfSubspecies" + +allModelMat <- rbind(modelMat, interMat) + + +allFormulae <- apply(allModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + ')))) + +allFormulae <- c(as.formula('virusSpecies ~ 1'), allFormulae) + + + +modelSelect <- function(allForm, data, phy, boot, allModelMat, varList){ + + set.seed(paste0('123', boot)) + bootData <- cbind(data, rand = runif(nrow(data))) + + # log some predictors + bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')]) + + # scale + bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')]) + + coefs <- matrix(NA, ncol = length(varList) + 2, nrow = nrow(allModelMat), + dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList, 'scholarRefs:NumberOfSubspecies')))) + + results <- apply(allModelMat, 1, function(x) sapply(c(varList, "scholarRefs:NumberOfSubspecies"), function(y) y %in% x)) %>% + t %>% + data.frame %>% + cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs) + + # Fit each model + # I'm having problems with convergence so sometimes have to try different starting values. + for(m in 1:length(allForm)){ + if(exists('model')){ + rm(model) + } + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 1 + }) + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 2 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 3 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 4 + }) + } + if(!exists('model')){ + try({ + model <- lm(allForm[[m]], data = bootData) + results$attempt[m] <- 5 + message('Running lm') + }) + } + #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML') + results$AIC[m] <- AICc(model) + + if(inherits(model, 'gls')){ + results$lambda[m] <- model$modelStruct$corStruct[1] + } + + results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3] + + + results[m, paste0('beta.', names(coef(model)))] <- coef(model) + + message(paste('Boot:', boot, ', m:', m, '\n')) + } + + results$dAIC <- results$AIC - min(results$AIC) + results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC)) + + + return(results) + +} + + + + +%%end.rcode + +%%begin.rcode modelSelectBoots, eval = subBoots + +fitModelsBootStrap <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies, pruneTree, b, allModelMat, varList), mc.cores = nCores) + +allResults <- do.call(rbind, fitModelsBootStrap) + +write.csv(allResults, file = 'data/Chapter3/modelSelectSubspecies.csv') + + +%%end.rcode + +%%begin.rcode analyseModelSelect, fig.show = extraFigs + +allResults <- read.csv('data/Chapter3/modelSelectSubspecies.csv', row.names = 1) + +#varWeights <- sapply(names(allResults)[1:6], function(x) sum(allResults$weight[allResults[, x]])/nBoots) + +sepVarWeights <- lapply(1:nBoots, function(b) + sapply(names(allResults)[1:6], + function(x) + sum(allResults[allResults$boot == b, 'weight'][allResults[allResults$boot == b, x]]) + ) + ) + +sepVarWeights <- do.call(rbind, sepVarWeights) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +sepVarWeights$col <- 'Other Variables' +sepVarWeights$col[grep('NumberOf', sepVarWeights$variable)] <- 'Population Structure' +sepVarWeights$col[sepVarWeights$variable == 'rand'] <- 'Null' + + + +modelWeights <- allResults %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) %>% + mutate(string = predictors) + + +# Calculate variable weights based on mean(AIC) rather than raw AIC. +varWeights <- sapply(names(allResults)[1:6], + function(x) sum(modelWeights$modelWeight[grep(x, as.character(modelWeights$predictors))])) + + + +allResults %>% + filter(rand, !`scholarRefs.NumberOfSubspecies`, NumberOfSubspecies) %>% +ggplot(., aes(x = lambda, colour = predictors)) + + geom_density() + + scale_colour_hc() + +ggplot(allResults, aes(x = lambda)) + + geom_density() + +allResults %>% + filter(boot == 1) %>% + dplyr::select(predictors, lambda) + +%%end.rcode + + + +%%begin.rcode ITPlots + +# reorder factors to get structure vars at beginning. +sepVarWeights$variable <- factor(sepVarWeights$variable, levels(sepVarWeights$variable)[c(2, 6, 1, 3, 4, 5)]) + +ITPlot <- ggplot(sepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +%%end.rcode + +%%begin.rcode nSpeciesCoef, fig.show = extraFigs + +ggplot(allResults, aes(x = 'beta.NumberOfSubspecies', colour = scholarRefs)) + + geom_density() + + + +mean(allResults$NumberOfSubspecies, na.rm = TRUE) + + +varCoefMeans <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.mean(x, allResults$weight, na.rm = TRUE)) +varCoefVars <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.var(x, allResults$weight, na.rm = TRUE)) + +nSpeciesCoefMean <- wtd.mean(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesCoefMeanI <- wtd.mean(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesInterMean <- wtd.mean(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE) + + +nSpeciesCoefVar <- wtd.var(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesCoefVarI <- wtd.var(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], + allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) +nSpeciesInterVar <- wtd.var(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE) + + + +# Direction of interaction models + +min(nSpecies$NumberOfSubspecies) + +max(nSpecies$NumberOfSubspecies) + +# At minimum study effort +nSpeciesInterMean*log(min(nSpecies$scholarRefs)) + nSpeciesCoefMeanI +nSpeciesInterMean*log(max(nSpecies$scholarRefs)) + nSpeciesCoefMeanI +nSpeciesInterMean*log(median(nSpecies$scholarRefs)) + nSpeciesCoefMeanI + +mean(nSpeciesInterMean*log(nSpecies$scholarRefs) + nSpeciesCoefMeanI > 0) + + + +%%end.rcode + + + +%%begin.rcode familyMeans + +familyMeans <- nSpecies %>% + group_by(Family) %>% + summarise(mean = mean(virusSpecies), n = n()) + +%%end.rcode + + +%%begin.rcode univariatePGLS + +#orderedNSpecies <- nSpecies[sapply(pruneTree$tip.label, function(x) which(nSpecies$binomial == x)),] + + +sspLambda <- summary(pgls(NumberOfSubspecies ~ 1, data = compSubspecies, lambda = 'ML')) +massLambda <- summary(pgls(log(mass) ~ 1, data = compSubspecies, lambda = 'ML')) +scholarLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compSubspecies, lambda = 'ML')) +virusLambda <- summary(pgls(virusSpecies ~ 1, data = compSubspecies, lambda = 'ML')) +distrLambda <- summary(pgls(log(distrSize) ~ 1, data = compSubspecies, lambda = 'ML')) + +sspUni <- summary(pgls(virusSpecies ~ NumberOfSubspecies, data = compSubspecies, lambda = 'ML')) + + +%%end.rcode + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%% FST ANALYSIS %%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +%%begin.rcode fstRead, eval = fstComb + +# Read in Fst data. +# Then add extra columns needed. + +fst <- read.csv('data/Chapter3/FstDataCompData.csv') + +# Check overlap of datasets. +sum(!(fst$binomial %in% virus2$binomial[virus2$host.species != ''])) + +notInFst <- fst$binomial[!(fst$binomial %in% virus2$binomial)] +# lots of sp not in virus2. MAybe will include 0 virus species. Kinda makes sense. + + + +######################################################################################### +#### Get distribution size and width #### +######################################################################################### + + + + +fst$binomial[!(fst$binomial %in% ranges$binomial)] + +fst <- fst[(fst$binomial %in% ranges$binomial), ] + +unique(fst$binomial) %>% length + + + + +findAreaFst <- function(name){ + #cat(name) + A <- areaPolygon(ranges[ranges$binomial == as.character(name), ]) + sum(A) +} + +fstIucnDistr <- sapply(fst$binomial, findAreaFst) + + +fst$distrSize <- fstIucnDistr + + +#### Now get distribution width + +findWidth <- function(name){ + #print(name) + distr <- ranges[ranges$binomial == as.character(name), ] + + coords <- list() + # Get coordinates from all polygons into one matrix. + for(i in 1:length(distr@polygons)){ + coords[[i]] <- distr@polygons[[i]]@Polygons[[1]]@coords + } + coords <- do.call(rbind, coords) + + # Take the convex hull of coordinates to speed up last step. + hullCoords <- coords[chull(coords), ] + + maxDist <- max(apply(hullCoords, 1, function(x) distGeo(coords, x)))/1000 + return(maxDist) + +} + +# Calculate widest part of all species distributions. +# This is slow but also RAM heavy. +# 3 cores doesn't crash my computer with 16GB RAM. +rangeWidth <- mclapply(fst$binomial, findWidth, mc.cores = 3) %>% do.call(c, .) + +#rangeWidth <- sapply(fst$binomial, findWidth) + +fst$rangeWidth <- rangeWidth +fst$rangeCoverage <- fst$Dmax..km. / fst$rangeWidth + + + +fst$Useable <- fst$rangeCoverage > rangeUseable +sum(fst$Useable, na.rm = TRUE) +fst$binomial[fst$Useable] %>% unique %>% .[!is.na(.)] %>% length + +# Need to go back and check data but for now if fst$Useable is na, then it's FALSE (i.e.\ it's not a useable row) +fst$Useable[is.na(fst$Useable)] <- FALSE + + +%%end.rcode + + + +%%begin.rcode fstStudyEffort, eval = fstComb + +# First take what data we can from nSpecies analysis. +fstStudy <- sqldf(" + SELECT fst.binomial, nSpecies.scholarRefs, nSpecies.pubmedRefs + FROM fst + LEFT JOIN nSpecies + ON nSpecies.binomial=fst.binomial + ") + +%%end.rcode + +%%begin.rcode fstScrape, eval = runFstScrape + +######################################################## +#### Sloow bit that might get you blocked by google #### +######################################################## + +fstNewStudy <- fstStudy[is.na(fstStudy[,2]),1] %>% + lapply(., function(x) c(x, scrapeScholar(x), scrapePub(x))) %>% + do.call(rbind, .) + +names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs') + +write.csv(fstNewStudy, file = 'data/Chapter3/fstScrape.csv') + +%%end.rcode + + +%%begin.rcode fstCombine, eval = fstComb + +fstNewStudy <- read.csv('data/Chapter3/fstScrape.csv', row.names = 1) +names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs') + +# NAs are from searches with 0 references. +fstNewStudy$pubmedRefs[is.na(fstNewStudy$pubmedRefs)] <- 0 + +whichRows <- lapply(fstNewStudy$binomial, function(x) which(fstStudy$binomial == x)) +for(i in 1:length(whichRows)){ + fstStudy[whichRows[[i]], 2:3] <- fstNewStudy[i, 2:3] +} + + +fst <- cbind(fst, fstStudy[, 2:3]) + +# Remove rows whose scale is too small +fst <- fst[fst$Useable, ] + + +# Don't want rows using mtDNA due to female baised dispersal +fst <- fst[fst$Marker != 'mtDNA', ] + +%%end.rcode + +%%begin.rcode convertFst, eval = fstComb + +calcNm <- function(Fst){ (1 - Fst)/(4 * Fst) } + + +fst$Nm <- calcNm(fst$Value) + + + +fst <- fst[!is.na(fst$Nm) & !(fst$Nm == Inf), ] + +fstFinal <- fst + +# Take means of species with multiple measurements + +fstFinal <- fstFinal[!duplicated(fstFinal$binomial), ] +fstFinal$Nm <- sapply(fstFinal$binomial, function(x) mean(fst$Nm[fst$binomial == x])) + +# Add number of viruses to fst dataset +# Includes zeros for species with no known viruses. + +fstFinal$virusSpecies <- sapply(fstFinal$binomial, function(x) sum(virus2$binomial == x)) + + + + +# Add mass data. + + +mass <- sqldf(" + SELECT [X5.1_AdultBodyMass_g] + FROM fstFinal + LEFT JOIN pantheria + ON fstFinal.binomial=pantheria.MSW05_Binomial + ") + +# Don't need pantheria data anymore +rm(pantheria) + +fstFinal$mass <- mass[, 1] + +fstFinal$mass[fstFinal$binomial == 'Myotis ricketti'] <- meanAdditionalMass$mass[meanAdditionalMass$binomial == 'Myotis ricketti'] + +fstFinal$mass[fstFinal$binomial == 'Myotis macropus'] <- 9.8 + +fstFinal <- fstFinal[!is.na(fstFinal$mass), ] + + +############################# +### fst data is finished ### +############################# + +write.csv(fstFinal, 'data/Chapter3/fstFinal.csv') +%%end.rcode + + +%%begin.rcode + +#### Read is full fstFinal dataframe + +fstFinal <- read.csv('data/Chapter3/fstFinal.csv', row.names = 1) + +%%end.rcode + +%%begin.rcode fstCors, fig.show = extraFigs + +fstFinal[, c('mass', 'scholarRefs', 'rangeWidth', 'Nm')] %>% + log %>% + cbind(virusSpecies = fstFinal$virusSpecies) %>% + ggpairs(.) + + +%%end.rcode + + + +%%begin.rcode compareNm, fig.show = extraFigs + +ggplot(fstFinal, aes(x = Marker, y = Nm)) + + geom_point() + + scale_y_log10() + +lm(fstFinal$Nm ~ fstFinal$Marker) %>% aov %>% summary + + +%%end.rcode + + +%%begin.rcode fstTree + +# Prune the tree for the fst data. + +# Which tips are not needed +fstUnneededTips <- tr1$tip.label[!(tr1$tip.label %in% fstFinal$binomial)] + +# Prune tree down to only needed tips. +fstTree <- drop.tip(tr1, fstUnneededTips) + + + +%%end.rcode + + +%%begin.rcode fstTreePlot, fig.show = extraFigs, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.height = 3.6 + +# Plot tree +p <- ggtree(fstTree) + + +fstFinal$lengthNames <- fstFinal$binomial %>% + as.character %>% + paste0(' ', .) + + +p %<+% fstFinal[, c('binomial', 'virusSpecies')] + + #geom_tiplab(family = 'lato light', align = FALSE) + + geom_text2(aes(x = x + 15, label = as.character(label), subset = isTip), + family = 'Lato light', hjust = 0, size = 3.3) + + #geom_text(aes(x = x + 15, label = as.character(label)), subset=.(isTip), + # family = 'Lato light', hjust = 0, size = 3.3) + + ggplot2::xlim(0, 210) + + theme_tcdl + + geom_point2(aes(x = x + 8, size = virusSpecies, subset = isTip)) + + scale_size(range = c(0, 4)) + + theme(legend.key.size = unit(0.8, "lines"), + legend.text = element_text(size = 9), + legend.title = element_text(size = 8), + legend.position = "right", + text = element_text(colour = 'darkgrey'), + legend.key = element_blank()) + + labs(size = 'Virus Richness') + + + +%%end.rcode + + + +%%begin.rcode fstITanalysis + +fstVarList <- c('scholarRefs', 'Nm', 'mass', 'distrSize', 'rand') + + +fstModelList <- lapply(0:5, function(k) findCombs(k, fstVarList, 5)) +fstModelMat <- do.call(rbind, fstModelList) + +fstAllFormulae <- apply(fstModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + ')))) + +fstAllFormulae <- c(as.formula('virusSpecies ~ 1'), fstAllFormulae) + +%%end.rcode + +%%begin.rcode fstModelSelectFun + + +fstModelSelect <- function(allForm, data, phy, boot, allModelMat, varList){ + + set.seed(paste0('2388', boot)) + bootData <- cbind(data, rand = runif(nrow(data))) + row.names(bootData) <- bootData$binomial + + + # log some predictors + bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')]) + + # scale + bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')]) + + + coefs <- matrix(NA, ncol = length(varList) + 1, nrow = nrow(allModelMat), + dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList)))) + + results <- apply(allModelMat, 1, function(x) sapply(varList, function(y) y %in% x)) %>% + t %>% + data.frame %>% + cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs) + + # Fit each model + # I'm having problems with convergence so sometimes have to try different starting values. + for(m in 1:length(allForm)){ + if(exists('model')){ + rm(model) + } + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 1 + }) + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 2 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 3 + }) + } + if(!exists('model')){ + try({ + model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML') + results$attempt[m] <- 4 + }) + } + if(!exists('model')){ + try({ + model <- lm(allForm[[m]], data = bootData) + results$attempt[m] <- 5 + message('Running lm') + }) + } + #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML') + results$AIC[m] <- AICc(model) + + if(inherits(model, 'gls')){ + results$lambda[m] <- model$modelStruct$corStruct[1] + } + + results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3] + + + results[m, paste0('beta.', names(coef(model)))] <- coef(model) + + message(paste('Boot:', boot, ', m:', m, '\n')) + } + + results$dAIC <- results$AIC - min(results$AIC) + results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC)) + + + return(results) + +} + +%%end.rcode + +%%begin.rcode fstModelSelectBoots, eval = fstBoots + + + +fstModelsBootStrap <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree, b, fstModelMat, fstVarList), mc.cores = nCores) + +fstAllResults <- do.call(rbind, fstModelsBootStrap) + +write.csv(fstAllResults, file = 'data/Chapter3/fstModelSelectSubspecies.csv') + + +%%end.rcode + +%%begin.rcode fstAnalyseModelSelect, fig.show = extraFigs + +fstAllResults <- read.csv('data/Chapter3/fstModelSelectSubspecies.csv', row.names = 1) + +fstSepVarWeights <- lapply(1:nBoots, function(b) + sapply(names(fstAllResults)[1:5], + function(x) + sum(fstAllResults[fstAllResults$boot == b, 'weight'][fstAllResults[fstAllResults$boot == b, x]]) + ) + ) + +fstSepVarWeights <- do.call(rbind, fstSepVarWeights) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +fstSepVarWeights$col <- 'Other Variables' +fstSepVarWeights$col[fstSepVarWeights$variable == 'Nm'] <- 'Population Structure' +fstSepVarWeights$col[fstSepVarWeights$variable == 'rand'] <- 'Null' + + + + + +fstModelWeights <- fstAllResults %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) + +# Calculate variable weights based on mean(AIC) rather than raw AIC. +fstVarWeights <- sapply(names(fstAllResults)[1:5], + function(x) sum(fstModelWeights$modelWeight[grep(x, as.character(fstModelWeights$predictors))])) + +%%end.rcode + + + + +%%begin.rcode fstITlambda, fig.show = extraFigs, fig.cap = 'Values of $\\lambda$ found in $F_{ST}$ analysis.', fig.height = 3 + +ggplot(fstAllResults, aes(x = lambda)) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + +%%end.rcode + + +%%begin.rcode fstITlambdaFacets, fig.show = extraFigs, fig.height = 4 + + +transform(fstAllResults, mass = c('Other', 'Mass' )[factor(mass)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ mass) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +transform(fstAllResults, Nm = c('Other', 'Nm' )[factor(Nm)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ Nm) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +transform(fstAllResults, distrSize = c('Other', 'distrSize' )[factor(distrSize)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ distrSize) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +transform(fstAllResults, scholarRefs = factor(c('Scholar Refs', 'Other')[factor(!scholarRefs)], levels = c('Scholar Refs', 'Other'))) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ scholarRefs) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + +transform(fstAllResults, rand = c('Other', 'Rand' )[factor(rand)]) %>% +ggplot(aes(x = lambda)) + + facet_grid(. ~ rand) + + geom_histogram() + + ylab('Count') + + xlab(expression(paste('Phylogenetic Signal, ', lambda))) + + +%%end.rcode + +%%begin.rcode lookAtLambda, fig.show = extraFigs + +fstComp <- comparative.data(fstTree, fstFinal, 'binomial') + +fullFst <- pgls(virusSpecies ~ log(Nm) + log(mass) + log(distrSize) + log(distrSize) + log(scholarRefs), fstComp, lambda = 'ML') + +fst.lambda.profile <- pgls.profile(fullFst, "lambda") +plot(fst.lambda.profile) + +data.frame(x = fst.lambda.profile$x, L = fst.lambda.profile$logLik) %>% +ggplot(aes(x, L)) + + geom_line() + + geom_vline(xintercept = fst.lambda.profile$ci$ci.val, col = 'steelblue') + + +%%end.rcode + + +%%begin.rcode fstCoef, fig.show = extraFigs + +ggplot(fstAllResults, aes(x = beta.Nm)) + + geom_histogram() + + +ggplot(fstAllResults, aes(x = beta.Nm, colour = scholarRefs)) + + geom_density() + + + +ggplot(fstAllResults, aes(x = beta.Nm, colour = distrSize)) + + geom_density() + + +fstCoefMeans <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.mean(x, fstAllResults$weight, na.rm = TRUE)) +fstCoefVars <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.var(x, fstAllResults$weight, na.rm = TRUE)) + +pcCoefLzero <- 100*sum(na.omit(fstAllResults$beta.Nm) < 0) / length(na.omit(fstAllResults$beta.Nm)) + +%%end.rcode + + + +%%begin.rcode univariateFstPGLS + +#orderedFst <- fstFinal[sapply(fstTree$tip.label, function(x) which(fstFinal$binomial == x)),] + +compFst <- comparative.data(data = fstFinal, phy = fstTree, names.col = 'binomial') + +nmFstLambda <- summary(pgls(log(Nm) ~ 1, data = compFst, lambda = 'ML')) +massFstLambda <- summary(pgls(log(mass) ~ 1, data = compFst, lambda = 'ML')) +scholarFstLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compFst, lambda = 'ML')) +virusFstLambda <- summary(pgls(virusSpecies ~ 1, data = compFst, lambda = 'ML')) +distrFstLambda <- summary(pgls(distrSize ~ 1, data = compFst, lambda = 'ML')) + +nmFstUni <- summary(pgls(virusSpecies ~ log(Nm), data = compFst, lambda = 'ML')) + +massFstUni <- summary(pgls(virusSpecies ~ log(mass), data = compFst, lambda = 'ML')) +fstDistrStudyEffort <- summary(pgls(log(scholarRefs) ~ log(distrSize), data = compFst, lambda = 'ML')) + +fstMassStudyEffort <- summary(pgls(log(scholarRefs) ~ log(mass), data = compFst, lambda = 'ML')) + +%%end.rcode + + + + + + + + +\subsubsection{Population structure data} + +I used two measures of population structure: the number of subspecies and the effective level of gene flow. +The number of subspecies was counted using the taxonomy from \textcite{wilson2005mammal}. +The effective level of gene flow was calculated from estimates of $F_{ST}$ collated from the literature. +The studies were from a wide range of spatial scales, from local ($\sim\SI{10}{\kilo\metre}$) to continental. +As $F_{ST}$ often increases with spatial scale \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range} I controlled for this by only using data from studies where a large proportion of the species range was studied. +I used the ratio of the furthest distance between $F_{ST}$ samples (taken from the paper or measured with \url{http://www.distancefromto.net/} if not stated) to the length of the IUCN species range \cite{iucn} and only used studies if this ratio was greater than \rinline{rangeUseable}. +This is an arbitrary value that was a compromise between retaining a reasonable number of data points and controlling for the bias in spatial scale. +I only used global $F_{ST}$ estimates as the mean of pairwise $F_{ST}$ values is not necessarily equal to the global $F_{ST}$ value. +I converted all $F_{ST}$ values to effective migration rates using $M = (1-F_{ST})/4F_{ST}$. +This transforms the data from being bound by $(0, 1)$ to being in the range $\lbrack 0, \infty)$ and is easier to interpret. + +The two measures of population structure were analysed separately because the number of subspecies data set had \rinline{nrow(nSpecies)} data points but there was only $F_{ST}$ data for \rinline{nrow(fstFinal)} bat species. +For the subspecies analysis, all bat species in \textcite{luis2013comparison} were used (i.e.\ all species with at least one known virus species). +This was to avoid using the very large number of bat species that have simply never been sampled for viruses. +However, for the gene flow analysis, all bat species with suitable $F_{ST}$ estimates were used. +As some bat species had suitable $F_{ST}$ estimates but were not present in \textcite{luis2013comparison}, some bat species with zero known virus species were included. +These bat species with no known viruses were included to make the greatest use of the $F_{ST}$ data available and because the number of species with no known virus species was not unduly large (\rinline{sum(fstFinal$virusSpecies == 0)} species). + +After data cleaning there was data for \rinline{nrow(nSpecies)} bat species in \rinline{length(unique(nSpecies$Family))} families for the subspecies analysis. +Due to the limited number of studies and the restrictive requirements imposed on study design, there was only data for \rinline{nrow(fstFinal)} bat species in \rinline{length(unique(fstFinal$Family))} families for the effective gene flow analysis. +The raw data are included in Table~\ref{A-rawData}. + + + + +\subsubsection{Other explanatory variables} + + + +To control for study bias I collected the number of PubMed and Google Scholar citations for each bat species name including synonyms from ITIS \cite{itis}. +This was performed in \emph{R} \cite{R} using the \emph{rvest} package \cite{rvest}, with ITIS synonyms being accessed with the \emph{taxize} package \cite{chamberlain2013taxize}. +I log transformed these variables as they were strongly right skewed. +I tested for correlation between these two proxies for study effort using phylogenetic least squares regression (pgls), using the best-supported phylogeny from \textcite{fritz2009geographical}, and likelihood ratio tests using the \emph{caper} package \cite{caper} (Figures~\ref{fig:treePlot} and \ref{fig:scholarvspubmedPlot}). +The log number of citations on PubMed and Google scholar were highly correlated (pgls: $t$ = \rinline{studyEffortCor$coefficients['log(pubmedRefs + 1)', 't value']}, df = \rinline{studyEffortCor$df[2]}, $p < 10^{-5}$). +As the correlation between citation counts was strong, I only used Google Scholar reference counts in subsequent analyses. +%See the appendix for analyses run using PubMed citations. + +Two factors that have previously been found to be important were included as additional explanatory variables: body mass \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat, han2015infectious, bordes2008bat}, range size \cite{kamiya2014determines, turmelle2009correlates, maganga2014bat}. +These other factors were included to avoid spurious positive results occurring simply due to correlations between pathogen richness and a different, causal factor. +Despite commonly being associated with pathogen richness \cite{arneberg2002host, kamiya2014determines, nunn2003comparative}, population density was not included in the analysis as there is very little data for bat densities. +Measurements of body mass were taken from Pantheria \cite{jones2009pantheria} and primary literature \cite{canals2005relative, arita1993rarity, lopez2014echolocation, orr2013does, lim2001bat, aldridge1987turning, ma2003dietary, owen2003home, henderson2008movements, heaney2012nyctalus, oleksy2015high, zhang2009recent}. +\emph{Pipistrellus pygmaeus} was assigned the same mass as \emph{P. pipistrellus} as they are indistinguishable by mass. +Body mass measurements were log transformed as they were strongly right skewed. +Distribution size was estimated by downloading range maps for all species from IUCN \cite{iucn} and were also log transformed due to right skew. + + + + +\subsection{Statistical analysis} + +Statistical analysis for both response variables --- number of subspecies and effective level of gene flow --- was conducted using an information theoretical approach \cite{burnham2002model}, specifically following \textcite{whittingham2005habitat, whittingham2006we}. +All analyses were performed in \emph{R} \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex}. +I chose a credible set of models including all combinations of explanatory variables and a model with just an intercept. +In the analysis using the number of subspecies response variable I also modelled the interaction between study effort and number of subspecies by including their product. +This interaction was included as I believed \emph{a priori} that this interaction may be important as subspecies in well studied species are more likely to be identified. +The interaction was only included in models with both study effort and number of subspecies as individual terms. +Following \textcite{whittingham2005habitat} I included a uniformly distributed random variable. +This variable can be used to benchmark how important other explanatory variables are. +The whole analysis was run \rinline{nBoots} times, resampling the random variable each time. + + +To control for phylogenetic non-independence of data points I used the best-supported phylogeny from \textcite{fritz2009geographical} which is the supertree from \textcite{bininda2007delayed} with names updated to match the taxonomy by \textcite{wilson2005mammal}. +This tree was pruned to include only the species I had data for (Figure~\ref{fig:treePlot}). +Phylogenetic manipulation was performed using the \emph{ape} package \cite{ape}. +I also performed the analysis using the phylogeny from \textcite{jones2005bats} as this has some broad topological differences including the Rhinolophoidea being sister to the Pteropodidae rather than being related to the other insectivorous bats (Figure~\ref{fig:treePlot2}). + + + + +%%begin.rcode treeCapt + +treeCapt <- ' +The phylogenetic distribution of viral richness. +The phylogeny is from \\cite{fritz2009geographical} pruned to include all species used in either the number of subspecies or gene flow analysis. +Dot size shows the number of known viruses for that species and colour shows family. +The red scale bar shows 25 million years.' + + + +treeTitle <- 'Pruned phylogeny showing number of pathogens and family' + +%%end.rcode + +%%begin.rcode treePlot, out.width = '1\\textwidth', out.extra = 'trim = 0cm 0cm 0cm 0cm', fig.height = 5, fig.height = 5.5, fig.cap = treeCapt, fig.scap = treeTitle + +combUneeded <- tr1$tip.label[!(tr1$tip.label %in% c(as.character(fstFinal$binomial), nSpecies$binomial))] + +# Prune tree down to only needed tips. +combTree <- drop.tip(tr1, combUneeded) + +combdf <- nSpecies %>% + dplyr::select(binomial, virusSpecies, Family) %>% + rbind(fstFinal %>% dplyr::select(binomial, virusSpecies, Family)) %>% + distinct(binomial) + +# Plot tree +p <- ggtree(combTree, layout = 'fan') + +p %<+% combdf + + geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + + scale_size(range = c(0.1, 3)) + + scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) + + theme_tcdl + + theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) + + theme(legend.position = c(0.5, -0.04)) + + geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) + + labs(size = 'Virus Richness') + +# guides(size = guide_legend(override.aes = list(shape = 1))) + + theme(legend.key.size = unit(0.8, "lines"), + legend.text = element_text(size = 10), + legend.margin = unit(c(0.05), "cm"), + legend.title = element_text(size = 12), + legend.direction = "horizontal") + + guides(colour = guide_legend(ncol=3)) + + +# Attempt at concentric circle time bar. +#scale <- data.frame(x = c(0, 0), y = c(0, 0), l = c(1200, 2400)) + +#p %<+% combdf + +# geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + +# scale_size(range = c(0.1, 100), breaks = c(1, 5, 10)) + +# scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) + +# theme_tcdl + +# theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) + +# theme(legend.position = c(0.5, -0.04)) + +# geom_point(data = scale, aes(x = x, y = y, size = l), alpha = 0.2) + +# geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) + +# labs(size = 'Virus Richness') + +## guides(size = guide_legend(override.aes = list(shape = 1)), alpha = 0.9) + +# theme(legend.key.size = unit(0.8, "lines"), +# legend.text = element_text(size = 10), +# legend.margin = unit(c(0.05), "cm"), +# legend.title = element_text(size = 12), +# legend.direction = "horizontal") + +# guides(colour = guide_legend(ncol=3)) + +# Or using bars + +#scale2 <- data.frame(x = c(1, 1), y = c(10, 200), w = c(1, 1)) + +#p %<+% combdf + +# geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + +# geom_bar(data = scale2, aes(x = x, y = y, size = w), alpha = 0.3, stat = 'identity', position = 'identity') + +%%end.rcode + + + +The importance of the phylogeny on each variable separately was examined by estimating the $\lambda$ parameter when regressing the variable against an intercept using the \emph{pgls} function in \emph{caper} \cite{caper}. +The parameter $\lambda$ usually takes values between zero and one and \emph{pgls} constrains $\lambda$ within these bounds. +$\lambda = 0$ implies no autocorrelation while a trait evolving by Brownian motion along the tree would have $\lambda = 1$. +I tested fitted $\lambda$ values against the null hypothesis of $\lambda = 0$ (no correlation between species) with log-likelihood ratio tests using \emph{caper} \cite{caper}. + +I fitted phylogenetic regressions for all models in the credible set using the function \emph{gls} in the package \emph{nlme} \cite{nlme}. +The explanatory variables were centred and scaled to allow direct comparison of the coefficients \cite{schielzeth2010simple}. +For each regression model I simultaneously fitted the $\lambda$ parameter as this avoids misspecifying the model \cite{revell2010phylogenetic}. +Unlike the \emph{pgls} function, \emph{gls} does not constrain $\lambda$ to be in the range $\lbrack 0, 1\rbrack$. +$\lambda < 0$ indicates that residuals from the fitted model are distributed on the phylogeny more uniformly than expected by chance. +$\kappa$ and $\delta$ parameters were constrained to one as they are more concerned with when evolution occurs along a branch than the importance of the phylogeny. +Further, fitting multiple parameters makes interpretation difficult. + + + +To establish the importance of variables I calculated the probability, $Pr$, that each variable would be in the best model amongst those examined (under the assumption that all models are \emph{a priori} equally likely). +This value can more generally, and with fewer assumptions, be considered as simply the relative weight of evidence for each variable being in the best model amongst those examined. +I calculated AICc for each model. +As each model was fitted 50 times, I calculated the average AICc, $\bar{\text{AICc}}$, by averaging AICc scores for each model. +$\Delta\text{AICc}$ was calculated as $\text{min}(\bar{\text{AICc}}) - \bar{\text{AICc}}$, not the mean of the individual $\Delta\text{AICc}$ scores, to guarantee that the best model has $\Delta\text{AICc} = 0$. +From these $\Delta\text{AICc}$ values I calculated Akaike weights, $w$. +This value can be interpreted as the probability that a model is the best model, given the data, amongst those examined. +For each variable, the sum of the Akaike weights of models containing that variable are summed to give $Pr$. +This value can be interpreted as the probability that the given variable is in the best model. + +To determine the direction and strength of the effect of each variable the mean of its regression coefficient, $b$, in all models that contained that variable, weighted by the model's Akaike weight, was also calculated. +In the subspecies analysis the inclusion of an interaction term between number of subspecies and study effort makes interpretation of this mean coefficient more difficult, particularly because the interaction term greatly affects the estimated value of $b$. +To aid interpretation, the mean coefficient for the number of subspecies was calculated for: \emph{i}) all models containing the number of species, \emph{ii}) only models with the interaction term and \emph{iii}) only models with the number of subspecies but not the interaction term. + + + +%%begin.rcode boxplotCapt + +# Caption for the main boxplot of subspecies vs virus + +boxplotCapt <- paste( +'The relationship between number of subspecies and viral richness for', +nrow(nSpecies), +'bat species. +The area of the circle shows the number of bat species at each discrete value. +48 bat species have one subspecies and one known virus species. +The red line represents a phylogenetic simple regression between the two variables. +' +) + +boxplotTitle <- paste( +'The relationship between number of subspecies and viral richness for', +nrow(nSpecies), +'bat species' +) + +%%end.rcode + +%%begin.rcode boxplot, fig.cap = boxplotCapt, fig.scap = boxplotTitle, fig.height = 2.3 + + +nSpeciesCounts <- nSpecies %>% + group_by(NumberOfSubspecies, virusSpecies) %>% + dplyr::summarize(n = n()) + +ggplot(nSpeciesCounts, aes(NumberOfSubspecies, virusSpecies, size = n)) + + geom_point() + + scale_size(range = c(0.5, 4.3), breaks = c(1, 20, 40)) + + scale_y_continuous(breaks = c(1, 5, 10, max(nSpecies$virusSpecies))) + + scale_x_continuous(breaks = c(1, 4, 8, 12, 16)) + + xlab('Number of Subspecies') + + ylab('Viral Richness') + + geom_abline(slope = sspUni$coef[2, 1], intercept = sspUni$coef[1,1], lwd = 0.7, colour = pokepal('nidorina')[10]) + +%%end.rcode + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Results} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + + +\subsection{Number of Subspecies} +\tmpsection{More descriptive} + +The number of described virus species for a bat host ranged up to \rinline{max(nSpecies$virusSpecies)} viruses in \emph{\rinline{nSpecies$binomial[which.max(nSpecies$virusSpecies)]}}. +There appears to be a positive relationship between the number of subspecies and viral richness (Figure~\ref{fig:boxplot}) though few species have more than five subspecies. +Out of \rinline{nrow(modelWeights)} fitted models, the top seven models all had $\Delta\text{AICc} < 4$ meaning there was no clear best model (Table~\ref{t:models} and Table~\ref{A-modelWeights}). +However these top seven models all contained study effort, number of subspecies and the interaction between these two variables. +The explanatory variables log(Mass), log(Range Size) and the uniformly random variable are each in three of the top seven models. +These top seven models had a combined weight of \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))} meaning that there is a \rinline{sprintf("%.0f", round(100 * modelWeights[7, 5]))}\% chance that one of these models is the best model amongst those examined. + +Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}. +The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}). +When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). +Models with an interaction term between the number of subspecies and study effort have a positive interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}). + + + +\afterpage{ % use after page to make sure this whole table is at the end of a page. +\begin{landscape} +\begin{table}[p!] +\centering +%\rowcolors{2}{gray!25}{white} +\caption[Model selection results]{ +Model selection results for number of subspecies and effective level of gene flow analysis. +Models are ranked according to $\bar{\text{AICc}}$ and only the best nine and three models are shown respectively. +Models were fitted to all combinations of variables (in total \rinline{nrow(modelWeights)} number of subspecies models and \rinline{nrow(fstModelWeights)} effective gene flow models). +$\bar{\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. +$\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. +$w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). +$\sum w$ is the cumulative sum of the Akaike weights. +log(Scholar)*NSubspecies indicates the interaction term between study effort and number of subspecies. +%In the number of subspecies analysis there are many models with low $\Delta$AICc scores suggesting there there is no single `best model'. +%In the gene flow analysis, only the top model is supported. +} + + +\begin{tabular}{@{}>{\footnotesize}lrrrr@{}} + +\toprule +\normalsize{Model} & $\bar{\text{AICc}}$ & $\Delta$AICc & $w$ & $\sum w$\\ +\midrule +&&&&\\[-3mm] +\textit{\small{Number of Subspecies}} &&&&\\ +%1 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(Mass) + log(RangeSize) & +\rinline{round(modelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[1, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[1, 5], 2))}\\ +%2 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(Mass) & +\rinline{round(modelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[2, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[2, 5], 2))}\\ +%3 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random + log(Mass) & +\rinline{round(modelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[3, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[3, 5], 2))}\\ +%4 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies & +\rinline{round(modelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[4, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[4, 5], 2))}\\ +%5 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(RangeSize) & +\rinline{round(modelWeights[5 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[5, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[5, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[5, 5], 2))}\\ +%6 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random + log(RangeSize) & +\rinline{round(modelWeights[6 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[6, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[6, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[6, 5], 2))}\\ +%7 +log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random & +\rinline{round(modelWeights[7 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[7, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[7, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))}\\ +%8 +log(Scholar) + NSubspecies + log(Mass) + Random & +\rinline{round(modelWeights[8 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[8, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[8, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[8, 5], 2))}\\ +%9 +log(Scholar) + NSubspecies + log(Mass) + log(RangeSize) + rand& +\rinline{round(modelWeights[9 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[9, 3], 2))} & +\rinline{sprintf("%.2f", round(modelWeights[9, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[9, 5], 2))}\\[5mm] +\textit{\small{Gene flow}} &&&&\\ +log(Scholar) + log(Gene flow) + log(Mass) & +\rinline{round(fstModelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 3], 2))} & +\rinline{sprintf("%.2f", round(fstModelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 5], 2))}\\ +log(Range size) & +\rinline{round(fstModelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 3], 2))} & +\rinline{sprintf("%.2f", round(fstModelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 5], 2))}\\ +log(Mass) & +\rinline{round(fstModelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 3], 2))} & +\rinline{sprintf("%.2f", round(fstModelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 5], 2))}\\ +%log(Scholar) + log(Gene flow) + log(Mass) + Random & +%\rinline{round(fstModelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 3], 2))} & +%\rinline{sprintf("%.2f", round(fstModelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 5], 2))}\\ +\bottomrule +\end{tabular} + +\label{t:models} +\end{table} +\end{landscape} +} + + + + +When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and~\ref{t:variables2}). +Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. +However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. + + + +\tmpsection{Model results} + + +\begin{table}[t!] +\centering +\caption[Estimated variable weights and coefficients for number of subspecies and gene flow analyses]{ +Estimated variable weights (probability that a variable is in the best model) and their estimated coefficients for both number of subspecies and gene flow analyses. +The coefficients for the number of subspecies variable are given for models with and without the interaction term because this term strongly changes the coefficient and because the coefficient can only be usefully interpreted when estimated without the interaction. +However, there are no weights for these separated terms as they are not directly compared in the model selection framework. +} +%\rowcolors{2}{gray!25}{white} +\begin{tabular}{@{}>{\small}l rrrr@{}} +\toprule +& \multicolumn{2}{c}{\textit{Number of Subspecies}} & \multicolumn{2}{c}{\textit{Gene flow}}\\\cmidrule(rl){2-3}\cmidrule(rl){4-5} +\normalsize{Variable} & $Pr$ & Coefficient & $Pr$ & Coefficient\\ +\midrule +Number of subspecies &&&&\\ +\hspace{3mm}Total & \rinline{sprintf('%.2f', varWeights['NumberOfSubspecies'])} & \rinline{varCoefMeans['beta.NumberOfSubspecies']} &&\\ +\hspace{3mm}Models without interaction term && \rinline{nSpeciesCoefMean} &&\\ +\hspace{3mm}Models with interaction term && \rinline{nSpeciesCoefMeanI} &&\\ +Number of subspecies*log(Scholar) & \rinline{varWeights['scholarRefs.NumberOfSubspecies']} & \rinline{sprintf('%.2f', varCoefMeans['beta.scholarRefs.NumberOfSubspecies'])} && \\[2.5mm] +Gene flow & & & \rinline{sprintf('%.2f', fstVarWeights['Nm'])} & \rinline{fstCoefMeans['beta.Nm']}\\[2.5mm] +log(Scholar) & \rinline{sprintf('%.2f', varWeights['scholarRefs'])} & \rinline{varCoefMeans['beta.scholarRefs']} & + \rinline{sprintf('%.2f', fstVarWeights['scholarRefs'])} & \rinline{fstCoefMeans['beta.scholarRefs']}\\ +log(Mass) & \rinline{sprintf('%.2f', varWeights['mass'])} & \rinline{varCoefMeans['beta.mass']} & + \rinline{sprintf('%.2f', fstVarWeights['mass'])} & \rinline{fstCoefMeans['beta.mass']}\\ +log(Range size) & \rinline{sprintf('%.2f', varWeights['distrSize'])} & \rinline{varCoefMeans['beta.distrSize']}& + \rinline{fstVarWeights['distrSize']} & \rinline{fstCoefMeans['beta.distrSize']}\\ +Random & \rinline{sprintf('%.2f', varWeights['rand'])} & \rinline{varCoefMeans['beta.rand']}& + \rinline{fstVarWeights['rand']} & \rinline{fstCoefMeans['beta.rand']}\\ +\bottomrule +\end{tabular} + +\label{t:variables} +\end{table} + + + + +\subsection{Gene Flow} + +\tmpsection{More Descriptive} + +%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. +The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). +Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}). +The effective level of gene flow was likely in the best model ($Pr > 0.99$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). +On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). +Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. + + + + + +%%begin.rcode ITCombPlotCapt + +ITPlotCapts <- " +The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness. +The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis. +The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. +The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles. +The red ``Random'' box is the uniformly random variable. +Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses." + +ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness" + +%%end.rcode + + +%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = ITPlotTitle, out.width = '\\textwidth', cache = FALSE + +# Reorder var levels to get structure at beginning. +fstSepVarWeights$variable <- factor(fstSepVarWeights$variable, levels(fstSepVarWeights$variable)[c(2, 1, 3, 4, 5)]) + +# Draw the fst model selection plot +fstIT <- ggplot(fstSepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + ylim(0, 1) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +#plot_grid(ITPlot, fstIT, labels = c("A", "B"), align = 'h', label_size = 10) + + +# Combine and print the plots. +ggdraw() + + draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(ITPlot, 0, 0, 0.5, 1) + + draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(fstIT, 0.5, 0.164, 0.5, 0.855) + + draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) + + +%%end.rcode + + + + +Study effort was very likely in the best model ($Pr > 0.99$) as was body mass ($Pr > 0.99$). +However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}). +In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}. +%This variable being less supported than the random variable may be because range size is closely correlated with study effort (pgls: $b$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Estimate']}, $t$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 't value']}, df = \rinline{fstDistrStudyEffort$df[2]}, $p$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Pr(>|t|)']}). +Of the three explanatory variables in the best model, study effort had the largest effect ($b = $ \rinline{fstCoefMeans['beta.scholarRefs']}, variance = \rinline{fstCoefVars['beta.scholarRefs']}). +The effect size of gene flow ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) was approximately twice the size of that of body mass ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}) + + + + +%%begin.rcode fstRawCapt + +fstRawDataCapt <- +paste( +'Relationship between viral richness and log effective gene flow per generation for', +nrow(fstFinal), +'bat species. +Green points are studies that estimated effective gene flow using allozymes and blue points are studies using microsatellites. +The red line represents a phylogenetic simple regression between the two variables. +') + + + +fstRawDataTitle <- +paste( +'Relationship between viral richness and log effective gene flow per generation for', +nrow(fstFinal), +'bat species +') + +%%end.rcode + + + +%%begin.rcode fstRawData, fig.height = 2.3, fig.cap = fstRawDataCapt, fig.scap = fstRawDataTitle + +# Plot raw fst data + +ggplot(fstFinal, aes(x = Nm, y = virusSpecies, colour = Marker)) + + geom_point(size = 2) + + scale_colour_poke(pokemon = 'oddish', spread = 3) + + scale_x_log10() + + geom_abline(intercept = nmFstUni$coef[1, 1], slope = nmFstUni$coef[2, 1], lwd = 0.7, colour = pokepal('nidorina')[10]) + + xlab('Gene Flow (per gen.)') + + ylab('Viral Richness') + +%%end.rcode + + + +When using the phylogeny from \textcite{jones2005bats} the analysis became very unstable (Figure~\ref{f:A-itplots}). +The support for each variable changed dramatically with each resampling of the random variable. +On average however, only the model containing mass and range size is supported (Tables~\ref{A-fstModelWeights} and~\ref{t:variables2}). + + + + +\subsection{Phylogenetic Analysis} + +\subsubsection{Number of subspecies} + +Figure~\ref{fig:treePlot} shows the phylogeny used and the number of viruses for each species. +The mean number of viruses across families is fairly constant with \rinline{familyMeans$Family[which.min(familyMeans$mean)]} having the smallest mean, (\rinline{min(familyMeans$mean)}). +The highest mean is \rinline{familyMeans$Family[which.max(familyMeans$mean)]} with \rinline{max(familyMeans$mean)} virus species per bat species, but this is based on only \rinline{familyMeans$n[which.max(familyMeans$mean)]} species. +The \rinline{familyMeans$Family[order(familyMeans$mean, decreasing = TRUE)[2]]} have the second highest mean of \rinline{familyMeans$mean[order(familyMeans$mean, decreasing = TRUE)[2]]} ($n$ = \rinline{familyMeans$n[order(familyMeans$mean, decreasing = TRUE)[2]]}). + + + +The small change in mean pathogen richness across families and the lack of clear pattern in Figure~\ref{fig:treePlot} implies that viral richness is not strongly phylogenetic. +This is corroborated by the small estimated size of $\lambda$ ($\lambda$ = \rinline{virusLambda$param['lambda']}, $p$ = \rinline{virusLambda$param.CI$lambda$bounds.p[1]}). +%This fact implies that other factors must control pathogen richness. +%It also implies that pathogens are not directly inherited down the phylogeny, although this is to be expected by the fast evolution of viruses. + +Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.99$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). +Across all multiple regression models the mean value of $\lambda$ was \rinline{mean(na.omit(allResults$lambda))} which implied that the residuals from the models were very weakly phylogenetic. +A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%) had negatively phylogenetically distributed residuals. + + + + +\subsubsection{Effective gene flow} + +There was no phylogenetic signal in the number of virus species ($\lambda$ = \rinline{virusFstLambda$param['lambda']}, $p > 0.99$). +Gene flow also had no phylogenetic autocorrelation ($\lambda$ = \rinline{nmFstLambda$param['lambda']}, $p > 0.99$). +Due to the limited sample size, significance tests are unlikely to have much power. +There is little evidence of phylogenetic autocorrelation in study effort ($\lambda$ = \rinline{scholarFstLambda$param['lambda']}, $p$ = \rinline{scholarFstLambda$param.CI$lambda$bounds.p[1]}). +However, there is some weak evidence of phylogenetic signal in range size as the estimated size of $\lambda$ is large while $p$ is also large, potentially due to a lack of statistical power ($\lambda$ = \rinline{distrFstLambda$param['lambda']}, $p$ = \rinline{distrFstLambda$param.CI$lambda$bounds.p[1]}). +Body mass showed significant phylogenetic autocorrelation ($\lambda$ = \rinline{massFstLambda$param['lambda']}, $p$ = \rinline{massFstLambda$param.CI$lambda$bounds.p[1]}). + + +Across all multiple regression models the mean value of $\lambda$ is \rinline{mean(na.omit(fstAllResults$lambda))} and a large number of individual models (\rinline{round(mean(na.omit(fstAllResults$lambda < 0))*100)}\%) had negatively phylogenetically distributed residuals implying the residuals from the model are spread more uniformly on the phylogeny than expected by chance. +Due to the small sample size this was probably due to a small number of data points with large residuals being distant on the tree. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Discussion} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\tmpsection{Discuss results in more detail} + + +\tmpsection{Pop structure relates to pathogen richness} + +% It does so here +% I hope this study is more robust. +In this study I have used known viral richness in bats as a case study for the more general hypothesis that increased population structure promotes pathogen richness. +In both analyses I found that a positive effect of increasing population structure (a positive effect of the number of subspecies and a negative effect of gene flow) is likely to be in the best model for explaining viral richness. +Only the effective gene flow analysis, when performed using the phylogeny from \textcite{jones2005bats}, does not support this hypothesis. +Therefore my study supports the broader hypothesis that increased population structure promotes pathogen richness. +The positive relationship between increased population structure and pathogen richness implies that direct or indirect competitive mechanisms are acting such that increased population structure allows escape from competition which promotes pathogen richness. +Furthermore my study contradicts the assumption that factors that promote high $R_0$ will automatically promote high pathogen richness by increasing the rate of spread of new pathogens entering into the population \cite{nunn2003comparative, morand2000wormy}. + + + +% It does so in some lit +This analysis is in agreement with two studies that have specifically tested this same hypothesis \cite{turmelle2009correlates, maganga2014bat}. +These two studies used $F_{ST}$ \cite{turmelle2009correlates} and fragmentation of species distributions \cite{maganga2014bat}. +Combined with the analysis here using the number of subspecies, three different measures of population structure have been shown to correlate with pathogen richness in bats. +By analysing data on two measures of population structure, and using larger data sets than previous studies, it is hoped that the results here may be more robust than in previous analyses \cite{gay2014parasite, turmelle2009correlates, maganga2014bat}. + + + +% The pattern is reversed in other lit +In contrast, one study \textcite{gay2014parasite} found the opposite relationship using fragmentation of species distribution. +Furthermore, \textcite{bordes2008bat} found no relationship between increased colony size and pathogen richness while \textcite{gay2014parasite} found relationships in opposite directions for virus and ectoparasite richness. +However, the study by \textcite{gay2014parasite} uses relatively few species while the study by \textcite{bordes2008bat} uses group size which is a measure of local rather than global population structure. +The overall weight of evidence suggests that population structure and pathogen richness are associated. + + + + +\tmpsection{There is an interaction between study effort and number of subspecies} + +% interpretations +% Biases are known in the lit. gippoliti2007problem % maybe should add to methods? + +There was strong support for a positive interaction between the number of subspecies and study effort. +The support for this interaction implies that increased population structure has a stronger relationship with known pathogen richness in the presence of study effort. +One interpretation of this is that increased population structure alone does not predict high known viral richness; reasonable study effort is also needed to turn the expected high viral richness into known and recorded viral richness. +Biases in identification of subspecies have been noted before \cite{gippoliti2007problem}. +The number of subspecies is more commonly used as a variable in comparative analyses of birds than mammals but the fact that it is associated with study effort is often not taken into account \cite{phillimore2007biogeographical, belliure2000dispersal}. + +\tmpsection{Other explanatory vars} + + +% study effort is important. Never forget. +% body mass behaves wierdly. +% Range size is very marginal + +Of the other explanatory variables considered, study effort and body mass were selected as being in the best model while there was marginal evidence for range size being associated with viral richness. +Study effort positively predicted pathogen richness, confirming the expectation that additional study of a bat species yields more known viruses infecting that host species. +Therefore, this bias cannot be ignored in studies using known pathogen richness as a proxy for total pathogen richness \cite{nunn2003comparative, gregory1990parasites}. +While body mass is selected as being in the best model in both the number of subspecies analysis and the effective gene flow analysis the estimated coefficients have opposite signs in the two analyses. +In the number of subspecies analysis, body mass has a positive relationship with pathogen richness which is in agreement with previous studies \cite{kamiya2014determines, bordes2008bat, turmelle2009correlates, gay2014parasite, maganga2014bat}. +However, in the effective gene flow analysis, body mass has a negative estimated coefficient. +This is in contrast to the number of subspecies analysis, previous studies in the literature and the single-predictor model. +This result is probably due to correlations with other variables in the analysis and exacerbated by the small sample size in this analysis. + + +\tmpsection{phylogeny} +% Phylogeny is not very important +% phylogeny is weird in Fst study? + + + +%Another interpretation is that having few subspecies does not predict low viral richness unless the species has been adequately studied as otherwise the low number of subspecies is probably due to a lack of study rather than an accurate measurement. + +%Another potential mechanism by which structure might be promoting increased richness is by slowing the spread of highly virulent viruses such as rabies and preventing them from having short, intense epidemics followed by extinction. +%This mechanism has interesting parallel to metapopulation theory in ecology in which a metapopulation structure can allow persistence of species that would otherwise go extinct. + +\subsection{Broader implications} + +The relationship between increased population structure and pathogen richness suggests that population structure has at least some potential as being predictive of high pathogen richness and therefore of a species' likelihood of being a reservoir of a potentially zoonotic pathogen. +However, given that it is difficult to measure population structure and given that the relationship appears to be weak at best, this trait on its own is unlikely to be useful in predicting zoonotic risk. +However, as a number of other factors are also associated with pathogen richness such as body mass and to a lesser extent range size as shown here as well as other traits studied elsewhere \cite{turmelle2009correlates, luis2013comparison}. +Therefore, using a combination of traits in a predictive (i.e.\ machine learning) framework has potential for use in prioritising zoonotic disease surveillance. +The main hurdle in this approach is finding a way to validate models; due to the study effort bias in current data, predictive models will also be biased. +As unbiased pathogen surveys such as \textcite{anthony2013strategy} become more common good validation may become possible. +Alternatively, predictive models could be trained on all available --- and therefore biased --- data and validated by predicting smaller, unbiased data sets such as the data collected in \textcite{maganga2014bat}. + +The relationship between increased population structure and pathogen richness also has implications for habitat fragmentation and range shifts due to global change. +In short, habitat fragmentation and range shifts that reduce movement between populations would be predicted to increase pathogen richness. +However, depending on the mechanisms by which increased population structure increases pathogen richness this may not be a cause for concern. +If the main mechanism is one that reduces pathogen extinction rates, a newly fragmented population is unlikely to increase its pathogen richness over any short to medium-term timescales. +If, however, increased population structure actively promotes the evolution of new pathogen strains or allows the persistence of more virulent strains \cite{blackwood2013resolving, pons2014insights, plowright2011urban} this could have important public health implications. +Therefore further studies on the exact mechanisms by which increased population structure affects pathogen richness are needed. + + +\subsection{Study limitations} + +Although I have used measures of study effort to try to control for biases in the viral richness data, this bias could still make the results here unreliable --- this is especially true as study effort is by far the strongest predictor of viral richness in both data sets. +It is hoped that as untargeted sequencing of viral genetic material becomes cheaper and more common this bias can be reduced \cite{anthony2013strategy}. +The strength of the relationship between study effort and known viral richness also highlights the number of bat-virus host-pathogen relationships yet to be discovered and the number of virus species that are yet to be described. + +I have included a number of explanatory variables to avoid spurious correlations. +However, there is little data on bat density or population size. +Given that studies in other mammalian groups have found relationships between host density and pathogen richness this would be a useful variable to include in further analyses \cite{kamiya2014determines, nunn2003comparative, arneberg2002host}. +Acoustic monitoring is becoming cheaper and less labour intensive and may provide suitable data for estimating population densities or population sizes for more bat species. +However, it is not clear whether host population density or host population size is the more appropriate measure with respect to disease dynamics \cite{begon2002clarification}. +Given the importance of geographic range size found here and elsewhere \cite{lindenfors2007parasite, nunn2003comparative, turmelle2009correlates, huang2015parasite, kamiya2014determines} comparative studies may struggle to select between these three related factors: host population size, population density and geographic range size. + +I have used two measures of population structure and the number of subspecies data set is larger than those used in previous studies. +However it is clear that the gene flow data set is small ($n$ = \rinline{nrow(fstFinal)}). +This may explain some unexpected results. +While the model averaging approach has given a negative model averaged coefficient for gene flow, the single-predictor model of gene flow against viral richness gave a positive coefficient. +Furthermore body mass has a negative average coefficient. +This is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model. +It is not easy to interpret these contradictions but it is clear that the results from the gene flow analysis alone should not be considered strong evidence for a relationship between increased population structure and pathogen richness. +These contradictions also reiterate the need to use large data sets where possible and the need to use multiple measures of population structure to promote robust conclusions. + +Finally, while comparative studies are a useful tool for examining broad trends of pathogen richness across large taxonomic groups, they cannot examine the specific mechanisms that may be underpinning the correlations found. +Therefore, further work is needed to test which mechanisms are actually causing the relationship between increased population structure and pathogen richness that I have identified here. +A number of mechanisms might be involved. +A reduced rate of pathogen extinction might be caused by a reduction in competition due to the slow dispersal of competing pathogens. +Alternatively, increased population structure may promote the invasion of new pathogens, by creating localised areas of low competition or host immunity. +One method for testing these mechanisms would be through mechanistic epidemiological models. + +\subsection{Conclusions} + + +I have used phylogenetic linear models to identify positive relationships between two measures of population structure (the number of subspecies and effective levels of gene flow) and viral richness in bats. +This study adds to the evidence that increased population structure may promote pathogen richness. +It does not support the view that factors that increase $R_0$ will increase pathogen richness. +Using larger data sets and multiple measurements makes the weight of the evidence here stronger than in previous studies. +However, caution must still be taken in interpreting these results as the data is biased and particularly sparse in one of the analyses. + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%% Repeat analysis with bat clocks and rocks %%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +%\section{Appendix} + + + +%%begin.rcode treeRead2 + +# Read in trees +t2 <- read.nexus('data/Chapter3/BatST2BL.nex') + +# Make names match previous names +t2$tip.label <- gsub('_', ' ', t2$tip.label) + +#missing <- nSpecies$binomial[!nSpecies$binomial %in% pruneTree2$tip.label ] + +## Copy binomial column. binomial will be changed to fit t2. +#nSpecies$oldBinomial <- nSpecies$binomial + +## Replace with agrep where possible +#closeMatch <- sapply(missing, function(i) t2$tip.label[agrep(i, t2$tip.label, max.distance = 0.11)]) + +#closeMatch <- closeMatch[sapply(closeMatch, function(i) length(i) > 0)] + + + + +unneededTips2 <- t2$tip.label[!(t2$tip.label %in% nSpecies$binomial)] + +# Prune tree down to only needed tips. +pruneTree2 <- drop.tip(t2, unneededTips2) + + +nSpecies2 <- sapply(pruneTree2$tip.label, function(x) which(nSpecies$binomial == x)) %>% + nSpecies[., ] + + +################ +## Fst tree ## +################ + + +# Which tips are not needed +fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)] + +# Prune tree down to only needed tips. +fstTree2 <- drop.tip(t2, fstUnneededTips2) + +# Which tips in Fst analysis are not in bats clocks tree. +fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)] + + +# Hacky cruddy way of placing the missing tips into the tree. Should end up with genus level polytomies in trimmed tree. +# Just replacing some of the uneeded tips with the ones I need. + +t2$tip.label[t2$tip.label == 'Miniopterus pusillus'] <- 'Miniopterus natalensis' +t2$tip.label[t2$tip.label == 'Miniopterus schreibersi'] <- 'Miniopterus schreibersii' +t2$tip.label[t2$tip.label == 'Rousettus celebensis'] <- 'Rousettus leschenaultii' +t2$tip.label[t2$tip.label == 'Myotis oxyotus'] <- 'Myotis macropus' +t2$tip.label[t2$tip.label == 'Myotis leibii'] <- 'Myotis ciliolabrum' + +#Re prune tree +# Which tips are not needed +fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)] + +# Prune tree down to only needed tips. +fstTree2 <- drop.tip(t2, fstUnneededTips2) + +# Check we now have all the tips. +fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)] + +rm(t2) + + + + +%%end.rcode + + +%%begin.rcode treePlot2, show.figs = 'hide', out.width = '\\textwidth', fig.cap = 'Pruned phylogeny \\cite{jones2005bats} with dot size showing number of pathogens and colour showing family.' + + + +## Plot tree +#p2 <- ggtree(pruneTree2, layout = 'fan') + +#p2 %<+% nSpecies2[, 1:6] + +# geom_point(aes(size = virusSpecies, colour = Family), subset=.(isTip)) + +# scale_size(range = c(0.8, 3)) + +# scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12)])) + +# theme_tcdl + +# theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) + +# theme(legend.position = 'right') + +# labs(size = 'Virus Richness') + +# theme(legend.key.size = unit(0.6, "lines"), +# legend.text = element_text(size = 6), +# legend.title = element_text(size = 8)) + + + +%%end.rcode + + + +%%begin.rcode runBatClocks, eval = TRUE + + +fitModelsBootStrap2 <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies2, pruneTree2, b, allModelMat, varList), mc.cores = nCores) + +allResults2 <- do.call(rbind, fitModelsBootStrap2) + +write.csv(allResults2, file = 'data/Chapter3/modelSelectSubspeciesBatClocks.csv') + + +## FST analysis + +fstModelsBootStrap2 <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree2, b, fstModelMat, fstVarList), mc.cores = nCores) + +fstAllResults2 <- do.call(rbind, fstModelsBootStrap2) + +write.csv(fstAllResults2, file = 'data/Chapter3/fstModelSelectSubspeciesBatClocks.csv') + + +%%end.rcode + + +%%begin.rcode batClocksAnalyse + +allResults2 <- read.csv('data/Chapter3/modelSelectSubspeciesBatClocks.csv', row.names = 1) + +varWeights2 <- sapply(names(allResults2)[1:6], function(x) sum(allResults2$weight[allResults2[, x]])/nBoots) + + +sepVarWeights2 <- lapply(1:nBoots, function(b) + sapply(names(allResults2)[1:6], + function(x) + sum(allResults2[allResults2$boot == b, 'weight'][allResults2[allResults2$boot == b, x]]) + ) + ) + +sepVarWeights2 <- do.call(rbind, sepVarWeights2) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +sepVarWeights2$col <- 'Other Variables' +sepVarWeights2$col[grep('NumberOf', sepVarWeights2$variable)] <- 'Population Structure' +sepVarWeights2$col[sepVarWeights2$variable == 'rand'] <- 'Null' + + + +modelWeights2 <- allResults2 %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) %>% + mutate(string = levels(predictors)[predictors]) + + +#### FST + + +fstAllResults2 <- read.csv('data/Chapter3/fstModelSelectSubspeciesBatClocks.csv', row.names = 1) + +fstSepVarWeights2 <- lapply(1:nBoots, function(b) + sapply(names(fstAllResults2)[1:5], + function(x) + sum(fstAllResults2[fstAllResults2$boot == b, 'weight'][fstAllResults2[fstAllResults2$boot == b, x]]) + ) + ) + +fstSepVarWeights2 <- do.call(rbind, fstSepVarWeights2) %>% + data.frame(., boot = 1:nBoots) %>% + reshape2::melt(., value.name = 'estimate', id.vars = 'boot') + +fstSepVarWeights2$col <- 'Other Variables' +fstSepVarWeights2$col[fstSepVarWeights2$variable == 'Nm'] <- 'Population Structure' +fstSepVarWeights2$col[fstSepVarWeights2$variable == 'rand'] <- 'Null' + + +fstVarWeights2 <- sapply(names(fstAllResults2)[1:5], function(x) sum(fstAllResults2$weight[fstAllResults2[, x]])/nBoots) + + +fstModelWeights2 <- fstAllResults2 %>% + group_by(predictors) %>% + summarise(AICc = mean(AIC)) %>% + mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% + arrange(desc(modelWeight)) %>% + mutate(cumulativeWeight = cumsum(modelWeight)) + + +%%end.rcode + + +%% ------------------------------------------- %% +%% plot bat clocks rocks +%% ------------------------------------------- %% + + +%%begin.rcode ITPlots2 + +# reorder factors to get structure vars at beginning. +sepVarWeights2$variable <- factor(sepVarWeights2$variable, levels(sepVarWeights2$variable)[c(2, 6, 1, 3, 4, 5)]) + +ITPlot2 <- ggplot(sepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +%%end.rcode + + +%%begin.rcode fstITPlots2, fig.show = extraFigs, fig.cap = "Akaike variable weights for both analyses using the phylogeny from \\textcite{jones2005bats}. The probability that each variable is in the best model (amongst the models test) is shown, with the boxplots showing the variation amongst the models over 50 resamplings of the uniformly random ``null'' variable. The three bars of the boxplot show the median values and upper and lower quartiles of the data, vertical lines show the range and points display outliers. The red ``Random'' box is the uniformly random variable.", fig.height = 2.5, fig.scap = 'Akaike variable weights', out.width = '\\textwidth', out.extra = 'trim = 0 1cm 0 0' + + +# Reorder var levels to get structure at beginning. +fstSepVarWeights2$variable <- factor(fstSepVarWeights2$variable, levels(fstSepVarWeights2$variable)[c(2, 1, 3, 4, 5)]) + +# Draw the fst model selection plot +fstIT2 <- ggplot(fstSepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) + + geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + + scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + + scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + + ylim(0, 1) + + theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), + panel.grid.major.x = element_blank(), + axis.text.y = element_text(size = 8)) + + scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + + scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + + ylim(0, 1) + + ylab('P(in best model)') + + xlab('') + + +# Combine and print. +ggdraw() + + draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(ITPlot2, 0, 0, 0.5, 1) + + draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(fstIT2, 0.5, 0.164, 0.5, 0.855) + + draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) + + +%%end.rcode + diff --git a/population-structure-affects-pathogen-richness-mechanistic-model.Rtex b/population-structure-affects-pathogen-richness-mechanistic-model.Rtex new file mode 100644 index 0000000..48d450b --- /dev/null +++ b/population-structure-affects-pathogen-richness-mechanistic-model.Rtex @@ -0,0 +1,1503 @@ +%--------------------------------------------------------------------------------------------------------------------------------% +% Code and text for "Understanding how population structure affects pathogen richness in a mechanistic model of bat populations" +% Chapter 3 of thesis "The role of population structure and size in determining bat pathogen richness" +% by Tim CD Lucas +% +% NB This file is a copy due to the mess up with chapter numbers. +% To see the full commit history see https://github.com/timcdlucas/PhDThesis/blob/master/Chapter2.Rtex +% +%---------------------------------------------------------------------------------------------------------------------------------% + + + + + +%%begin.rcode settings, echo = FALSE, cache = FALSE, message = FALSE, results = 'hide' + +#################################### +### Important simulation options ### +#################################### + +# Compilation options +# Run simulations? This will take many hours +runAllSims <- FALSE + +# Save raw simulation output +# This will take ~10GB or so. +# If false, summary statistics of each simulation are saved instead. +saveData <- FALSE + + +# How many cores do you want to use to run simulations? +nCores <- 7 + +########################## +### End options ### +########################## + + +opts_chunk$set(cache.path = '.Ch2Cache/') + +source('misc/theme_tcdl.R') +source('misc/KnitrOptions.R') +theme_set(theme_grey() + theme_tcdl) + +%%end.rcode + + +%%begin.rcode libs, cache = FALSE, result = FALSE + +# My package. For running and analysing Epidemiological sims. +# https://github.com/timcdlucas/metapopepi +library(MetapopEpi) + +# Data manipulations +library(reshape2) +library(dplyr) + +# Calc confidence intervals (could probably do with broom instead now.) +library(binom) + +# To tidy up stats models/tests +library(broom) + + +# Run simulations in parallel +library(parallel) + +# Plotting +library(ggplot2) +library(palettetown) +library(cowplot) + +%%end.rcode + + +\section{Abstract} + +%\tmpsection{One or two sentences providing a basic introduction to the field} +% comprehensible to a scientist in any discipline. +%\lettr{A}n increasingly large proportion of emerging human diseases comes from animals. +%These diseases have a huge impact on human health, healthcare systems and economic development. +The chance that a new zoonosis will come from any particular wild host species increases with the number of pathogen species occurring in that host species. +Comparative, phylogenetic studies have shown that host-species traits such as population density and population structure correlate with pathogen richness +However, the mechanisms by which these factors control pathogen richness in wild animal species remain unclear. +% +% +%\tmpsection{Two to three sentences of more detailed background} +% comprehensible to scientists in related disciplines. +% Add mechanistic vs empirical +Typically it is assumed that well-connected, unstructured populations (that therefore have a high basic reproductive number, $R_0$) promote the invasion of new pathogens and therefore increase pathogen richness. However, this assumption is largely untested in the multipathogen context. +In the presence of inter-pathogen competition, the opposite effect might occur; increased population structure may increase pathogen richness by reducing the effects of competition. +A more mechanistic understanding of how population structure affects pathogen richness could discriminate between these two broad hypotheses. +Here I have examined one mechanism by which increased population structure may cause greater pathogen richness. +I used simulations to test whether increased population structure could increase the probability that a newly evolved pathogen would invade into a population already infected with an identical, endemic pathogen. +I tested this hypothesis using individual-based, metapopulation networks parameterised to mimic wild bat populations as bats have highly varied social structures and have recently been implicated in a number of high profile diseases such as Ebola, SARS, Hendra and Nipah. +In a metapopulation, dispersal rate and the number of links between colonies can both affect population structure. +I tested whether either of these factors could increase the probability that a pathogen would invade and persist in the population. +I found that, at intermediate transmission rates, increasing dispersal rate significantly increased the probability of a newly evolved pathogen invading into the metapopulation. +However, there was very limited evidence that the number of links between colonies affected pathogen invasion probability. + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\section{Introduction} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +%Possible structure (each number a sep. paragraph): +%1. zoonotics bad, need to predict spillover, factors controlling pathogen richness unknown +%2. results from comparative studies (including mammal and bat ones), explaining why population structure is important +%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach +%4. description of the possible mechanisms for population structure - explaining why focusing on reduction of competition mechanisms +%5. results from analytical models so far and limitations of the approach +%6. what is needed now +%7. what your focus is (including a bit about bat focus) +%8. 'here I show..' what you found briefly to lead into methods + + + + +\tmpsection{General Intro} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% A basic introduction to the field, +% comprehensible to a scientist in any discipline. + +%1. zoonotics bad, need to predict spillover, factors controlling pathogen richness unknown +\tmpsection{Why is pathogen richness? important?} +Over 60\% of emerging infectious diseases have an animal source \cite{jones2008global, smith2014global}. +Zoonotic pathogens can be highly virulent \cite{luby2009recurrent, lefebvre2014case} and can have huge public health impacts \cite{granich2015trends}, economic costs \cite{knobler2004learning} and slow down international development \cite{ebolaWorldbank}. +Therefore understanding and predicting changes in the process of zoonotic spillover is a global health priority \cite{taylor2001risk}. +The number of pathogen species hosted by a wild animal species affects the chance that a disease from that species will infect humans \cite{wolfe2000deforestation}. +However, the factors that control the number of pathogen species in a wild animal population are still unclear \cite{metcalf2015five}; in particular our mechanistic understanding of how population processes inhibit or promote pathogen richness is poor. + + + +\tmpsection{Specific Intro} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% more detailed background} +% comprehensible to scientists in related disciplines. + + +\tmpsection{We know some factors that correlate with pathogen richness} +%population density, longevity, body size and population structure + +%2. results from comparative studies (including mammal and bat ones), explaining why population structure is important +%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach + +In comparative studies, a number of host traits have been shown to correlate with pathogen richness including body size \cite{kamiya2014determines, arneberg2002host}, population density \cite{nunn2003comparative, arneberg2002host} and range size \cite{bordes2011impact, kamiya2014determines}. +A further factor that may affect pathogen richness is population structure. +In comparative studies it is often assumed that factors that promote fast disease spread should promote high pathogen richness; the faster a new pathogen spreads through a population, the more likely it is to persist \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. +However, this assumption ignores competitive mechanisms such as cross-immunity and depletion of susceptible hosts. +If competitive mechanisms are strong, endemic pathogens in populations with high $R_0$ will be able to easily out-compete invading pathogens. +Only if competitive mechanisms are weak will high $R_0$ enable the invasion of new pathogens and allow higher pathogen richness. + +Overall, the evidence from comparative studies indicates that increased population structure correlates with higher pathogen richness. +This conclusion is based on studies using a number of measures of population structure: genetic measures, the number of subspecies, the shape of species distributions and social group size (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). +However, there are a number of studies that contradict this conclusion \cite{gay2014parasite, bordes2007rodent, ezenwa2006host}. +Comparative studies are often contradictory due to small sample sizes, noisy data and because empirical relationships often do not extrapolate well to other taxa. +Furthermore, multicollinearity between many traits also makes it hard to clearly distinguish which factors are important \cite{nunn2015infectious}. +However, meta-analyses can be used to combine studies to help generalise conclusions \cite{kamiya2014determines}. + + +%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach + +Furthermore, knowing which factors correlate with pathogen richness does not tell us if, or how, they causally control pathogen richness. +This lack of a solid mechanistic understanding of these processes prevents predictions of how wild populations will respond to perturbations such as increased human pressure and global change. +As habitats fragment we expect wild populations to change in a number of ways including becoming smaller and less well connected \cite{andren1994effects, cushman2012separating}. +As multiple population-level factors are likely to change simultaneously due to global change, the correlative relationships examined in comparative studies are unlikely to effectively predict future changes in pathogen richness. +Mechanistic models are needed to project how these highly non-linear disease systems will respond to the multiple, simultaneous stressors affecting them. + + + +\tmpsection{Network structure has been studied} +%5. results from analytical models so far and limitations of the approach +%4. description of the possible mechanisms for population structure - explaining why focusing on reduction of competition mechanisms + +There are a number of mechanisms by which population structure could increase pathogen richness. +Firstly, population structure may reduce competition between pathogens. +In analytical models of well-mixed populations competitive exclusion has been predicted \cite{ackleh2003competitive, bremermann1989competitive, martcheva2013competitive, qiu2013vector, allen2004sis}. +In models where competitive exclusion occurs in well-mixed populations, population structure has sometimes been shown to allow coexistence \cite{qiu2013vector, allen2004sis, nunes2006localized, garmer2016multistrain}. +Alternatively, population structure may promote the evolution of new strains within a species \cite{buckee2004effects}, reduce the rate of pathogen extinction \cite{rand1995invasion} or increase the probability of pathogen invasion from other host species \cite{nunes2006localized}. +These separate mechanisms have not been examined and it is difficult to see how they could be distinguished through comparative methods. + +%Competing epidemics, or two pathogens spreading at the same time in a population, is a well studied area \cite{poletto2013host, poletto2015characterising, karrer2011competing}. +%This area is related to the study of pathogen richness in that they indicate that dynamics of multiple pathogens in a population do depend on population %structure. +%However, the results for short term epidemic competition do not directly transfer to the study of long term disease persistence. + + +%6. what is needed now +\tmpsection{The gap} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% One sentence clearly stating the general problem +% being addressed by this particular study. +% By this stage, must have defined/introduced all terms used within. + +Currently, the literature contains very abstract, simplified models \cite{qiu2013vector, allen2004sis, garmer2016multistrain, may1994superinfection}. +These cannot be easily applied to real data. +They also do not easily give quantitative predictions of pathogen richness; typically they predict either no pathogen coexistence \cite{bremermann1989competitive, martcheva2013competitive} or infinite pathogen richness \cite{may1994superinfection}. +Models that can give quantitative predictions of pathogen richness in wild populations are more applicable to real-world issues such as zoonotic disease surveillance. +While predicting an absolute value of pathogen richness for a wild species is likely to be impossible, models that attempt to rank species from highest to lowest pathogen richness are still useful for prioritising species for surveillance. +This requires a middle ground of model complexity. + +\tmpsection{What I did} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%7. what your focus is (including a bit about bat focus) + +In order to capture this middle ground, I have used metapopulation models. +Unlike two-patch models that are used to add population structure while keeping model complexity to a minimum \cite{qiu2013vector, allen2004sis, garmer2016multistrain}, the metapopulations used here split the population into multiple subpopulations. +I have used two independent variables that alter population structure: dispersal rate and metapopulation network topology. +I have studied the invasion of new pathogens as a mechanism for increasing pathogen richness. +In particular I have focused on studying the invasion of a newly evolved pathogen that is therefore identical in epidemiological parameters to the endemic pathogen. +Furthermore, this close evolutionary relationship means that competition via cross-immunity is strong. + +\tmpsection{Why bats} +The metapopulations were parameterised to broadly mimic wild bat populations. +Population structure has already been found to correlate with pathogen richness in bats (Chapter~\ref{ch:empirical}, \cites{gay2014parasite, maganga2014bat, turmelle2009correlates}). +Furthermore, bats have an unusually large variety of social structures. +Colony sizes range from ten to 1 million individuals \cite{jones2009pantheria} and colonies can be very stable \cite{kerth2011bats, mccracken1981social}. +This strong colony fidelity means they fit the assumptions of metapopulations well. +Bats have also, over the last decade, become a focus for disease research \cite{calisher2006bats, hughes2007emerging}. +The reason for this focus is that they have been implicated in a number of high profile diseases including Ebola, SARS, Hendra and Nipah \cite{calisher2006bats, li2005bats}. + +%8. 'here I show..' what you found briefly to lead into methods + +\tmpsection{What I found} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% One sentence summarising the main result +% (with the words “here we show” or their equivalent). + +Here I found that, given the assumptions of a metapopulation, increased dispersal significantly increased the probability of invasion of new pathogens. +Furthermore, structured populations nearly always had a lower probability of pathogen invasion than fully-mixed populations of equal size. +The topology of the network did not strongly affect the probability of pathogen invasion as long as the population was not completely unconnected. +Overall, I found significant evidence that reduced population structure increases the probability of invasion of a new pathogen, implying a role for the generation of pathogen richness more generally. + +\begin{figure}[t] +\centering + \includegraphics[width=0.5\textwidth]{imgs/SIRoption1.pdf} + \caption[Schematic of the SIR model used]{ + Schematic of the SIR model used. + Individuals are in one of five classes, susceptible (orange, $S$), infected with Pathogen 1, Pathogen 2 or both (blue, $I_1, I_2, I_{12}$) or recovered and immune from further infection (green, $R$). + Transitions between epidemiological classes occur as indicated by solid arrows. + Vital dynamics (births and deaths) are indicated by dashed arrows. + Parameter symbols for transitions are indicated. + Note that individuals in $I_{12}$ move into $R$, not back to $I_1$ or $I_2$. + That is, recovery from one pathogen causes immediate recovery from the other pathogen. + } +\label{f:sir} +\end{figure} + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Methods} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%% + + +\subsection{Two pathogen SIR model} + +I developed a multipathogen, SIR compartment model with individuals being classed as susceptible, infected or recovered with immunity (Figure~\ref{f:sir}). +Susceptible individuals are counted in class $S$ (see Table~\ref{t:params} for a list of symbols and values used). +There are three infected classes, $I_1$, $I_2$ and $I_{12}$, being individuals infected with Pathogen 1, Pathogen 2 or both respectively. +Recovered individuals, $R$, are immune to both pathogens, even if they have only been infected with one (i.e.\ there is complete cross-immunity). +Furthermore, recovery from one pathogen moves an individual straight into the recovered class, even if the individual is infected with both pathogens (Figure~\ref{f:sir}). +This modelling choice allows the model to be easily expanded to include more than two pathogens, though this study is restricted to two pathogens. +The assumption of immediate recovery from all other diseases is likely to be reasonable. +Any up-regulation of innate immune response will affect both pathogens equally. +Furthermore, as the pathogens are identical, any acquired immunity would also affect both pathogens equally. + +The coinfection rate (the rate at which an infected individual is infected with a second pathogen) is adjusted compared to the infection rate by a factor $\alpha$. +As in \textcite{castillo1989epidemiological}, low values of $\alpha$ imply lower rates of coinfection. +In particular, $\alpha = 0$ indicates no coinfections, $\alpha = 1$ indicates that coinfections happen at the same rate as first infections while $\alpha > 1$ indicates that coinfections occur more readily than first infections. + + +\begin{figure}[t] +{\centering +\subfloat[Minimally connected\label{fig:fullyConnected}]{ + \includegraphics[width=0.45\textwidth]{imgs/minimallyConnected.pdf} +} +\subfloat[Fully connected +\label{fig:minimallyConnected}]{ + \includegraphics[width=0.45\textwidth]{imgs/fullyConnected.pdf} +} +} +\caption[Network topologies used to compare network connectedness]{ +The two network topologies used to test whether network connectedness influences a pathogen's ability to invade. +A) Animals can only disperse to neighbouring colonies. +B) Dispersal can occur between any colonies. +Blue circles are colonies of \SI{3000} individuals. +Dispersal only occurs between colonies connected by an edge (black line). +The dispersal rate is held constant between the two topologies. +} +\label{f:net} +\end{figure} + + +In the application of long term existence of pathogens it is necessary to include vital dynamics (births and deaths) as the SIR model without vital dynamics has no endemic state. +Birth and death rates ($\Lambda$ and $\mu$) are set as being equal meaning the population does not systematically increase or decrease. +The population size does however change as a random walk. +New born individuals enter the susceptible class. +Infection and coinfection were assumed to cause no extra mortality as for a number of viruses, bats show no clinical signs of infection \cite{halpin2011pteropid, deThoisy2016bioecological}. +%In humans, coinfection generally worsens health \cite{griffiths2011nature} but as there are + +\tmpsection{Metapopulation} + + +\begin{table}[tb] +\centering +\caption[All symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2}]{A summary of all symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2} along with their units and default values. +The justifications for parameter values are given in Section~\ref{s:paramSelect}.} + +\begin{tabular}{@{}lp{6cm}p{2.9cm}r@{}} +\toprule +Symbol & Explanation & Units & Value\\ +\midrule +$\rho$ & Number of pathogens && 2\\ +$x, y$ & Colony index &&\\ +$p$ & Pathogen index i.e.\ $p\in\{1,2\}$ for pathogens 1 and 2 & &\\ +$q$ & Disease class i.e.\ $q\in\{1,2,12\}$&\\ +$S_x$ & Number of susceptible individuals in colony $x$ &&\\ +$I_{qx}$ & Number of individuals infected with disease(s) $q \in \{1, 2, 12\}$ in colony $x$ &&\\ +$R_x$ & Number of individuals in colony $x$ in the recovered with immunity class &&\\ +$N$ & Total Population size && 30,000\\ +$m$ & Number of colonies&& 10\\ +$n$ & Colony size && 3,000\\ +$a$ & Area & \si{\square\kilo\metre}& 10,000\\ +$\beta$ & Transmission rate & & 0.1 -- 0.4\\ +$\alpha$ & Coinfection adjustment factor & & 0.1\\ +$\gamma$ & Recovery rate & year$^{-1}.$individual$^{-1}$ & 1\\ +$\xi$ & Dispersal rate & year$^{-1}.$individual$^{-1}$ & 0.001--0.1\\ +$\Lambda$ & Birth rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ +$\mu$ & Death rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ +$k_x$ & Degree of node $x$ (number of colonies that individuals from colony $x$ can disperse to). &&\\ +$\delta$ & Waiting time until next event & years &\\ + +$e_i$ & The rate at which event $i$ occurs & year$^{-1}$&\\ +\bottomrule +\end{tabular} + +\label{t:params} +\end{table} + + +The population is modelled as a metapopulation, being divided into a number of subpopulations (colonies). +This model is an intermediate level of complexity between fully-mixed populations and contact networks. +There is ample evidence that bat populations are structured to some extent. +This evidence comes from the existence of subspecies, measurements of genetic dissimilarity and ecological studies \cite{kerth2011bats, mccracken1981social, burns2014correlates, wilson2005mammal}. +Therefore a fully mixed population is a large oversimplification. +However, trying to study the contact network relies on detailed knowledge of individual behaviour which is rarely available. + +The metapopulation is modelled as a network with colonies being nodes and dispersal between colonies being indicated by edges (Figure~\ref{f:net}). +Individuals within a colony interact randomly so that the colony is fully mixed. +Dispersal between colonies occurs at a rate $\xi$. +Individuals can only disperse to colonies connected to theirs by an edge in the network. +The rate of dispersal is not affected by the number of edges a colony has (known as the degree of the colony and denoted $k$). +Therefore, the dispersal rate from a colony $y$ with degree $k_y$ to colony $x$ is $\xi / k_y$. +Note this rate is not affect by the degree and size of colony $x$. + + + +\tmpsection{Stochastic simulations} + +I examined this model using stochastic, continuous-time simulations implemented in \emph{R} \cite{R}. +The implementation is available as an \emph{R} package on GitHub \cite{metapopepi}. +The model can be written as a continuous-time Markov chain. +The Markov chain contains the random variables $((S_x)_{x = 1\ldots m}, (I_{x, q})_{x =1\ldots m,\:q \in \{1, 2, 12\}}, (R_x)_{x = 1\ldots m})$. +Here, $(S_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of susceptibles in each colony. +$(I_{x, q})_{x =1\ldots m, q \in \{1, 2, 12\}}$ is a length $m \times 3$ vector describing the number of individuals of each disease class ($q \in \{1, 2, 12\}$) in each colony. +Finally, $(R_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of individuals in the recovered class. +The model is a Markov chain where extinction of both pathogen species and extinction of the host species are absorbing states. +The expected time for either host to go extinct is much larger than the duration of the simulations. + +At any time, suppose the system is in state $((s_x), (i_{x,q}), (r_x))$. +At each step in the simulation we calculate the rate at which each possible event might occur. +One event is then randomly chosen, weighted by its rate +\begin{align} + p(\text{event } i) = \frac{e_i}{\sum_j e_j}, +\end{align} +where $e_i$ is the rate at which event $i$ occurs and $\sum_j e_j$ is the sum of the rates of all possible events. +Finally, the length of the time step, $\delta$, is drawn from an exponential distribution +\begin{align} + \delta \sim \operatorname{Exp}\left(\sum_j e_j \right). +\end{align} + + +We can now write down the rates of all events. +%I defined $I^+_p$ to be the sum of all classes that are infectious with pathogen $p$, for example $I^+_1 = I_1 + I_{12}$. +Assuming asexual reproduction, that all classes reproduce at the same rate and that individuals are born into the susceptible class we get +\begin{align} + s_x \rightarrow s_x + 1 \;\;\;\text{at a rate of}\;\; \Lambda\left( s_{x}+\sum_q i_{qx} + r_{x}\right) +\end{align} +where $s_x \rightarrow s_x + 1$ is the event that the number of susceptibles in colony $x$ will increase by 1 (a single birth) and $\sum_q i_{qx}$ is the sum of all infection classes $q~\in~\{1, 2, 12\}$. +The rates of death, given a death rate $\mu$, and no increased mortality due to infection, are given by +\begin{align} + s_x \rightarrow s_x-1 &\;\;\;\text{at a rate of}\;\; \mu s_x, \\ + i_{qx} \rightarrow i_{qx}-1 &\;\;\text{at a rate of}\;\; \mu i_{qx},\\ + r_x \rightarrow r_x-1 &\;\;\;\text{at a rate of}\;\; \mu r_x. +\end{align} + + + +I modelled transmission as being density-dependent. +This assumption was more suitable than frequency-dependent transmission as I was modelling a disease transmitted by saliva or urine in highly dense populations confined to caves, buildings or potentially a small number of tree roosts. +I was notably not modelling a sexually transmitted disease (STD) as spillover of STDs from bats to humans is likely to be rare. +Infection of a susceptible with either Pathogen 1 or 2 is therefore given by +\begin{align} + i_{1x} \rightarrow i_{1x}+1,\;\;\; s_x \rightarrow s_x-1 &\;\;\text{at a rate of}\;\; \beta s_x\left(i_{1x} + i_{12x}\right),\\ + i_{2x} \rightarrow i_{2x}+1,\;\;\; s_x \rightarrow s_x-1 &\;\;\text{at a rate of}\;\; \beta s_x\left(i_{2x} + i_{12x}\right), +\end{align} +while coinfection, given the coinfection adjustment factor $\alpha$, is given by +\begin{align} + i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{1x} \rightarrow i_{1x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{1x}\left(i_{2x} + i_{12x}\right),\\ + i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{2x} \rightarrow i_{2x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{2x}\left(i_{1x} + i_{12x}\right). +\end{align} +Note that lower values of $\alpha$ give lower rates of infection as in \textcite{castillo1989epidemiological}. + + +The rate of migration from colony $y$ (with degree $k_y$) to colony $x$, given a dispersal rate $\xi$ is given by +\begin{align} + s_x \rightarrow s_x+1,\;\;\; s_y \rightarrow s_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi s_y}{k_y},\\ + i_{qx} \rightarrow i_{qx}+1,\;\;\; i_{qy} \rightarrow i_{qy}-1 &\;\;\text{at a rate of}\;\; \frac{\xi i_{qy}}{k_y},\\ + r_x \rightarrow r_x+1,\;\;\; r_y \rightarrow r_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi r_y}{k_y}. +\end{align} +Note that the dispersal rate does not change with infection. +As above, this is due to the low virulence of bat viruses. +Finally, recovery from any infectious class occurs at a rate $\gamma$ +\begin{align} + i_{qx} \rightarrow i_{qx}-1,\;\; r_x \rightarrow r_x+1 \;\;\text{at a rate of}\;\; \gamma i_{qx}. +\end{align} + + +%%begin.rcode SimLengths + + # These apply to both topo and disp sims. And probably should apply to extinction sims if I include them. + # How long should the simulation last? + nEvent <- 8e5 + + # When should the invading pathogen be added. + invadeT <- 3e5 + +%%end.rcode + + + + + + +% ------------------------------------------------------------------ % +% Dispersal Sims +% ------------------------------------------------------------------ % + + +%%begin.rcode DispSimsFuncs + + ################################# + # Dispersal sim definitions # + ################################# + + + # How often should the population be sampled. Only sampled populations are saved. + sample <- 1000 + + # How many simulations to run? + each <- 100 + nDispSims <- 12 * each + + +# Define our simulation function. +fullSim <- function(x){ + + dispVec <- rep(c(0.001, 0.01, 0.1), each = nDispSims/3 +1) + disp <- dispVec[x] + + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4 ), nDispSims/3 + 1) + tran <- tranVec[x] + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = disp, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = 100, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + for(i in 1:10){ + p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) + } + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[2, 1, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + if(saveData){ + file <- paste0('data/Chapter2/DispSims/DispSim_', formatC(x, width = 4, flag = '0'), '.RData') + save(p, file = file) + } + + rm(p) + + return(d) + +} +%%end.rcode + +%%begin.rcode runDispSim, eval = runAllSims, cache = TRUE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 33355 +set.seed(seed) + +# If we want to save the data, make a directory for it. +if(saveData){ + dir.create('data/Chapter2/DispSims/') +} + +# Run sims. +z <- mclapply(1:nDispSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/DispSims.csv') + +%%end.rcode + + + + + + +%%begin.rcode extraMidBeta, eval = runAllSims + + +nExtraSims <- 150 + +# Define our simulation function. +fullSim <- function(x){ + + dispVec <- rep(c(0.001, 0.01, 0.1), each = nExtraSims/3 + 1) + disp <- dispVec[x] + + tranVec <- rep(c(0.2, 0.3), nExtraSims/2 + 1) + tran <- tranVec[x] + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = disp, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = 100, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + for(i in 1:10){ + p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) + } + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[2, 1, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + + # Time until extinction + invadePath <- colSums(p$sample[2, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + colSums(p$sample[4, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + d$extinctionTime <- cumsum(p$sampleWaiting)[min(which(invadePath == 0)) + (2 + invadeT / sample)] + d$totalTime <- sum(p$sampleWaiting) + d$survivalTime <- d$extinctionTime - cumsum(p$sampleWaiting)[(2 + invadeT / sample)] + + d$pathInv <- sum(p$sample[c(1, 4), , dim(p$sample)[3]]) + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + rm(p) + + return(d) + +} + + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 787 +set.seed(seed) + + +# Run sims. +z <- mclapply(1:600, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/extraMidBeta.csv') + + +%%end.rcode + + + +% ------------------------------------------------------------------ % +% Topology Sims +% ------------------------------------------------------------------ % + + + +%%begin.rcode TopoSimsFuncs + + ################################# + # Topology sim definitions # + ################################# + + # How many simulations to run? + nTopoSims <- 8 * each + +# Define our simulation function. +fullSim <- function(x){ + + + # Chose maxdistance so that we get either fully connected or circle networks. + mxDis <- rep(c(40, 200), each = nTopoSims/2 + 1)[x] + + # Chose transmission rates. + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nTopoSims/4 + 1) + tran <- tranVec[x] + + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.01, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = mxDis, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + for(i in 1:10){ + p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) + } + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[2, 1, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + if(saveData){ + file <- paste0('data/Chapter2/TopoSims/TopoSim_', formatC(x, width = 4, flag = '0'), '.RData') + save(p, file = file) + } + + rm(p) + + return(d) + +} +%%end.rcode + +%%begin.rcode runTopoSim, eval = runAllSims, cache = TRUE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 1230202 +set.seed(seed) + +# If we want to save the data, make a directory for it. +if(saveData){ + dir.create('data/Chapter2/TopoSims/') +} + +# Run sims. +z <- mclapply(1:nTopoSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/TopoSims.csv') + +%%end.rcode + + + + + + +% ------------------------------------------------------------------ % +% Unstructured Sims +% ------------------------------------------------------------------ % + +%%begin.rcode unstructuredSimsFuncs + + ################################# + # Topology sim definitions # + ################################# + + # How many simulations to run? + nUnstructuredSims <- 4 * each + +# Define our simulation function. +fullSim <- function(x){ + + + # Chose transmission rates. + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nUnstructuredSims/4 + 1) + tran <- tranVec[x] + + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 2, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.0, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 29800, infectDeath = 0, transmission = tran, maxDistance = 120, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + p$I[2, 2, 1] <- 200 + p$I[1, 1, 1] <- 0 + + # Recalculate rates of each event after seeding. + p <- transRates(p, 1) + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[3, 2, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + if(saveData){ + file <- paste0('data/Chapter2/UnstructuredSims/UnstructuredSims_', formatC(x, width = 4, flag = '0'), '.RData') + save(p, file = file) + } + + rm(p) + + return(d) + +} +%%end.rcode + +%%begin.rcode runUnstructuredSim, eval = runAllSims, cache = TRUE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 13 +set.seed(seed) + +# If we want to save the data, make a directory for it. +if(saveData){ + dir.create('data/Chapter2/UnstructuredSims/') +} + +# Run sims. +z <- mclapply(1:nUnstructuredSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/unstructuredSims.csv') + +%%end.rcode + + + + + +%%begin.rcode noDispSimsFuncs + + + ################################# + # Topology sim definitions # + ################################# + + # How many simulations to run? + nNoDispSims <- 4 * each + +# Define our simulation function. +fullSim <- function(x){ + + + # Chose transmission rates. + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nNoDispSims/4 + 1) + tran <- tranVec[x] + + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 2, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.0, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 2800, infectDeath = 0, transmission = tran, maxDistance = 120, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + p$I[2, 2, 1] <- 200 + p$I[1, 1, 1] <- 0 + + # Recalculate rates of each event after seeding. + p <- transRates(p, 1) + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[3, 2, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + d$path2 <- sum(p$sample[c(2, 4), , dim(p$sample)[3]]) + + # Time until extinction + invadePath <- colSums(p$sample[2, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + colSums(p$sample[4, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + d$extinctionTime <- cumsum(p$sampleWaiting)[min(which(invadePath == 0)) + (2 + invadeT / sample)] + d$totalTime <- sum(p$sampleWaiting) + d$survivalTime <- d$extinctionTime - cumsum(p$sampleWaiting)[(2 + invadeT / sample)] + + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + + rm(p) + + return(d) + +} + + + +%%end.rcode + +%%begin.rcode runNoDispSims, eval = runAllSims, cache = FALSE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 19 +set.seed(seed) + +# Run sims. +z <- mclapply(1:nNoDispSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + + + +# Save summary data. +write.csv(z, file = 'data/Chapter2/noDispSims.csv') + +%%end.rcode + + + +\subsection{Parameter selection} +\label{s:paramSelect} + +The fixed parameters were chosen to roughly reflect realistic wild bat populations. +The death rate $\mu$ was set as 0.05 per year giving a generation time of 20 years. +The birth rate $\Lambda$ was set to be equal to $\mu$. +This yields a population that does not systematically increase or decrease. +However, the size of each colony changes as a random walk. +Given the length of the simulations, colonies were very unlikely to go extinct (Figure~\ref{fig:plotsNoInvade2}). +The starting size of each colony was set to \si{3000}. +This is appropriate for many bat species \cite{jones2009pantheria}, especially the large, frugivorous \emph{Pteropodidae} that have been particularly associated with recent zoonotic diseases. + +The recovery rate $\gamma$ was set to one, giving an average infection duration of one year. +This is therefore a long lasting infection but not a chronic infection. +It is very difficult to directly estimate infection durations in wild populations but it seems that these infections might sometimes be long lasting \cite{peel2012henipavirus, plowright2015ecological}. +However, other studies have found much shorter infectious periods \cite{amengual2007temporal}. +These shorter infections are not studied further here. %todo consider readding this. " as preliminary simulations found that they could not persist in the relatively small populations being modelled here." + +Four values of the transmission rate $\beta$ were used, 0.1, 0.2, 0.3 and 0.4. +These values were chosen to cover the range of behaviours, from very high probabilities of invasion of the second pathogen, to very low probabilities. +All simulations were run under all four transmission rates as this is such a fundamental parameter. +The coinfection adjustment parameter, $\alpha$, was set to 0.1 so that an individual infected with one pathogen is 90\% less likely to be infected with another. +This is a rather arbitrary value. +However, the rationale of the model was that the invading species might be a newly speciated strain of the endemic species. +Furthermore, the model assumes complete cross-immunity after recovery from infection. +Therefore cross-immunity to coinfection is likely to be very strong as well. +Some pairs of closely related bat viruses have been found to coinfect individual bats less than would be expected by chance \cite{anthony2013strategy}. +This indicates a level of cross-immunity between these pairs of viruses. %todo I'm sure there was a marburg ebola paper... + + + + + +\subsection{Experimental setup} + +The metapopulation was made up of ten colonies. +Ten colonies was selected as a trade-off between computation time and a network complex enough that any effects of population structure could be detected. +This value is artificially small compared to wildlife populations. +In each simulation, the na{\"i}ve population was seeded with ten sets of 200 individuals infected with Pathogen 1. +These groups were seeded into randomly selected colonies with replacement. +For each 200 infected individuals added, 200 susceptible individuals were removed to keep starting colony sizes constant. +Pathogen 1 was then allowed to spread until the initial, large epidemic had ended. +Visual inspection of preliminary simulations was used to decide on \SI{\rinline{invadeT}} events as being long enough for the epidemic to end and the pathogen to be in an endemic state (Figures~\ref{fig:plotsInvade} and \ref{fig:plotsNoInvade1}). +After \SI{\rinline{invadeT}} events, five individuals infected with Pathogen 2 were added to one randomly selected colony. +After another \SI{\rinline{nEvent - invadeT}} events the invasion of Pathogen 2 was considered successful if any individuals were still infected with Pathogen 2. +Therefore, if at least one individual was in class $I_2$ or $I_{12}$ at the end of the simulation, this was considered an invasion. +Again, visual inspection of preliminary simulations was used to determine that after \SI{\rinline{nEvent - invadeT}} events, if an invading pathogen was still present, it was well established (Figures~\ref{fig:plotsInvade} and \ref{fig:plotsNoInvade1}). + +The choice to use a fixed number of events, rather than a fixed number of years, was for computational convenience. +However, this choice creates a risk of bias as simulations with a greater total rate of events, $\sum_j e_j$ (e.g.,\ faster disease transmission) will last for a shorter time overall (i.e.\ a smaller $\sum \delta$ over all events). +However, visual inspection of the dynamics of disease extinction (Figure~\ref{fig:plotsNoInvade1}), and examination of the typical time to extinction suggests that this bias is negligible. +For example, of the simulations where extinction occurred, the extinction occurred more than 50 years before the end of the simulation in 90\% of cases. +On a preliminary run of 106 simulations across all combinations of dispersal and transmission rates, examining the population after \SI{700000} events instead of \SI{\rinline{nEvent}} events gave exactly the same result with respect to the binary state of invasion or no invasion. + + +\subsection{Population structure} + +As a baseline for comparison, I ran simulations of a fully unstructured population. +These simulations were run with a population of \SI{30000} so that the total population size was equal to that of the total metapopulation size in the structured simulations. +I ran 100 simulations at each transmission rate. + + +\tmpsection{Dispersal} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +Two parameters control population structure in the model: dispersal rate and the topology of the metapopulation network. +The values used for these parameters were chosen to highlight the effects of population structure. +I selected the dispersal rates $\xi = 0, 0.001, 0.01$ and $0.1$ dispersals per individual per year. +The probability that an individual disperses at least once in its lifetime is given by $\xi / \left(\xi + \mu\right)$. +Therefore, $\xi = 0.1$ relates to 67\% of individuals dispersing between colonies at least once in their lifetime. +Exclusively juvenile dispersal would have dispersal rates similar to this value. %todo cite +$\xi = 0.01$ relates to 17\% of individuals dispersing at least once in their lifetime. +This value is relatively close to male-biased dispersal, with female philopatry. %todo cite +$\xi = 0.001$ relates to 2\% of individuals dispersing during their lifetime. +This therefore relates to a species that does not habitually disperse. +Finally, I ran simulations with no dispersal. +Given zero dispersal, only the colony seeded with Pathogen 2 could ever receive infections of the invading pathogen. +Therefore, only one colony was simulated for $\xi = 0$. +While altering the dispersal rate I used a fully connected network topology. +I ran 100 simulations for most parameter sets. +I ran 150 simulations for $\xi = 0.1, 0.01$ and $0.001$ with $\beta = 0.2$ and $0.3$ as preliminary simulations indicated that any effects of population structure would most likely be seen at these values so extra simulations were run to increase statistical power. + + + +\tmpsection{Network structure} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +I also altered the topology of the metapopulation network. +The network topology was created to be either fully or minimally connected (Figure~\ref{f:net}). +To model a completely unconnected population the $\xi = 0$ simulations from above were used. +While altering network topology, the intermediate dispersal rate, $\xi = 0.01$, was used. +I again ran 100 simulations for each parameter set. + + + +\subsection{Statistical analysis} + +I used generalised linear models (GLMs) with a binary response variable, invasion or not, to test the hypothesis that probability of invasion increased with dispersal. +I examined $p$ values and the regression coefficient, $b$, from each model. +Separate GLMs were fitted for each transmission rate. +These tests were performed both with and without the $\xi = 0$ results as the complete lack of dispersal makes these simulations qualitatively different to the other simulations. +To test whether the different topologies had different probabilities of invasion, I used Fisher's exact tests because topology is best described as a categorical variable. +As with the $\xi = 0$ results, these tests were performed both with and without the completely unconnected topology results. +Finally, I also used binomial GLMs to test the hypothesis that the probability of invasion increased with transmission rate. +Separate GLMs were fitted for each dispersal rate and network topology. +All statistical analyses were performed using the \emph{stats} package in \emph{R}. +The code used for running the simulations and analysing the results is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/population-structure-affects-pathogen-richness-mechanistic-model.Rtex}. + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\section{Results} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%%begin.rcode loadNoDispData +noDisp <- read.csv('data/Chapter2/noDispSims.csv', stringsAsFactors = FALSE) + +%%end.rcode + + +%%begin.rcode loadDispData + +# Read in the data. +# Later I'll add in the option to simulate the whole dataset. +dDisp <- read.csv('data/Chapter2/DispSims.csv', stringsAsFactors = FALSE) +dim(dDisp) +head(dDisp) + +extraMidBeta <- read.csv('data/Chapter2/extraMidBeta.csv', stringsAsFactors = FALSE) + +%%end.rcode + + + + + +%%begin.rcode DispDataOrganise + +dDisp <- rbind(dDisp, extraMidBeta[, 1:11], noDisp[, 1:11]) + + +# Which simulations have an extinction + +dDisp$invasion <- dDisp$nPathogens - dDisp$nExtantDis == 0 + + + +# Number of simulations of each treatment +nDisp <- dDisp %>% + group_by(transmission, dispersal) %>% + dplyr::select(invasion) %>% + summarise(n()) + +nDisp + + +# Number of extinctions by treatment +invsDisp <- dDisp %>% + group_by(transmission, dispersal) %>% + dplyr::select(invasion) %>% + filter(invasion == TRUE) %>% + summarise(n()) + +invsDisp + + +propsDisp <- left_join(nDisp, invsDisp, by = c('dispersal', 'transmission')) + +names(propsDisp) <- c( 'transmission', 'dispersal', 'n', 'invasions') + +propsDisp$invasions[is.na(propsDisp$invasions)] <- 0 + +# Proportion of invasions in totals. +propsDisp$props <- propsDisp$invasions / propsDisp$n + +propsDisp + +%%end.rcode + + + + +%%begin.rcode DispPropTests + +# Then run proportion tests between population structures + + +DispGLM1 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.1, ], family = binomial)) +DispGLM2 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.2, ], family = binomial)) +DispGLM3 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.3, ], family = binomial)) +DispGLM4 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.4, ], family = binomial)) + +dDispNoZero <- dDisp %>% dplyr::filter(dispersal != 0) +Disp2GLM1 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.1, ], family = binomial)) +Disp2GLM2 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.2, ], family = binomial)) +Disp2GLM3 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.3, ], family = binomial)) +Disp2GLM4 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.4, ], family = binomial)) + + +%%end.rcode + + + +%%begin.rcode DispTransPropTests +##Finally run proportion tests between transmission rates + +#Finally run proportion tests between transmission rates + +DispTransGLM1 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.001, ], family = binomial)) +DispTransGLM2 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.01, ], family = binomial)) +DispTransGLM3 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.1, ], family = binomial)) +DispTransGLM4 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0, ], family = binomial)) + +%%end.rcode + + + + +%%begin.rcode loadTopoData + +# Read in the data. +# Later I'll add in the option to simulate the whole dataset. +dTopo <- read.csv('data/Chapter2/TopoSims.csv', stringsAsFactors = FALSE) +dim(dTopo) +head(dTopo) + +dTopo <- rbind(dTopo, noDisp[, 1:11]) + +%%end.rcode + + +%%begin.rcode TopoDataOrganise + +# Which simulations have an extinction + +dTopo$invasion <- dTopo$nPathogens - dTopo$nExtantDis == 0 + +# Number of extinctions by treatment +invsTopo <- dTopo %>% + group_by(transmission, meanK) %>% + dplyr::select(invasion) %>% + filter(invasion == TRUE) %>% + summarise(n()) %>% + rbind(c(0.1, 1, 0), .) + +invsTopo + +# Number of simulations of each treatment +nTopo <- dTopo %>% + group_by(transmission, meanK) %>% + dplyr::select(invasion) %>% + summarise(n()) + +nTopo + + + +propsTopo <- left_join(nTopo, invsTopo, by = c('meanK', 'transmission')) + +names(propsTopo) <- c( 'transmission', 'meanK', 'n', 'invasions') + +propsTopo$invasions[is.na(propsTopo$invasions)] <- 0 + +# Proportion of invasions in totals. +propsTopo$props <- propsTopo$invasions / propsTopo$n + +propsTopo + +%%end.rcode + + + + +%%begin.rcode TopoPropTests + +# Then run proportion tests between population structures + +TopoTest1 <- fisher.test(cbind(propsTopo$invasions[1:3], propsTopo$n[1:3] - propsTopo$invasions[1:3])) +TopoTest2 <- fisher.test(cbind(propsTopo$invasions[4:6], propsTopo$n[4:6] - propsTopo$invasions[4:6])) +TopoTest3 <- fisher.test(cbind(propsTopo$invasions[7:9], propsTopo$n[7:9] - propsTopo$invasions[7:9])) +TopoTest4 <- fisher.test(cbind(propsTopo$invasions[10:12], propsTopo$n[10:12] - propsTopo$invasions[10:12])) + + + +Topo2Test1 <- fisher.test(cbind(propsTopo$invasions[2:3], propsTopo$n[2:3] - propsTopo$invasions[2:3])) +Topo2Test2 <- fisher.test(cbind(propsTopo$invasions[5:6], propsTopo$n[5:6] - propsTopo$invasions[5:6])) +Topo2Test3 <- fisher.test(cbind(propsTopo$invasions[8:9], propsTopo$n[8:9] - propsTopo$invasions[8:9])) +Topo2Test4 <- fisher.test(cbind(propsTopo$invasions[11:12], propsTopo$n[11:12] - propsTopo$invasions[11:12])) + +%%end.rcode + + + + +%%begin.rcode TopoTransPropTests +#Finally run proportion tests between transmission rates + +#TopoTransTest1 <- fisher.test(cbind(propsTopo$invasions[c(1, 3, 5, 7)], propsTopo$n[c(1, 3, 5, 7)] - propsTopo$invasions[c(1, 3, 5, 7)])) +#TopoTransTest2 <- fisher.test(cbind(propsTopo$invasions[c(2, 4, 6, 8)], propsTopo$n[c(2, 4, 6, 8)] - propsTopo$invasions[c(2, 4, 6, 8)])) + + +TopoTransGLM1 <- summary(glm(invasion ~ transmission, data = dTopo[dTopo$meanK == 9, ], family = binomial)) +TopoTransGLM2 <- summary(glm(invasion ~ transmission, data = dTopo[dTopo$meanK == 2, ], family = binomial)) + +%%end.rcode + + +%%begin.rcode caption1String +# Just defining my caption label here to avoid the long string in chunk options below. + +invasionPropCaption <- sprintf(" + The probability of successful invasion for different A) dispersal rates and B) network topologies (with network topologies ``unconnected'', ``minimally connected'' and ``fully connected'' as in Figure~\\ref{f:net}). + Error bars are 95\\%% confidence intervals of probability of invasion. + %i simulations were run for each treatment except $\\beta = 0.2$ and $0.3$ in A) which has 150 per treatment. + Other parameters were kept constant at: $m = 10,\\, \\, \\mu = \\Lambda = 0.05,\\, \\gamma = 1,\\, \\alpha = 0.1$. + When dispersal is varied, the population structure is fully connected. + When network topology is varied, $\\xi = 0.01$.", + as.integer(each)) + +invasionPropShort <- "The probability of invasion across different dispersal rates and network topologies" + + +%%end.rcode + + +%%begin.rcode invasionPropPlots, fig.lp = 'f:', fig.height = 2.6, out.width = "\\textwidth", fig.cap = invasionPropCaption, cache = FALSE, fig.scap = invasionPropShort + +propsDispCI <- data.frame(propsDisp[, 1:2], binom.confint(propsDisp$invasions, propsDisp$n, conf.level = 0.95, methods = "exact")) +propsDispCI <- propsDispCI %>% mutate(dispersal = replace(dispersal, dispersal == 0, 1e-4)) +propsDispCI <- propsDispCI %>% mutate(transFactor = factor(transmission)) + +dispPlot <- ggplot(propsDispCI, aes(x = dispersal, y = mean, colour = transFactor)) + + geom_point() + + geom_line() + + scale_x_log10(breaks = c(1e-4, 1e-3, 1e-2, 1e-1), labels = c('0', '0.001', '0.01', '0.1')) + + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.04) + + scale_colour_poke(name = expression(beta), + pokemon = 'illumise', + spread = 4) + + xlab('Dispersal') + + ylab('Prop. Invasions') + + theme(legend.position = "none", panel.grid.major.x = element_blank()) + + +propsTopoCI <- data.frame(propsTopo[, 1:2], binom.confint(propsTopo$invasions, nTopo$n, conf.level = 0.95, methods = "exact")) + +propsTopoCI$topo <- factor(propsTopoCI$meanK, levels = c(1, 2, 9)) +propsTopoCI$topoCont <- as.numeric(propsTopoCI$topo) +propsTopoCI <- propsTopoCI %>% mutate(transFactor = factor(transmission)) + + +topoPlot <- ggplot(propsTopoCI, aes(x = topoCont, y = mean, colour = transFactor)) + + geom_point() + + geom_line() + + scale_x_continuous(breaks = c(1, 2, 3), + labels = c('Unconn.','Min.','Full.'), + limits = c(0.9, 3.1)) + + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.04) + + scale_colour_poke(name = expression(beta), + pokemon = 'illumise', + spread = 4) + + + xlab('Network Topology') + + ylab('Prop. Invasions') + + theme(legend.position = "none", panel.grid.major.x = element_blank()) + +# Extract the legend +grobs <- ggplotGrob(topoPlot + theme(legend.position="bottom"))$grobs +legend_b <- grobs[[which(sapply(grobs, function(x) x$name) == "guide-box")]] + + +ggdraw() + + draw_label("A)", 0.03, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(dispPlot, 0, 0.06, 0.5, 0.94) + + draw_label("B)", 0.53, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(topoPlot, 0.5, 0.06, 0.5, 0.94) + + draw_grob(legend_b, 0.32, 0.01, 0.4, 0.1) + + + + +%%end.rcode + + +\subsection{Dispersal} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +In the unstructured population, Pathogen 2 invaded in 100 out of 100 simulations. +This was true at all four transmission rates. + +%todo check formating of pvalues +When the $\xi = 0$ simulations were included, there was a positive relationship between dispersal rate and invasion probability for $\beta = 0.2, 0.3$ and $0.4$ (Figure~\ref{f:invasionPropPlots}A, Table~\ref{B-disp}). +These positive relationships were all significant (GLM. $\beta = 0.2$: $b$ = \rinline{DispGLM2$coefficients[2, 1]}, $p < 10^{-5}$. $\beta = 0.3$: $b$ = \rinline{DispGLM3$coefficients[2, 1]}, $p$ = \rinline{DispGLM3$coefficients[2, 4]}. $\beta = 0.4$: $b$ = \rinline{DispGLM4$coefficients[2, 1]}, $p$ = \rinline{DispGLM4$coefficients[2, 4]}.) +At $\beta = 0.1$ there was no significant relationship as invasion probability was very close to zero at all dispersal rates (GLM. $b$ = \rinline{DispGLM1$coefficients[2, 1]}, $p$ = \rinline{DispGLM1$coefficients[2, 4]}). + +However, when the $\xi = 0$ simulations were removed, this significant, positive relationship largely disappeared. +At $\beta = 0.2$, the significant positive relationship remained (GLM: $b~=~$\rinline{Disp2GLM2$coefficients[2, 1]}, $p$ = \rinline{Disp2GLM2$coefficients[2, 4]}). +At all other transmission rates, the probability of invasion did not significantly change with dispersal rate (GLM. $\beta = 0.1$: $b$ = \rinline{Disp2GLM1$coefficients[2, 1]}, $p$ = \rinline{Disp2GLM1$coefficients[2, 4]}. $\beta = 0.3$: $b$ = \rinline{Disp2GLM3$coefficients[2, 1]}, $p$ = \rinline{sprintf('%.2f', Disp2GLM3$coefficients[2, 4])}. $\beta = 0.4$: $b$ = \rinline{Disp2GLM4$coefficients[2, 1]}, $p$ = \rinline{sprintf('%.2f', Disp2GLM4$coefficients[2, 4])}.) + + +\subsection{Network topology} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +When the completely unconnected topology simulations were included, the probability of invasion was different across topologies for $\beta = 0.2, 0.3$ and $0.4$ (Fisher's exact test. $\beta = 0.2$: $p < 10^{-5}$. $\beta = 0.3$: $p < 10^{-5}$. $\beta = 0.4$: $p < 10^{-5}$). +In each case, the fully unconnected population had a lower probability of invasion than the minimally and completely connected topologies (Figure~\ref{f:invasionPropPlots}B, Table~\ref{B-topo}). +At $\beta = 0.1$ there was no significant difference ($p = \rinline{p(TopoTest1$p.value)}$) and the probability of invasion was close to zero for all topologies (Figure~\ref{f:invasionPropPlots}B). + +When the completely unconnected topology simulations were removed, there were no significant differences between topologies i.e.\ between the minimally and fully connected topologies (Figure~\ref{f:invasionPropPlots}B). +This was true at all transmission rates (Fisher's exact test. $\beta = 0.1$, $p = \rinline{sprintf('%.2f', Topo2Test1$p.value)}$. $\beta = 0.2$, $p = \rinline{p(Topo2Test2$p.value)}$. $\beta = 0.3$, $p = \rinline{p(Topo2Test3$p.value)}$. $\beta = 0.4$, $p = \rinline{p(Topo2Test4$p.value)}$). + + + +\subsection{Transmission} +%%%%%%%%%%%%%%%%%%%%%%%%%% + +Increasing the transmission rate increased the probability of invasion (Figure~\ref{f:invasionPropPlots}). +This was true for all four dispersal values (GLM. $\xi = 0$: $b$ = \rinline{DispTransGLM4$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.001$: $b$ = \rinline{DispTransGLM1$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.01$: $b$ = \rinline{DispTransGLM2$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.1$: $b$ = \rinline{DispTransGLM3$coefficients[2, 1]}, $p < 10^{-5}$.) and both network structures (GLM. Minimally connected: $b$ = \rinline{TopoTransGLM2$coefficients[2, 1]}, $p < 10^{-5}$. Fully connected: $b$ = \rinline{TopoTransGLM1$coefficients[2, 1]}, $p < 10^{-5}$). + + + + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\section{Discussion}\label{s:sims1Disc} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\tmpsection{Restate the gap and the main result} + +I have used mechanistic, metapopulation models to test whether increased population structure can promote pathogen richness by facilitating invasion of new pathogens. +I found that dispersal does affect the ability of a new pathogen to invade and persist in a population. +I also found evidence that pathogen invasion was less likely in completely isolated colonies. +However, apart from the completely unconnected network, the topology of the metapopulation network did not affect invasion probability. +Increasing transmission rate quickly reaches a state where new pathogens always invade as long as the metapopulation is not completely unconnected. +Decreasing the transmission rate quickly reaches a state where invasion is impossible. + +The result that increased population structure decreases pathogen richness supports many existing predictions that increasing $R_0$ should increase pathogen richness \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. +However, many comparative studies have found the opposite relationship, with increased population structure increasing pathogen richness (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). +Furthermore, simple analytical models suggest that population structure should increase pathogen richness \cite{qiu2013vector, allen2004sis, nunes2006localized} and I find no evidence of this. + + +\tmpsection{Link results to consequences} + +These results suggest that if population structure does in fact affect pathogen richness, as observed in comparative studies (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), it must occur by a mechanism other than the one studied here. +In this study the hypothesised mechanism for the relationship between population structure and pathogen richness, was that the spread and persistence of a newly evolved pathogen would be facilitated in highly structured populations as the lack of movement between colonies would stochastically create areas of low prevalence of the endemic pathogen. +If the invading pathogen evolved (i.e.\ was seeded) in one of these areas of low prevalence, invasion would be more likely. +Instead, reduced population structure allowed the new pathogen to quickly spread outside of the colony in which it evolved. +As the mechanism studied here cannot explain the relationship between population structure and pathogen richness seen in wild species (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), other mechanisms should be studied. +Other mechanisms that should be examined include reduced competitive exclusion of already established pathogens or increased invasion of less closely related and less strongly competing pathogens, perhaps mediated by ecological competition of pathogens (i.e.\ reduction of the susceptible pool by disease induced mortality). +Furthermore, single pathogen dynamics could have an important role such as population structure causing a much slower, asynchronous epidemic preventing acquired herd immunity \cite{plowright2011urban}. + +I ran simulations of a completely unstructured population as a baseline comparison of pathogen invasion probability. +However, this unstructured population could also be considered one, very large, subpopulation or colony. +The fact that invasion occurred 100\% of the time in these simulations suggests that colony size has an important role in pathogen richness. +Therefore the interplay between population structure and colony size should be studied further especially as the range of colony size in bats is large, ranging from ten to 1 million \cite{jones2009pantheria} individuals. + +My simulations also highlighted the importance of competition for the spread of a new pathogen. +All parameters used corresponded to pathogens with $R_0>1$ (as seen by the consistent spread of Pathogen 1). +However, the competition with the endemic pathogen meant that for some transmission rates the chance of epidemic spread and persistence of Pathogen 2 was close to zero. +This has implications for human epidemics as well --- if there is strong competition between a newly evolved strain and an endemic strain, we are unlikely to see the new strain spread, regardless of population structure. + + + +\subsection{Model assumptions} + +\subsubsection{Complete cross-immunity} + +I have assumed that once recovered, individuals are immune to both pathogens. +Furthermore, when a coinfected individual recovers from one pathogen, it immediately recovers from the other as well. +This is probably a reasonable assumption given that I am modelling a newly evolved strain. +However, the rate of recovery from pathogens in the presence of coinfections has not been well studied. +In humans, the rate of recovery from respiratory syncytial virus was faster in individuals that had recently recovered from one of a number of co-circulating viruses \cite{munywoki2015influence}. +However, currently coinfected individuals recovered more slowly than average \cite{munywoki2015influence}. + +However, further work could relax this assumption using a model similar to \cite{poletto2015characterising} which contains additional classes for ``infected with Pathogen 1, immune to Pathogen 2'' and ``infected with Pathogen 2, immune to Pathogen 1''. +The model here was formulated such that the study of systems with greater than two pathogens (an avenue for further study) is still computationally feasible. +A model such as used in \cite{poletto2015characterising} contains $3^\rho$ classes for a system with $\rho$ pathogen species. +This quickly becomes computationally restrictive. +It might be expected that there is an upper limit to the total number of pathogen species that can coexist in a population. +In particular, it is possible that once a certain number of species are endemic in a population, no more pathogens can invade into the population. +This has not been studied in the context of metapopulations. + +\subsubsection{Identical strains} + +Many papers on pathogen richness have focused on the evolution of pathogen traits and have considered a trade-off between transmission rate and virulence \cite{nowak1994superinfection, nowak1994superinfection} or infectious period \cite{poletto2013host}. +However, here I am interested in host traits. +Therefore I have assumed that pathogen strains are identical. +It is clear however that there are a number of factors that affect pathogen richness and my focus on host population structure does not imply that pathogen traits are not important. + +\subsubsection{Complex social structure and behaviour} + +With the models here I have aimed to tread a middle ground between the overly simplistic models employed in analytical studies \cite{allen2004sis} and the full complexity and variety of true bat social systems \cite{kerth2008causes}. +The factors that have not been modelled here include seasonal migration, maternity roosts, hibernation roosts and swarming sites \cite{kerth2008causes, fleming2003ecology, richter2008first, cryan2014continental}. +While future models might aim to model this complexity more fully, the number of parameters that are required to be estimated and varied becomes very large. +Furthermore, not all of these social complexities exist in all bat species, so in limiting my analysis to the simpler end of bat social systems it is hoped that the results are more broadly representative of the order. + +Furthermore, I have considered a single host species in isolation. +It seems likely that sympatry in bats and other mammals is epidemiologically important \cite{brierley2016quantifying, luis2013comparison, pilosof2015potential} but this was beyond the scope of this study. +There is potential for this to be effectively modelled as a multi-layered network \cite{wang2016structural, funk2010interacting} and this would be expected to act to reduce population structure. +Conversely, the case of interspecies roost sharing could be modelled as an additional layer of within-colony, population structure which would tend to increase population structure. + +Finally, many species of bat exhibit strong seasonal birth pulses which are known to affect disease dynamics \cite{hayman2015biannual,peel2014effect,amman2012seasonal}. +This would be expected to facilitate the invasion of new pathogen species; if a new strain evolved or entered the population by migration during a period of low population immunity, it would have a higher chance of invading and establishing in the population. +Again this was beyond the scope of this study, but birth pulses and their interactions with seasonally varying transmission rates is a useful area for further research. + +\subsection{Conclusions} + +In conclusion I have found evidence that reduced population structure facilitates the invasion and establishment of newly evolved pathogen species. +However, the direction of the relationship contradicts those found in wild species. +This suggests that if population structure does have a role in shaping pathogen communities, it is unlikely to be by this specific mechanism. + + + + + + From 2b883068deb0e991b486143564cb20223cd1e18e Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 16:59:27 +0100 Subject: [PATCH 14/17] Fix long url. --- Chapter2.Rtex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Chapter2.Rtex b/Chapter2.Rtex index 9424734..4603a82 100644 --- a/Chapter2.Rtex +++ b/Chapter2.Rtex @@ -1073,7 +1073,7 @@ I again ran 100 simulations for each parameter set. \subsection{Statistical analysis} - +\sloppy I used generalised linear models (GLMs) with a binary response variable, invasion or not, to test the hypothesis that probability of invasion increased with dispersal. I examined $p$ values and the regression coefficient, $b$, from each model. Separate GLMs were fitted for each transmission rate. @@ -1083,8 +1083,8 @@ As with the $\xi = 0$ results, these tests were performed both with and without Finally, I also used binomial GLMs to test the hypothesis that the probability of invasion increased with transmission rate. Separate GLMs were fitted for each dispersal rate and network topology. All statistical analyses were performed using the \emph{stats} package in \emph{R}. -The code used for running the simulations and analysing the results is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/population-structure-affects-pathogen-richness-mechanistic-model.Rtex}. - +The code used for running the simulations and analysing the results is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/pop-structure-path-richness-mechanistic.Rtex}. +\fussy %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% From 535e692e628866eb1c0dfc2f7cd2d79b4576f829 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 17:01:28 +0100 Subject: [PATCH 15/17] More messing with these file copies. --- ...t of the role of population structure.Rtex | 2707 ----------------- pop-structure-path-richness-mechanistic.Rtex | 1503 +++++++++ 2 files changed, 1503 insertions(+), 2707 deletions(-) delete mode 100644 comparative test of the role of population structure.Rtex create mode 100644 pop-structure-path-richness-mechanistic.Rtex diff --git a/comparative test of the role of population structure.Rtex b/comparative test of the role of population structure.Rtex deleted file mode 100644 index e77ad2c..0000000 --- a/comparative test of the role of population structure.Rtex +++ /dev/null @@ -1,2707 +0,0 @@ -%--------------------------------------------------------------------------------------------------------------------------------% -% Code and text for "A comparative test of the role of population structure in determining pathogen richness" -% Chapter 2 of thesis "The role of population structure and size in determining bat pathogen richness" -% by Tim CD Lucas -% -% NB This file is a copy due to the mess up with chapter numbers. -% To see the full commit history see https://github.com/timcdlucas/PhDThesis/blob/master/Chapter3.Rtex -% -%---------------------------------------------------------------------------------------------------------------------------------% - - - - - -%%begin.rcode settings, echo = FALSE, cache = FALSE, message = FALSE, results = 'hide', eval = TRUE - - -################################## -### Run web scraping? ### -################################## - -# There's some slow webscrapping functions. Run them? -runPubmedScrape <- FALSE -runScholarScrape <- FALSE -runFstScrape <- FALSE - - -# Run slow bootstrapping? -subBoots <- FALSE -fstBoots <- FALSE -batclocksBoots <- FALSE - -# Run slow fst data wrangling as some is slow. -fstComb <- FALSE -runIucn <- FALSE - -# There are figures created in the data analysis which are not in the final chapter document. -# If TRUE, they will be included in the output. -# Use 'hide' to remove them. -extraFigs <- 'hide' - -#knitr options -opts_chunk$set(cache.path = '.Ch3Cache/') -source('misc/KnitrOptions.R') - -# ggplot2 theme. -source('misc/theme_tcdl.R') -theme_set(theme_grey() + theme_tcdl) - - -# Choose the number of cores to use -nCores <- 4 - -%%end.rcode - - -%%begin.rcode libs, cache = FALSE, result = FALSE - -# Data handling -library(dplyr) -library(broom) -library(readxl) -library(sqldf) -library(reshape2) - -# phylogenetic regression -library(ape) -library(caper) -library(phytools) -library(nlme) -library(qpcR) -library(car) - -# weighted means + var -library(Hmisc) - -# Plotting -library(ggplot2) -library(ggtree) -library(palettetown) -library(ggthemes) -library(GGally) -library(cowplot) - - -# Web scraping. -library(rvest) - -# For synonym list -library(taxize) - -# Spatial analysis -library(maptools) -library(geosphere) - -# Parllel computation -library(parallel) - -%%end.rcode - - - -%%begin.rcode parameters - - -# Define some parameters. -# This is useful at the top so that it can go in text. - -# How many bootstraps for model selection NULL variable -nBoots <- 50 - -# What proportion of a species range should be covered for an Fst study to count as valid. -rangeUseable <- 0.20 - -%%end.rcode - -\section{Abstract} - - -%\tmpsection{One or two sentences providing a basic introduction to the field} -% comprehensible to a scientist in any discipline. -\lettr{Z}oonotic diseases make up the majority of human infectious diseases and are a major drain on healthcare resources and economies. -Species that host many pathogen species are more likely to be the source of a novel zoonotic disease than species with few pathogens, all else being equal. -However, the factors that influence pathogen richness in animal species are poorly understood. -% -% -%\tmpsection{Two to three sentences of more detailed background} -% comprehensible to scientists in related disciplines. -% Theory led. -The pattern of contacts between individuals (i.e.\ population structure) can be influenced by habitat fragmentation, sociality and dispersal behaviour. -Epidemiological theory suggests that increased population structure can promote pathogen richness by reducing competition between pathogen species. -Conversely, it is often assumed that as greater population structure slows the spread of a new pathogen (i.e.\ lowers $R_0$), less structured populations should have greater pathogen richness. -% -% -%\tmpsection{One sentence clearly stating the general problem (the gap)} -% being addressed by this particular study. -Previous comparative studies comparing pathogen richness and population structure measured population structure differently and have had contradictory results, complicating the interpretation. -% -% -%\tmpsection{One sentence summarising the main result} -% (with the words “here we show” or their equivalent). -Here I test whether increased population structure correlates with viral richness using comparative data across 203 bat species, controlling for body mass, geographic range size, study effort and phylogeny. -This is an indirect test between the two competing hypotheses: does increased population structure allow pathogen coexistence by reducing competition, or does increased population structure decrease $R_0$ and therefore cause fewer new pathogens to enter the population. -Bats, as a group, make a useful case study because they have been associated with a number of important, recent zoonotic outbreaks. -Unlike previous studies, I used two measures of population structure: the number of subspecies and effective levels of gene flow. -I find that both measures are positively associated with pathogen richness. -% -% -%\tmpsection{Two or three sentences explaining what the main result reveals in direct comparison to what was thoughts to be the case previously} -% or how the main result adds to previous knowledge -My results add more robust support to the hypothesis that increased population structure promotes viral richness in bats. -The results support the prediction that increased population structure allows greater pathogen richness by reducing competition between pathogens -The prediction that factors that decrease $R_0$ should decrease pathogen richness is not supported. -% -% -%\tmpsection{One or two sentences to put the results into a more general context.} -Although my analysis implies that increased population structure does promote pathogen richness in bats, the weakness of the relationship and the difficulty in obtaining some measurements means that this is probably not a useful, predictive factor on its own for optimising zoonotic surveillance. -%However, the relationship has implications for global change, implying that increased habitat fragmentation might promote greater viral richness in bats. - - - - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\section{Introduction} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -%#the introduction is not bad and starts very well but i think you need a bit more from studies of other mammals (not bats) to put the study into context as well as explaining why particularly you focus on pop structure, some justification of why bats, and less detail about the specific Fst measures (move to methods) and more stuff on your actual methods and approach you use in this study. - -%#Structure could be: -%#1. Zoonotic disease is bad (as you have written it already) -%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. However, pop structure (explain what this means) is of particular interest because of blah blah. -%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition -%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa). -%#5. Bats are very interesting in this regard because of blah -%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny. -%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. -%#8. I found ... - -\tmpsection{General Intro} - -%#1. Zoonotic disease is bad (as you have written it already) -Zoonotic pathogens make up the majority of newly emerging diseases and have profound consequences for public health, economics and international development \cite{jones2008global, smith2014global, ebolaWorldbank}. -Better statistical models for predicting which wild host species are potential reservoirs of zoonotic diseases would allow us to optimise zoonotic disease surveillance and anticipate how the risks of disease spillover might change with global change. -The chance that a host species will be the source of a zoonotic pathogen depends on a number of factors, such as its proximity and interactions with humans, the prevalence of its pathogens and the number of pathogen species it carries \cite{wolfe2000deforestation}. -However, the factors that control the number of pathogen species a host species carries remain poorly understood. - - -\tmpsection{Specific Intro} - -%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. -\tmpsection{Theoretical background} - - -A number of species traits that might control pathogen richness have been studied. -These traits can be at the level of the individual (e.g., body mass and longevity) or the level of the population (e.g., population density, sociality and species range size). -Large bodied animals have been shown to have high pathogen richness with large bodies providing more resources for pathogens \cite{kamiya2014determines, arneberg2002host, poulin1995phylogeny, bordes2008bat, luis2013comparison}. -Long lived species are expected to have high pathogen richness because the number of pathogens a host encounters in its lifetime will be higher \cite{nunn2003comparative, ezenwa2006host, luis2013comparison}. -Animal density \cite{kamiya2014determines, nunn2003comparative, arneberg2002host} and sociality \cite{bordes2007rodent, vitone2004body, altizer2003social, ezenwa2006host} are both predicted to increase pathogen richness by increasing the rate of spread, $R_0$, of a new pathogen. -Finally, widely distributed species have high pathogen richness, potentially because they experience a wider range of environments or because they are sympatric with more species \cite{kamiya2014determines, nunn2003comparative, luis2013comparison}. - -%# However, pop structure (explain what this means) is of particular interest because of blah blah. - -%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition - - -A further population level factor that may affect pathogen richness is population structure. -Population structure can be defined as the extent to which interactions between individuals in a population are non-random. -The role of population structure on human epidemics has been studied in depth and it has been shown that decreased population structure increases the speed of pathogen spread and makes establishment of a new pathogen more likely \cite{colizza2007invasion, vespignani2008reaction}. -In comparative studies of pathogen richness in wild animals, this relationship with $R_0$ is often taken as a prediction that decreased population structure will increase pathogen richness relative to other host species \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. -However, epidemiological models of highly virulent pathogens have shown that increased population structure can allow persistence of a pathogen where a well-mixed population would experience a single, large epidemic followed by pathogen extinction \cite{blackwood2013resolving, plowright2011urban}. -Furthermore, the assumption that high $R_0$ leads to high pathogen richness ignores inter-pathogen competition. -Simple epidemiological models of competition between multiple pathogens show that, in completely unstructured populations, a competitive exclusion process occurs but that adding population structure makes coexistence possible \cite{qiu2013vector, allen2004sis, nunes2006localized}. - - -\tmpsection{Previous Studies} - -%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa). - -There is a lack of large, comparative studies of the role of population structure on pathogen richness. -Sociality, which is one constituent part of population structure, has been well studied. -However, in primates only a weak positive association between sociality and pathogen richness was found \cite{vitone2004body}. -Furthermore, a negative association was found in rodents \cite{bordes2007rodent} and in even and odd-toed hoofed mammals \cite{ezenwa2006host}. -Finally, two studies tested for an association between group size and parasite richness in bats \cite{bordes2008bat, gay2014parasite}. -Amongst 138 bat species, \textcite{bordes2008bat} found no relationship between group size (coded into four classes) and bat fly species richness. -\textcite{gay2014parasite} found a negative relationship between colony size and viral richness but a positive relationship between colony size and ectoparasite richness. -While sociality is an important component of population structure it does not capture fully how connected the population is globally. - - -%#5. Bats are very interesting in this regard because of blah - -%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny. - - -Three studies have used comparative data to test for an association between global population structure and viral richness in bats. -A study on 15 African bat species found a positive relationship between the extent of distribution fragmentation and viral richness \cite{maganga2014bat}. -Conversely, a study on 20 South-East Asian bat species found the opposite relationship \cite{gay2014parasite}. -These studies used the ratio between the perimeter and area of the species' geographic range as their measure of population structure. -However, range maps are very coarse for many species. -Furthermore, range maps are likely to be more detailed (and therefore have a greater perimeter) in well studied species. - -A global study on 33 bat species found a positive relationship between $F_{ST}$ --- a measure of genetic structure --- and viral richness \cite{turmelle2009correlates}. -However, this study included measures using mtDNA which only measures female dispersal which may have biased the results as many bat species show female philopatry \cite{kerth2002extreme, hulva2010mechanisms}. -Furthermore, this study used measures of $F_{ST}$ irrespective of the spatial scale of the study including studies covering from tens \cite{mccracken1981social} to thousands \cite{petit1999male} of kilometres. -As isolation by distance has been shown in a number of bat species \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}, this could bias results further. -Finally, when a global $F_{ST}$ value is not given, \textcite{turmelle2009correlates} used the mean of all pairwise $F_{ST}$ values between sites. -This is not correct as pairwise and global $F_{ST}$ values have different relationships with effective migration rates. - - - -\tmpsection{The gap} -\tmpsection{What I did/found} - -%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. -%#8. I found ... - -Here I used a phylogenetic comparative approach to test for a relationship between increased population structure and pathogen richness in the largest study of bats to date. -I used phylogenetic linear models, controlling for the other life history characteristics known to impact pathogen richness, to quantify the relationship between viral richness (as a proxy for pathogen richness) and two measures of population structure: the number of subspecies and effective gene flow. -I used two measures of population structure to increase the robustness of the analysis; this is particularly important as previous studies have had contradictory results \cite{maganga2014bat, gay2014parasite, turmelle2009correlates}. - -I found that increases in both measures of population structure are positively associated with viral richness and are included as explanatory variables in the best models for describing viral richness. -Furthermore, I found that the role of phylogeny is very weak both in the models and in the distribution of viral richness amongst taxa. - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\section{Methods} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -\subsection{Data Collection} - -\subsubsection{Pathogen richness} - -To measure pathogen richness I used data from \textcite{luis2013comparison}. -This data simply includes known infections of a bat species with a virus species. -I have used viral richness as a proxy for pathogen richness more generally. -Rows with host species that were not identified to species level according to \textcite{wilson2005mammal} were removed. -Many viruses were not identified to species level or their specified species names were not in the ICTV virus taxonomy \cite{ICTV}. -Therefore, I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified (present in the ICTV taxonomy). -For example, if a host is recorded as harbouring an unknown Paramyxoviridae virus, then it is logical to assume that the host carries at least one Paramyxoviridae virus. -If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, it is hard to confirm that the unknown virus is not another record of the known virus. -In this case, the host would be counted as having one virus species. - - -%$F_{ST}$ studies are conducted at a range of spatial scales, but $F_{ST}$ often increases with distance studied \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}. -%To minimise the effects of this I only used data from studies that cover \rinline{rangeUseable * 100}\% of the diameter of the species range. -%This is a largely arbitrary value that could be considered to reflect a ``global'' estimate of $F_{ST}$ while keeping a reasonable number of data points available. -%I calculated the diameter of the species range by finding the furthest apart points in the IUCN species range \cite{iucn} even if the range is split into multiple polygons. -%The width covered by each study was the distance between the most distant sampling sites. -%When this was not explicit in the paper, the centre of the lowest level of geographic area was used. - - - - -%%begin.rcode luis2013virusRead - -#read in luis2013virus data -virus2 <- read.csv('data/Chapter3/luis2013comparison.csv', stringsAsFactors = FALSE) - - -virus2$binomial <- paste(virus2$host.genus, virus2$host.species) - - -# From methods -#Many viruses were not identified to species level or their identified species was not in the ICTV virus taxonomy \cite{ICTV}. -#I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified in the ICTV taxonomy. -#That is, if a host carries an unknown Paramyxoviridae virus, then it must carry at least one Paramyxoviridae virus. -#If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, then it is hard to confirm that the unknown virus is not another record of the known virus. -#In this case, this would be counted as one virus species. - -# This has been implemented manually and indicated in the column `remove` - -virus2 <- virus2[!virus2$remove, ] - -%%end.rcode - - -%%begin.rcode wilsonReaderTaxonomyRead, fig.show = extraFigs, fig.cap = 'Histogram of number of subspecies' - -################################################################## -### Subspecies vs Viruses analysis. ### -################################################################## - - -# Read in the wilson Reader Taxonomy and use it to calculate the number of subspecies each bat species has. - -tax <- read.csv('data/Chapter3/msw3-all.csv', stringsAsFactors = FALSE) - -chir <- tax %>% - filter(Order == 'CHIROPTERA') - -# Save some memory. -rm(tax) - -# Count the number of subspecies each bat species has. -subs <- sqldf(' - SELECT Family, Genus, Species, COUNT(Subspecies) - AS NumberOfSubspecies - FROM chir - Where Species <> "" - GROUP BY Genus, Species - ') - - - -# I think each species has 1 row for species and extra rows for subspecies -# Check this is true. -# If that is correct, then Species with >1 NumberOfSubspecies should be one less. - -SpeciesRows <- sqldf(' - SELECT Genus, Species, COUNT(Subspecies) - AS SpeciesRows - FROM chir - WHERE Subspecies == "" AND Species <> "" - GROUP BY Genus, Species - ') - -# -(SpeciesRows$SpeciesRows != 1) %>% sum -all(SpeciesRows$SpeciesRows == 1) - -# Species with >1 NumberOfSubspecies should be one less -subs$NumberOfSubspecies <- ifelse(subs$NumberOfSubspecies > 1, - subs$NumberOfSubspecies - 1, - subs$NumberOfSubspecies) - -# Quick look at species with highest number of subspecies. -subs[order(subs$NumberOfSubspecies, decreasing = TRUE ),] %>% head - -# Megaderma spasma is top. It's widespread across south east asia islands. -# So this makes sense. - -# Quick look at the number of subspecies. -ggplot(subs, aes(x = NumberOfSubspecies)) + - geom_histogram(binwidth = 2) + - xlab('Number of Subspecies') + - ylab('Count') - - -# Create a combined binomial name column -subs$binomial <- paste(subs$Genus, subs$Species) - - - - -# Check overlap of datasets. -sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)) - -notInTax <- (virus2$binomial[virus2$host.species != ''])[!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)] - -# Run this to find synonyms of names not in Wilson and Reeder -# Doesn't find much of use. -# syns <- synonyms(notInTax, db = 'itis') - -# Clean some names -# As taxize::synonyms didn't find most of them, I am using IUCN. -# And checking that the IUCN name is then in The Wilson & Reeder taxonomy - -virus2$binomial[virus2$binomial == 'Myotis pilosus'] <- 'Myotis ricketti' -virus2$binomial[virus2$binomial == 'Tadarida pumila'] <- 'Chaerephon pumilus' -virus2$binomial[virus2$binomial == 'Tadarida condylura'] <- 'Mops condylurus' -virus2$binomial[virus2$binomial == 'Rhinolophus hildebrandti'] <- 'Rhinolophus hildebrandtii' -# Rhinolophus horsfeldi: I can't find this species anywhere. Will exclude. -# Possibly Megaderma spasma according to http://www.fao.org/3/a-i2407e.pdf -virus2$binomial[virus2$binomial == 'Tadarida plicata'] <- 'Chaerephon plicatus' -virus2$binomial[virus2$binomial == 'Artibeus planirostris'] <- 'Artibeus jamaicensis' - -sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)) - -%%end.rcode - -%%begin.rcode subsHistsByFam, fig.show = extraFigs, fig.height = 3, fig.cap = 'Histograms of number of subspecies for the families with many species.' - -# Compare the histograms of numbers of subspecies over the families with many species. -subs %>% - filter(Family %in% names(which(table(subs$Family) > 99))) %>% - ggplot(., aes(x = NumberOfSubspecies, y = ..density..)) + - geom_histogram() + - facet_grid(. ~ Family) + - xlab('Number of Subspecies') + - ylab('Density') - -%%end.rcode - -%%begin.rcode, subvsvirusCaption - -# Caption for subspecies vs n. viruses plot. -subvsvirus <- ' -Number of viruses against number of subspecies. -Points are coloured by family, with families with less than 10 species being grouped into "other". -Contours show the 2D density of points and suggest a positive correlation. -' -subvsvirusTitle <- 'Number of viruses against number of subspecies' -%%end.rcode - -%%begin.rcode subsDataFrame, fig.show = extraFigs, fig.cap = subvsvirus, fig.scap = subvsvirusTitle, out.width = '\\textwidth' -# create combined dataframe - -# Join dataframes -species <- sqldf(" - SELECT subs.binomial, virus2.[virus.species] - FROM subs - INNER JOIN virus2 - ON subs.binomial=virus2.binomial; - ") - -# Count number of virus species for each bat species -nSpecies <- species %>% - unique %>% - group_by(binomial) %>% - summarise(virusSpecies = n()) - -# Add other Subspecies data. -nSpecies <- sqldf(" - SELECT nSpecies.binomial, virusSpecies, NumberOfSubspecies, Genus, Family - FROM nSpecies - LEFT JOIN subs - ON nSpecies.binomial=subs.binomial - ") - -# Create another column to make plotting easier. -# Group families with few rows into 'other' - -nSpecies$familyPlotCol <- nSpecies$Family -nSpecies$familyPlotCol[ - nSpecies$Family %in% names(which(table(nSpecies$Family) < 10))] <- 'Other' - -table(nSpecies$familyPlotCol) - -ggplot(nSpecies, aes(x = log(NumberOfSubspecies), y = log(virusSpecies))) + - # geom_smooth(method = 'lm') + - geom_jitter(aes(colour = familyPlotCol), size = 2.5, alpha = 0.8, - position = position_jitter(width = .1, height = .1)) + - scale_colour_hc() + - geom_density2d() + - labs(colour = 'Family') - -%%end.rcode - -%%begin.rcode virusHist, fig.show = extraFigs, fig.cap = 'Histogram of known viruses per species' - -ggplot(nSpecies, aes(x = virusSpecies)) + - geom_histogram() - -%%end.rcode - - - - -%%begin.rcode euthRead - -# Read in pantheria data base -pantheria <- read.table(file = 'data/Chapter3/PanTHERIA_1-0_WR05_Aug2008.txt', - header = TRUE, sep = "\t", na.strings = c("-999", "-999.00")) - -mass <- sqldf(" - SELECT [X5.1_AdultBodyMass_g] - FROM nSpecies - LEFT JOIN pantheria - ON nSpecies.binomial=pantheria.MSW05_Binomial - ") - -nSpecies$mass <- mass[, 1] - -# Now add additional mass estimates. - -additionalMass <- read.csv('data/Chapter3/AdditionalBodyMass.csv', stringsAsFactors = FALSE) -meanAdditionalMass <- additionalMass %>% - group_by(binomial) %>% - summarise(mass = mean(Body.Mass.grams)) - -nSpecies$mass[ - sapply(meanAdditionalMass$binomial, function(x) which(nSpecies$binomial == x)) - ] <- meanAdditionalMass$mass - - -%%end.rcode - - - -%%begin.rcode IUCNranges, eval = runIucn - -# Read in iucn ranges and calculate range sizes for each species. -ranges <- readShapePoly('data/Chapter3/TERRESTRIAL_MAMMALS/TERRESTRIAL_MAMMALS.shp') - -ranges <- ranges[ranges$order_name == 'CHIROPTERA', ] - -levels(ranges$binomial) <- c(levels(ranges$binomial), 'Myotis ricketti') -ranges$binomial[ranges$binomial == 'Myotis pilosus'] <- 'Myotis ricketti' - - - - -nSpecies$binomial[!(nSpecies$binomial %in% ranges$binomial)] - -findArea <- function(name){ - #cat(name) - A <- areaPolygon(ranges[ranges$binomial == name, ]) - sum(A) -} - -iucnDistr <- sapply(nSpecies$binomial, findArea) - -write.csv(iucnDistr, 'data/Chapter3/iucnDistr.csv') - -%%end.rcode - -%%begin.rcode readIucnIn - -iucnDistr <- read.csv('data/Chapter3/iucnDistr.csv', row.names = 1) - -nSpecies$distrSize <- iucnDistr$x - -%%end.rcode - - - -%%begin.rcode pubmedScrapeFunc - -# Scrape from pubmed - -scrapePub <- function(sp){ - - Sys.sleep(2) - - # Initialise refs - refs <- NA - - # Find synonyms from taxize - syns <- synonyms(sp, db = 'itis') - if(NROW(syns[[1]]) == 1){ - spString <- tolower(gsub(' ', '%20', sp)) - } else { - spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22') - } - - - url <- paste0('http://www.ncbi.nlm.nih.gov/pubmed/?term=%22', spString, '%22') - - - page <- html(url) - - # Test if exact phrase was found. - phraseFound <- try(page %>% - html_node('.icon') %>% - html_text() %>% - grepl("The following term was not found in PubMed:", .), silent = TRUE) - - if (class(phraseFound) == "logical") { - if(phraseFound){ - if(phraseFound) refs <- NA - } - } - if (class(phraseFound) != "logical") { - try({ - refs <- page %>% - html_node('.result_count') %>% - html_text() %>% - strsplit(' ') %>% - .[[1]] %>% - .[length(.)] %>% - as.numeric() - }) - } - - return(refs) -} - - -%%end.rcode - - -%%begin.rcode pubmedScrape, eval = runPubmedScrape - -# Create empty vector -pubmedRefs <- rep(NA, nrow(nSpecies)) - -for(i in 1:NROW(nSpecies)){ - pubmedRefs[i] <- scrapePub(nSpecies$binomial[i]) -} - -pubmedScrapeDate <- Sys.Date() - -pubmedRefs <- cbind(binomial = nSpecies$binomial, pubmedRefs = pubmedRefs) - -# Write out. -write.csv(pubmedRefs, file = 'data/Chapter3/pubmedRefs.csv') - -%%end.rcode - - - - -%%begin.rcode pubmedRead - - -pubmedRefs <- read.csv('data/Chapter3/pubmedRefs.csv', stringsAsFactors = FALSE, row.names = 1) - -# Function returns NA for none found. Change that to a zero. -pubmedRefs$pubmedRefs[is.na(pubmedRefs$pubmedRefs)] <- 0 -nSpecies$pubmedRefs <- pubmedRefs$pubmedRefs - -%%end.rcode - -%%begin.rcode scholarScrapeFunc - -scrapeScholar <- function(sp){ - - wait <- rnorm(1, 120, 2) - Sys.sleep(wait) - - - syns <- synonyms(sp, db = 'itis') - if(NROW(syns[[1]]) == 1){ - spString <- tolower(gsub(' ', '%20', sp)) - } else { - spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22') - } - - url <- paste0('https://scholar.google.co.uk/scholar?hl=en&q=%22', - spString, '%22&btnG=&as_sdt=1%2C5&as_sdtp=') - - - page <- html(url) - - try({ - refs <- page %>% - html_node('#gs_ab_md') %>% - html_text() %>% - gsub('About\\s(.*)\\sresults.*', '\\1', .) %>% - gsub(',', '', .) %>% - as.numeric - }) - return(refs) -} - -%%end.rcode - -%%begin.rcode scholarScrape, eval = runScholarScrape - -# Create empty vector -scholarRefs <- rep(NA, nrow(nSpecies)) - -for(i in 1:NROW(nSpecies)){ - scholarRefs[i] <- scrapeScholar(nSpecies$binomial[i]) -} - -scholarScrapeDate <- Sys.Date() - -scholarRefs <- cbind(binomial = nSpecies$binomial, scholarRefs = scholarRefs) - -# Write out. -write.csv(scholarRefs, file = 'data/Chapter3/scholarRefs.csv') - -%%end.rcode - - - - -%%begin.rcode scholarRead - - -scholarRefs <- read.csv('data/Chapter3/scholarRefs.csv', stringsAsFactors = FALSE, row.names = 1) - -# Function returns NA for none found. Change that to a zero. -scholarRefs$scholarRefs[is.na(scholarRefs$scholarRefs)] <- 0 - -nSpecies$scholarRefs <- sqldf(' - SELECT scholarRefs - FROM nSpecies - INNER JOIN scholarRefs - ON scholarRefs.binomial=nSpecies.binomial - ' - ) %>% - .$scholarRefs - -%%end.rcode - - - - - - - -%%begin.rcode subsRemoveNAs - -# Remove missing data and sort out the data frame a little. - -nSpecies <- nSpecies[complete.cases(nSpecies), ] - -# Add number of subspecies as a factor. Might help plotting. -nSpecies$SubspeciesFactor <- factor(nSpecies$NumberOfSubspecies, - levels = as.character(1:max(nSpecies$NumberOfSubspecies))) - -# Rownames to species names -rownames(nSpecies) <- nSpecies$binomial - -%%end.rcode - - - -%%begin.rcode savenSpecies -######################################################## -### At this point, nSpecies should be in final form ### -######################################################## - -write.csv(nSpecies, file = 'data/Chapter3/nSpecies.csv') - -%%end.rcode - - - -%%begin.rcode treeRead - -# Read in trees -t <- read.nexus('data/Chapter3/fritz2009geographical.tre') - -# Select best supported tree -tr1 <- t[[1]] - -# Make names match previous names -tr1$tip.label <- gsub('_', ' ', tr1$tip.label) - -# Which tips are not needed -unneededTips <- tr1$tip.label[!(tr1$tip.label %in% nSpecies$binomial)] - -# Prune tree down to only needed tips. -pruneTree <- drop.tip(tr1, unneededTips) - -rm(t) - -%%end.rcode - -%%begin.rcode nSpeciesTreePlot, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.show = extraFigs - -# Plot tree -p <- ggtree(pruneTree, layout = 'fan') - -p %<+% nSpecies[, 1:6] + - geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + - scale_size(range = c(0.2, 2)) + - scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12)])) + - theme_tcdl + - theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) + - theme(legend.position = 'right') + - labs(size = 'Virus Richness') + - theme(legend.key.size = unit(0.6, "lines"), - legend.text = element_text(size = 6), - legend.title = element_text(size = 8)) - - -%%end.rcode - - - -%%begin.rcode scholarvspubmed, fig.show = extraFigs, fig.cap = 'Logged number of references on scholar and pubmed, with a fitted (unphylogenetic) linear model. Colours indicate family.' - -# Check how correlated pubmed and scholar are. - - -compSubspecies <- comparative.data(data = nSpecies, phy = pruneTree, names.col = 'binomial') - -citeCor <- pgls(log(scholarRefs) ~ log(pubmedRefs + 1), data = compSubspecies, lambda = 'ML') - -studyEffortCor <- summary(citeCor) -# And plot -ggplot(nSpecies, aes(x = scholarRefs, y = pubmedRefs + 1)) + - geom_point(aes(colour = familyPlotCol), size = 2.5) + - geom_smooth(method = 'lm') + - scale_x_log10() + - scale_y_log10() + - scale_colour_hc() - -%%end.rcode - -%%begin.rcode subsDataCapts -subsDataCapts <- c( -'Unlogged number of virus species against log mass with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.', -'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.', -'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added.', -'Virus species against study effort (log pubmed references +1)') -%%end.rcode - -%%begin.rcode subsDataviz, fig.show = extraFigs, fig.cap = subsDataCapts - -# A number of exploratory plots - -# Mass against viruses -ggplot(nSpecies, aes(log(mass), virusSpecies)) + - geom_point(aes(colour = familyPlotCol), size = 2.5) + - geom_smooth(method = 'lm')+ - labs(colour = 'Family') + - scale_colour_hc() - - - -# N Subspecies and against viruses -ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) + - geom_jitter(aes(colour = familyPlotCol), size = 2.5, - position = position_jitter(width = .3, height = .3)) + - geom_smooth(method = 'lm')+ - labs(colour = 'Family') + - scale_colour_hc() - - -# Log(N Subspecies) and against viruses - -ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) + - geom_jitter(aes(colour = familyPlotCol), size = 2.5, - position = position_jitter(width = .05, height = .2)) + - scale_x_log10() + - geom_smooth(method = 'lm')+ - labs(colour = 'Family') + - scale_colour_hc() - - -# N. Subspecies against viruses as a boxplot to deal with overplotting. -ggplot(nSpecies, aes(SubspeciesFactor, virusSpecies)) + - geom_boxplot() + - scale_x_discrete(limits = levels(nSpecies$SubspeciesFactor), drop=FALSE) + - geom_smooth(method = 'lm', aes(group = 1)) + - xlab('# subspecies') - - -# Study effort against virusSpecies -ggplot(nSpecies, aes(log(pubmedRefs + 1), virusSpecies)) + - geom_jitter(aes(colour = familyPlotCol), size = 2.5, - position = position_jitter(width = .1, height = .1)) + - geom_smooth(method = 'lm') + - labs(colour = 'Family')+ - scale_colour_hc() - - -# Distribution size aginst virus - - -ggplot(nSpecies, aes(distrSize, virusSpecies)) + - geom_point(aes(colour = familyPlotCol), size = 2.5) + - geom_smooth(method = 'lm') + - labs(colour = 'Family') + - scale_colour_hc() + - scale_x_log10() - - -# Correlation plot -nSpecies %>% - dplyr::select(virusSpecies, NumberOfSubspecies, mass, distrSize, pubmedRefs, scholarRefs) %>% - mutate(mass = log(mass), distrSize = log(distrSize), pubmedRefs = log(pubmedRefs + 1), scholarRefs = log(scholarRefs)) %>% - ggpairs(.) - -%%end.rcode - - - -%%begin.rcode, subsAnalysis, fig.show = extraFigs - -################################################################################## -## N Virus ~ subs + log(cites + mass) - -subspeciesJointUnlog <- pgls( - virusSpecies ~ log(scholarRefs) + NumberOfSubspecies + log(mass), - data = compSubspecies, lambda = 'ML') - - - -## N Virus ~ subs + log(cites + mass) + subs*log(cites) - -subspeciesInter <- pgls( - virusSpecies ~ log(mass) + - NumberOfSubspecies*log(scholarRefs), - data = compSubspecies, lambda = 'ML') - -#subInter.summary <- summary(subspeciesInter) - - - - -## Look at Variance inflation factors. -## Couple of help messages imply lm vif is fine. - -#sqrt(vif(lm(virusSpecies ~ log(scholarRefs) + NumberOfSubspecies + log(mass) + log(distrSize), data = nSpecies))) - -%%end.rcode - - - - - - - - - -%%begin.rcode ITanalysis - -varList <- c('scholarRefs', 'NumberOfSubspecies', 'mass', 'distrSize', 'rand') - -findCombs <- function(k, vars, longest){ - x <- t(combn(vars, k)) - nas <- matrix(NA, ncol = longest - NCOL(x), nrow = nrow(x)) - mat <- cbind(x, nas) - return(mat) -} - -modelList <- lapply(0:5, function(k) findCombs(k, varList, 6)) -modelMat <- do.call(rbind, modelList) - -interMat <- modelMat[apply(modelMat, 1, function(x) "scholarRefs" %in% x & "NumberOfSubspecies" %in% x), ] -interMat[, 2:5] <- interMat[, 1:4] -interMat[, 1] <- "scholarRefs:NumberOfSubspecies" - -allModelMat <- rbind(modelMat, interMat) - - -allFormulae <- apply(allModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + ')))) - -allFormulae <- c(as.formula('virusSpecies ~ 1'), allFormulae) - - - -modelSelect <- function(allForm, data, phy, boot, allModelMat, varList){ - - set.seed(paste0('123', boot)) - bootData <- cbind(data, rand = runif(nrow(data))) - - # log some predictors - bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')]) - - # scale - bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')]) - - coefs <- matrix(NA, ncol = length(varList) + 2, nrow = nrow(allModelMat), - dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList, 'scholarRefs:NumberOfSubspecies')))) - - results <- apply(allModelMat, 1, function(x) sapply(c(varList, "scholarRefs:NumberOfSubspecies"), function(y) y %in% x)) %>% - t %>% - data.frame %>% - cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs) - - # Fit each model - # I'm having problems with convergence so sometimes have to try different starting values. - for(m in 1:length(allForm)){ - if(exists('model')){ - rm(model) - } - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 1 - }) - if(!exists('model')){ - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 2 - }) - } - if(!exists('model')){ - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 3 - }) - } - if(!exists('model')){ - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 4 - }) - } - if(!exists('model')){ - try({ - model <- lm(allForm[[m]], data = bootData) - results$attempt[m] <- 5 - message('Running lm') - }) - } - #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML') - results$AIC[m] <- AICc(model) - - if(inherits(model, 'gls')){ - results$lambda[m] <- model$modelStruct$corStruct[1] - } - - results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3] - - - results[m, paste0('beta.', names(coef(model)))] <- coef(model) - - message(paste('Boot:', boot, ', m:', m, '\n')) - } - - results$dAIC <- results$AIC - min(results$AIC) - results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC)) - - - return(results) - -} - - - - -%%end.rcode - -%%begin.rcode modelSelectBoots, eval = subBoots - -fitModelsBootStrap <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies, pruneTree, b, allModelMat, varList), mc.cores = nCores) - -allResults <- do.call(rbind, fitModelsBootStrap) - -write.csv(allResults, file = 'data/Chapter3/modelSelectSubspecies.csv') - - -%%end.rcode - -%%begin.rcode analyseModelSelect, fig.show = extraFigs - -allResults <- read.csv('data/Chapter3/modelSelectSubspecies.csv', row.names = 1) - -#varWeights <- sapply(names(allResults)[1:6], function(x) sum(allResults$weight[allResults[, x]])/nBoots) - -sepVarWeights <- lapply(1:nBoots, function(b) - sapply(names(allResults)[1:6], - function(x) - sum(allResults[allResults$boot == b, 'weight'][allResults[allResults$boot == b, x]]) - ) - ) - -sepVarWeights <- do.call(rbind, sepVarWeights) %>% - data.frame(., boot = 1:nBoots) %>% - reshape2::melt(., value.name = 'estimate', id.vars = 'boot') - -sepVarWeights$col <- 'Other Variables' -sepVarWeights$col[grep('NumberOf', sepVarWeights$variable)] <- 'Population Structure' -sepVarWeights$col[sepVarWeights$variable == 'rand'] <- 'Null' - - - -modelWeights <- allResults %>% - group_by(predictors) %>% - summarise(AICc = mean(AIC)) %>% - mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% - arrange(desc(modelWeight)) %>% - mutate(cumulativeWeight = cumsum(modelWeight)) %>% - mutate(string = predictors) - - -# Calculate variable weights based on mean(AIC) rather than raw AIC. -varWeights <- sapply(names(allResults)[1:6], - function(x) sum(modelWeights$modelWeight[grep(x, as.character(modelWeights$predictors))])) - - - -allResults %>% - filter(rand, !`scholarRefs.NumberOfSubspecies`, NumberOfSubspecies) %>% -ggplot(., aes(x = lambda, colour = predictors)) + - geom_density() + - scale_colour_hc() - -ggplot(allResults, aes(x = lambda)) + - geom_density() - -allResults %>% - filter(boot == 1) %>% - dplyr::select(predictors, lambda) - -%%end.rcode - - - -%%begin.rcode ITPlots - -# reorder factors to get structure vars at beginning. -sepVarWeights$variable <- factor(sepVarWeights$variable, levels(sepVarWeights$variable)[c(2, 6, 1, 3, 4, 5)]) - -ITPlot <- ggplot(sepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + - geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) + - scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + - scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + - theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), - panel.grid.major.x = element_blank(), - axis.text.y = element_text(size = 8)) + - scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) + - scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + - ylim(0, 1) + - ylab('P(in best model)') + - xlab('') - - -%%end.rcode - -%%begin.rcode nSpeciesCoef, fig.show = extraFigs - -ggplot(allResults, aes(x = 'beta.NumberOfSubspecies', colour = scholarRefs)) + - geom_density() - - - -mean(allResults$NumberOfSubspecies, na.rm = TRUE) - - -varCoefMeans <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.mean(x, allResults$weight, na.rm = TRUE)) -varCoefVars <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.var(x, allResults$weight, na.rm = TRUE)) - -nSpeciesCoefMean <- wtd.mean(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], - allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) -nSpeciesCoefMeanI <- wtd.mean(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], - allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) -nSpeciesInterMean <- wtd.mean(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE) - - -nSpeciesCoefVar <- wtd.var(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], - allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) -nSpeciesCoefVarI <- wtd.var(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], - allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE) -nSpeciesInterVar <- wtd.var(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE) - - - -# Direction of interaction models - -min(nSpecies$NumberOfSubspecies) - -max(nSpecies$NumberOfSubspecies) - -# At minimum study effort -nSpeciesInterMean*log(min(nSpecies$scholarRefs)) + nSpeciesCoefMeanI -nSpeciesInterMean*log(max(nSpecies$scholarRefs)) + nSpeciesCoefMeanI -nSpeciesInterMean*log(median(nSpecies$scholarRefs)) + nSpeciesCoefMeanI - -mean(nSpeciesInterMean*log(nSpecies$scholarRefs) + nSpeciesCoefMeanI > 0) - - - -%%end.rcode - - - -%%begin.rcode familyMeans - -familyMeans <- nSpecies %>% - group_by(Family) %>% - summarise(mean = mean(virusSpecies), n = n()) - -%%end.rcode - - -%%begin.rcode univariatePGLS - -#orderedNSpecies <- nSpecies[sapply(pruneTree$tip.label, function(x) which(nSpecies$binomial == x)),] - - -sspLambda <- summary(pgls(NumberOfSubspecies ~ 1, data = compSubspecies, lambda = 'ML')) -massLambda <- summary(pgls(log(mass) ~ 1, data = compSubspecies, lambda = 'ML')) -scholarLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compSubspecies, lambda = 'ML')) -virusLambda <- summary(pgls(virusSpecies ~ 1, data = compSubspecies, lambda = 'ML')) -distrLambda <- summary(pgls(log(distrSize) ~ 1, data = compSubspecies, lambda = 'ML')) - -sspUni <- summary(pgls(virusSpecies ~ NumberOfSubspecies, data = compSubspecies, lambda = 'ML')) - - -%%end.rcode - - - - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%% FST ANALYSIS %%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -%%begin.rcode fstRead, eval = fstComb - -# Read in Fst data. -# Then add extra columns needed. - -fst <- read.csv('data/Chapter3/FstDataCompData.csv') - -# Check overlap of datasets. -sum(!(fst$binomial %in% virus2$binomial[virus2$host.species != ''])) - -notInFst <- fst$binomial[!(fst$binomial %in% virus2$binomial)] -# lots of sp not in virus2. MAybe will include 0 virus species. Kinda makes sense. - - - -######################################################################################### -#### Get distribution size and width #### -######################################################################################### - - - - -fst$binomial[!(fst$binomial %in% ranges$binomial)] - -fst <- fst[(fst$binomial %in% ranges$binomial), ] - -unique(fst$binomial) %>% length - - - - -findAreaFst <- function(name){ - #cat(name) - A <- areaPolygon(ranges[ranges$binomial == as.character(name), ]) - sum(A) -} - -fstIucnDistr <- sapply(fst$binomial, findAreaFst) - - -fst$distrSize <- fstIucnDistr - - -#### Now get distribution width - -findWidth <- function(name){ - #print(name) - distr <- ranges[ranges$binomial == as.character(name), ] - - coords <- list() - # Get coordinates from all polygons into one matrix. - for(i in 1:length(distr@polygons)){ - coords[[i]] <- distr@polygons[[i]]@Polygons[[1]]@coords - } - coords <- do.call(rbind, coords) - - # Take the convex hull of coordinates to speed up last step. - hullCoords <- coords[chull(coords), ] - - maxDist <- max(apply(hullCoords, 1, function(x) distGeo(coords, x)))/1000 - return(maxDist) - -} - -# Calculate widest part of all species distributions. -# This is slow but also RAM heavy. -# 3 cores doesn't crash my computer with 16GB RAM. -rangeWidth <- mclapply(fst$binomial, findWidth, mc.cores = 3) %>% do.call(c, .) - -#rangeWidth <- sapply(fst$binomial, findWidth) - -fst$rangeWidth <- rangeWidth -fst$rangeCoverage <- fst$Dmax..km. / fst$rangeWidth - - - -fst$Useable <- fst$rangeCoverage > rangeUseable -sum(fst$Useable, na.rm = TRUE) -fst$binomial[fst$Useable] %>% unique %>% .[!is.na(.)] %>% length - -# Need to go back and check data but for now if fst$Useable is na, then it's FALSE (i.e.\ it's not a useable row) -fst$Useable[is.na(fst$Useable)] <- FALSE - - -%%end.rcode - - - -%%begin.rcode fstStudyEffort, eval = fstComb - -# First take what data we can from nSpecies analysis. -fstStudy <- sqldf(" - SELECT fst.binomial, nSpecies.scholarRefs, nSpecies.pubmedRefs - FROM fst - LEFT JOIN nSpecies - ON nSpecies.binomial=fst.binomial - ") - -%%end.rcode - -%%begin.rcode fstScrape, eval = runFstScrape - -######################################################## -#### Sloow bit that might get you blocked by google #### -######################################################## - -fstNewStudy <- fstStudy[is.na(fstStudy[,2]),1] %>% - lapply(., function(x) c(x, scrapeScholar(x), scrapePub(x))) %>% - do.call(rbind, .) - -names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs') - -write.csv(fstNewStudy, file = 'data/Chapter3/fstScrape.csv') - -%%end.rcode - - -%%begin.rcode fstCombine, eval = fstComb - -fstNewStudy <- read.csv('data/Chapter3/fstScrape.csv', row.names = 1) -names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs') - -# NAs are from searches with 0 references. -fstNewStudy$pubmedRefs[is.na(fstNewStudy$pubmedRefs)] <- 0 - -whichRows <- lapply(fstNewStudy$binomial, function(x) which(fstStudy$binomial == x)) -for(i in 1:length(whichRows)){ - fstStudy[whichRows[[i]], 2:3] <- fstNewStudy[i, 2:3] -} - - -fst <- cbind(fst, fstStudy[, 2:3]) - -# Remove rows whose scale is too small -fst <- fst[fst$Useable, ] - - -# Don't want rows using mtDNA due to female baised dispersal -fst <- fst[fst$Marker != 'mtDNA', ] - -%%end.rcode - -%%begin.rcode convertFst, eval = fstComb - -calcNm <- function(Fst){ (1 - Fst)/(4 * Fst) } - - -fst$Nm <- calcNm(fst$Value) - - - -fst <- fst[!is.na(fst$Nm) & !(fst$Nm == Inf), ] - -fstFinal <- fst - -# Take means of species with multiple measurements - -fstFinal <- fstFinal[!duplicated(fstFinal$binomial), ] -fstFinal$Nm <- sapply(fstFinal$binomial, function(x) mean(fst$Nm[fst$binomial == x])) - -# Add number of viruses to fst dataset -# Includes zeros for species with no known viruses. - -fstFinal$virusSpecies <- sapply(fstFinal$binomial, function(x) sum(virus2$binomial == x)) - - - - -# Add mass data. - - -mass <- sqldf(" - SELECT [X5.1_AdultBodyMass_g] - FROM fstFinal - LEFT JOIN pantheria - ON fstFinal.binomial=pantheria.MSW05_Binomial - ") - -# Don't need pantheria data anymore -rm(pantheria) - -fstFinal$mass <- mass[, 1] - -fstFinal$mass[fstFinal$binomial == 'Myotis ricketti'] <- meanAdditionalMass$mass[meanAdditionalMass$binomial == 'Myotis ricketti'] - -fstFinal$mass[fstFinal$binomial == 'Myotis macropus'] <- 9.8 - -fstFinal <- fstFinal[!is.na(fstFinal$mass), ] - - -############################# -### fst data is finished ### -############################# - -write.csv(fstFinal, 'data/Chapter3/fstFinal.csv') -%%end.rcode - - -%%begin.rcode - -#### Read is full fstFinal dataframe - -fstFinal <- read.csv('data/Chapter3/fstFinal.csv', row.names = 1) - -%%end.rcode - -%%begin.rcode fstCors, fig.show = extraFigs - -fstFinal[, c('mass', 'scholarRefs', 'rangeWidth', 'Nm')] %>% - log %>% - cbind(virusSpecies = fstFinal$virusSpecies) %>% - ggpairs(.) - - -%%end.rcode - - - -%%begin.rcode compareNm, fig.show = extraFigs - -ggplot(fstFinal, aes(x = Marker, y = Nm)) + - geom_point() + - scale_y_log10() - -lm(fstFinal$Nm ~ fstFinal$Marker) %>% aov %>% summary - - -%%end.rcode - - -%%begin.rcode fstTree - -# Prune the tree for the fst data. - -# Which tips are not needed -fstUnneededTips <- tr1$tip.label[!(tr1$tip.label %in% fstFinal$binomial)] - -# Prune tree down to only needed tips. -fstTree <- drop.tip(tr1, fstUnneededTips) - - - -%%end.rcode - - -%%begin.rcode fstTreePlot, fig.show = extraFigs, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.height = 3.6 - -# Plot tree -p <- ggtree(fstTree) - - -fstFinal$lengthNames <- fstFinal$binomial %>% - as.character %>% - paste0(' ', .) - - -p %<+% fstFinal[, c('binomial', 'virusSpecies')] + - #geom_tiplab(family = 'lato light', align = FALSE) + - geom_text2(aes(x = x + 15, label = as.character(label), subset = isTip), - family = 'Lato light', hjust = 0, size = 3.3) + - #geom_text(aes(x = x + 15, label = as.character(label)), subset=.(isTip), - # family = 'Lato light', hjust = 0, size = 3.3) + - ggplot2::xlim(0, 210) + - theme_tcdl + - geom_point2(aes(x = x + 8, size = virusSpecies, subset = isTip)) + - scale_size(range = c(0, 4)) + - theme(legend.key.size = unit(0.8, "lines"), - legend.text = element_text(size = 9), - legend.title = element_text(size = 8), - legend.position = "right", - text = element_text(colour = 'darkgrey'), - legend.key = element_blank()) + - labs(size = 'Virus Richness') - - - -%%end.rcode - - - -%%begin.rcode fstITanalysis - -fstVarList <- c('scholarRefs', 'Nm', 'mass', 'distrSize', 'rand') - - -fstModelList <- lapply(0:5, function(k) findCombs(k, fstVarList, 5)) -fstModelMat <- do.call(rbind, fstModelList) - -fstAllFormulae <- apply(fstModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + ')))) - -fstAllFormulae <- c(as.formula('virusSpecies ~ 1'), fstAllFormulae) - -%%end.rcode - -%%begin.rcode fstModelSelectFun - - -fstModelSelect <- function(allForm, data, phy, boot, allModelMat, varList){ - - set.seed(paste0('2388', boot)) - bootData <- cbind(data, rand = runif(nrow(data))) - row.names(bootData) <- bootData$binomial - - - # log some predictors - bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')]) - - # scale - bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')]) - - - coefs <- matrix(NA, ncol = length(varList) + 1, nrow = nrow(allModelMat), - dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList)))) - - results <- apply(allModelMat, 1, function(x) sapply(varList, function(y) y %in% x)) %>% - t %>% - data.frame %>% - cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs) - - # Fit each model - # I'm having problems with convergence so sometimes have to try different starting values. - for(m in 1:length(allForm)){ - if(exists('model')){ - rm(model) - } - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 1 - }) - if(!exists('model')){ - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 2 - }) - } - if(!exists('model')){ - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 3 - }) - } - if(!exists('model')){ - try({ - model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML') - results$attempt[m] <- 4 - }) - } - if(!exists('model')){ - try({ - model <- lm(allForm[[m]], data = bootData) - results$attempt[m] <- 5 - message('Running lm') - }) - } - #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML') - results$AIC[m] <- AICc(model) - - if(inherits(model, 'gls')){ - results$lambda[m] <- model$modelStruct$corStruct[1] - } - - results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3] - - - results[m, paste0('beta.', names(coef(model)))] <- coef(model) - - message(paste('Boot:', boot, ', m:', m, '\n')) - } - - results$dAIC <- results$AIC - min(results$AIC) - results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC)) - - - return(results) - -} - -%%end.rcode - -%%begin.rcode fstModelSelectBoots, eval = fstBoots - - - -fstModelsBootStrap <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree, b, fstModelMat, fstVarList), mc.cores = nCores) - -fstAllResults <- do.call(rbind, fstModelsBootStrap) - -write.csv(fstAllResults, file = 'data/Chapter3/fstModelSelectSubspecies.csv') - - -%%end.rcode - -%%begin.rcode fstAnalyseModelSelect, fig.show = extraFigs - -fstAllResults <- read.csv('data/Chapter3/fstModelSelectSubspecies.csv', row.names = 1) - -fstSepVarWeights <- lapply(1:nBoots, function(b) - sapply(names(fstAllResults)[1:5], - function(x) - sum(fstAllResults[fstAllResults$boot == b, 'weight'][fstAllResults[fstAllResults$boot == b, x]]) - ) - ) - -fstSepVarWeights <- do.call(rbind, fstSepVarWeights) %>% - data.frame(., boot = 1:nBoots) %>% - reshape2::melt(., value.name = 'estimate', id.vars = 'boot') - -fstSepVarWeights$col <- 'Other Variables' -fstSepVarWeights$col[fstSepVarWeights$variable == 'Nm'] <- 'Population Structure' -fstSepVarWeights$col[fstSepVarWeights$variable == 'rand'] <- 'Null' - - - - - -fstModelWeights <- fstAllResults %>% - group_by(predictors) %>% - summarise(AICc = mean(AIC)) %>% - mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% - arrange(desc(modelWeight)) %>% - mutate(cumulativeWeight = cumsum(modelWeight)) - -# Calculate variable weights based on mean(AIC) rather than raw AIC. -fstVarWeights <- sapply(names(fstAllResults)[1:5], - function(x) sum(fstModelWeights$modelWeight[grep(x, as.character(fstModelWeights$predictors))])) - -%%end.rcode - - - - -%%begin.rcode fstITlambda, fig.show = extraFigs, fig.cap = 'Values of $\\lambda$ found in $F_{ST}$ analysis.', fig.height = 3 - -ggplot(fstAllResults, aes(x = lambda)) + - geom_histogram() + - ylab('Count') + - xlab(expression(paste('Phylogenetic Signal, ', lambda))) - -%%end.rcode - - -%%begin.rcode fstITlambdaFacets, fig.show = extraFigs, fig.height = 4 - - -transform(fstAllResults, mass = c('Other', 'Mass' )[factor(mass)]) %>% -ggplot(aes(x = lambda)) + - facet_grid(. ~ mass) + - geom_histogram() + - ylab('Count') + - xlab(expression(paste('Phylogenetic Signal, ', lambda))) - - -transform(fstAllResults, Nm = c('Other', 'Nm' )[factor(Nm)]) %>% -ggplot(aes(x = lambda)) + - facet_grid(. ~ Nm) + - geom_histogram() + - ylab('Count') + - xlab(expression(paste('Phylogenetic Signal, ', lambda))) - - -transform(fstAllResults, distrSize = c('Other', 'distrSize' )[factor(distrSize)]) %>% -ggplot(aes(x = lambda)) + - facet_grid(. ~ distrSize) + - geom_histogram() + - ylab('Count') + - xlab(expression(paste('Phylogenetic Signal, ', lambda))) - - -transform(fstAllResults, scholarRefs = factor(c('Scholar Refs', 'Other')[factor(!scholarRefs)], levels = c('Scholar Refs', 'Other'))) %>% -ggplot(aes(x = lambda)) + - facet_grid(. ~ scholarRefs) + - geom_histogram() + - ylab('Count') + - xlab(expression(paste('Phylogenetic Signal, ', lambda))) - -transform(fstAllResults, rand = c('Other', 'Rand' )[factor(rand)]) %>% -ggplot(aes(x = lambda)) + - facet_grid(. ~ rand) + - geom_histogram() + - ylab('Count') + - xlab(expression(paste('Phylogenetic Signal, ', lambda))) - - -%%end.rcode - -%%begin.rcode lookAtLambda, fig.show = extraFigs - -fstComp <- comparative.data(fstTree, fstFinal, 'binomial') - -fullFst <- pgls(virusSpecies ~ log(Nm) + log(mass) + log(distrSize) + log(distrSize) + log(scholarRefs), fstComp, lambda = 'ML') - -fst.lambda.profile <- pgls.profile(fullFst, "lambda") -plot(fst.lambda.profile) - -data.frame(x = fst.lambda.profile$x, L = fst.lambda.profile$logLik) %>% -ggplot(aes(x, L)) + - geom_line() + - geom_vline(xintercept = fst.lambda.profile$ci$ci.val, col = 'steelblue') - - -%%end.rcode - - -%%begin.rcode fstCoef, fig.show = extraFigs - -ggplot(fstAllResults, aes(x = beta.Nm)) + - geom_histogram() - - -ggplot(fstAllResults, aes(x = beta.Nm, colour = scholarRefs)) + - geom_density() - - - -ggplot(fstAllResults, aes(x = beta.Nm, colour = distrSize)) + - geom_density() - - -fstCoefMeans <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.mean(x, fstAllResults$weight, na.rm = TRUE)) -fstCoefVars <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.var(x, fstAllResults$weight, na.rm = TRUE)) - -pcCoefLzero <- 100*sum(na.omit(fstAllResults$beta.Nm) < 0) / length(na.omit(fstAllResults$beta.Nm)) - -%%end.rcode - - - -%%begin.rcode univariateFstPGLS - -#orderedFst <- fstFinal[sapply(fstTree$tip.label, function(x) which(fstFinal$binomial == x)),] - -compFst <- comparative.data(data = fstFinal, phy = fstTree, names.col = 'binomial') - -nmFstLambda <- summary(pgls(log(Nm) ~ 1, data = compFst, lambda = 'ML')) -massFstLambda <- summary(pgls(log(mass) ~ 1, data = compFst, lambda = 'ML')) -scholarFstLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compFst, lambda = 'ML')) -virusFstLambda <- summary(pgls(virusSpecies ~ 1, data = compFst, lambda = 'ML')) -distrFstLambda <- summary(pgls(distrSize ~ 1, data = compFst, lambda = 'ML')) - -nmFstUni <- summary(pgls(virusSpecies ~ log(Nm), data = compFst, lambda = 'ML')) - -massFstUni <- summary(pgls(virusSpecies ~ log(mass), data = compFst, lambda = 'ML')) -fstDistrStudyEffort <- summary(pgls(log(scholarRefs) ~ log(distrSize), data = compFst, lambda = 'ML')) - -fstMassStudyEffort <- summary(pgls(log(scholarRefs) ~ log(mass), data = compFst, lambda = 'ML')) - -%%end.rcode - - - - - - - - -\subsubsection{Population structure data} - -I used two measures of population structure: the number of subspecies and the effective level of gene flow. -The number of subspecies was counted using the taxonomy from \textcite{wilson2005mammal}. -The effective level of gene flow was calculated from estimates of $F_{ST}$ collated from the literature. -The studies were from a wide range of spatial scales, from local ($\sim\SI{10}{\kilo\metre}$) to continental. -As $F_{ST}$ often increases with spatial scale \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range} I controlled for this by only using data from studies where a large proportion of the species range was studied. -I used the ratio of the furthest distance between $F_{ST}$ samples (taken from the paper or measured with \url{http://www.distancefromto.net/} if not stated) to the length of the IUCN species range \cite{iucn} and only used studies if this ratio was greater than \rinline{rangeUseable}. -This is an arbitrary value that was a compromise between retaining a reasonable number of data points and controlling for the bias in spatial scale. -I only used global $F_{ST}$ estimates as the mean of pairwise $F_{ST}$ values is not necessarily equal to the global $F_{ST}$ value. -I converted all $F_{ST}$ values to effective migration rates using $M = (1-F_{ST})/4F_{ST}$. -This transforms the data from being bound by $(0, 1)$ to being in the range $\lbrack 0, \infty)$ and is easier to interpret. - -The two measures of population structure were analysed separately because the number of subspecies data set had \rinline{nrow(nSpecies)} data points but there was only $F_{ST}$ data for \rinline{nrow(fstFinal)} bat species. -For the subspecies analysis, all bat species in \textcite{luis2013comparison} were used (i.e.\ all species with at least one known virus species). -This was to avoid using the very large number of bat species that have simply never been sampled for viruses. -However, for the gene flow analysis, all bat species with suitable $F_{ST}$ estimates were used. -As some bat species had suitable $F_{ST}$ estimates but were not present in \textcite{luis2013comparison}, some bat species with zero known virus species were included. -These bat species with no known viruses were included to make the greatest use of the $F_{ST}$ data available and because the number of species with no known virus species was not unduly large (\rinline{sum(fstFinal$virusSpecies == 0)} species). - -After data cleaning there was data for \rinline{nrow(nSpecies)} bat species in \rinline{length(unique(nSpecies$Family))} families for the subspecies analysis. -Due to the limited number of studies and the restrictive requirements imposed on study design, there was only data for \rinline{nrow(fstFinal)} bat species in \rinline{length(unique(fstFinal$Family))} families for the effective gene flow analysis. -The raw data are included in Table~\ref{A-rawData}. - - - - -\subsubsection{Other explanatory variables} - - - -To control for study bias I collected the number of PubMed and Google Scholar citations for each bat species name including synonyms from ITIS \cite{itis}. -This was performed in \emph{R} \cite{R} using the \emph{rvest} package \cite{rvest}, with ITIS synonyms being accessed with the \emph{taxize} package \cite{chamberlain2013taxize}. -I log transformed these variables as they were strongly right skewed. -I tested for correlation between these two proxies for study effort using phylogenetic least squares regression (pgls), using the best-supported phylogeny from \textcite{fritz2009geographical}, and likelihood ratio tests using the \emph{caper} package \cite{caper} (Figures~\ref{fig:treePlot} and \ref{fig:scholarvspubmedPlot}). -The log number of citations on PubMed and Google scholar were highly correlated (pgls: $t$ = \rinline{studyEffortCor$coefficients['log(pubmedRefs + 1)', 't value']}, df = \rinline{studyEffortCor$df[2]}, $p < 10^{-5}$). -As the correlation between citation counts was strong, I only used Google Scholar reference counts in subsequent analyses. -%See the appendix for analyses run using PubMed citations. - -Two factors that have previously been found to be important were included as additional explanatory variables: body mass \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat, han2015infectious, bordes2008bat}, range size \cite{kamiya2014determines, turmelle2009correlates, maganga2014bat}. -These other factors were included to avoid spurious positive results occurring simply due to correlations between pathogen richness and a different, causal factor. -Despite commonly being associated with pathogen richness \cite{arneberg2002host, kamiya2014determines, nunn2003comparative}, population density was not included in the analysis as there is very little data for bat densities. -Measurements of body mass were taken from Pantheria \cite{jones2009pantheria} and primary literature \cite{canals2005relative, arita1993rarity, lopez2014echolocation, orr2013does, lim2001bat, aldridge1987turning, ma2003dietary, owen2003home, henderson2008movements, heaney2012nyctalus, oleksy2015high, zhang2009recent}. -\emph{Pipistrellus pygmaeus} was assigned the same mass as \emph{P. pipistrellus} as they are indistinguishable by mass. -Body mass measurements were log transformed as they were strongly right skewed. -Distribution size was estimated by downloading range maps for all species from IUCN \cite{iucn} and were also log transformed due to right skew. - - - - -\subsection{Statistical analysis} - -Statistical analysis for both response variables --- number of subspecies and effective level of gene flow --- was conducted using an information theoretical approach \cite{burnham2002model}, specifically following \textcite{whittingham2005habitat, whittingham2006we}. -All analyses were performed in \emph{R} \cite{R} and all code is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/comparative-test-of-pop-structure.Rtex}. -I chose a credible set of models including all combinations of explanatory variables and a model with just an intercept. -In the analysis using the number of subspecies response variable I also modelled the interaction between study effort and number of subspecies by including their product. -This interaction was included as I believed \emph{a priori} that this interaction may be important as subspecies in well studied species are more likely to be identified. -The interaction was only included in models with both study effort and number of subspecies as individual terms. -Following \textcite{whittingham2005habitat} I included a uniformly distributed random variable. -This variable can be used to benchmark how important other explanatory variables are. -The whole analysis was run \rinline{nBoots} times, resampling the random variable each time. - - -To control for phylogenetic non-independence of data points I used the best-supported phylogeny from \textcite{fritz2009geographical} which is the supertree from \textcite{bininda2007delayed} with names updated to match the taxonomy by \textcite{wilson2005mammal}. -This tree was pruned to include only the species I had data for (Figure~\ref{fig:treePlot}). -Phylogenetic manipulation was performed using the \emph{ape} package \cite{ape}. -I also performed the analysis using the phylogeny from \textcite{jones2005bats} as this has some broad topological differences including the Rhinolophoidea being sister to the Pteropodidae rather than being related to the other insectivorous bats (Figure~\ref{fig:treePlot2}). - - - - -%%begin.rcode treeCapt - -treeCapt <- ' -The phylogenetic distribution of viral richness. -The phylogeny is from \\cite{fritz2009geographical} pruned to include all species used in either the number of subspecies or gene flow analysis. -Dot size shows the number of known viruses for that species and colour shows family. -The red scale bar shows 25 million years.' - - - -treeTitle <- 'Pruned phylogeny showing number of pathogens and family' - -%%end.rcode - -%%begin.rcode treePlot, out.width = '1\\textwidth', out.extra = 'trim = 0cm 0cm 0cm 0cm', fig.height = 5, fig.height = 5.5, fig.cap = treeCapt, fig.scap = treeTitle - -combUneeded <- tr1$tip.label[!(tr1$tip.label %in% c(as.character(fstFinal$binomial), nSpecies$binomial))] - -# Prune tree down to only needed tips. -combTree <- drop.tip(tr1, combUneeded) - -combdf <- nSpecies %>% - dplyr::select(binomial, virusSpecies, Family) %>% - rbind(fstFinal %>% dplyr::select(binomial, virusSpecies, Family)) %>% - distinct(binomial) - -# Plot tree -p <- ggtree(combTree, layout = 'fan') - -p %<+% combdf + - geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + - scale_size(range = c(0.1, 3)) + - scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) + - theme_tcdl + - theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) + - theme(legend.position = c(0.5, -0.04)) + - geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) + - labs(size = 'Virus Richness') + -# guides(size = guide_legend(override.aes = list(shape = 1))) + - theme(legend.key.size = unit(0.8, "lines"), - legend.text = element_text(size = 10), - legend.margin = unit(c(0.05), "cm"), - legend.title = element_text(size = 12), - legend.direction = "horizontal") + - guides(colour = guide_legend(ncol=3)) - - -# Attempt at concentric circle time bar. -#scale <- data.frame(x = c(0, 0), y = c(0, 0), l = c(1200, 2400)) - -#p %<+% combdf + -# geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + -# scale_size(range = c(0.1, 100), breaks = c(1, 5, 10)) + -# scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) + -# theme_tcdl + -# theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) + -# theme(legend.position = c(0.5, -0.04)) + -# geom_point(data = scale, aes(x = x, y = y, size = l), alpha = 0.2) + -# geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) + -# labs(size = 'Virus Richness') + -## guides(size = guide_legend(override.aes = list(shape = 1)), alpha = 0.9) + -# theme(legend.key.size = unit(0.8, "lines"), -# legend.text = element_text(size = 10), -# legend.margin = unit(c(0.05), "cm"), -# legend.title = element_text(size = 12), -# legend.direction = "horizontal") + -# guides(colour = guide_legend(ncol=3)) - -# Or using bars - -#scale2 <- data.frame(x = c(1, 1), y = c(10, 200), w = c(1, 1)) - -#p %<+% combdf + -# geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) + -# geom_bar(data = scale2, aes(x = x, y = y, size = w), alpha = 0.3, stat = 'identity', position = 'identity') - -%%end.rcode - - - -The importance of the phylogeny on each variable separately was examined by estimating the $\lambda$ parameter when regressing the variable against an intercept using the \emph{pgls} function in \emph{caper} \cite{caper}. -The parameter $\lambda$ usually takes values between zero and one and \emph{pgls} constrains $\lambda$ within these bounds. -$\lambda = 0$ implies no autocorrelation while a trait evolving by Brownian motion along the tree would have $\lambda = 1$. -I tested fitted $\lambda$ values against the null hypothesis of $\lambda = 0$ (no correlation between species) with log-likelihood ratio tests using \emph{caper} \cite{caper}. - -I fitted phylogenetic regressions for all models in the credible set using the function \emph{gls} in the package \emph{nlme} \cite{nlme}. -The explanatory variables were centred and scaled to allow direct comparison of the coefficients \cite{schielzeth2010simple}. -For each regression model I simultaneously fitted the $\lambda$ parameter as this avoids misspecifying the model \cite{revell2010phylogenetic}. -Unlike the \emph{pgls} function, \emph{gls} does not constrain $\lambda$ to be in the range $\lbrack 0, 1\rbrack$. -$\lambda < 0$ indicates that residuals from the fitted model are distributed on the phylogeny more uniformly than expected by chance. -$\kappa$ and $\delta$ parameters were constrained to one as they are more concerned with when evolution occurs along a branch than the importance of the phylogeny. -Further, fitting multiple parameters makes interpretation difficult. - - - -To establish the importance of variables I calculated the probability, $Pr$, that each variable would be in the best model amongst those examined (under the assumption that all models are \emph{a priori} equally likely). -This value can more generally, and with fewer assumptions, be considered as simply the relative weight of evidence for each variable being in the best model amongst those examined. -I calculated AICc for each model. -As each model was fitted 50 times, I calculated the average AICc, $\bar{\text{AICc}}$, by averaging AICc scores for each model. -$\Delta\text{AICc}$ was calculated as $\text{min}(\bar{\text{AICc}}) - \bar{\text{AICc}}$, not the mean of the individual $\Delta\text{AICc}$ scores, to guarantee that the best model has $\Delta\text{AICc} = 0$. -From these $\Delta\text{AICc}$ values I calculated Akaike weights, $w$. -This value can be interpreted as the probability that a model is the best model, given the data, amongst those examined. -For each variable, the sum of the Akaike weights of models containing that variable are summed to give $Pr$. -This value can be interpreted as the probability that the given variable is in the best model. - -To determine the direction and strength of the effect of each variable the mean of its regression coefficient, $b$, in all models that contained that variable, weighted by the model's Akaike weight, was also calculated. -In the subspecies analysis the inclusion of an interaction term between number of subspecies and study effort makes interpretation of this mean coefficient more difficult, particularly because the interaction term greatly affects the estimated value of $b$. -To aid interpretation, the mean coefficient for the number of subspecies was calculated for: \emph{i}) all models containing the number of species, \emph{ii}) only models with the interaction term and \emph{iii}) only models with the number of subspecies but not the interaction term. - - - -%%begin.rcode boxplotCapt - -# Caption for the main boxplot of subspecies vs virus - -boxplotCapt <- paste( -'The relationship between number of subspecies and viral richness for', -nrow(nSpecies), -'bat species. -The area of the circle shows the number of bat species at each discrete value. -48 bat species have one subspecies and one known virus species. -The red line represents a phylogenetic simple regression between the two variables. -' -) - -boxplotTitle <- paste( -'The relationship between number of subspecies and viral richness for', -nrow(nSpecies), -'bat species' -) - -%%end.rcode - -%%begin.rcode boxplot, fig.cap = boxplotCapt, fig.scap = boxplotTitle, fig.height = 2.3 - - -nSpeciesCounts <- nSpecies %>% - group_by(NumberOfSubspecies, virusSpecies) %>% - dplyr::summarize(n = n()) - -ggplot(nSpeciesCounts, aes(NumberOfSubspecies, virusSpecies, size = n)) + - geom_point() + - scale_size(range = c(0.5, 4.3), breaks = c(1, 20, 40)) + - scale_y_continuous(breaks = c(1, 5, 10, max(nSpecies$virusSpecies))) + - scale_x_continuous(breaks = c(1, 4, 8, 12, 16)) + - xlab('Number of Subspecies') + - ylab('Viral Richness') + - geom_abline(slope = sspUni$coef[2, 1], intercept = sspUni$coef[1,1], lwd = 0.7, colour = pokepal('nidorina')[10]) - -%%end.rcode - - - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\section{Results} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - - -\subsection{Number of Subspecies} -\tmpsection{More descriptive} - -The number of described virus species for a bat host ranged up to \rinline{max(nSpecies$virusSpecies)} viruses in \emph{\rinline{nSpecies$binomial[which.max(nSpecies$virusSpecies)]}}. -There appears to be a positive relationship between the number of subspecies and viral richness (Figure~\ref{fig:boxplot}) though few species have more than five subspecies. -Out of \rinline{nrow(modelWeights)} fitted models, the top seven models all had $\Delta\text{AICc} < 4$ meaning there was no clear best model (Table~\ref{t:models} and Table~\ref{A-modelWeights}). -However these top seven models all contained study effort, number of subspecies and the interaction between these two variables. -The explanatory variables log(Mass), log(Range Size) and the uniformly random variable are each in three of the top seven models. -These top seven models had a combined weight of \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))} meaning that there is a \rinline{sprintf("%.0f", round(100 * modelWeights[7, 5]))}\% chance that one of these models is the best model amongst those examined. - -Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}. -The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}). -When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}). -Models with an interaction term between the number of subspecies and study effort have a positive interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}). - - - -\afterpage{ % use after page to make sure this whole table is at the end of a page. -\begin{landscape} -\begin{table}[p!] -\centering -%\rowcolors{2}{gray!25}{white} -\caption[Model selection results]{ -Model selection results for number of subspecies and effective level of gene flow analysis. -Models are ranked according to $\bar{\text{AICc}}$ and only the best nine and three models are shown respectively. -Models were fitted to all combinations of variables (in total \rinline{nrow(modelWeights)} number of subspecies models and \rinline{nrow(fstModelWeights)} effective gene flow models). -$\bar{\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. -$\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. -$w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). -$\sum w$ is the cumulative sum of the Akaike weights. -log(Scholar)*NSubspecies indicates the interaction term between study effort and number of subspecies. -%In the number of subspecies analysis there are many models with low $\Delta$AICc scores suggesting there there is no single `best model'. -%In the gene flow analysis, only the top model is supported. -} - - -\begin{tabular}{@{}>{\footnotesize}lrrrr@{}} - -\toprule -\normalsize{Model} & $\bar{\text{AICc}}$ & $\Delta$AICc & $w$ & $\sum w$\\ -\midrule -&&&&\\[-3mm] -\textit{\small{Number of Subspecies}} &&&&\\ -%1 -log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(Mass) + log(RangeSize) & -\rinline{round(modelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[1, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[1, 5], 2))}\\ -%2 -log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(Mass) & -\rinline{round(modelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[2, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[2, 5], 2))}\\ -%3 -log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random + log(Mass) & -\rinline{round(modelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[3, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[3, 5], 2))}\\ -%4 -log(Scholar) + NSubspecies + log(Scholar)*NSubspecies & -\rinline{round(modelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[4, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[4, 5], 2))}\\ -%5 -log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + log(RangeSize) & -\rinline{round(modelWeights[5 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[5, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[5, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[5, 5], 2))}\\ -%6 -log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random + log(RangeSize) & -\rinline{round(modelWeights[6 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[6, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[6, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[6, 5], 2))}\\ -%7 -log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random & -\rinline{round(modelWeights[7 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[7, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[7, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))}\\ -%8 -log(Scholar) + NSubspecies + log(Mass) + Random & -\rinline{round(modelWeights[8 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[8, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[8, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[8, 5], 2))}\\ -%9 -log(Scholar) + NSubspecies + log(Mass) + log(RangeSize) + rand& -\rinline{round(modelWeights[9 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[9, 3], 2))} & -\rinline{sprintf("%.2f", round(modelWeights[9, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[9, 5], 2))}\\[5mm] -\textit{\small{Gene flow}} &&&&\\ -log(Scholar) + log(Gene flow) + log(Mass) & -\rinline{round(fstModelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 3], 2))} & -\rinline{sprintf("%.2f", round(fstModelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 5], 2))}\\ -log(Range size) & -\rinline{round(fstModelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 3], 2))} & -\rinline{sprintf("%.2f", round(fstModelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 5], 2))}\\ -log(Mass) & -\rinline{round(fstModelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 3], 2))} & -\rinline{sprintf("%.2f", round(fstModelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 5], 2))}\\ -%log(Scholar) + log(Gene flow) + log(Mass) + Random & -%\rinline{round(fstModelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 3], 2))} & -%\rinline{sprintf("%.2f", round(fstModelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 5], 2))}\\ -\bottomrule -\end{tabular} - -\label{t:models} -\end{table} -\end{landscape} -} - - - - -When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and~\ref{t:variables2}). -Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support. -However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}. - - - -\tmpsection{Model results} - - -\begin{table}[t!] -\centering -\caption[Estimated variable weights and coefficients for number of subspecies and gene flow analyses]{ -Estimated variable weights (probability that a variable is in the best model) and their estimated coefficients for both number of subspecies and gene flow analyses. -The coefficients for the number of subspecies variable are given for models with and without the interaction term because this term strongly changes the coefficient and because the coefficient can only be usefully interpreted when estimated without the interaction. -However, there are no weights for these separated terms as they are not directly compared in the model selection framework. -} -%\rowcolors{2}{gray!25}{white} -\begin{tabular}{@{}>{\small}l rrrr@{}} -\toprule -& \multicolumn{2}{c}{\textit{Number of Subspecies}} & \multicolumn{2}{c}{\textit{Gene flow}}\\\cmidrule(rl){2-3}\cmidrule(rl){4-5} -\normalsize{Variable} & $Pr$ & Coefficient & $Pr$ & Coefficient\\ -\midrule -Number of subspecies &&&&\\ -\hspace{3mm}Total & \rinline{sprintf('%.2f', varWeights['NumberOfSubspecies'])} & \rinline{varCoefMeans['beta.NumberOfSubspecies']} &&\\ -\hspace{3mm}Models without interaction term && \rinline{nSpeciesCoefMean} &&\\ -\hspace{3mm}Models with interaction term && \rinline{nSpeciesCoefMeanI} &&\\ -Number of subspecies*log(Scholar) & \rinline{varWeights['scholarRefs.NumberOfSubspecies']} & \rinline{sprintf('%.2f', varCoefMeans['beta.scholarRefs.NumberOfSubspecies'])} && \\[2.5mm] -Gene flow & & & \rinline{sprintf('%.2f', fstVarWeights['Nm'])} & \rinline{fstCoefMeans['beta.Nm']}\\[2.5mm] -log(Scholar) & \rinline{sprintf('%.2f', varWeights['scholarRefs'])} & \rinline{varCoefMeans['beta.scholarRefs']} & - \rinline{sprintf('%.2f', fstVarWeights['scholarRefs'])} & \rinline{fstCoefMeans['beta.scholarRefs']}\\ -log(Mass) & \rinline{sprintf('%.2f', varWeights['mass'])} & \rinline{varCoefMeans['beta.mass']} & - \rinline{sprintf('%.2f', fstVarWeights['mass'])} & \rinline{fstCoefMeans['beta.mass']}\\ -log(Range size) & \rinline{sprintf('%.2f', varWeights['distrSize'])} & \rinline{varCoefMeans['beta.distrSize']}& - \rinline{fstVarWeights['distrSize']} & \rinline{fstCoefMeans['beta.distrSize']}\\ -Random & \rinline{sprintf('%.2f', varWeights['rand'])} & \rinline{varCoefMeans['beta.rand']}& - \rinline{fstVarWeights['rand']} & \rinline{fstCoefMeans['beta.rand']}\\ -\bottomrule -\end{tabular} - -\label{t:variables} -\end{table} - - - - -\subsection{Gene Flow} - -\tmpsection{More Descriptive} - -%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species. -The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}). -Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}). -The effective level of gene flow was likely in the best model ($Pr > 0.99$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}). -On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}). -Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship. - - - - - -%%begin.rcode ITCombPlotCapt - -ITPlotCapts <- " -The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness. -The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis. -The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. -The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles. -The red ``Random'' box is the uniformly random variable. -Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses." - -ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness" - -%%end.rcode - - -%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = ITPlotTitle, out.width = '\\textwidth', cache = FALSE - -# Reorder var levels to get structure at beginning. -fstSepVarWeights$variable <- factor(fstSepVarWeights$variable, levels(fstSepVarWeights$variable)[c(2, 1, 3, 4, 5)]) - -# Draw the fst model selection plot -fstIT <- ggplot(fstSepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) + - geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + - scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + - scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + - ylim(0, 1) + - theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), - panel.grid.major.x = element_blank(), - axis.text.y = element_text(size = 8)) + - scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + - scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + - ylim(0, 1) + - ylab('P(in best model)') + - xlab('') - - -#plot_grid(ITPlot, fstIT, labels = c("A", "B"), align = 'h', label_size = 10) - - -# Combine and print the plots. -ggdraw() + - draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(ITPlot, 0, 0, 0.5, 1) + - draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(fstIT, 0.5, 0.164, 0.5, 0.855) + - draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) - - -%%end.rcode - - - - -Study effort was very likely in the best model ($Pr > 0.99$) as was body mass ($Pr > 0.99$). -However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}). -In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}. -%This variable being less supported than the random variable may be because range size is closely correlated with study effort (pgls: $b$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Estimate']}, $t$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 't value']}, df = \rinline{fstDistrStudyEffort$df[2]}, $p$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Pr(>|t|)']}). -Of the three explanatory variables in the best model, study effort had the largest effect ($b = $ \rinline{fstCoefMeans['beta.scholarRefs']}, variance = \rinline{fstCoefVars['beta.scholarRefs']}). -The effect size of gene flow ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) was approximately twice the size of that of body mass ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}) - - - - -%%begin.rcode fstRawCapt - -fstRawDataCapt <- -paste( -'Relationship between viral richness and log effective gene flow per generation for', -nrow(fstFinal), -'bat species. -Green points are studies that estimated effective gene flow using allozymes and blue points are studies using microsatellites. -The red line represents a phylogenetic simple regression between the two variables. -') - - - -fstRawDataTitle <- -paste( -'Relationship between viral richness and log effective gene flow per generation for', -nrow(fstFinal), -'bat species -') - -%%end.rcode - - - -%%begin.rcode fstRawData, fig.height = 2.3, fig.cap = fstRawDataCapt, fig.scap = fstRawDataTitle - -# Plot raw fst data - -ggplot(fstFinal, aes(x = Nm, y = virusSpecies, colour = Marker)) + - geom_point(size = 2) + - scale_colour_poke(pokemon = 'oddish', spread = 3) + - scale_x_log10() + - geom_abline(intercept = nmFstUni$coef[1, 1], slope = nmFstUni$coef[2, 1], lwd = 0.7, colour = pokepal('nidorina')[10]) + - xlab('Gene Flow (per gen.)') + - ylab('Viral Richness') - -%%end.rcode - - - -When using the phylogeny from \textcite{jones2005bats} the analysis became very unstable (Figure~\ref{f:A-itplots}). -The support for each variable changed dramatically with each resampling of the random variable. -On average however, only the model containing mass and range size is supported (Tables~\ref{A-fstModelWeights} and~\ref{t:variables2}). - - - - -\subsection{Phylogenetic Analysis} - -\subsubsection{Number of subspecies} - -Figure~\ref{fig:treePlot} shows the phylogeny used and the number of viruses for each species. -The mean number of viruses across families is fairly constant with \rinline{familyMeans$Family[which.min(familyMeans$mean)]} having the smallest mean, (\rinline{min(familyMeans$mean)}). -The highest mean is \rinline{familyMeans$Family[which.max(familyMeans$mean)]} with \rinline{max(familyMeans$mean)} virus species per bat species, but this is based on only \rinline{familyMeans$n[which.max(familyMeans$mean)]} species. -The \rinline{familyMeans$Family[order(familyMeans$mean, decreasing = TRUE)[2]]} have the second highest mean of \rinline{familyMeans$mean[order(familyMeans$mean, decreasing = TRUE)[2]]} ($n$ = \rinline{familyMeans$n[order(familyMeans$mean, decreasing = TRUE)[2]]}). - - - -The small change in mean pathogen richness across families and the lack of clear pattern in Figure~\ref{fig:treePlot} implies that viral richness is not strongly phylogenetic. -This is corroborated by the small estimated size of $\lambda$ ($\lambda$ = \rinline{virusLambda$param['lambda']}, $p$ = \rinline{virusLambda$param.CI$lambda$bounds.p[1]}). -%This fact implies that other factors must control pathogen richness. -%It also implies that pathogens are not directly inherited down the phylogeny, although this is to be expected by the fast evolution of viruses. - -Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.99$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). -Across all multiple regression models the mean value of $\lambda$ was \rinline{mean(na.omit(allResults$lambda))} which implied that the residuals from the models were very weakly phylogenetic. -A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%) had negatively phylogenetically distributed residuals. - - - - -\subsubsection{Effective gene flow} - -There was no phylogenetic signal in the number of virus species ($\lambda$ = \rinline{virusFstLambda$param['lambda']}, $p > 0.99$). -Gene flow also had no phylogenetic autocorrelation ($\lambda$ = \rinline{nmFstLambda$param['lambda']}, $p > 0.99$). -Due to the limited sample size, significance tests are unlikely to have much power. -There is little evidence of phylogenetic autocorrelation in study effort ($\lambda$ = \rinline{scholarFstLambda$param['lambda']}, $p$ = \rinline{scholarFstLambda$param.CI$lambda$bounds.p[1]}). -However, there is some weak evidence of phylogenetic signal in range size as the estimated size of $\lambda$ is large while $p$ is also large, potentially due to a lack of statistical power ($\lambda$ = \rinline{distrFstLambda$param['lambda']}, $p$ = \rinline{distrFstLambda$param.CI$lambda$bounds.p[1]}). -Body mass showed significant phylogenetic autocorrelation ($\lambda$ = \rinline{massFstLambda$param['lambda']}, $p$ = \rinline{massFstLambda$param.CI$lambda$bounds.p[1]}). - - -Across all multiple regression models the mean value of $\lambda$ is \rinline{mean(na.omit(fstAllResults$lambda))} and a large number of individual models (\rinline{round(mean(na.omit(fstAllResults$lambda < 0))*100)}\%) had negatively phylogenetically distributed residuals implying the residuals from the model are spread more uniformly on the phylogeny than expected by chance. -Due to the small sample size this was probably due to a small number of data points with large residuals being distant on the tree. - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\section{Discussion} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -\tmpsection{Discuss results in more detail} - - -\tmpsection{Pop structure relates to pathogen richness} - -% It does so here -% I hope this study is more robust. -In this study I have used known viral richness in bats as a case study for the more general hypothesis that increased population structure promotes pathogen richness. -In both analyses I found that a positive effect of increasing population structure (a positive effect of the number of subspecies and a negative effect of gene flow) is likely to be in the best model for explaining viral richness. -Only the effective gene flow analysis, when performed using the phylogeny from \textcite{jones2005bats}, does not support this hypothesis. -Therefore my study supports the broader hypothesis that increased population structure promotes pathogen richness. -The positive relationship between increased population structure and pathogen richness implies that direct or indirect competitive mechanisms are acting such that increased population structure allows escape from competition which promotes pathogen richness. -Furthermore my study contradicts the assumption that factors that promote high $R_0$ will automatically promote high pathogen richness by increasing the rate of spread of new pathogens entering into the population \cite{nunn2003comparative, morand2000wormy}. - - - -% It does so in some lit -This analysis is in agreement with two studies that have specifically tested this same hypothesis \cite{turmelle2009correlates, maganga2014bat}. -These two studies used $F_{ST}$ \cite{turmelle2009correlates} and fragmentation of species distributions \cite{maganga2014bat}. -Combined with the analysis here using the number of subspecies, three different measures of population structure have been shown to correlate with pathogen richness in bats. -By analysing data on two measures of population structure, and using larger data sets than previous studies, it is hoped that the results here may be more robust than in previous analyses \cite{gay2014parasite, turmelle2009correlates, maganga2014bat}. - - - -% The pattern is reversed in other lit -In contrast, one study \textcite{gay2014parasite} found the opposite relationship using fragmentation of species distribution. -Furthermore, \textcite{bordes2008bat} found no relationship between increased colony size and pathogen richness while \textcite{gay2014parasite} found relationships in opposite directions for virus and ectoparasite richness. -However, the study by \textcite{gay2014parasite} uses relatively few species while the study by \textcite{bordes2008bat} uses group size which is a measure of local rather than global population structure. -The overall weight of evidence suggests that population structure and pathogen richness are associated. - - - - -\tmpsection{There is an interaction between study effort and number of subspecies} - -% interpretations -% Biases are known in the lit. gippoliti2007problem % maybe should add to methods? - -There was strong support for a positive interaction between the number of subspecies and study effort. -The support for this interaction implies that increased population structure has a stronger relationship with known pathogen richness in the presence of study effort. -One interpretation of this is that increased population structure alone does not predict high known viral richness; reasonable study effort is also needed to turn the expected high viral richness into known and recorded viral richness. -Biases in identification of subspecies have been noted before \cite{gippoliti2007problem}. -The number of subspecies is more commonly used as a variable in comparative analyses of birds than mammals but the fact that it is associated with study effort is often not taken into account \cite{phillimore2007biogeographical, belliure2000dispersal}. - -\tmpsection{Other explanatory vars} - - -% study effort is important. Never forget. -% body mass behaves wierdly. -% Range size is very marginal - -Of the other explanatory variables considered, study effort and body mass were selected as being in the best model while there was marginal evidence for range size being associated with viral richness. -Study effort positively predicted pathogen richness, confirming the expectation that additional study of a bat species yields more known viruses infecting that host species. -Therefore, this bias cannot be ignored in studies using known pathogen richness as a proxy for total pathogen richness \cite{nunn2003comparative, gregory1990parasites}. -While body mass is selected as being in the best model in both the number of subspecies analysis and the effective gene flow analysis the estimated coefficients have opposite signs in the two analyses. -In the number of subspecies analysis, body mass has a positive relationship with pathogen richness which is in agreement with previous studies \cite{kamiya2014determines, bordes2008bat, turmelle2009correlates, gay2014parasite, maganga2014bat}. -However, in the effective gene flow analysis, body mass has a negative estimated coefficient. -This is in contrast to the number of subspecies analysis, previous studies in the literature and the single-predictor model. -This result is probably due to correlations with other variables in the analysis and exacerbated by the small sample size in this analysis. - - -\tmpsection{phylogeny} -% Phylogeny is not very important -% phylogeny is weird in Fst study? - - - -%Another interpretation is that having few subspecies does not predict low viral richness unless the species has been adequately studied as otherwise the low number of subspecies is probably due to a lack of study rather than an accurate measurement. - -%Another potential mechanism by which structure might be promoting increased richness is by slowing the spread of highly virulent viruses such as rabies and preventing them from having short, intense epidemics followed by extinction. -%This mechanism has interesting parallel to metapopulation theory in ecology in which a metapopulation structure can allow persistence of species that would otherwise go extinct. - -\subsection{Broader implications} - -The relationship between increased population structure and pathogen richness suggests that population structure has at least some potential as being predictive of high pathogen richness and therefore of a species' likelihood of being a reservoir of a potentially zoonotic pathogen. -However, given that it is difficult to measure population structure and given that the relationship appears to be weak at best, this trait on its own is unlikely to be useful in predicting zoonotic risk. -However, as a number of other factors are also associated with pathogen richness such as body mass and to a lesser extent range size as shown here as well as other traits studied elsewhere \cite{turmelle2009correlates, luis2013comparison}. -Therefore, using a combination of traits in a predictive (i.e.\ machine learning) framework has potential for use in prioritising zoonotic disease surveillance. -The main hurdle in this approach is finding a way to validate models; due to the study effort bias in current data, predictive models will also be biased. -As unbiased pathogen surveys such as \textcite{anthony2013strategy} become more common good validation may become possible. -Alternatively, predictive models could be trained on all available --- and therefore biased --- data and validated by predicting smaller, unbiased data sets such as the data collected in \textcite{maganga2014bat}. - -The relationship between increased population structure and pathogen richness also has implications for habitat fragmentation and range shifts due to global change. -In short, habitat fragmentation and range shifts that reduce movement between populations would be predicted to increase pathogen richness. -However, depending on the mechanisms by which increased population structure increases pathogen richness this may not be a cause for concern. -If the main mechanism is one that reduces pathogen extinction rates, a newly fragmented population is unlikely to increase its pathogen richness over any short to medium-term timescales. -If, however, increased population structure actively promotes the evolution of new pathogen strains or allows the persistence of more virulent strains \cite{blackwood2013resolving, pons2014insights, plowright2011urban} this could have important public health implications. -Therefore further studies on the exact mechanisms by which increased population structure affects pathogen richness are needed. - - -\subsection{Study limitations} - -Although I have used measures of study effort to try to control for biases in the viral richness data, this bias could still make the results here unreliable --- this is especially true as study effort is by far the strongest predictor of viral richness in both data sets. -It is hoped that as untargeted sequencing of viral genetic material becomes cheaper and more common this bias can be reduced \cite{anthony2013strategy}. -The strength of the relationship between study effort and known viral richness also highlights the number of bat-virus host-pathogen relationships yet to be discovered and the number of virus species that are yet to be described. - -I have included a number of explanatory variables to avoid spurious correlations. -However, there is little data on bat density or population size. -Given that studies in other mammalian groups have found relationships between host density and pathogen richness this would be a useful variable to include in further analyses \cite{kamiya2014determines, nunn2003comparative, arneberg2002host}. -Acoustic monitoring is becoming cheaper and less labour intensive and may provide suitable data for estimating population densities or population sizes for more bat species. -However, it is not clear whether host population density or host population size is the more appropriate measure with respect to disease dynamics \cite{begon2002clarification}. -Given the importance of geographic range size found here and elsewhere \cite{lindenfors2007parasite, nunn2003comparative, turmelle2009correlates, huang2015parasite, kamiya2014determines} comparative studies may struggle to select between these three related factors: host population size, population density and geographic range size. - -I have used two measures of population structure and the number of subspecies data set is larger than those used in previous studies. -However it is clear that the gene flow data set is small ($n$ = \rinline{nrow(fstFinal)}). -This may explain some unexpected results. -While the model averaging approach has given a negative model averaged coefficient for gene flow, the single-predictor model of gene flow against viral richness gave a positive coefficient. -Furthermore body mass has a negative average coefficient. -This is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model. -It is not easy to interpret these contradictions but it is clear that the results from the gene flow analysis alone should not be considered strong evidence for a relationship between increased population structure and pathogen richness. -These contradictions also reiterate the need to use large data sets where possible and the need to use multiple measures of population structure to promote robust conclusions. - -Finally, while comparative studies are a useful tool for examining broad trends of pathogen richness across large taxonomic groups, they cannot examine the specific mechanisms that may be underpinning the correlations found. -Therefore, further work is needed to test which mechanisms are actually causing the relationship between increased population structure and pathogen richness that I have identified here. -A number of mechanisms might be involved. -A reduced rate of pathogen extinction might be caused by a reduction in competition due to the slow dispersal of competing pathogens. -Alternatively, increased population structure may promote the invasion of new pathogens, by creating localised areas of low competition or host immunity. -One method for testing these mechanisms would be through mechanistic epidemiological models. - -\subsection{Conclusions} - - -I have used phylogenetic linear models to identify positive relationships between two measures of population structure (the number of subspecies and effective levels of gene flow) and viral richness in bats. -This study adds to the evidence that increased population structure may promote pathogen richness. -It does not support the view that factors that increase $R_0$ will increase pathogen richness. -Using larger data sets and multiple measurements makes the weight of the evidence here stronger than in previous studies. -However, caution must still be taken in interpreting these results as the data is biased and particularly sparse in one of the analyses. - - - - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%% Repeat analysis with bat clocks and rocks %%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -%\section{Appendix} - - - -%%begin.rcode treeRead2 - -# Read in trees -t2 <- read.nexus('data/Chapter3/BatST2BL.nex') - -# Make names match previous names -t2$tip.label <- gsub('_', ' ', t2$tip.label) - -#missing <- nSpecies$binomial[!nSpecies$binomial %in% pruneTree2$tip.label ] - -## Copy binomial column. binomial will be changed to fit t2. -#nSpecies$oldBinomial <- nSpecies$binomial - -## Replace with agrep where possible -#closeMatch <- sapply(missing, function(i) t2$tip.label[agrep(i, t2$tip.label, max.distance = 0.11)]) - -#closeMatch <- closeMatch[sapply(closeMatch, function(i) length(i) > 0)] - - - - -unneededTips2 <- t2$tip.label[!(t2$tip.label %in% nSpecies$binomial)] - -# Prune tree down to only needed tips. -pruneTree2 <- drop.tip(t2, unneededTips2) - - -nSpecies2 <- sapply(pruneTree2$tip.label, function(x) which(nSpecies$binomial == x)) %>% - nSpecies[., ] - - -################ -## Fst tree ## -################ - - -# Which tips are not needed -fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)] - -# Prune tree down to only needed tips. -fstTree2 <- drop.tip(t2, fstUnneededTips2) - -# Which tips in Fst analysis are not in bats clocks tree. -fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)] - - -# Hacky cruddy way of placing the missing tips into the tree. Should end up with genus level polytomies in trimmed tree. -# Just replacing some of the uneeded tips with the ones I need. - -t2$tip.label[t2$tip.label == 'Miniopterus pusillus'] <- 'Miniopterus natalensis' -t2$tip.label[t2$tip.label == 'Miniopterus schreibersi'] <- 'Miniopterus schreibersii' -t2$tip.label[t2$tip.label == 'Rousettus celebensis'] <- 'Rousettus leschenaultii' -t2$tip.label[t2$tip.label == 'Myotis oxyotus'] <- 'Myotis macropus' -t2$tip.label[t2$tip.label == 'Myotis leibii'] <- 'Myotis ciliolabrum' - -#Re prune tree -# Which tips are not needed -fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)] - -# Prune tree down to only needed tips. -fstTree2 <- drop.tip(t2, fstUnneededTips2) - -# Check we now have all the tips. -fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)] - -rm(t2) - - - - -%%end.rcode - - -%%begin.rcode treePlot2, show.figs = 'hide', out.width = '\\textwidth', fig.cap = 'Pruned phylogeny \\cite{jones2005bats} with dot size showing number of pathogens and colour showing family.' - - - -## Plot tree -#p2 <- ggtree(pruneTree2, layout = 'fan') - -#p2 %<+% nSpecies2[, 1:6] + -# geom_point(aes(size = virusSpecies, colour = Family), subset=.(isTip)) + -# scale_size(range = c(0.8, 3)) + -# scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12)])) + -# theme_tcdl + -# theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) + -# theme(legend.position = 'right') + -# labs(size = 'Virus Richness') + -# theme(legend.key.size = unit(0.6, "lines"), -# legend.text = element_text(size = 6), -# legend.title = element_text(size = 8)) - - - -%%end.rcode - - - -%%begin.rcode runBatClocks, eval = TRUE - - -fitModelsBootStrap2 <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies2, pruneTree2, b, allModelMat, varList), mc.cores = nCores) - -allResults2 <- do.call(rbind, fitModelsBootStrap2) - -write.csv(allResults2, file = 'data/Chapter3/modelSelectSubspeciesBatClocks.csv') - - -## FST analysis - -fstModelsBootStrap2 <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree2, b, fstModelMat, fstVarList), mc.cores = nCores) - -fstAllResults2 <- do.call(rbind, fstModelsBootStrap2) - -write.csv(fstAllResults2, file = 'data/Chapter3/fstModelSelectSubspeciesBatClocks.csv') - - -%%end.rcode - - -%%begin.rcode batClocksAnalyse - -allResults2 <- read.csv('data/Chapter3/modelSelectSubspeciesBatClocks.csv', row.names = 1) - -varWeights2 <- sapply(names(allResults2)[1:6], function(x) sum(allResults2$weight[allResults2[, x]])/nBoots) - - -sepVarWeights2 <- lapply(1:nBoots, function(b) - sapply(names(allResults2)[1:6], - function(x) - sum(allResults2[allResults2$boot == b, 'weight'][allResults2[allResults2$boot == b, x]]) - ) - ) - -sepVarWeights2 <- do.call(rbind, sepVarWeights2) %>% - data.frame(., boot = 1:nBoots) %>% - reshape2::melt(., value.name = 'estimate', id.vars = 'boot') - -sepVarWeights2$col <- 'Other Variables' -sepVarWeights2$col[grep('NumberOf', sepVarWeights2$variable)] <- 'Population Structure' -sepVarWeights2$col[sepVarWeights2$variable == 'rand'] <- 'Null' - - - -modelWeights2 <- allResults2 %>% - group_by(predictors) %>% - summarise(AICc = mean(AIC)) %>% - mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% - arrange(desc(modelWeight)) %>% - mutate(cumulativeWeight = cumsum(modelWeight)) %>% - mutate(string = levels(predictors)[predictors]) - - -#### FST - - -fstAllResults2 <- read.csv('data/Chapter3/fstModelSelectSubspeciesBatClocks.csv', row.names = 1) - -fstSepVarWeights2 <- lapply(1:nBoots, function(b) - sapply(names(fstAllResults2)[1:5], - function(x) - sum(fstAllResults2[fstAllResults2$boot == b, 'weight'][fstAllResults2[fstAllResults2$boot == b, x]]) - ) - ) - -fstSepVarWeights2 <- do.call(rbind, fstSepVarWeights2) %>% - data.frame(., boot = 1:nBoots) %>% - reshape2::melt(., value.name = 'estimate', id.vars = 'boot') - -fstSepVarWeights2$col <- 'Other Variables' -fstSepVarWeights2$col[fstSepVarWeights2$variable == 'Nm'] <- 'Population Structure' -fstSepVarWeights2$col[fstSepVarWeights2$variable == 'rand'] <- 'Null' - - -fstVarWeights2 <- sapply(names(fstAllResults2)[1:5], function(x) sum(fstAllResults2$weight[fstAllResults2[, x]])/nBoots) - - -fstModelWeights2 <- fstAllResults2 %>% - group_by(predictors) %>% - summarise(AICc = mean(AIC)) %>% - mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>% - arrange(desc(modelWeight)) %>% - mutate(cumulativeWeight = cumsum(modelWeight)) - - -%%end.rcode - - -%% ------------------------------------------- %% -%% plot bat clocks rocks -%% ------------------------------------------- %% - - -%%begin.rcode ITPlots2 - -# reorder factors to get structure vars at beginning. -sepVarWeights2$variable <- factor(sepVarWeights2$variable, levels(sepVarWeights2$variable)[c(2, 6, 1, 3, 4, 5)]) - -ITPlot2 <- ggplot(sepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) + - geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) + - scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + - scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + - theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), - panel.grid.major.x = element_blank(), - axis.text.y = element_text(size = 8)) + - scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) + - scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + - ylim(0, 1) + - ylab('P(in best model)') + - xlab('') - - -%%end.rcode - - -%%begin.rcode fstITPlots2, fig.show = extraFigs, fig.cap = "Akaike variable weights for both analyses using the phylogeny from \\textcite{jones2005bats}. The probability that each variable is in the best model (amongst the models test) is shown, with the boxplots showing the variation amongst the models over 50 resamplings of the uniformly random ``null'' variable. The three bars of the boxplot show the median values and upper and lower quartiles of the data, vertical lines show the range and points display outliers. The red ``Random'' box is the uniformly random variable.", fig.height = 2.5, fig.scap = 'Akaike variable weights', out.width = '\\textwidth', out.extra = 'trim = 0 1cm 0 0' - - -# Reorder var levels to get structure at beginning. -fstSepVarWeights2$variable <- factor(fstSepVarWeights2$variable, levels(fstSepVarWeights2$variable)[c(2, 1, 3, 4, 5)]) - -# Draw the fst model selection plot -fstIT2 <- ggplot(fstSepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) + - geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) + - scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) + - scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) + - ylim(0, 1) + - theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'), - panel.grid.major.x = element_blank(), - axis.text.y = element_text(size = 8)) + - scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) + - scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) + - ylim(0, 1) + - ylab('P(in best model)') + - xlab('') - - -# Combine and print. -ggdraw() + - draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(ITPlot2, 0, 0, 0.5, 1) + - draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(fstIT2, 0.5, 0.164, 0.5, 0.855) + - draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12) - - -%%end.rcode - diff --git a/pop-structure-path-richness-mechanistic.Rtex b/pop-structure-path-richness-mechanistic.Rtex new file mode 100644 index 0000000..b48b0d1 --- /dev/null +++ b/pop-structure-path-richness-mechanistic.Rtex @@ -0,0 +1,1503 @@ +%--------------------------------------------------------------------------------------------------------------------------------% +% Code and text for "Understanding how population structure affects pathogen richness in a mechanistic model of bat populations" +% Chapter 3 of thesis "The role of population structure and size in determining bat pathogen richness" +% by Tim CD Lucas +% +% NB This file is a copy due to the mess up with chapter numbers. +% To see the full commit history see https://github.com/timcdlucas/PhDThesis/blob/master/Chapter2.Rtex +% +%---------------------------------------------------------------------------------------------------------------------------------% + + + + + +%%begin.rcode settings, echo = FALSE, cache = FALSE, message = FALSE, results = 'hide' + +#################################### +### Important simulation options ### +#################################### + +# Compilation options +# Run simulations? This will take many hours +runAllSims <- FALSE + +# Save raw simulation output +# This will take ~10GB or so. +# If false, summary statistics of each simulation are saved instead. +saveData <- FALSE + + +# How many cores do you want to use to run simulations? +nCores <- 7 + +########################## +### End options ### +########################## + + +opts_chunk$set(cache.path = '.Ch2Cache/') + +source('misc/theme_tcdl.R') +source('misc/KnitrOptions.R') +theme_set(theme_grey() + theme_tcdl) + +%%end.rcode + + +%%begin.rcode libs, cache = FALSE, result = FALSE + +# My package. For running and analysing Epidemiological sims. +# https://github.com/timcdlucas/metapopepi +library(MetapopEpi) + +# Data manipulations +library(reshape2) +library(dplyr) + +# Calc confidence intervals (could probably do with broom instead now.) +library(binom) + +# To tidy up stats models/tests +library(broom) + + +# Run simulations in parallel +library(parallel) + +# Plotting +library(ggplot2) +library(palettetown) +library(cowplot) + +%%end.rcode + + +\section{Abstract} + +%\tmpsection{One or two sentences providing a basic introduction to the field} +% comprehensible to a scientist in any discipline. +%\lettr{A}n increasingly large proportion of emerging human diseases comes from animals. +%These diseases have a huge impact on human health, healthcare systems and economic development. +The chance that a new zoonosis will come from any particular wild host species increases with the number of pathogen species occurring in that host species. +Comparative, phylogenetic studies have shown that host-species traits such as population density and population structure correlate with pathogen richness +However, the mechanisms by which these factors control pathogen richness in wild animal species remain unclear. +% +% +%\tmpsection{Two to three sentences of more detailed background} +% comprehensible to scientists in related disciplines. +% Add mechanistic vs empirical +Typically it is assumed that well-connected, unstructured populations (that therefore have a high basic reproductive number, $R_0$) promote the invasion of new pathogens and therefore increase pathogen richness. However, this assumption is largely untested in the multipathogen context. +In the presence of inter-pathogen competition, the opposite effect might occur; increased population structure may increase pathogen richness by reducing the effects of competition. +A more mechanistic understanding of how population structure affects pathogen richness could discriminate between these two broad hypotheses. +Here I have examined one mechanism by which increased population structure may cause greater pathogen richness. +I used simulations to test whether increased population structure could increase the probability that a newly evolved pathogen would invade into a population already infected with an identical, endemic pathogen. +I tested this hypothesis using individual-based, metapopulation networks parameterised to mimic wild bat populations as bats have highly varied social structures and have recently been implicated in a number of high profile diseases such as Ebola, SARS, Hendra and Nipah. +In a metapopulation, dispersal rate and the number of links between colonies can both affect population structure. +I tested whether either of these factors could increase the probability that a pathogen would invade and persist in the population. +I found that, at intermediate transmission rates, increasing dispersal rate significantly increased the probability of a newly evolved pathogen invading into the metapopulation. +However, there was very limited evidence that the number of links between colonies affected pathogen invasion probability. + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\section{Introduction} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +%Possible structure (each number a sep. paragraph): +%1. zoonotics bad, need to predict spillover, factors controlling pathogen richness unknown +%2. results from comparative studies (including mammal and bat ones), explaining why population structure is important +%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach +%4. description of the possible mechanisms for population structure - explaining why focusing on reduction of competition mechanisms +%5. results from analytical models so far and limitations of the approach +%6. what is needed now +%7. what your focus is (including a bit about bat focus) +%8. 'here I show..' what you found briefly to lead into methods + + + + +\tmpsection{General Intro} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% A basic introduction to the field, +% comprehensible to a scientist in any discipline. + +%1. zoonotics bad, need to predict spillover, factors controlling pathogen richness unknown +\tmpsection{Why is pathogen richness? important?} +Over 60\% of emerging infectious diseases have an animal source \cite{jones2008global, smith2014global}. +Zoonotic pathogens can be highly virulent \cite{luby2009recurrent, lefebvre2014case} and can have huge public health impacts \cite{granich2015trends}, economic costs \cite{knobler2004learning} and slow down international development \cite{ebolaWorldbank}. +Therefore understanding and predicting changes in the process of zoonotic spillover is a global health priority \cite{taylor2001risk}. +The number of pathogen species hosted by a wild animal species affects the chance that a disease from that species will infect humans \cite{wolfe2000deforestation}. +However, the factors that control the number of pathogen species in a wild animal population are still unclear \cite{metcalf2015five}; in particular our mechanistic understanding of how population processes inhibit or promote pathogen richness is poor. + + + +\tmpsection{Specific Intro} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% more detailed background} +% comprehensible to scientists in related disciplines. + + +\tmpsection{We know some factors that correlate with pathogen richness} +%population density, longevity, body size and population structure + +%2. results from comparative studies (including mammal and bat ones), explaining why population structure is important +%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach + +In comparative studies, a number of host traits have been shown to correlate with pathogen richness including body size \cite{kamiya2014determines, arneberg2002host}, population density \cite{nunn2003comparative, arneberg2002host} and range size \cite{bordes2011impact, kamiya2014determines}. +A further factor that may affect pathogen richness is population structure. +In comparative studies it is often assumed that factors that promote fast disease spread should promote high pathogen richness; the faster a new pathogen spreads through a population, the more likely it is to persist \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. +However, this assumption ignores competitive mechanisms such as cross-immunity and depletion of susceptible hosts. +If competitive mechanisms are strong, endemic pathogens in populations with high $R_0$ will be able to easily out-compete invading pathogens. +Only if competitive mechanisms are weak will high $R_0$ enable the invasion of new pathogens and allow higher pathogen richness. + +Overall, the evidence from comparative studies indicates that increased population structure correlates with higher pathogen richness. +This conclusion is based on studies using a number of measures of population structure: genetic measures, the number of subspecies, the shape of species distributions and social group size (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). +However, there are a number of studies that contradict this conclusion \cite{gay2014parasite, bordes2007rodent, ezenwa2006host}. +Comparative studies are often contradictory due to small sample sizes, noisy data and because empirical relationships often do not extrapolate well to other taxa. +Furthermore, multicollinearity between many traits also makes it hard to clearly distinguish which factors are important \cite{nunn2015infectious}. +However, meta-analyses can be used to combine studies to help generalise conclusions \cite{kamiya2014determines}. + + +%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach + +Furthermore, knowing which factors correlate with pathogen richness does not tell us if, or how, they causally control pathogen richness. +This lack of a solid mechanistic understanding of these processes prevents predictions of how wild populations will respond to perturbations such as increased human pressure and global change. +As habitats fragment we expect wild populations to change in a number of ways including becoming smaller and less well connected \cite{andren1994effects, cushman2012separating}. +As multiple population-level factors are likely to change simultaneously due to global change, the correlative relationships examined in comparative studies are unlikely to effectively predict future changes in pathogen richness. +Mechanistic models are needed to project how these highly non-linear disease systems will respond to the multiple, simultaneous stressors affecting them. + + + +\tmpsection{Network structure has been studied} +%5. results from analytical models so far and limitations of the approach +%4. description of the possible mechanisms for population structure - explaining why focusing on reduction of competition mechanisms + +There are a number of mechanisms by which population structure could increase pathogen richness. +Firstly, population structure may reduce competition between pathogens. +In analytical models of well-mixed populations competitive exclusion has been predicted \cite{ackleh2003competitive, bremermann1989competitive, martcheva2013competitive, qiu2013vector, allen2004sis}. +In models where competitive exclusion occurs in well-mixed populations, population structure has sometimes been shown to allow coexistence \cite{qiu2013vector, allen2004sis, nunes2006localized, garmer2016multistrain}. +Alternatively, population structure may promote the evolution of new strains within a species \cite{buckee2004effects}, reduce the rate of pathogen extinction \cite{rand1995invasion} or increase the probability of pathogen invasion from other host species \cite{nunes2006localized}. +These separate mechanisms have not been examined and it is difficult to see how they could be distinguished through comparative methods. + +%Competing epidemics, or two pathogens spreading at the same time in a population, is a well studied area \cite{poletto2013host, poletto2015characterising, karrer2011competing}. +%This area is related to the study of pathogen richness in that they indicate that dynamics of multiple pathogens in a population do depend on population %structure. +%However, the results for short term epidemic competition do not directly transfer to the study of long term disease persistence. + + +%6. what is needed now +\tmpsection{The gap} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% One sentence clearly stating the general problem +% being addressed by this particular study. +% By this stage, must have defined/introduced all terms used within. + +Currently, the literature contains very abstract, simplified models \cite{qiu2013vector, allen2004sis, garmer2016multistrain, may1994superinfection}. +These cannot be easily applied to real data. +They also do not easily give quantitative predictions of pathogen richness; typically they predict either no pathogen coexistence \cite{bremermann1989competitive, martcheva2013competitive} or infinite pathogen richness \cite{may1994superinfection}. +Models that can give quantitative predictions of pathogen richness in wild populations are more applicable to real-world issues such as zoonotic disease surveillance. +While predicting an absolute value of pathogen richness for a wild species is likely to be impossible, models that attempt to rank species from highest to lowest pathogen richness are still useful for prioritising species for surveillance. +This requires a middle ground of model complexity. + +\tmpsection{What I did} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%7. what your focus is (including a bit about bat focus) + +In order to capture this middle ground, I have used metapopulation models. +Unlike two-patch models that are used to add population structure while keeping model complexity to a minimum \cite{qiu2013vector, allen2004sis, garmer2016multistrain}, the metapopulations used here split the population into multiple subpopulations. +I have used two independent variables that alter population structure: dispersal rate and metapopulation network topology. +I have studied the invasion of new pathogens as a mechanism for increasing pathogen richness. +In particular I have focused on studying the invasion of a newly evolved pathogen that is therefore identical in epidemiological parameters to the endemic pathogen. +Furthermore, this close evolutionary relationship means that competition via cross-immunity is strong. + +\tmpsection{Why bats} +The metapopulations were parameterised to broadly mimic wild bat populations. +Population structure has already been found to correlate with pathogen richness in bats (Chapter~\ref{ch:empirical}, \cites{gay2014parasite, maganga2014bat, turmelle2009correlates}). +Furthermore, bats have an unusually large variety of social structures. +Colony sizes range from ten to 1 million individuals \cite{jones2009pantheria} and colonies can be very stable \cite{kerth2011bats, mccracken1981social}. +This strong colony fidelity means they fit the assumptions of metapopulations well. +Bats have also, over the last decade, become a focus for disease research \cite{calisher2006bats, hughes2007emerging}. +The reason for this focus is that they have been implicated in a number of high profile diseases including Ebola, SARS, Hendra and Nipah \cite{calisher2006bats, li2005bats}. + +%8. 'here I show..' what you found briefly to lead into methods + +\tmpsection{What I found} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% One sentence summarising the main result +% (with the words “here we show” or their equivalent). + +Here I found that, given the assumptions of a metapopulation, increased dispersal significantly increased the probability of invasion of new pathogens. +Furthermore, structured populations nearly always had a lower probability of pathogen invasion than fully-mixed populations of equal size. +The topology of the network did not strongly affect the probability of pathogen invasion as long as the population was not completely unconnected. +Overall, I found significant evidence that reduced population structure increases the probability of invasion of a new pathogen, implying a role for the generation of pathogen richness more generally. + +\begin{figure}[t] +\centering + \includegraphics[width=0.5\textwidth]{imgs/SIRoption1.pdf} + \caption[Schematic of the SIR model used]{ + Schematic of the SIR model used. + Individuals are in one of five classes, susceptible (orange, $S$), infected with Pathogen 1, Pathogen 2 or both (blue, $I_1, I_2, I_{12}$) or recovered and immune from further infection (green, $R$). + Transitions between epidemiological classes occur as indicated by solid arrows. + Vital dynamics (births and deaths) are indicated by dashed arrows. + Parameter symbols for transitions are indicated. + Note that individuals in $I_{12}$ move into $R$, not back to $I_1$ or $I_2$. + That is, recovery from one pathogen causes immediate recovery from the other pathogen. + } +\label{f:sir} +\end{figure} + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Methods} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%% + + +\subsection{Two pathogen SIR model} + +I developed a multipathogen, SIR compartment model with individuals being classed as susceptible, infected or recovered with immunity (Figure~\ref{f:sir}). +Susceptible individuals are counted in class $S$ (see Table~\ref{t:params} for a list of symbols and values used). +There are three infected classes, $I_1$, $I_2$ and $I_{12}$, being individuals infected with Pathogen 1, Pathogen 2 or both respectively. +Recovered individuals, $R$, are immune to both pathogens, even if they have only been infected with one (i.e.\ there is complete cross-immunity). +Furthermore, recovery from one pathogen moves an individual straight into the recovered class, even if the individual is infected with both pathogens (Figure~\ref{f:sir}). +This modelling choice allows the model to be easily expanded to include more than two pathogens, though this study is restricted to two pathogens. +The assumption of immediate recovery from all other diseases is likely to be reasonable. +Any up-regulation of innate immune response will affect both pathogens equally. +Furthermore, as the pathogens are identical, any acquired immunity would also affect both pathogens equally. + +The coinfection rate (the rate at which an infected individual is infected with a second pathogen) is adjusted compared to the infection rate by a factor $\alpha$. +As in \textcite{castillo1989epidemiological}, low values of $\alpha$ imply lower rates of coinfection. +In particular, $\alpha = 0$ indicates no coinfections, $\alpha = 1$ indicates that coinfections happen at the same rate as first infections while $\alpha > 1$ indicates that coinfections occur more readily than first infections. + + +\begin{figure}[t] +{\centering +\subfloat[Minimally connected\label{fig:fullyConnected}]{ + \includegraphics[width=0.45\textwidth]{imgs/minimallyConnected.pdf} +} +\subfloat[Fully connected +\label{fig:minimallyConnected}]{ + \includegraphics[width=0.45\textwidth]{imgs/fullyConnected.pdf} +} +} +\caption[Network topologies used to compare network connectedness]{ +The two network topologies used to test whether network connectedness influences a pathogen's ability to invade. +A) Animals can only disperse to neighbouring colonies. +B) Dispersal can occur between any colonies. +Blue circles are colonies of \SI{3000} individuals. +Dispersal only occurs between colonies connected by an edge (black line). +The dispersal rate is held constant between the two topologies. +} +\label{f:net} +\end{figure} + + +In the application of long term existence of pathogens it is necessary to include vital dynamics (births and deaths) as the SIR model without vital dynamics has no endemic state. +Birth and death rates ($\Lambda$ and $\mu$) are set as being equal meaning the population does not systematically increase or decrease. +The population size does however change as a random walk. +New born individuals enter the susceptible class. +Infection and coinfection were assumed to cause no extra mortality as for a number of viruses, bats show no clinical signs of infection \cite{halpin2011pteropid, deThoisy2016bioecological}. +%In humans, coinfection generally worsens health \cite{griffiths2011nature} but as there are + +\tmpsection{Metapopulation} + + +\begin{table}[tb] +\centering +\caption[All symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2}]{A summary of all symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2} along with their units and default values. +The justifications for parameter values are given in Section~\ref{s:paramSelect}.} + +\begin{tabular}{@{}lp{6cm}p{2.9cm}r@{}} +\toprule +Symbol & Explanation & Units & Value\\ +\midrule +$\rho$ & Number of pathogens && 2\\ +$x, y$ & Colony index &&\\ +$p$ & Pathogen index i.e.\ $p\in\{1,2\}$ for pathogens 1 and 2 & &\\ +$q$ & Disease class i.e.\ $q\in\{1,2,12\}$&\\ +$S_x$ & Number of susceptible individuals in colony $x$ &&\\ +$I_{qx}$ & Number of individuals infected with disease(s) $q \in \{1, 2, 12\}$ in colony $x$ &&\\ +$R_x$ & Number of individuals in colony $x$ in the recovered with immunity class &&\\ +$N$ & Total Population size && 30,000\\ +$m$ & Number of colonies&& 10\\ +$n$ & Colony size && 3,000\\ +$a$ & Area & \si{\square\kilo\metre}& 10,000\\ +$\beta$ & Transmission rate & & 0.1 -- 0.4\\ +$\alpha$ & Coinfection adjustment factor & & 0.1\\ +$\gamma$ & Recovery rate & year$^{-1}.$individual$^{-1}$ & 1\\ +$\xi$ & Dispersal rate & year$^{-1}.$individual$^{-1}$ & 0.001--0.1\\ +$\Lambda$ & Birth rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ +$\mu$ & Death rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ +$k_x$ & Degree of node $x$ (number of colonies that individuals from colony $x$ can disperse to). &&\\ +$\delta$ & Waiting time until next event & years &\\ + +$e_i$ & The rate at which event $i$ occurs & year$^{-1}$&\\ +\bottomrule +\end{tabular} + +\label{t:params} +\end{table} + + +The population is modelled as a metapopulation, being divided into a number of subpopulations (colonies). +This model is an intermediate level of complexity between fully-mixed populations and contact networks. +There is ample evidence that bat populations are structured to some extent. +This evidence comes from the existence of subspecies, measurements of genetic dissimilarity and ecological studies \cite{kerth2011bats, mccracken1981social, burns2014correlates, wilson2005mammal}. +Therefore a fully mixed population is a large oversimplification. +However, trying to study the contact network relies on detailed knowledge of individual behaviour which is rarely available. + +The metapopulation is modelled as a network with colonies being nodes and dispersal between colonies being indicated by edges (Figure~\ref{f:net}). +Individuals within a colony interact randomly so that the colony is fully mixed. +Dispersal between colonies occurs at a rate $\xi$. +Individuals can only disperse to colonies connected to theirs by an edge in the network. +The rate of dispersal is not affected by the number of edges a colony has (known as the degree of the colony and denoted $k$). +Therefore, the dispersal rate from a colony $y$ with degree $k_y$ to colony $x$ is $\xi / k_y$. +Note this rate is not affect by the degree and size of colony $x$. + + + +\tmpsection{Stochastic simulations} + +I examined this model using stochastic, continuous-time simulations implemented in \emph{R} \cite{R}. +The implementation is available as an \emph{R} package on GitHub \cite{metapopepi}. +The model can be written as a continuous-time Markov chain. +The Markov chain contains the random variables $((S_x)_{x = 1\ldots m}, (I_{x, q})_{x =1\ldots m,\:q \in \{1, 2, 12\}}, (R_x)_{x = 1\ldots m})$. +Here, $(S_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of susceptibles in each colony. +$(I_{x, q})_{x =1\ldots m, q \in \{1, 2, 12\}}$ is a length $m \times 3$ vector describing the number of individuals of each disease class ($q \in \{1, 2, 12\}$) in each colony. +Finally, $(R_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of individuals in the recovered class. +The model is a Markov chain where extinction of both pathogen species and extinction of the host species are absorbing states. +The expected time for either host to go extinct is much larger than the duration of the simulations. + +At any time, suppose the system is in state $((s_x), (i_{x,q}), (r_x))$. +At each step in the simulation we calculate the rate at which each possible event might occur. +One event is then randomly chosen, weighted by its rate +\begin{align} + p(\text{event } i) = \frac{e_i}{\sum_j e_j}, +\end{align} +where $e_i$ is the rate at which event $i$ occurs and $\sum_j e_j$ is the sum of the rates of all possible events. +Finally, the length of the time step, $\delta$, is drawn from an exponential distribution +\begin{align} + \delta \sim \operatorname{Exp}\left(\sum_j e_j \right). +\end{align} + + +We can now write down the rates of all events. +%I defined $I^+_p$ to be the sum of all classes that are infectious with pathogen $p$, for example $I^+_1 = I_1 + I_{12}$. +Assuming asexual reproduction, that all classes reproduce at the same rate and that individuals are born into the susceptible class we get +\begin{align} + s_x \rightarrow s_x + 1 \;\;\;\text{at a rate of}\;\; \Lambda\left( s_{x}+\sum_q i_{qx} + r_{x}\right) +\end{align} +where $s_x \rightarrow s_x + 1$ is the event that the number of susceptibles in colony $x$ will increase by 1 (a single birth) and $\sum_q i_{qx}$ is the sum of all infection classes $q~\in~\{1, 2, 12\}$. +The rates of death, given a death rate $\mu$, and no increased mortality due to infection, are given by +\begin{align} + s_x \rightarrow s_x-1 &\;\;\;\text{at a rate of}\;\; \mu s_x, \\ + i_{qx} \rightarrow i_{qx}-1 &\;\;\text{at a rate of}\;\; \mu i_{qx},\\ + r_x \rightarrow r_x-1 &\;\;\;\text{at a rate of}\;\; \mu r_x. +\end{align} + + + +I modelled transmission as being density-dependent. +This assumption was more suitable than frequency-dependent transmission as I was modelling a disease transmitted by saliva or urine in highly dense populations confined to caves, buildings or potentially a small number of tree roosts. +I was notably not modelling a sexually transmitted disease (STD) as spillover of STDs from bats to humans is likely to be rare. +Infection of a susceptible with either Pathogen 1 or 2 is therefore given by +\begin{align} + i_{1x} \rightarrow i_{1x}+1,\;\;\; s_x \rightarrow s_x-1 &\;\;\text{at a rate of}\;\; \beta s_x\left(i_{1x} + i_{12x}\right),\\ + i_{2x} \rightarrow i_{2x}+1,\;\;\; s_x \rightarrow s_x-1 &\;\;\text{at a rate of}\;\; \beta s_x\left(i_{2x} + i_{12x}\right), +\end{align} +while coinfection, given the coinfection adjustment factor $\alpha$, is given by +\begin{align} + i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{1x} \rightarrow i_{1x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{1x}\left(i_{2x} + i_{12x}\right),\\ + i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{2x} \rightarrow i_{2x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{2x}\left(i_{1x} + i_{12x}\right). +\end{align} +Note that lower values of $\alpha$ give lower rates of infection as in \textcite{castillo1989epidemiological}. + + +The rate of migration from colony $y$ (with degree $k_y$) to colony $x$, given a dispersal rate $\xi$ is given by +\begin{align} + s_x \rightarrow s_x+1,\;\;\; s_y \rightarrow s_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi s_y}{k_y},\\ + i_{qx} \rightarrow i_{qx}+1,\;\;\; i_{qy} \rightarrow i_{qy}-1 &\;\;\text{at a rate of}\;\; \frac{\xi i_{qy}}{k_y},\\ + r_x \rightarrow r_x+1,\;\;\; r_y \rightarrow r_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi r_y}{k_y}. +\end{align} +Note that the dispersal rate does not change with infection. +As above, this is due to the low virulence of bat viruses. +Finally, recovery from any infectious class occurs at a rate $\gamma$ +\begin{align} + i_{qx} \rightarrow i_{qx}-1,\;\; r_x \rightarrow r_x+1 \;\;\text{at a rate of}\;\; \gamma i_{qx}. +\end{align} + + +%%begin.rcode SimLengths + + # These apply to both topo and disp sims. And probably should apply to extinction sims if I include them. + # How long should the simulation last? + nEvent <- 8e5 + + # When should the invading pathogen be added. + invadeT <- 3e5 + +%%end.rcode + + + + + + +% ------------------------------------------------------------------ % +% Dispersal Sims +% ------------------------------------------------------------------ % + + +%%begin.rcode DispSimsFuncs + + ################################# + # Dispersal sim definitions # + ################################# + + + # How often should the population be sampled. Only sampled populations are saved. + sample <- 1000 + + # How many simulations to run? + each <- 100 + nDispSims <- 12 * each + + +# Define our simulation function. +fullSim <- function(x){ + + dispVec <- rep(c(0.001, 0.01, 0.1), each = nDispSims/3 +1) + disp <- dispVec[x] + + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4 ), nDispSims/3 + 1) + tran <- tranVec[x] + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = disp, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = 100, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + for(i in 1:10){ + p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) + } + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[2, 1, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + if(saveData){ + file <- paste0('data/Chapter2/DispSims/DispSim_', formatC(x, width = 4, flag = '0'), '.RData') + save(p, file = file) + } + + rm(p) + + return(d) + +} +%%end.rcode + +%%begin.rcode runDispSim, eval = runAllSims, cache = TRUE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 33355 +set.seed(seed) + +# If we want to save the data, make a directory for it. +if(saveData){ + dir.create('data/Chapter2/DispSims/') +} + +# Run sims. +z <- mclapply(1:nDispSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/DispSims.csv') + +%%end.rcode + + + + + + +%%begin.rcode extraMidBeta, eval = runAllSims + + +nExtraSims <- 150 + +# Define our simulation function. +fullSim <- function(x){ + + dispVec <- rep(c(0.001, 0.01, 0.1), each = nExtraSims/3 + 1) + disp <- dispVec[x] + + tranVec <- rep(c(0.2, 0.3), nExtraSims/2 + 1) + tran <- tranVec[x] + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = disp, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = 100, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + for(i in 1:10){ + p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) + } + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[2, 1, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + + # Time until extinction + invadePath <- colSums(p$sample[2, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + colSums(p$sample[4, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + d$extinctionTime <- cumsum(p$sampleWaiting)[min(which(invadePath == 0)) + (2 + invadeT / sample)] + d$totalTime <- sum(p$sampleWaiting) + d$survivalTime <- d$extinctionTime - cumsum(p$sampleWaiting)[(2 + invadeT / sample)] + + d$pathInv <- sum(p$sample[c(1, 4), , dim(p$sample)[3]]) + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + rm(p) + + return(d) + +} + + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 787 +set.seed(seed) + + +# Run sims. +z <- mclapply(1:600, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/extraMidBeta.csv') + + +%%end.rcode + + + +% ------------------------------------------------------------------ % +% Topology Sims +% ------------------------------------------------------------------ % + + + +%%begin.rcode TopoSimsFuncs + + ################################# + # Topology sim definitions # + ################################# + + # How many simulations to run? + nTopoSims <- 8 * each + +# Define our simulation function. +fullSim <- function(x){ + + + # Chose maxdistance so that we get either fully connected or circle networks. + mxDis <- rep(c(40, 200), each = nTopoSims/2 + 1)[x] + + # Chose transmission rates. + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nTopoSims/4 + 1) + tran <- tranVec[x] + + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.01, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = mxDis, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + for(i in 1:10){ + p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) + } + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[2, 1, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + if(saveData){ + file <- paste0('data/Chapter2/TopoSims/TopoSim_', formatC(x, width = 4, flag = '0'), '.RData') + save(p, file = file) + } + + rm(p) + + return(d) + +} +%%end.rcode + +%%begin.rcode runTopoSim, eval = runAllSims, cache = TRUE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 1230202 +set.seed(seed) + +# If we want to save the data, make a directory for it. +if(saveData){ + dir.create('data/Chapter2/TopoSims/') +} + +# Run sims. +z <- mclapply(1:nTopoSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/TopoSims.csv') + +%%end.rcode + + + + + + +% ------------------------------------------------------------------ % +% Unstructured Sims +% ------------------------------------------------------------------ % + +%%begin.rcode unstructuredSimsFuncs + + ################################# + # Topology sim definitions # + ################################# + + # How many simulations to run? + nUnstructuredSims <- 4 * each + +# Define our simulation function. +fullSim <- function(x){ + + + # Chose transmission rates. + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nUnstructuredSims/4 + 1) + tran <- tranVec[x] + + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 2, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.0, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 29800, infectDeath = 0, transmission = tran, maxDistance = 120, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + p$I[2, 2, 1] <- 200 + p$I[1, 1, 1] <- 0 + + # Recalculate rates of each event after seeding. + p <- transRates(p, 1) + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[3, 2, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + if(saveData){ + file <- paste0('data/Chapter2/UnstructuredSims/UnstructuredSims_', formatC(x, width = 4, flag = '0'), '.RData') + save(p, file = file) + } + + rm(p) + + return(d) + +} +%%end.rcode + +%%begin.rcode runUnstructuredSim, eval = runAllSims, cache = TRUE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 13 +set.seed(seed) + +# If we want to save the data, make a directory for it. +if(saveData){ + dir.create('data/Chapter2/UnstructuredSims/') +} + +# Run sims. +z <- mclapply(1:nUnstructuredSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + +# Save summary data. +write.csv(z, file = 'data/Chapter2/unstructuredSims.csv') + +%%end.rcode + + + + + +%%begin.rcode noDispSimsFuncs + + + ################################# + # Topology sim definitions # + ################################# + + # How many simulations to run? + nNoDispSims <- 4 * each + +# Define our simulation function. +fullSim <- function(x){ + + + # Chose transmission rates. + tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nNoDispSims/4 + 1) + tran <- tranVec[x] + + + # Set seed (this is set within each parallel simulation to prevent reusing random numbers). + simSeed <- paste0(seed, x) + set.seed(simSeed) + + # Make the population. + p <- makePop(model = 'SIR', events = nEvent, nColonies = 2, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.0, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 2800, infectDeath = 0, transmission = tran, maxDistance = 120, colonySpatialDistr = 'circle') + + # Seed endemic pathogen. + p$I[2, 2, 1] <- 200 + p$I[1, 1, 1] <- 0 + + # Recalculate rates of each event after seeding. + p <- transRates(p, 1) + + # Burn in simulation + p <- runSim(p, end = invadeT) + + # Seed invading pathogen. + p$I[3, 2, (invadeT + 1) %% sample] <- 5 + + # Recalculate rates of each event after seeding. + p <- transRates(p, (invadeT + 1) %% sample) + + # Continue simulation + p <- runSim(p, start = invadeT + 1, end = 'end') + + # Was the invasion succesful? + invasion <- findDisDistr(p, 2)[1] > 0 + + # Save summary stats + d <- data.frame(transmission = NA) + + d$transmission <- p$parameters['transmission'] + d$dispersal <- p$parameters['dispersal'] + d$nExtantDis <- sum(findDisDistr(p, 2) > 0) + d$singleInf <- findCoinfDistr(p, 2)[2] + d$doubleInf <- findCoinfDistr(p, 2)[3] + d$nColonies <- p$parameters['nColonies'] + d$nPathogens <- p$parameters['nPathogens'] + d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] + d$maxDistance <- p$parameters['maxDistance'] + d$nEvents <- p$parameters['events'] + d$path2 <- sum(p$sample[c(2, 4), , dim(p$sample)[3]]) + + # Time until extinction + invadePath <- colSums(p$sample[2, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + colSums(p$sample[4, , (2 + invadeT / sample):(dim(p$sample)[3])]) + + d$extinctionTime <- cumsum(p$sampleWaiting)[min(which(invadePath == 0)) + (2 + invadeT / sample)] + d$totalTime <- sum(p$sampleWaiting) + d$survivalTime <- d$extinctionTime - cumsum(p$sampleWaiting)[(2 + invadeT / sample)] + + + + message(paste0("finished ", x, ". Invasion: ", invasion )) + + + rm(p) + + return(d) + +} + + + +%%end.rcode + +%%begin.rcode runNoDispSims, eval = runAllSims, cache = FALSE + +# Create and set seed (seed object is used to set seed in each seperate simulation.' +seed <- 19 +set.seed(seed) + +# Run sims. +z <- mclapply(1:nNoDispSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) + +z <- do.call(rbind, z) + + + +# Save summary data. +write.csv(z, file = 'data/Chapter2/noDispSims.csv') + +%%end.rcode + + + +\subsection{Parameter selection} +\label{s:paramSelect} + +The fixed parameters were chosen to roughly reflect realistic wild bat populations. +The death rate $\mu$ was set as 0.05 per year giving a generation time of 20 years. +The birth rate $\Lambda$ was set to be equal to $\mu$. +This yields a population that does not systematically increase or decrease. +However, the size of each colony changes as a random walk. +Given the length of the simulations, colonies were very unlikely to go extinct (Figure~\ref{fig:plotsNoInvade2}). +The starting size of each colony was set to \si{3000}. +This is appropriate for many bat species \cite{jones2009pantheria}, especially the large, frugivorous \emph{Pteropodidae} that have been particularly associated with recent zoonotic diseases. + +The recovery rate $\gamma$ was set to one, giving an average infection duration of one year. +This is therefore a long lasting infection but not a chronic infection. +It is very difficult to directly estimate infection durations in wild populations but it seems that these infections might sometimes be long lasting \cite{peel2012henipavirus, plowright2015ecological}. +However, other studies have found much shorter infectious periods \cite{amengual2007temporal}. +These shorter infections are not studied further here. %todo consider readding this. " as preliminary simulations found that they could not persist in the relatively small populations being modelled here." + +Four values of the transmission rate $\beta$ were used, 0.1, 0.2, 0.3 and 0.4. +These values were chosen to cover the range of behaviours, from very high probabilities of invasion of the second pathogen, to very low probabilities. +All simulations were run under all four transmission rates as this is such a fundamental parameter. +The coinfection adjustment parameter, $\alpha$, was set to 0.1 so that an individual infected with one pathogen is 90\% less likely to be infected with another. +This is a rather arbitrary value. +However, the rationale of the model was that the invading species might be a newly speciated strain of the endemic species. +Furthermore, the model assumes complete cross-immunity after recovery from infection. +Therefore cross-immunity to coinfection is likely to be very strong as well. +Some pairs of closely related bat viruses have been found to coinfect individual bats less than would be expected by chance \cite{anthony2013strategy}. +This indicates a level of cross-immunity between these pairs of viruses. %todo I'm sure there was a marburg ebola paper... + + + + + +\subsection{Experimental setup} + +The metapopulation was made up of ten colonies. +Ten colonies was selected as a trade-off between computation time and a network complex enough that any effects of population structure could be detected. +This value is artificially small compared to wildlife populations. +In each simulation, the na{\"i}ve population was seeded with ten sets of 200 individuals infected with Pathogen 1. +These groups were seeded into randomly selected colonies with replacement. +For each 200 infected individuals added, 200 susceptible individuals were removed to keep starting colony sizes constant. +Pathogen 1 was then allowed to spread until the initial, large epidemic had ended. +Visual inspection of preliminary simulations was used to decide on \SI{\rinline{invadeT}} events as being long enough for the epidemic to end and the pathogen to be in an endemic state (Figures~\ref{fig:plotsInvade} and \ref{fig:plotsNoInvade1}). +After \SI{\rinline{invadeT}} events, five individuals infected with Pathogen 2 were added to one randomly selected colony. +After another \SI{\rinline{nEvent - invadeT}} events the invasion of Pathogen 2 was considered successful if any individuals were still infected with Pathogen 2. +Therefore, if at least one individual was in class $I_2$ or $I_{12}$ at the end of the simulation, this was considered an invasion. +Again, visual inspection of preliminary simulations was used to determine that after \SI{\rinline{nEvent - invadeT}} events, if an invading pathogen was still present, it was well established (Figures~\ref{fig:plotsInvade} and \ref{fig:plotsNoInvade1}). + +The choice to use a fixed number of events, rather than a fixed number of years, was for computational convenience. +However, this choice creates a risk of bias as simulations with a greater total rate of events, $\sum_j e_j$ (e.g.,\ faster disease transmission) will last for a shorter time overall (i.e.\ a smaller $\sum \delta$ over all events). +However, visual inspection of the dynamics of disease extinction (Figure~\ref{fig:plotsNoInvade1}), and examination of the typical time to extinction suggests that this bias is negligible. +For example, of the simulations where extinction occurred, the extinction occurred more than 50 years before the end of the simulation in 90\% of cases. +On a preliminary run of 106 simulations across all combinations of dispersal and transmission rates, examining the population after \SI{700000} events instead of \SI{\rinline{nEvent}} events gave exactly the same result with respect to the binary state of invasion or no invasion. + + +\subsection{Population structure} + +As a baseline for comparison, I ran simulations of a fully unstructured population. +These simulations were run with a population of \SI{30000} so that the total population size was equal to that of the total metapopulation size in the structured simulations. +I ran 100 simulations at each transmission rate. + + +\tmpsection{Dispersal} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +Two parameters control population structure in the model: dispersal rate and the topology of the metapopulation network. +The values used for these parameters were chosen to highlight the effects of population structure. +I selected the dispersal rates $\xi = 0, 0.001, 0.01$ and $0.1$ dispersals per individual per year. +The probability that an individual disperses at least once in its lifetime is given by $\xi / \left(\xi + \mu\right)$. +Therefore, $\xi = 0.1$ relates to 67\% of individuals dispersing between colonies at least once in their lifetime. +Exclusively juvenile dispersal would have dispersal rates similar to this value. %todo cite +$\xi = 0.01$ relates to 17\% of individuals dispersing at least once in their lifetime. +This value is relatively close to male-biased dispersal, with female philopatry. %todo cite +$\xi = 0.001$ relates to 2\% of individuals dispersing during their lifetime. +This therefore relates to a species that does not habitually disperse. +Finally, I ran simulations with no dispersal. +Given zero dispersal, only the colony seeded with Pathogen 2 could ever receive infections of the invading pathogen. +Therefore, only one colony was simulated for $\xi = 0$. +While altering the dispersal rate I used a fully connected network topology. +I ran 100 simulations for most parameter sets. +I ran 150 simulations for $\xi = 0.1, 0.01$ and $0.001$ with $\beta = 0.2$ and $0.3$ as preliminary simulations indicated that any effects of population structure would most likely be seen at these values so extra simulations were run to increase statistical power. + + + +\tmpsection{Network structure} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +I also altered the topology of the metapopulation network. +The network topology was created to be either fully or minimally connected (Figure~\ref{f:net}). +To model a completely unconnected population the $\xi = 0$ simulations from above were used. +While altering network topology, the intermediate dispersal rate, $\xi = 0.01$, was used. +I again ran 100 simulations for each parameter set. + + + +\subsection{Statistical analysis} +\sloppy +I used generalised linear models (GLMs) with a binary response variable, invasion or not, to test the hypothesis that probability of invasion increased with dispersal. +I examined $p$ values and the regression coefficient, $b$, from each model. +Separate GLMs were fitted for each transmission rate. +These tests were performed both with and without the $\xi = 0$ results as the complete lack of dispersal makes these simulations qualitatively different to the other simulations. +To test whether the different topologies had different probabilities of invasion, I used Fisher's exact tests because topology is best described as a categorical variable. +As with the $\xi = 0$ results, these tests were performed both with and without the completely unconnected topology results. +Finally, I also used binomial GLMs to test the hypothesis that the probability of invasion increased with transmission rate. +Separate GLMs were fitted for each dispersal rate and network topology. +All statistical analyses were performed using the \emph{stats} package in \emph{R}. +The code used for running the simulations and analysing the results is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/pop-structure-path-richness-mechanistic.Rtex}. +\fussy + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\section{Results} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%%begin.rcode loadNoDispData +noDisp <- read.csv('data/Chapter2/noDispSims.csv', stringsAsFactors = FALSE) + +%%end.rcode + + +%%begin.rcode loadDispData + +# Read in the data. +# Later I'll add in the option to simulate the whole dataset. +dDisp <- read.csv('data/Chapter2/DispSims.csv', stringsAsFactors = FALSE) +dim(dDisp) +head(dDisp) + +extraMidBeta <- read.csv('data/Chapter2/extraMidBeta.csv', stringsAsFactors = FALSE) + +%%end.rcode + + + + + +%%begin.rcode DispDataOrganise + +dDisp <- rbind(dDisp, extraMidBeta[, 1:11], noDisp[, 1:11]) + + +# Which simulations have an extinction + +dDisp$invasion <- dDisp$nPathogens - dDisp$nExtantDis == 0 + + + +# Number of simulations of each treatment +nDisp <- dDisp %>% + group_by(transmission, dispersal) %>% + dplyr::select(invasion) %>% + summarise(n()) + +nDisp + + +# Number of extinctions by treatment +invsDisp <- dDisp %>% + group_by(transmission, dispersal) %>% + dplyr::select(invasion) %>% + filter(invasion == TRUE) %>% + summarise(n()) + +invsDisp + + +propsDisp <- left_join(nDisp, invsDisp, by = c('dispersal', 'transmission')) + +names(propsDisp) <- c( 'transmission', 'dispersal', 'n', 'invasions') + +propsDisp$invasions[is.na(propsDisp$invasions)] <- 0 + +# Proportion of invasions in totals. +propsDisp$props <- propsDisp$invasions / propsDisp$n + +propsDisp + +%%end.rcode + + + + +%%begin.rcode DispPropTests + +# Then run proportion tests between population structures + + +DispGLM1 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.1, ], family = binomial)) +DispGLM2 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.2, ], family = binomial)) +DispGLM3 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.3, ], family = binomial)) +DispGLM4 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.4, ], family = binomial)) + +dDispNoZero <- dDisp %>% dplyr::filter(dispersal != 0) +Disp2GLM1 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.1, ], family = binomial)) +Disp2GLM2 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.2, ], family = binomial)) +Disp2GLM3 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.3, ], family = binomial)) +Disp2GLM4 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.4, ], family = binomial)) + + +%%end.rcode + + + +%%begin.rcode DispTransPropTests +##Finally run proportion tests between transmission rates + +#Finally run proportion tests between transmission rates + +DispTransGLM1 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.001, ], family = binomial)) +DispTransGLM2 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.01, ], family = binomial)) +DispTransGLM3 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.1, ], family = binomial)) +DispTransGLM4 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0, ], family = binomial)) + +%%end.rcode + + + + +%%begin.rcode loadTopoData + +# Read in the data. +# Later I'll add in the option to simulate the whole dataset. +dTopo <- read.csv('data/Chapter2/TopoSims.csv', stringsAsFactors = FALSE) +dim(dTopo) +head(dTopo) + +dTopo <- rbind(dTopo, noDisp[, 1:11]) + +%%end.rcode + + +%%begin.rcode TopoDataOrganise + +# Which simulations have an extinction + +dTopo$invasion <- dTopo$nPathogens - dTopo$nExtantDis == 0 + +# Number of extinctions by treatment +invsTopo <- dTopo %>% + group_by(transmission, meanK) %>% + dplyr::select(invasion) %>% + filter(invasion == TRUE) %>% + summarise(n()) %>% + rbind(c(0.1, 1, 0), .) + +invsTopo + +# Number of simulations of each treatment +nTopo <- dTopo %>% + group_by(transmission, meanK) %>% + dplyr::select(invasion) %>% + summarise(n()) + +nTopo + + + +propsTopo <- left_join(nTopo, invsTopo, by = c('meanK', 'transmission')) + +names(propsTopo) <- c( 'transmission', 'meanK', 'n', 'invasions') + +propsTopo$invasions[is.na(propsTopo$invasions)] <- 0 + +# Proportion of invasions in totals. +propsTopo$props <- propsTopo$invasions / propsTopo$n + +propsTopo + +%%end.rcode + + + + +%%begin.rcode TopoPropTests + +# Then run proportion tests between population structures + +TopoTest1 <- fisher.test(cbind(propsTopo$invasions[1:3], propsTopo$n[1:3] - propsTopo$invasions[1:3])) +TopoTest2 <- fisher.test(cbind(propsTopo$invasions[4:6], propsTopo$n[4:6] - propsTopo$invasions[4:6])) +TopoTest3 <- fisher.test(cbind(propsTopo$invasions[7:9], propsTopo$n[7:9] - propsTopo$invasions[7:9])) +TopoTest4 <- fisher.test(cbind(propsTopo$invasions[10:12], propsTopo$n[10:12] - propsTopo$invasions[10:12])) + + + +Topo2Test1 <- fisher.test(cbind(propsTopo$invasions[2:3], propsTopo$n[2:3] - propsTopo$invasions[2:3])) +Topo2Test2 <- fisher.test(cbind(propsTopo$invasions[5:6], propsTopo$n[5:6] - propsTopo$invasions[5:6])) +Topo2Test3 <- fisher.test(cbind(propsTopo$invasions[8:9], propsTopo$n[8:9] - propsTopo$invasions[8:9])) +Topo2Test4 <- fisher.test(cbind(propsTopo$invasions[11:12], propsTopo$n[11:12] - propsTopo$invasions[11:12])) + +%%end.rcode + + + + +%%begin.rcode TopoTransPropTests +#Finally run proportion tests between transmission rates + +#TopoTransTest1 <- fisher.test(cbind(propsTopo$invasions[c(1, 3, 5, 7)], propsTopo$n[c(1, 3, 5, 7)] - propsTopo$invasions[c(1, 3, 5, 7)])) +#TopoTransTest2 <- fisher.test(cbind(propsTopo$invasions[c(2, 4, 6, 8)], propsTopo$n[c(2, 4, 6, 8)] - propsTopo$invasions[c(2, 4, 6, 8)])) + + +TopoTransGLM1 <- summary(glm(invasion ~ transmission, data = dTopo[dTopo$meanK == 9, ], family = binomial)) +TopoTransGLM2 <- summary(glm(invasion ~ transmission, data = dTopo[dTopo$meanK == 2, ], family = binomial)) + +%%end.rcode + + +%%begin.rcode caption1String +# Just defining my caption label here to avoid the long string in chunk options below. + +invasionPropCaption <- sprintf(" + The probability of successful invasion for different A) dispersal rates and B) network topologies (with network topologies ``unconnected'', ``minimally connected'' and ``fully connected'' as in Figure~\\ref{f:net}). + Error bars are 95\\%% confidence intervals of probability of invasion. + %i simulations were run for each treatment except $\\beta = 0.2$ and $0.3$ in A) which has 150 per treatment. + Other parameters were kept constant at: $m = 10,\\, \\, \\mu = \\Lambda = 0.05,\\, \\gamma = 1,\\, \\alpha = 0.1$. + When dispersal is varied, the population structure is fully connected. + When network topology is varied, $\\xi = 0.01$.", + as.integer(each)) + +invasionPropShort <- "The probability of invasion across different dispersal rates and network topologies" + + +%%end.rcode + + +%%begin.rcode invasionPropPlots, fig.lp = 'f:', fig.height = 2.6, out.width = "\\textwidth", fig.cap = invasionPropCaption, cache = FALSE, fig.scap = invasionPropShort + +propsDispCI <- data.frame(propsDisp[, 1:2], binom.confint(propsDisp$invasions, propsDisp$n, conf.level = 0.95, methods = "exact")) +propsDispCI <- propsDispCI %>% mutate(dispersal = replace(dispersal, dispersal == 0, 1e-4)) +propsDispCI <- propsDispCI %>% mutate(transFactor = factor(transmission)) + +dispPlot <- ggplot(propsDispCI, aes(x = dispersal, y = mean, colour = transFactor)) + + geom_point() + + geom_line() + + scale_x_log10(breaks = c(1e-4, 1e-3, 1e-2, 1e-1), labels = c('0', '0.001', '0.01', '0.1')) + + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.04) + + scale_colour_poke(name = expression(beta), + pokemon = 'illumise', + spread = 4) + + xlab('Dispersal') + + ylab('Prop. Invasions') + + theme(legend.position = "none", panel.grid.major.x = element_blank()) + + +propsTopoCI <- data.frame(propsTopo[, 1:2], binom.confint(propsTopo$invasions, nTopo$n, conf.level = 0.95, methods = "exact")) + +propsTopoCI$topo <- factor(propsTopoCI$meanK, levels = c(1, 2, 9)) +propsTopoCI$topoCont <- as.numeric(propsTopoCI$topo) +propsTopoCI <- propsTopoCI %>% mutate(transFactor = factor(transmission)) + + +topoPlot <- ggplot(propsTopoCI, aes(x = topoCont, y = mean, colour = transFactor)) + + geom_point() + + geom_line() + + scale_x_continuous(breaks = c(1, 2, 3), + labels = c('Unconn.','Min.','Full.'), + limits = c(0.9, 3.1)) + + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.04) + + scale_colour_poke(name = expression(beta), + pokemon = 'illumise', + spread = 4) + + + xlab('Network Topology') + + ylab('Prop. Invasions') + + theme(legend.position = "none", panel.grid.major.x = element_blank()) + +# Extract the legend +grobs <- ggplotGrob(topoPlot + theme(legend.position="bottom"))$grobs +legend_b <- grobs[[which(sapply(grobs, function(x) x$name) == "guide-box")]] + + +ggdraw() + + draw_label("A)", 0.03, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(dispPlot, 0, 0.06, 0.5, 0.94) + + draw_label("B)", 0.53, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + + draw_plot(topoPlot, 0.5, 0.06, 0.5, 0.94) + + draw_grob(legend_b, 0.32, 0.01, 0.4, 0.1) + + + + +%%end.rcode + + +\subsection{Dispersal} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +In the unstructured population, Pathogen 2 invaded in 100 out of 100 simulations. +This was true at all four transmission rates. + +%todo check formating of pvalues +When the $\xi = 0$ simulations were included, there was a positive relationship between dispersal rate and invasion probability for $\beta = 0.2, 0.3$ and $0.4$ (Figure~\ref{f:invasionPropPlots}A, Table~\ref{B-disp}). +These positive relationships were all significant (GLM. $\beta = 0.2$: $b$ = \rinline{DispGLM2$coefficients[2, 1]}, $p < 10^{-5}$. $\beta = 0.3$: $b$ = \rinline{DispGLM3$coefficients[2, 1]}, $p$ = \rinline{DispGLM3$coefficients[2, 4]}. $\beta = 0.4$: $b$ = \rinline{DispGLM4$coefficients[2, 1]}, $p$ = \rinline{DispGLM4$coefficients[2, 4]}.) +At $\beta = 0.1$ there was no significant relationship as invasion probability was very close to zero at all dispersal rates (GLM. $b$ = \rinline{DispGLM1$coefficients[2, 1]}, $p$ = \rinline{DispGLM1$coefficients[2, 4]}). + +However, when the $\xi = 0$ simulations were removed, this significant, positive relationship largely disappeared. +At $\beta = 0.2$, the significant positive relationship remained (GLM: $b~=~$\rinline{Disp2GLM2$coefficients[2, 1]}, $p$ = \rinline{Disp2GLM2$coefficients[2, 4]}). +At all other transmission rates, the probability of invasion did not significantly change with dispersal rate (GLM. $\beta = 0.1$: $b$ = \rinline{Disp2GLM1$coefficients[2, 1]}, $p$ = \rinline{Disp2GLM1$coefficients[2, 4]}. $\beta = 0.3$: $b$ = \rinline{Disp2GLM3$coefficients[2, 1]}, $p$ = \rinline{sprintf('%.2f', Disp2GLM3$coefficients[2, 4])}. $\beta = 0.4$: $b$ = \rinline{Disp2GLM4$coefficients[2, 1]}, $p$ = \rinline{sprintf('%.2f', Disp2GLM4$coefficients[2, 4])}.) + + +\subsection{Network topology} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +When the completely unconnected topology simulations were included, the probability of invasion was different across topologies for $\beta = 0.2, 0.3$ and $0.4$ (Fisher's exact test. $\beta = 0.2$: $p < 10^{-5}$. $\beta = 0.3$: $p < 10^{-5}$. $\beta = 0.4$: $p < 10^{-5}$). +In each case, the fully unconnected population had a lower probability of invasion than the minimally and completely connected topologies (Figure~\ref{f:invasionPropPlots}B, Table~\ref{B-topo}). +At $\beta = 0.1$ there was no significant difference ($p = \rinline{p(TopoTest1$p.value)}$) and the probability of invasion was close to zero for all topologies (Figure~\ref{f:invasionPropPlots}B). + +When the completely unconnected topology simulations were removed, there were no significant differences between topologies i.e.\ between the minimally and fully connected topologies (Figure~\ref{f:invasionPropPlots}B). +This was true at all transmission rates (Fisher's exact test. $\beta = 0.1$, $p = \rinline{sprintf('%.2f', Topo2Test1$p.value)}$. $\beta = 0.2$, $p = \rinline{p(Topo2Test2$p.value)}$. $\beta = 0.3$, $p = \rinline{p(Topo2Test3$p.value)}$. $\beta = 0.4$, $p = \rinline{p(Topo2Test4$p.value)}$). + + + +\subsection{Transmission} +%%%%%%%%%%%%%%%%%%%%%%%%%% + +Increasing the transmission rate increased the probability of invasion (Figure~\ref{f:invasionPropPlots}). +This was true for all four dispersal values (GLM. $\xi = 0$: $b$ = \rinline{DispTransGLM4$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.001$: $b$ = \rinline{DispTransGLM1$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.01$: $b$ = \rinline{DispTransGLM2$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.1$: $b$ = \rinline{DispTransGLM3$coefficients[2, 1]}, $p < 10^{-5}$.) and both network structures (GLM. Minimally connected: $b$ = \rinline{TopoTransGLM2$coefficients[2, 1]}, $p < 10^{-5}$. Fully connected: $b$ = \rinline{TopoTransGLM1$coefficients[2, 1]}, $p < 10^{-5}$). + + + + + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\section{Discussion}\label{s:sims1Disc} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\tmpsection{Restate the gap and the main result} + +I have used mechanistic, metapopulation models to test whether increased population structure can promote pathogen richness by facilitating invasion of new pathogens. +I found that dispersal does affect the ability of a new pathogen to invade and persist in a population. +I also found evidence that pathogen invasion was less likely in completely isolated colonies. +However, apart from the completely unconnected network, the topology of the metapopulation network did not affect invasion probability. +Increasing transmission rate quickly reaches a state where new pathogens always invade as long as the metapopulation is not completely unconnected. +Decreasing the transmission rate quickly reaches a state where invasion is impossible. + +The result that increased population structure decreases pathogen richness supports many existing predictions that increasing $R_0$ should increase pathogen richness \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. +However, many comparative studies have found the opposite relationship, with increased population structure increasing pathogen richness (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). +Furthermore, simple analytical models suggest that population structure should increase pathogen richness \cite{qiu2013vector, allen2004sis, nunes2006localized} and I find no evidence of this. + + +\tmpsection{Link results to consequences} + +These results suggest that if population structure does in fact affect pathogen richness, as observed in comparative studies (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), it must occur by a mechanism other than the one studied here. +In this study the hypothesised mechanism for the relationship between population structure and pathogen richness, was that the spread and persistence of a newly evolved pathogen would be facilitated in highly structured populations as the lack of movement between colonies would stochastically create areas of low prevalence of the endemic pathogen. +If the invading pathogen evolved (i.e.\ was seeded) in one of these areas of low prevalence, invasion would be more likely. +Instead, reduced population structure allowed the new pathogen to quickly spread outside of the colony in which it evolved. +As the mechanism studied here cannot explain the relationship between population structure and pathogen richness seen in wild species (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), other mechanisms should be studied. +Other mechanisms that should be examined include reduced competitive exclusion of already established pathogens or increased invasion of less closely related and less strongly competing pathogens, perhaps mediated by ecological competition of pathogens (i.e.\ reduction of the susceptible pool by disease induced mortality). +Furthermore, single pathogen dynamics could have an important role such as population structure causing a much slower, asynchronous epidemic preventing acquired herd immunity \cite{plowright2011urban}. + +I ran simulations of a completely unstructured population as a baseline comparison of pathogen invasion probability. +However, this unstructured population could also be considered one, very large, subpopulation or colony. +The fact that invasion occurred 100\% of the time in these simulations suggests that colony size has an important role in pathogen richness. +Therefore the interplay between population structure and colony size should be studied further especially as the range of colony size in bats is large, ranging from ten to 1 million \cite{jones2009pantheria} individuals. + +My simulations also highlighted the importance of competition for the spread of a new pathogen. +All parameters used corresponded to pathogens with $R_0>1$ (as seen by the consistent spread of Pathogen 1). +However, the competition with the endemic pathogen meant that for some transmission rates the chance of epidemic spread and persistence of Pathogen 2 was close to zero. +This has implications for human epidemics as well --- if there is strong competition between a newly evolved strain and an endemic strain, we are unlikely to see the new strain spread, regardless of population structure. + + + +\subsection{Model assumptions} + +\subsubsection{Complete cross-immunity} + +I have assumed that once recovered, individuals are immune to both pathogens. +Furthermore, when a coinfected individual recovers from one pathogen, it immediately recovers from the other as well. +This is probably a reasonable assumption given that I am modelling a newly evolved strain. +However, the rate of recovery from pathogens in the presence of coinfections has not been well studied. +In humans, the rate of recovery from respiratory syncytial virus was faster in individuals that had recently recovered from one of a number of co-circulating viruses \cite{munywoki2015influence}. +However, currently coinfected individuals recovered more slowly than average \cite{munywoki2015influence}. + +However, further work could relax this assumption using a model similar to \cite{poletto2015characterising} which contains additional classes for ``infected with Pathogen 1, immune to Pathogen 2'' and ``infected with Pathogen 2, immune to Pathogen 1''. +The model here was formulated such that the study of systems with greater than two pathogens (an avenue for further study) is still computationally feasible. +A model such as used in \cite{poletto2015characterising} contains $3^\rho$ classes for a system with $\rho$ pathogen species. +This quickly becomes computationally restrictive. +It might be expected that there is an upper limit to the total number of pathogen species that can coexist in a population. +In particular, it is possible that once a certain number of species are endemic in a population, no more pathogens can invade into the population. +This has not been studied in the context of metapopulations. + +\subsubsection{Identical strains} + +Many papers on pathogen richness have focused on the evolution of pathogen traits and have considered a trade-off between transmission rate and virulence \cite{nowak1994superinfection, nowak1994superinfection} or infectious period \cite{poletto2013host}. +However, here I am interested in host traits. +Therefore I have assumed that pathogen strains are identical. +It is clear however that there are a number of factors that affect pathogen richness and my focus on host population structure does not imply that pathogen traits are not important. + +\subsubsection{Complex social structure and behaviour} + +With the models here I have aimed to tread a middle ground between the overly simplistic models employed in analytical studies \cite{allen2004sis} and the full complexity and variety of true bat social systems \cite{kerth2008causes}. +The factors that have not been modelled here include seasonal migration, maternity roosts, hibernation roosts and swarming sites \cite{kerth2008causes, fleming2003ecology, richter2008first, cryan2014continental}. +While future models might aim to model this complexity more fully, the number of parameters that are required to be estimated and varied becomes very large. +Furthermore, not all of these social complexities exist in all bat species, so in limiting my analysis to the simpler end of bat social systems it is hoped that the results are more broadly representative of the order. + +Furthermore, I have considered a single host species in isolation. +It seems likely that sympatry in bats and other mammals is epidemiologically important \cite{brierley2016quantifying, luis2013comparison, pilosof2015potential} but this was beyond the scope of this study. +There is potential for this to be effectively modelled as a multi-layered network \cite{wang2016structural, funk2010interacting} and this would be expected to act to reduce population structure. +Conversely, the case of interspecies roost sharing could be modelled as an additional layer of within-colony, population structure which would tend to increase population structure. + +Finally, many species of bat exhibit strong seasonal birth pulses which are known to affect disease dynamics \cite{hayman2015biannual,peel2014effect,amman2012seasonal}. +This would be expected to facilitate the invasion of new pathogen species; if a new strain evolved or entered the population by migration during a period of low population immunity, it would have a higher chance of invading and establishing in the population. +Again this was beyond the scope of this study, but birth pulses and their interactions with seasonally varying transmission rates is a useful area for further research. + +\subsection{Conclusions} + +In conclusion I have found evidence that reduced population structure facilitates the invasion and establishment of newly evolved pathogen species. +However, the direction of the relationship contradicts those found in wild species. +This suggests that if population structure does have a role in shaping pathogen communities, it is unlikely to be by this specific mechanism. + + + + + + From 3a808fbfafdab0091425a331b167c4710311cbfe Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 17:02:19 +0100 Subject: [PATCH 16/17] Remove poorly named copy file. --- ...s-pathogen-richness-mechanistic-model.Rtex | 1503 ----------------- 1 file changed, 1503 deletions(-) delete mode 100644 population-structure-affects-pathogen-richness-mechanistic-model.Rtex diff --git a/population-structure-affects-pathogen-richness-mechanistic-model.Rtex b/population-structure-affects-pathogen-richness-mechanistic-model.Rtex deleted file mode 100644 index 48d450b..0000000 --- a/population-structure-affects-pathogen-richness-mechanistic-model.Rtex +++ /dev/null @@ -1,1503 +0,0 @@ -%--------------------------------------------------------------------------------------------------------------------------------% -% Code and text for "Understanding how population structure affects pathogen richness in a mechanistic model of bat populations" -% Chapter 3 of thesis "The role of population structure and size in determining bat pathogen richness" -% by Tim CD Lucas -% -% NB This file is a copy due to the mess up with chapter numbers. -% To see the full commit history see https://github.com/timcdlucas/PhDThesis/blob/master/Chapter2.Rtex -% -%---------------------------------------------------------------------------------------------------------------------------------% - - - - - -%%begin.rcode settings, echo = FALSE, cache = FALSE, message = FALSE, results = 'hide' - -#################################### -### Important simulation options ### -#################################### - -# Compilation options -# Run simulations? This will take many hours -runAllSims <- FALSE - -# Save raw simulation output -# This will take ~10GB or so. -# If false, summary statistics of each simulation are saved instead. -saveData <- FALSE - - -# How many cores do you want to use to run simulations? -nCores <- 7 - -########################## -### End options ### -########################## - - -opts_chunk$set(cache.path = '.Ch2Cache/') - -source('misc/theme_tcdl.R') -source('misc/KnitrOptions.R') -theme_set(theme_grey() + theme_tcdl) - -%%end.rcode - - -%%begin.rcode libs, cache = FALSE, result = FALSE - -# My package. For running and analysing Epidemiological sims. -# https://github.com/timcdlucas/metapopepi -library(MetapopEpi) - -# Data manipulations -library(reshape2) -library(dplyr) - -# Calc confidence intervals (could probably do with broom instead now.) -library(binom) - -# To tidy up stats models/tests -library(broom) - - -# Run simulations in parallel -library(parallel) - -# Plotting -library(ggplot2) -library(palettetown) -library(cowplot) - -%%end.rcode - - -\section{Abstract} - -%\tmpsection{One or two sentences providing a basic introduction to the field} -% comprehensible to a scientist in any discipline. -%\lettr{A}n increasingly large proportion of emerging human diseases comes from animals. -%These diseases have a huge impact on human health, healthcare systems and economic development. -The chance that a new zoonosis will come from any particular wild host species increases with the number of pathogen species occurring in that host species. -Comparative, phylogenetic studies have shown that host-species traits such as population density and population structure correlate with pathogen richness -However, the mechanisms by which these factors control pathogen richness in wild animal species remain unclear. -% -% -%\tmpsection{Two to three sentences of more detailed background} -% comprehensible to scientists in related disciplines. -% Add mechanistic vs empirical -Typically it is assumed that well-connected, unstructured populations (that therefore have a high basic reproductive number, $R_0$) promote the invasion of new pathogens and therefore increase pathogen richness. However, this assumption is largely untested in the multipathogen context. -In the presence of inter-pathogen competition, the opposite effect might occur; increased population structure may increase pathogen richness by reducing the effects of competition. -A more mechanistic understanding of how population structure affects pathogen richness could discriminate between these two broad hypotheses. -Here I have examined one mechanism by which increased population structure may cause greater pathogen richness. -I used simulations to test whether increased population structure could increase the probability that a newly evolved pathogen would invade into a population already infected with an identical, endemic pathogen. -I tested this hypothesis using individual-based, metapopulation networks parameterised to mimic wild bat populations as bats have highly varied social structures and have recently been implicated in a number of high profile diseases such as Ebola, SARS, Hendra and Nipah. -In a metapopulation, dispersal rate and the number of links between colonies can both affect population structure. -I tested whether either of these factors could increase the probability that a pathogen would invade and persist in the population. -I found that, at intermediate transmission rates, increasing dispersal rate significantly increased the probability of a newly evolved pathogen invading into the metapopulation. -However, there was very limited evidence that the number of links between colonies affected pathogen invasion probability. - - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -\section{Introduction} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -%Possible structure (each number a sep. paragraph): -%1. zoonotics bad, need to predict spillover, factors controlling pathogen richness unknown -%2. results from comparative studies (including mammal and bat ones), explaining why population structure is important -%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach -%4. description of the possible mechanisms for population structure - explaining why focusing on reduction of competition mechanisms -%5. results from analytical models so far and limitations of the approach -%6. what is needed now -%7. what your focus is (including a bit about bat focus) -%8. 'here I show..' what you found briefly to lead into methods - - - - -\tmpsection{General Intro} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% A basic introduction to the field, -% comprehensible to a scientist in any discipline. - -%1. zoonotics bad, need to predict spillover, factors controlling pathogen richness unknown -\tmpsection{Why is pathogen richness? important?} -Over 60\% of emerging infectious diseases have an animal source \cite{jones2008global, smith2014global}. -Zoonotic pathogens can be highly virulent \cite{luby2009recurrent, lefebvre2014case} and can have huge public health impacts \cite{granich2015trends}, economic costs \cite{knobler2004learning} and slow down international development \cite{ebolaWorldbank}. -Therefore understanding and predicting changes in the process of zoonotic spillover is a global health priority \cite{taylor2001risk}. -The number of pathogen species hosted by a wild animal species affects the chance that a disease from that species will infect humans \cite{wolfe2000deforestation}. -However, the factors that control the number of pathogen species in a wild animal population are still unclear \cite{metcalf2015five}; in particular our mechanistic understanding of how population processes inhibit or promote pathogen richness is poor. - - - -\tmpsection{Specific Intro} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% more detailed background} -% comprehensible to scientists in related disciplines. - - -\tmpsection{We know some factors that correlate with pathogen richness} -%population density, longevity, body size and population structure - -%2. results from comparative studies (including mammal and bat ones), explaining why population structure is important -%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach - -In comparative studies, a number of host traits have been shown to correlate with pathogen richness including body size \cite{kamiya2014determines, arneberg2002host}, population density \cite{nunn2003comparative, arneberg2002host} and range size \cite{bordes2011impact, kamiya2014determines}. -A further factor that may affect pathogen richness is population structure. -In comparative studies it is often assumed that factors that promote fast disease spread should promote high pathogen richness; the faster a new pathogen spreads through a population, the more likely it is to persist \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. -However, this assumption ignores competitive mechanisms such as cross-immunity and depletion of susceptible hosts. -If competitive mechanisms are strong, endemic pathogens in populations with high $R_0$ will be able to easily out-compete invading pathogens. -Only if competitive mechanisms are weak will high $R_0$ enable the invasion of new pathogens and allow higher pathogen richness. - -Overall, the evidence from comparative studies indicates that increased population structure correlates with higher pathogen richness. -This conclusion is based on studies using a number of measures of population structure: genetic measures, the number of subspecies, the shape of species distributions and social group size (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). -However, there are a number of studies that contradict this conclusion \cite{gay2014parasite, bordes2007rodent, ezenwa2006host}. -Comparative studies are often contradictory due to small sample sizes, noisy data and because empirical relationships often do not extrapolate well to other taxa. -Furthermore, multicollinearity between many traits also makes it hard to clearly distinguish which factors are important \cite{nunn2015infectious}. -However, meta-analyses can be used to combine studies to help generalise conclusions \cite{kamiya2014determines}. - - -%3. limitations of comparative studies (including highlighting that empirical and mechanistic approaches would give different predictions) that and need a more mechanistic approach - -Furthermore, knowing which factors correlate with pathogen richness does not tell us if, or how, they causally control pathogen richness. -This lack of a solid mechanistic understanding of these processes prevents predictions of how wild populations will respond to perturbations such as increased human pressure and global change. -As habitats fragment we expect wild populations to change in a number of ways including becoming smaller and less well connected \cite{andren1994effects, cushman2012separating}. -As multiple population-level factors are likely to change simultaneously due to global change, the correlative relationships examined in comparative studies are unlikely to effectively predict future changes in pathogen richness. -Mechanistic models are needed to project how these highly non-linear disease systems will respond to the multiple, simultaneous stressors affecting them. - - - -\tmpsection{Network structure has been studied} -%5. results from analytical models so far and limitations of the approach -%4. description of the possible mechanisms for population structure - explaining why focusing on reduction of competition mechanisms - -There are a number of mechanisms by which population structure could increase pathogen richness. -Firstly, population structure may reduce competition between pathogens. -In analytical models of well-mixed populations competitive exclusion has been predicted \cite{ackleh2003competitive, bremermann1989competitive, martcheva2013competitive, qiu2013vector, allen2004sis}. -In models where competitive exclusion occurs in well-mixed populations, population structure has sometimes been shown to allow coexistence \cite{qiu2013vector, allen2004sis, nunes2006localized, garmer2016multistrain}. -Alternatively, population structure may promote the evolution of new strains within a species \cite{buckee2004effects}, reduce the rate of pathogen extinction \cite{rand1995invasion} or increase the probability of pathogen invasion from other host species \cite{nunes2006localized}. -These separate mechanisms have not been examined and it is difficult to see how they could be distinguished through comparative methods. - -%Competing epidemics, or two pathogens spreading at the same time in a population, is a well studied area \cite{poletto2013host, poletto2015characterising, karrer2011competing}. -%This area is related to the study of pathogen richness in that they indicate that dynamics of multiple pathogens in a population do depend on population %structure. -%However, the results for short term epidemic competition do not directly transfer to the study of long term disease persistence. - - -%6. what is needed now -\tmpsection{The gap} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% One sentence clearly stating the general problem -% being addressed by this particular study. -% By this stage, must have defined/introduced all terms used within. - -Currently, the literature contains very abstract, simplified models \cite{qiu2013vector, allen2004sis, garmer2016multistrain, may1994superinfection}. -These cannot be easily applied to real data. -They also do not easily give quantitative predictions of pathogen richness; typically they predict either no pathogen coexistence \cite{bremermann1989competitive, martcheva2013competitive} or infinite pathogen richness \cite{may1994superinfection}. -Models that can give quantitative predictions of pathogen richness in wild populations are more applicable to real-world issues such as zoonotic disease surveillance. -While predicting an absolute value of pathogen richness for a wild species is likely to be impossible, models that attempt to rank species from highest to lowest pathogen richness are still useful for prioritising species for surveillance. -This requires a middle ground of model complexity. - -\tmpsection{What I did} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%7. what your focus is (including a bit about bat focus) - -In order to capture this middle ground, I have used metapopulation models. -Unlike two-patch models that are used to add population structure while keeping model complexity to a minimum \cite{qiu2013vector, allen2004sis, garmer2016multistrain}, the metapopulations used here split the population into multiple subpopulations. -I have used two independent variables that alter population structure: dispersal rate and metapopulation network topology. -I have studied the invasion of new pathogens as a mechanism for increasing pathogen richness. -In particular I have focused on studying the invasion of a newly evolved pathogen that is therefore identical in epidemiological parameters to the endemic pathogen. -Furthermore, this close evolutionary relationship means that competition via cross-immunity is strong. - -\tmpsection{Why bats} -The metapopulations were parameterised to broadly mimic wild bat populations. -Population structure has already been found to correlate with pathogen richness in bats (Chapter~\ref{ch:empirical}, \cites{gay2014parasite, maganga2014bat, turmelle2009correlates}). -Furthermore, bats have an unusually large variety of social structures. -Colony sizes range from ten to 1 million individuals \cite{jones2009pantheria} and colonies can be very stable \cite{kerth2011bats, mccracken1981social}. -This strong colony fidelity means they fit the assumptions of metapopulations well. -Bats have also, over the last decade, become a focus for disease research \cite{calisher2006bats, hughes2007emerging}. -The reason for this focus is that they have been implicated in a number of high profile diseases including Ebola, SARS, Hendra and Nipah \cite{calisher2006bats, li2005bats}. - -%8. 'here I show..' what you found briefly to lead into methods - -\tmpsection{What I found} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% One sentence summarising the main result -% (with the words “here we show” or their equivalent). - -Here I found that, given the assumptions of a metapopulation, increased dispersal significantly increased the probability of invasion of new pathogens. -Furthermore, structured populations nearly always had a lower probability of pathogen invasion than fully-mixed populations of equal size. -The topology of the network did not strongly affect the probability of pathogen invasion as long as the population was not completely unconnected. -Overall, I found significant evidence that reduced population structure increases the probability of invasion of a new pathogen, implying a role for the generation of pathogen richness more generally. - -\begin{figure}[t] -\centering - \includegraphics[width=0.5\textwidth]{imgs/SIRoption1.pdf} - \caption[Schematic of the SIR model used]{ - Schematic of the SIR model used. - Individuals are in one of five classes, susceptible (orange, $S$), infected with Pathogen 1, Pathogen 2 or both (blue, $I_1, I_2, I_{12}$) or recovered and immune from further infection (green, $R$). - Transitions between epidemiological classes occur as indicated by solid arrows. - Vital dynamics (births and deaths) are indicated by dashed arrows. - Parameter symbols for transitions are indicated. - Note that individuals in $I_{12}$ move into $R$, not back to $I_1$ or $I_2$. - That is, recovery from one pathogen causes immediate recovery from the other pathogen. - } -\label{f:sir} -\end{figure} - - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\section{Methods} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -%% - - -\subsection{Two pathogen SIR model} - -I developed a multipathogen, SIR compartment model with individuals being classed as susceptible, infected or recovered with immunity (Figure~\ref{f:sir}). -Susceptible individuals are counted in class $S$ (see Table~\ref{t:params} for a list of symbols and values used). -There are three infected classes, $I_1$, $I_2$ and $I_{12}$, being individuals infected with Pathogen 1, Pathogen 2 or both respectively. -Recovered individuals, $R$, are immune to both pathogens, even if they have only been infected with one (i.e.\ there is complete cross-immunity). -Furthermore, recovery from one pathogen moves an individual straight into the recovered class, even if the individual is infected with both pathogens (Figure~\ref{f:sir}). -This modelling choice allows the model to be easily expanded to include more than two pathogens, though this study is restricted to two pathogens. -The assumption of immediate recovery from all other diseases is likely to be reasonable. -Any up-regulation of innate immune response will affect both pathogens equally. -Furthermore, as the pathogens are identical, any acquired immunity would also affect both pathogens equally. - -The coinfection rate (the rate at which an infected individual is infected with a second pathogen) is adjusted compared to the infection rate by a factor $\alpha$. -As in \textcite{castillo1989epidemiological}, low values of $\alpha$ imply lower rates of coinfection. -In particular, $\alpha = 0$ indicates no coinfections, $\alpha = 1$ indicates that coinfections happen at the same rate as first infections while $\alpha > 1$ indicates that coinfections occur more readily than first infections. - - -\begin{figure}[t] -{\centering -\subfloat[Minimally connected\label{fig:fullyConnected}]{ - \includegraphics[width=0.45\textwidth]{imgs/minimallyConnected.pdf} -} -\subfloat[Fully connected -\label{fig:minimallyConnected}]{ - \includegraphics[width=0.45\textwidth]{imgs/fullyConnected.pdf} -} -} -\caption[Network topologies used to compare network connectedness]{ -The two network topologies used to test whether network connectedness influences a pathogen's ability to invade. -A) Animals can only disperse to neighbouring colonies. -B) Dispersal can occur between any colonies. -Blue circles are colonies of \SI{3000} individuals. -Dispersal only occurs between colonies connected by an edge (black line). -The dispersal rate is held constant between the two topologies. -} -\label{f:net} -\end{figure} - - -In the application of long term existence of pathogens it is necessary to include vital dynamics (births and deaths) as the SIR model without vital dynamics has no endemic state. -Birth and death rates ($\Lambda$ and $\mu$) are set as being equal meaning the population does not systematically increase or decrease. -The population size does however change as a random walk. -New born individuals enter the susceptible class. -Infection and coinfection were assumed to cause no extra mortality as for a number of viruses, bats show no clinical signs of infection \cite{halpin2011pteropid, deThoisy2016bioecological}. -%In humans, coinfection generally worsens health \cite{griffiths2011nature} but as there are - -\tmpsection{Metapopulation} - - -\begin{table}[tb] -\centering -\caption[All symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2}]{A summary of all symbols used in Chapters~\ref{ch:sims1} and \ref{ch:sims2} along with their units and default values. -The justifications for parameter values are given in Section~\ref{s:paramSelect}.} - -\begin{tabular}{@{}lp{6cm}p{2.9cm}r@{}} -\toprule -Symbol & Explanation & Units & Value\\ -\midrule -$\rho$ & Number of pathogens && 2\\ -$x, y$ & Colony index &&\\ -$p$ & Pathogen index i.e.\ $p\in\{1,2\}$ for pathogens 1 and 2 & &\\ -$q$ & Disease class i.e.\ $q\in\{1,2,12\}$&\\ -$S_x$ & Number of susceptible individuals in colony $x$ &&\\ -$I_{qx}$ & Number of individuals infected with disease(s) $q \in \{1, 2, 12\}$ in colony $x$ &&\\ -$R_x$ & Number of individuals in colony $x$ in the recovered with immunity class &&\\ -$N$ & Total Population size && 30,000\\ -$m$ & Number of colonies&& 10\\ -$n$ & Colony size && 3,000\\ -$a$ & Area & \si{\square\kilo\metre}& 10,000\\ -$\beta$ & Transmission rate & & 0.1 -- 0.4\\ -$\alpha$ & Coinfection adjustment factor & & 0.1\\ -$\gamma$ & Recovery rate & year$^{-1}.$individual$^{-1}$ & 1\\ -$\xi$ & Dispersal rate & year$^{-1}.$individual$^{-1}$ & 0.001--0.1\\ -$\Lambda$ & Birth rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ -$\mu$ & Death rate & year$^{-1}.$individual$^{-1}$ & 0.05\\ -$k_x$ & Degree of node $x$ (number of colonies that individuals from colony $x$ can disperse to). &&\\ -$\delta$ & Waiting time until next event & years &\\ - -$e_i$ & The rate at which event $i$ occurs & year$^{-1}$&\\ -\bottomrule -\end{tabular} - -\label{t:params} -\end{table} - - -The population is modelled as a metapopulation, being divided into a number of subpopulations (colonies). -This model is an intermediate level of complexity between fully-mixed populations and contact networks. -There is ample evidence that bat populations are structured to some extent. -This evidence comes from the existence of subspecies, measurements of genetic dissimilarity and ecological studies \cite{kerth2011bats, mccracken1981social, burns2014correlates, wilson2005mammal}. -Therefore a fully mixed population is a large oversimplification. -However, trying to study the contact network relies on detailed knowledge of individual behaviour which is rarely available. - -The metapopulation is modelled as a network with colonies being nodes and dispersal between colonies being indicated by edges (Figure~\ref{f:net}). -Individuals within a colony interact randomly so that the colony is fully mixed. -Dispersal between colonies occurs at a rate $\xi$. -Individuals can only disperse to colonies connected to theirs by an edge in the network. -The rate of dispersal is not affected by the number of edges a colony has (known as the degree of the colony and denoted $k$). -Therefore, the dispersal rate from a colony $y$ with degree $k_y$ to colony $x$ is $\xi / k_y$. -Note this rate is not affect by the degree and size of colony $x$. - - - -\tmpsection{Stochastic simulations} - -I examined this model using stochastic, continuous-time simulations implemented in \emph{R} \cite{R}. -The implementation is available as an \emph{R} package on GitHub \cite{metapopepi}. -The model can be written as a continuous-time Markov chain. -The Markov chain contains the random variables $((S_x)_{x = 1\ldots m}, (I_{x, q})_{x =1\ldots m,\:q \in \{1, 2, 12\}}, (R_x)_{x = 1\ldots m})$. -Here, $(S_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of susceptibles in each colony. -$(I_{x, q})_{x =1\ldots m, q \in \{1, 2, 12\}}$ is a length $m \times 3$ vector describing the number of individuals of each disease class ($q \in \{1, 2, 12\}$) in each colony. -Finally, $(R_x)_{x = 1\ldots m}$ is a length $m$ vector of the number of individuals in the recovered class. -The model is a Markov chain where extinction of both pathogen species and extinction of the host species are absorbing states. -The expected time for either host to go extinct is much larger than the duration of the simulations. - -At any time, suppose the system is in state $((s_x), (i_{x,q}), (r_x))$. -At each step in the simulation we calculate the rate at which each possible event might occur. -One event is then randomly chosen, weighted by its rate -\begin{align} - p(\text{event } i) = \frac{e_i}{\sum_j e_j}, -\end{align} -where $e_i$ is the rate at which event $i$ occurs and $\sum_j e_j$ is the sum of the rates of all possible events. -Finally, the length of the time step, $\delta$, is drawn from an exponential distribution -\begin{align} - \delta \sim \operatorname{Exp}\left(\sum_j e_j \right). -\end{align} - - -We can now write down the rates of all events. -%I defined $I^+_p$ to be the sum of all classes that are infectious with pathogen $p$, for example $I^+_1 = I_1 + I_{12}$. -Assuming asexual reproduction, that all classes reproduce at the same rate and that individuals are born into the susceptible class we get -\begin{align} - s_x \rightarrow s_x + 1 \;\;\;\text{at a rate of}\;\; \Lambda\left( s_{x}+\sum_q i_{qx} + r_{x}\right) -\end{align} -where $s_x \rightarrow s_x + 1$ is the event that the number of susceptibles in colony $x$ will increase by 1 (a single birth) and $\sum_q i_{qx}$ is the sum of all infection classes $q~\in~\{1, 2, 12\}$. -The rates of death, given a death rate $\mu$, and no increased mortality due to infection, are given by -\begin{align} - s_x \rightarrow s_x-1 &\;\;\;\text{at a rate of}\;\; \mu s_x, \\ - i_{qx} \rightarrow i_{qx}-1 &\;\;\text{at a rate of}\;\; \mu i_{qx},\\ - r_x \rightarrow r_x-1 &\;\;\;\text{at a rate of}\;\; \mu r_x. -\end{align} - - - -I modelled transmission as being density-dependent. -This assumption was more suitable than frequency-dependent transmission as I was modelling a disease transmitted by saliva or urine in highly dense populations confined to caves, buildings or potentially a small number of tree roosts. -I was notably not modelling a sexually transmitted disease (STD) as spillover of STDs from bats to humans is likely to be rare. -Infection of a susceptible with either Pathogen 1 or 2 is therefore given by -\begin{align} - i_{1x} \rightarrow i_{1x}+1,\;\;\; s_x \rightarrow s_x-1 &\;\;\text{at a rate of}\;\; \beta s_x\left(i_{1x} + i_{12x}\right),\\ - i_{2x} \rightarrow i_{2x}+1,\;\;\; s_x \rightarrow s_x-1 &\;\;\text{at a rate of}\;\; \beta s_x\left(i_{2x} + i_{12x}\right), -\end{align} -while coinfection, given the coinfection adjustment factor $\alpha$, is given by -\begin{align} - i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{1x} \rightarrow i_{1x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{1x}\left(i_{2x} + i_{12x}\right),\\ - i_{12,x} \rightarrow i_{12,x}+1,\;\;\; i_{2x} \rightarrow i_{2x}-1 &\;\;\text{at a rate of}\;\; \alpha\beta i_{2x}\left(i_{1x} + i_{12x}\right). -\end{align} -Note that lower values of $\alpha$ give lower rates of infection as in \textcite{castillo1989epidemiological}. - - -The rate of migration from colony $y$ (with degree $k_y$) to colony $x$, given a dispersal rate $\xi$ is given by -\begin{align} - s_x \rightarrow s_x+1,\;\;\; s_y \rightarrow s_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi s_y}{k_y},\\ - i_{qx} \rightarrow i_{qx}+1,\;\;\; i_{qy} \rightarrow i_{qy}-1 &\;\;\text{at a rate of}\;\; \frac{\xi i_{qy}}{k_y},\\ - r_x \rightarrow r_x+1,\;\;\; r_y \rightarrow r_y-1 &\;\;\text{at a rate of}\;\; \frac{\xi r_y}{k_y}. -\end{align} -Note that the dispersal rate does not change with infection. -As above, this is due to the low virulence of bat viruses. -Finally, recovery from any infectious class occurs at a rate $\gamma$ -\begin{align} - i_{qx} \rightarrow i_{qx}-1,\;\; r_x \rightarrow r_x+1 \;\;\text{at a rate of}\;\; \gamma i_{qx}. -\end{align} - - -%%begin.rcode SimLengths - - # These apply to both topo and disp sims. And probably should apply to extinction sims if I include them. - # How long should the simulation last? - nEvent <- 8e5 - - # When should the invading pathogen be added. - invadeT <- 3e5 - -%%end.rcode - - - - - - -% ------------------------------------------------------------------ % -% Dispersal Sims -% ------------------------------------------------------------------ % - - -%%begin.rcode DispSimsFuncs - - ################################# - # Dispersal sim definitions # - ################################# - - - # How often should the population be sampled. Only sampled populations are saved. - sample <- 1000 - - # How many simulations to run? - each <- 100 - nDispSims <- 12 * each - - -# Define our simulation function. -fullSim <- function(x){ - - dispVec <- rep(c(0.001, 0.01, 0.1), each = nDispSims/3 +1) - disp <- dispVec[x] - - tranVec <- rep(c(0.1, 0.2, 0.3, 0.4 ), nDispSims/3 + 1) - tran <- tranVec[x] - - # Set seed (this is set within each parallel simulation to prevent reusing random numbers). - simSeed <- paste0(seed, x) - set.seed(simSeed) - - # Make the population. - p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = disp, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = 100, colonySpatialDistr = 'circle') - - # Seed endemic pathogen. - for(i in 1:10){ - p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) - } - - # Burn in simulation - p <- runSim(p, end = invadeT) - - # Seed invading pathogen. - p$I[2, 1, (invadeT + 1) %% sample] <- 5 - - # Recalculate rates of each event after seeding. - p <- transRates(p, (invadeT + 1) %% sample) - - # Continue simulation - p <- runSim(p, start = invadeT + 1, end = 'end') - - # Was the invasion succesful? - invasion <- findDisDistr(p, 2)[1] > 0 - - # Save summary stats - d <- data.frame(transmission = NA) - - d$transmission <- p$parameters['transmission'] - d$dispersal <- p$parameters['dispersal'] - d$nExtantDis <- sum(findDisDistr(p, 2) > 0) - d$singleInf <- findCoinfDistr(p, 2)[2] - d$doubleInf <- findCoinfDistr(p, 2)[3] - d$nColonies <- p$parameters['nColonies'] - d$nPathogens <- p$parameters['nPathogens'] - d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] - d$maxDistance <- p$parameters['maxDistance'] - d$nEvents <- p$parameters['events'] - - - message(paste0("finished ", x, ". Invasion: ", invasion )) - - if(saveData){ - file <- paste0('data/Chapter2/DispSims/DispSim_', formatC(x, width = 4, flag = '0'), '.RData') - save(p, file = file) - } - - rm(p) - - return(d) - -} -%%end.rcode - -%%begin.rcode runDispSim, eval = runAllSims, cache = TRUE - -# Create and set seed (seed object is used to set seed in each seperate simulation.' -seed <- 33355 -set.seed(seed) - -# If we want to save the data, make a directory for it. -if(saveData){ - dir.create('data/Chapter2/DispSims/') -} - -# Run sims. -z <- mclapply(1:nDispSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) - -z <- do.call(rbind, z) - -# Save summary data. -write.csv(z, file = 'data/Chapter2/DispSims.csv') - -%%end.rcode - - - - - - -%%begin.rcode extraMidBeta, eval = runAllSims - - -nExtraSims <- 150 - -# Define our simulation function. -fullSim <- function(x){ - - dispVec <- rep(c(0.001, 0.01, 0.1), each = nExtraSims/3 + 1) - disp <- dispVec[x] - - tranVec <- rep(c(0.2, 0.3), nExtraSims/2 + 1) - tran <- tranVec[x] - - # Set seed (this is set within each parallel simulation to prevent reusing random numbers). - simSeed <- paste0(seed, x) - set.seed(simSeed) - - # Make the population. - p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = disp, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = 100, colonySpatialDistr = 'circle') - - # Seed endemic pathogen. - for(i in 1:10){ - p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) - } - - # Burn in simulation - p <- runSim(p, end = invadeT) - - # Seed invading pathogen. - p$I[2, 1, (invadeT + 1) %% sample] <- 5 - - # Recalculate rates of each event after seeding. - p <- transRates(p, (invadeT + 1) %% sample) - - # Continue simulation - p <- runSim(p, start = invadeT + 1, end = 'end') - - # Was the invasion succesful? - invasion <- findDisDistr(p, 2)[1] > 0 - - # Save summary stats - d <- data.frame(transmission = NA) - - d$transmission <- p$parameters['transmission'] - d$dispersal <- p$parameters['dispersal'] - d$nExtantDis <- sum(findDisDistr(p, 2) > 0) - d$singleInf <- findCoinfDistr(p, 2)[2] - d$doubleInf <- findCoinfDistr(p, 2)[3] - d$nColonies <- p$parameters['nColonies'] - d$nPathogens <- p$parameters['nPathogens'] - d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] - d$maxDistance <- p$parameters['maxDistance'] - d$nEvents <- p$parameters['events'] - - - - # Time until extinction - invadePath <- colSums(p$sample[2, , (2 + invadeT / sample):(dim(p$sample)[3])]) + - colSums(p$sample[4, , (2 + invadeT / sample):(dim(p$sample)[3])]) - - d$extinctionTime <- cumsum(p$sampleWaiting)[min(which(invadePath == 0)) + (2 + invadeT / sample)] - d$totalTime <- sum(p$sampleWaiting) - d$survivalTime <- d$extinctionTime - cumsum(p$sampleWaiting)[(2 + invadeT / sample)] - - d$pathInv <- sum(p$sample[c(1, 4), , dim(p$sample)[3]]) - - message(paste0("finished ", x, ". Invasion: ", invasion )) - - rm(p) - - return(d) - -} - - -# Create and set seed (seed object is used to set seed in each seperate simulation.' -seed <- 787 -set.seed(seed) - - -# Run sims. -z <- mclapply(1:600, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) - -z <- do.call(rbind, z) - -# Save summary data. -write.csv(z, file = 'data/Chapter2/extraMidBeta.csv') - - -%%end.rcode - - - -% ------------------------------------------------------------------ % -% Topology Sims -% ------------------------------------------------------------------ % - - - -%%begin.rcode TopoSimsFuncs - - ################################# - # Topology sim definitions # - ################################# - - # How many simulations to run? - nTopoSims <- 8 * each - -# Define our simulation function. -fullSim <- function(x){ - - - # Chose maxdistance so that we get either fully connected or circle networks. - mxDis <- rep(c(40, 200), each = nTopoSims/2 + 1)[x] - - # Chose transmission rates. - tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nTopoSims/4 + 1) - tran <- tranVec[x] - - - # Set seed (this is set within each parallel simulation to prevent reusing random numbers). - simSeed <- paste0(seed, x) - set.seed(simSeed) - - # Make the population. - p <- makePop(model = 'SIR', events = nEvent, nColonies = 10, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.01, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 3000, infectDeath = 0, transmission = tran, maxDistance = mxDis, colonySpatialDistr = 'circle') - - # Seed endemic pathogen. - for(i in 1:10){ - p <- seedPathogen(p, 2, n = 200, diffCols = FALSE) - } - - # Burn in simulation - p <- runSim(p, end = invadeT) - - # Seed invading pathogen. - p$I[2, 1, (invadeT + 1) %% sample] <- 5 - - # Recalculate rates of each event after seeding. - p <- transRates(p, (invadeT + 1) %% sample) - - # Continue simulation - p <- runSim(p, start = invadeT + 1, end = 'end') - - # Was the invasion succesful? - invasion <- findDisDistr(p, 2)[1] > 0 - - # Save summary stats - d <- data.frame(transmission = NA) - - - d$transmission <- p$parameters['transmission'] - d$dispersal <- p$parameters['dispersal'] - d$nExtantDis <- sum(findDisDistr(p, 2) > 0) - d$singleInf <- findCoinfDistr(p, 2)[2] - d$doubleInf <- findCoinfDistr(p, 2)[3] - d$nColonies <- p$parameters['nColonies'] - d$nPathogens <- p$parameters['nPathogens'] - d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] - d$maxDistance <- p$parameters['maxDistance'] - d$nEvents <- p$parameters['events'] - - - message(paste0("finished ", x, ". Invasion: ", invasion )) - - if(saveData){ - file <- paste0('data/Chapter2/TopoSims/TopoSim_', formatC(x, width = 4, flag = '0'), '.RData') - save(p, file = file) - } - - rm(p) - - return(d) - -} -%%end.rcode - -%%begin.rcode runTopoSim, eval = runAllSims, cache = TRUE - -# Create and set seed (seed object is used to set seed in each seperate simulation.' -seed <- 1230202 -set.seed(seed) - -# If we want to save the data, make a directory for it. -if(saveData){ - dir.create('data/Chapter2/TopoSims/') -} - -# Run sims. -z <- mclapply(1:nTopoSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) - -z <- do.call(rbind, z) - -# Save summary data. -write.csv(z, file = 'data/Chapter2/TopoSims.csv') - -%%end.rcode - - - - - - -% ------------------------------------------------------------------ % -% Unstructured Sims -% ------------------------------------------------------------------ % - -%%begin.rcode unstructuredSimsFuncs - - ################################# - # Topology sim definitions # - ################################# - - # How many simulations to run? - nUnstructuredSims <- 4 * each - -# Define our simulation function. -fullSim <- function(x){ - - - # Chose transmission rates. - tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nUnstructuredSims/4 + 1) - tran <- tranVec[x] - - - # Set seed (this is set within each parallel simulation to prevent reusing random numbers). - simSeed <- paste0(seed, x) - set.seed(simSeed) - - # Make the population. - p <- makePop(model = 'SIR', events = nEvent, nColonies = 2, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.0, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 29800, infectDeath = 0, transmission = tran, maxDistance = 120, colonySpatialDistr = 'circle') - - # Seed endemic pathogen. - p$I[2, 2, 1] <- 200 - p$I[1, 1, 1] <- 0 - - # Recalculate rates of each event after seeding. - p <- transRates(p, 1) - - # Burn in simulation - p <- runSim(p, end = invadeT) - - # Seed invading pathogen. - p$I[3, 2, (invadeT + 1) %% sample] <- 5 - - # Recalculate rates of each event after seeding. - p <- transRates(p, (invadeT + 1) %% sample) - - # Continue simulation - p <- runSim(p, start = invadeT + 1, end = 'end') - - # Was the invasion succesful? - invasion <- findDisDistr(p, 2)[1] > 0 - - # Save summary stats - d <- data.frame(transmission = NA) - - d$transmission <- p$parameters['transmission'] - d$dispersal <- p$parameters['dispersal'] - d$nExtantDis <- sum(findDisDistr(p, 2) > 0) - d$singleInf <- findCoinfDistr(p, 2)[2] - d$doubleInf <- findCoinfDistr(p, 2)[3] - d$nColonies <- p$parameters['nColonies'] - d$nPathogens <- p$parameters['nPathogens'] - d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] - d$maxDistance <- p$parameters['maxDistance'] - d$nEvents <- p$parameters['events'] - - - - message(paste0("finished ", x, ". Invasion: ", invasion )) - - if(saveData){ - file <- paste0('data/Chapter2/UnstructuredSims/UnstructuredSims_', formatC(x, width = 4, flag = '0'), '.RData') - save(p, file = file) - } - - rm(p) - - return(d) - -} -%%end.rcode - -%%begin.rcode runUnstructuredSim, eval = runAllSims, cache = TRUE - -# Create and set seed (seed object is used to set seed in each seperate simulation.' -seed <- 13 -set.seed(seed) - -# If we want to save the data, make a directory for it. -if(saveData){ - dir.create('data/Chapter2/UnstructuredSims/') -} - -# Run sims. -z <- mclapply(1:nUnstructuredSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) - -z <- do.call(rbind, z) - -# Save summary data. -write.csv(z, file = 'data/Chapter2/unstructuredSims.csv') - -%%end.rcode - - - - - -%%begin.rcode noDispSimsFuncs - - - ################################# - # Topology sim definitions # - ################################# - - # How many simulations to run? - nNoDispSims <- 4 * each - -# Define our simulation function. -fullSim <- function(x){ - - - # Chose transmission rates. - tranVec <- rep(c(0.1, 0.2, 0.3, 0.4), nNoDispSims/4 + 1) - tran <- tranVec[x] - - - # Set seed (this is set within each parallel simulation to prevent reusing random numbers). - simSeed <- paste0(seed, x) - set.seed(simSeed) - - # Make the population. - p <- makePop(model = 'SIR', events = nEvent, nColonies = 2, nPathogens = 2, recovery = 1, sample = sample, dispersal = 0.0, birth = 0.05, death = 0.05, crossImmunity = 0.1, meanColonySize = 2800, infectDeath = 0, transmission = tran, maxDistance = 120, colonySpatialDistr = 'circle') - - # Seed endemic pathogen. - p$I[2, 2, 1] <- 200 - p$I[1, 1, 1] <- 0 - - # Recalculate rates of each event after seeding. - p <- transRates(p, 1) - - # Burn in simulation - p <- runSim(p, end = invadeT) - - # Seed invading pathogen. - p$I[3, 2, (invadeT + 1) %% sample] <- 5 - - # Recalculate rates of each event after seeding. - p <- transRates(p, (invadeT + 1) %% sample) - - # Continue simulation - p <- runSim(p, start = invadeT + 1, end = 'end') - - # Was the invasion succesful? - invasion <- findDisDistr(p, 2)[1] > 0 - - # Save summary stats - d <- data.frame(transmission = NA) - - d$transmission <- p$parameters['transmission'] - d$dispersal <- p$parameters['dispersal'] - d$nExtantDis <- sum(findDisDistr(p, 2) > 0) - d$singleInf <- findCoinfDistr(p, 2)[2] - d$doubleInf <- findCoinfDistr(p, 2)[3] - d$nColonies <- p$parameters['nColonies'] - d$nPathogens <- p$parameters['nPathogens'] - d$meanK <- sum(p$adjacency != 0 )/p$parameters['nColonies'] - d$maxDistance <- p$parameters['maxDistance'] - d$nEvents <- p$parameters['events'] - d$path2 <- sum(p$sample[c(2, 4), , dim(p$sample)[3]]) - - # Time until extinction - invadePath <- colSums(p$sample[2, , (2 + invadeT / sample):(dim(p$sample)[3])]) + - colSums(p$sample[4, , (2 + invadeT / sample):(dim(p$sample)[3])]) - - d$extinctionTime <- cumsum(p$sampleWaiting)[min(which(invadePath == 0)) + (2 + invadeT / sample)] - d$totalTime <- sum(p$sampleWaiting) - d$survivalTime <- d$extinctionTime - cumsum(p$sampleWaiting)[(2 + invadeT / sample)] - - - - message(paste0("finished ", x, ". Invasion: ", invasion )) - - - rm(p) - - return(d) - -} - - - -%%end.rcode - -%%begin.rcode runNoDispSims, eval = runAllSims, cache = FALSE - -# Create and set seed (seed object is used to set seed in each seperate simulation.' -seed <- 19 -set.seed(seed) - -# Run sims. -z <- mclapply(1:nNoDispSims, . %>% fullSim, mc.preschedule = FALSE, mc.cores = nCores) - -z <- do.call(rbind, z) - - - -# Save summary data. -write.csv(z, file = 'data/Chapter2/noDispSims.csv') - -%%end.rcode - - - -\subsection{Parameter selection} -\label{s:paramSelect} - -The fixed parameters were chosen to roughly reflect realistic wild bat populations. -The death rate $\mu$ was set as 0.05 per year giving a generation time of 20 years. -The birth rate $\Lambda$ was set to be equal to $\mu$. -This yields a population that does not systematically increase or decrease. -However, the size of each colony changes as a random walk. -Given the length of the simulations, colonies were very unlikely to go extinct (Figure~\ref{fig:plotsNoInvade2}). -The starting size of each colony was set to \si{3000}. -This is appropriate for many bat species \cite{jones2009pantheria}, especially the large, frugivorous \emph{Pteropodidae} that have been particularly associated with recent zoonotic diseases. - -The recovery rate $\gamma$ was set to one, giving an average infection duration of one year. -This is therefore a long lasting infection but not a chronic infection. -It is very difficult to directly estimate infection durations in wild populations but it seems that these infections might sometimes be long lasting \cite{peel2012henipavirus, plowright2015ecological}. -However, other studies have found much shorter infectious periods \cite{amengual2007temporal}. -These shorter infections are not studied further here. %todo consider readding this. " as preliminary simulations found that they could not persist in the relatively small populations being modelled here." - -Four values of the transmission rate $\beta$ were used, 0.1, 0.2, 0.3 and 0.4. -These values were chosen to cover the range of behaviours, from very high probabilities of invasion of the second pathogen, to very low probabilities. -All simulations were run under all four transmission rates as this is such a fundamental parameter. -The coinfection adjustment parameter, $\alpha$, was set to 0.1 so that an individual infected with one pathogen is 90\% less likely to be infected with another. -This is a rather arbitrary value. -However, the rationale of the model was that the invading species might be a newly speciated strain of the endemic species. -Furthermore, the model assumes complete cross-immunity after recovery from infection. -Therefore cross-immunity to coinfection is likely to be very strong as well. -Some pairs of closely related bat viruses have been found to coinfect individual bats less than would be expected by chance \cite{anthony2013strategy}. -This indicates a level of cross-immunity between these pairs of viruses. %todo I'm sure there was a marburg ebola paper... - - - - - -\subsection{Experimental setup} - -The metapopulation was made up of ten colonies. -Ten colonies was selected as a trade-off between computation time and a network complex enough that any effects of population structure could be detected. -This value is artificially small compared to wildlife populations. -In each simulation, the na{\"i}ve population was seeded with ten sets of 200 individuals infected with Pathogen 1. -These groups were seeded into randomly selected colonies with replacement. -For each 200 infected individuals added, 200 susceptible individuals were removed to keep starting colony sizes constant. -Pathogen 1 was then allowed to spread until the initial, large epidemic had ended. -Visual inspection of preliminary simulations was used to decide on \SI{\rinline{invadeT}} events as being long enough for the epidemic to end and the pathogen to be in an endemic state (Figures~\ref{fig:plotsInvade} and \ref{fig:plotsNoInvade1}). -After \SI{\rinline{invadeT}} events, five individuals infected with Pathogen 2 were added to one randomly selected colony. -After another \SI{\rinline{nEvent - invadeT}} events the invasion of Pathogen 2 was considered successful if any individuals were still infected with Pathogen 2. -Therefore, if at least one individual was in class $I_2$ or $I_{12}$ at the end of the simulation, this was considered an invasion. -Again, visual inspection of preliminary simulations was used to determine that after \SI{\rinline{nEvent - invadeT}} events, if an invading pathogen was still present, it was well established (Figures~\ref{fig:plotsInvade} and \ref{fig:plotsNoInvade1}). - -The choice to use a fixed number of events, rather than a fixed number of years, was for computational convenience. -However, this choice creates a risk of bias as simulations with a greater total rate of events, $\sum_j e_j$ (e.g.,\ faster disease transmission) will last for a shorter time overall (i.e.\ a smaller $\sum \delta$ over all events). -However, visual inspection of the dynamics of disease extinction (Figure~\ref{fig:plotsNoInvade1}), and examination of the typical time to extinction suggests that this bias is negligible. -For example, of the simulations where extinction occurred, the extinction occurred more than 50 years before the end of the simulation in 90\% of cases. -On a preliminary run of 106 simulations across all combinations of dispersal and transmission rates, examining the population after \SI{700000} events instead of \SI{\rinline{nEvent}} events gave exactly the same result with respect to the binary state of invasion or no invasion. - - -\subsection{Population structure} - -As a baseline for comparison, I ran simulations of a fully unstructured population. -These simulations were run with a population of \SI{30000} so that the total population size was equal to that of the total metapopulation size in the structured simulations. -I ran 100 simulations at each transmission rate. - - -\tmpsection{Dispersal} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -Two parameters control population structure in the model: dispersal rate and the topology of the metapopulation network. -The values used for these parameters were chosen to highlight the effects of population structure. -I selected the dispersal rates $\xi = 0, 0.001, 0.01$ and $0.1$ dispersals per individual per year. -The probability that an individual disperses at least once in its lifetime is given by $\xi / \left(\xi + \mu\right)$. -Therefore, $\xi = 0.1$ relates to 67\% of individuals dispersing between colonies at least once in their lifetime. -Exclusively juvenile dispersal would have dispersal rates similar to this value. %todo cite -$\xi = 0.01$ relates to 17\% of individuals dispersing at least once in their lifetime. -This value is relatively close to male-biased dispersal, with female philopatry. %todo cite -$\xi = 0.001$ relates to 2\% of individuals dispersing during their lifetime. -This therefore relates to a species that does not habitually disperse. -Finally, I ran simulations with no dispersal. -Given zero dispersal, only the colony seeded with Pathogen 2 could ever receive infections of the invading pathogen. -Therefore, only one colony was simulated for $\xi = 0$. -While altering the dispersal rate I used a fully connected network topology. -I ran 100 simulations for most parameter sets. -I ran 150 simulations for $\xi = 0.1, 0.01$ and $0.001$ with $\beta = 0.2$ and $0.3$ as preliminary simulations indicated that any effects of population structure would most likely be seen at these values so extra simulations were run to increase statistical power. - - - -\tmpsection{Network structure} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -I also altered the topology of the metapopulation network. -The network topology was created to be either fully or minimally connected (Figure~\ref{f:net}). -To model a completely unconnected population the $\xi = 0$ simulations from above were used. -While altering network topology, the intermediate dispersal rate, $\xi = 0.01$, was used. -I again ran 100 simulations for each parameter set. - - - -\subsection{Statistical analysis} - -I used generalised linear models (GLMs) with a binary response variable, invasion or not, to test the hypothesis that probability of invasion increased with dispersal. -I examined $p$ values and the regression coefficient, $b$, from each model. -Separate GLMs were fitted for each transmission rate. -These tests were performed both with and without the $\xi = 0$ results as the complete lack of dispersal makes these simulations qualitatively different to the other simulations. -To test whether the different topologies had different probabilities of invasion, I used Fisher's exact tests because topology is best described as a categorical variable. -As with the $\xi = 0$ results, these tests were performed both with and without the completely unconnected topology results. -Finally, I also used binomial GLMs to test the hypothesis that the probability of invasion increased with transmission rate. -Separate GLMs were fitted for each dispersal rate and network topology. -All statistical analyses were performed using the \emph{stats} package in \emph{R}. -The code used for running the simulations and analysing the results is available at \url{https://github.com/timcdlucas/PhDThesis/blob/master/population-structure-affects-pathogen-richness-mechanistic-model.Rtex}. - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -\section{Results} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -%%begin.rcode loadNoDispData -noDisp <- read.csv('data/Chapter2/noDispSims.csv', stringsAsFactors = FALSE) - -%%end.rcode - - -%%begin.rcode loadDispData - -# Read in the data. -# Later I'll add in the option to simulate the whole dataset. -dDisp <- read.csv('data/Chapter2/DispSims.csv', stringsAsFactors = FALSE) -dim(dDisp) -head(dDisp) - -extraMidBeta <- read.csv('data/Chapter2/extraMidBeta.csv', stringsAsFactors = FALSE) - -%%end.rcode - - - - - -%%begin.rcode DispDataOrganise - -dDisp <- rbind(dDisp, extraMidBeta[, 1:11], noDisp[, 1:11]) - - -# Which simulations have an extinction - -dDisp$invasion <- dDisp$nPathogens - dDisp$nExtantDis == 0 - - - -# Number of simulations of each treatment -nDisp <- dDisp %>% - group_by(transmission, dispersal) %>% - dplyr::select(invasion) %>% - summarise(n()) - -nDisp - - -# Number of extinctions by treatment -invsDisp <- dDisp %>% - group_by(transmission, dispersal) %>% - dplyr::select(invasion) %>% - filter(invasion == TRUE) %>% - summarise(n()) - -invsDisp - - -propsDisp <- left_join(nDisp, invsDisp, by = c('dispersal', 'transmission')) - -names(propsDisp) <- c( 'transmission', 'dispersal', 'n', 'invasions') - -propsDisp$invasions[is.na(propsDisp$invasions)] <- 0 - -# Proportion of invasions in totals. -propsDisp$props <- propsDisp$invasions / propsDisp$n - -propsDisp - -%%end.rcode - - - - -%%begin.rcode DispPropTests - -# Then run proportion tests between population structures - - -DispGLM1 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.1, ], family = binomial)) -DispGLM2 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.2, ], family = binomial)) -DispGLM3 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.3, ], family = binomial)) -DispGLM4 <- summary(glm(invasion ~ dispersal, data = dDisp[dDisp$transmission == 0.4, ], family = binomial)) - -dDispNoZero <- dDisp %>% dplyr::filter(dispersal != 0) -Disp2GLM1 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.1, ], family = binomial)) -Disp2GLM2 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.2, ], family = binomial)) -Disp2GLM3 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.3, ], family = binomial)) -Disp2GLM4 <- summary(glm(invasion ~ dispersal, data = dDispNoZero[dDispNoZero$transmission == 0.4, ], family = binomial)) - - -%%end.rcode - - - -%%begin.rcode DispTransPropTests -##Finally run proportion tests between transmission rates - -#Finally run proportion tests between transmission rates - -DispTransGLM1 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.001, ], family = binomial)) -DispTransGLM2 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.01, ], family = binomial)) -DispTransGLM3 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0.1, ], family = binomial)) -DispTransGLM4 <- summary(glm(invasion ~ transmission, data = dDisp[dDisp$dispersal == 0, ], family = binomial)) - -%%end.rcode - - - - -%%begin.rcode loadTopoData - -# Read in the data. -# Later I'll add in the option to simulate the whole dataset. -dTopo <- read.csv('data/Chapter2/TopoSims.csv', stringsAsFactors = FALSE) -dim(dTopo) -head(dTopo) - -dTopo <- rbind(dTopo, noDisp[, 1:11]) - -%%end.rcode - - -%%begin.rcode TopoDataOrganise - -# Which simulations have an extinction - -dTopo$invasion <- dTopo$nPathogens - dTopo$nExtantDis == 0 - -# Number of extinctions by treatment -invsTopo <- dTopo %>% - group_by(transmission, meanK) %>% - dplyr::select(invasion) %>% - filter(invasion == TRUE) %>% - summarise(n()) %>% - rbind(c(0.1, 1, 0), .) - -invsTopo - -# Number of simulations of each treatment -nTopo <- dTopo %>% - group_by(transmission, meanK) %>% - dplyr::select(invasion) %>% - summarise(n()) - -nTopo - - - -propsTopo <- left_join(nTopo, invsTopo, by = c('meanK', 'transmission')) - -names(propsTopo) <- c( 'transmission', 'meanK', 'n', 'invasions') - -propsTopo$invasions[is.na(propsTopo$invasions)] <- 0 - -# Proportion of invasions in totals. -propsTopo$props <- propsTopo$invasions / propsTopo$n - -propsTopo - -%%end.rcode - - - - -%%begin.rcode TopoPropTests - -# Then run proportion tests between population structures - -TopoTest1 <- fisher.test(cbind(propsTopo$invasions[1:3], propsTopo$n[1:3] - propsTopo$invasions[1:3])) -TopoTest2 <- fisher.test(cbind(propsTopo$invasions[4:6], propsTopo$n[4:6] - propsTopo$invasions[4:6])) -TopoTest3 <- fisher.test(cbind(propsTopo$invasions[7:9], propsTopo$n[7:9] - propsTopo$invasions[7:9])) -TopoTest4 <- fisher.test(cbind(propsTopo$invasions[10:12], propsTopo$n[10:12] - propsTopo$invasions[10:12])) - - - -Topo2Test1 <- fisher.test(cbind(propsTopo$invasions[2:3], propsTopo$n[2:3] - propsTopo$invasions[2:3])) -Topo2Test2 <- fisher.test(cbind(propsTopo$invasions[5:6], propsTopo$n[5:6] - propsTopo$invasions[5:6])) -Topo2Test3 <- fisher.test(cbind(propsTopo$invasions[8:9], propsTopo$n[8:9] - propsTopo$invasions[8:9])) -Topo2Test4 <- fisher.test(cbind(propsTopo$invasions[11:12], propsTopo$n[11:12] - propsTopo$invasions[11:12])) - -%%end.rcode - - - - -%%begin.rcode TopoTransPropTests -#Finally run proportion tests between transmission rates - -#TopoTransTest1 <- fisher.test(cbind(propsTopo$invasions[c(1, 3, 5, 7)], propsTopo$n[c(1, 3, 5, 7)] - propsTopo$invasions[c(1, 3, 5, 7)])) -#TopoTransTest2 <- fisher.test(cbind(propsTopo$invasions[c(2, 4, 6, 8)], propsTopo$n[c(2, 4, 6, 8)] - propsTopo$invasions[c(2, 4, 6, 8)])) - - -TopoTransGLM1 <- summary(glm(invasion ~ transmission, data = dTopo[dTopo$meanK == 9, ], family = binomial)) -TopoTransGLM2 <- summary(glm(invasion ~ transmission, data = dTopo[dTopo$meanK == 2, ], family = binomial)) - -%%end.rcode - - -%%begin.rcode caption1String -# Just defining my caption label here to avoid the long string in chunk options below. - -invasionPropCaption <- sprintf(" - The probability of successful invasion for different A) dispersal rates and B) network topologies (with network topologies ``unconnected'', ``minimally connected'' and ``fully connected'' as in Figure~\\ref{f:net}). - Error bars are 95\\%% confidence intervals of probability of invasion. - %i simulations were run for each treatment except $\\beta = 0.2$ and $0.3$ in A) which has 150 per treatment. - Other parameters were kept constant at: $m = 10,\\, \\, \\mu = \\Lambda = 0.05,\\, \\gamma = 1,\\, \\alpha = 0.1$. - When dispersal is varied, the population structure is fully connected. - When network topology is varied, $\\xi = 0.01$.", - as.integer(each)) - -invasionPropShort <- "The probability of invasion across different dispersal rates and network topologies" - - -%%end.rcode - - -%%begin.rcode invasionPropPlots, fig.lp = 'f:', fig.height = 2.6, out.width = "\\textwidth", fig.cap = invasionPropCaption, cache = FALSE, fig.scap = invasionPropShort - -propsDispCI <- data.frame(propsDisp[, 1:2], binom.confint(propsDisp$invasions, propsDisp$n, conf.level = 0.95, methods = "exact")) -propsDispCI <- propsDispCI %>% mutate(dispersal = replace(dispersal, dispersal == 0, 1e-4)) -propsDispCI <- propsDispCI %>% mutate(transFactor = factor(transmission)) - -dispPlot <- ggplot(propsDispCI, aes(x = dispersal, y = mean, colour = transFactor)) + - geom_point() + - geom_line() + - scale_x_log10(breaks = c(1e-4, 1e-3, 1e-2, 1e-1), labels = c('0', '0.001', '0.01', '0.1')) + - geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.04) + - scale_colour_poke(name = expression(beta), - pokemon = 'illumise', - spread = 4) + - xlab('Dispersal') + - ylab('Prop. Invasions') + - theme(legend.position = "none", panel.grid.major.x = element_blank()) - - -propsTopoCI <- data.frame(propsTopo[, 1:2], binom.confint(propsTopo$invasions, nTopo$n, conf.level = 0.95, methods = "exact")) - -propsTopoCI$topo <- factor(propsTopoCI$meanK, levels = c(1, 2, 9)) -propsTopoCI$topoCont <- as.numeric(propsTopoCI$topo) -propsTopoCI <- propsTopoCI %>% mutate(transFactor = factor(transmission)) - - -topoPlot <- ggplot(propsTopoCI, aes(x = topoCont, y = mean, colour = transFactor)) + - geom_point() + - geom_line() + - scale_x_continuous(breaks = c(1, 2, 3), - labels = c('Unconn.','Min.','Full.'), - limits = c(0.9, 3.1)) + - geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.04) + - scale_colour_poke(name = expression(beta), - pokemon = 'illumise', - spread = 4) + - - xlab('Network Topology') + - ylab('Prop. Invasions') + - theme(legend.position = "none", panel.grid.major.x = element_blank()) - -# Extract the legend -grobs <- ggplotGrob(topoPlot + theme(legend.position="bottom"))$grobs -legend_b <- grobs[[which(sapply(grobs, function(x) x$name) == "guide-box")]] - - -ggdraw() + - draw_label("A)", 0.03, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(dispPlot, 0, 0.06, 0.5, 0.94) + - draw_label("B)", 0.53, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + - draw_plot(topoPlot, 0.5, 0.06, 0.5, 0.94) + - draw_grob(legend_b, 0.32, 0.01, 0.4, 0.1) - - - - -%%end.rcode - - -\subsection{Dispersal} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -In the unstructured population, Pathogen 2 invaded in 100 out of 100 simulations. -This was true at all four transmission rates. - -%todo check formating of pvalues -When the $\xi = 0$ simulations were included, there was a positive relationship between dispersal rate and invasion probability for $\beta = 0.2, 0.3$ and $0.4$ (Figure~\ref{f:invasionPropPlots}A, Table~\ref{B-disp}). -These positive relationships were all significant (GLM. $\beta = 0.2$: $b$ = \rinline{DispGLM2$coefficients[2, 1]}, $p < 10^{-5}$. $\beta = 0.3$: $b$ = \rinline{DispGLM3$coefficients[2, 1]}, $p$ = \rinline{DispGLM3$coefficients[2, 4]}. $\beta = 0.4$: $b$ = \rinline{DispGLM4$coefficients[2, 1]}, $p$ = \rinline{DispGLM4$coefficients[2, 4]}.) -At $\beta = 0.1$ there was no significant relationship as invasion probability was very close to zero at all dispersal rates (GLM. $b$ = \rinline{DispGLM1$coefficients[2, 1]}, $p$ = \rinline{DispGLM1$coefficients[2, 4]}). - -However, when the $\xi = 0$ simulations were removed, this significant, positive relationship largely disappeared. -At $\beta = 0.2$, the significant positive relationship remained (GLM: $b~=~$\rinline{Disp2GLM2$coefficients[2, 1]}, $p$ = \rinline{Disp2GLM2$coefficients[2, 4]}). -At all other transmission rates, the probability of invasion did not significantly change with dispersal rate (GLM. $\beta = 0.1$: $b$ = \rinline{Disp2GLM1$coefficients[2, 1]}, $p$ = \rinline{Disp2GLM1$coefficients[2, 4]}. $\beta = 0.3$: $b$ = \rinline{Disp2GLM3$coefficients[2, 1]}, $p$ = \rinline{sprintf('%.2f', Disp2GLM3$coefficients[2, 4])}. $\beta = 0.4$: $b$ = \rinline{Disp2GLM4$coefficients[2, 1]}, $p$ = \rinline{sprintf('%.2f', Disp2GLM4$coefficients[2, 4])}.) - - -\subsection{Network topology} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -When the completely unconnected topology simulations were included, the probability of invasion was different across topologies for $\beta = 0.2, 0.3$ and $0.4$ (Fisher's exact test. $\beta = 0.2$: $p < 10^{-5}$. $\beta = 0.3$: $p < 10^{-5}$. $\beta = 0.4$: $p < 10^{-5}$). -In each case, the fully unconnected population had a lower probability of invasion than the minimally and completely connected topologies (Figure~\ref{f:invasionPropPlots}B, Table~\ref{B-topo}). -At $\beta = 0.1$ there was no significant difference ($p = \rinline{p(TopoTest1$p.value)}$) and the probability of invasion was close to zero for all topologies (Figure~\ref{f:invasionPropPlots}B). - -When the completely unconnected topology simulations were removed, there were no significant differences between topologies i.e.\ between the minimally and fully connected topologies (Figure~\ref{f:invasionPropPlots}B). -This was true at all transmission rates (Fisher's exact test. $\beta = 0.1$, $p = \rinline{sprintf('%.2f', Topo2Test1$p.value)}$. $\beta = 0.2$, $p = \rinline{p(Topo2Test2$p.value)}$. $\beta = 0.3$, $p = \rinline{p(Topo2Test3$p.value)}$. $\beta = 0.4$, $p = \rinline{p(Topo2Test4$p.value)}$). - - - -\subsection{Transmission} -%%%%%%%%%%%%%%%%%%%%%%%%%% - -Increasing the transmission rate increased the probability of invasion (Figure~\ref{f:invasionPropPlots}). -This was true for all four dispersal values (GLM. $\xi = 0$: $b$ = \rinline{DispTransGLM4$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.001$: $b$ = \rinline{DispTransGLM1$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.01$: $b$ = \rinline{DispTransGLM2$coefficients[2, 1]}, $p < 10^{-5}$. $\xi = 0.1$: $b$ = \rinline{DispTransGLM3$coefficients[2, 1]}, $p < 10^{-5}$.) and both network structures (GLM. Minimally connected: $b$ = \rinline{TopoTransGLM2$coefficients[2, 1]}, $p < 10^{-5}$. Fully connected: $b$ = \rinline{TopoTransGLM1$coefficients[2, 1]}, $p < 10^{-5}$). - - - - - - - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - -\section{Discussion}\label{s:sims1Disc} - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\tmpsection{Restate the gap and the main result} - -I have used mechanistic, metapopulation models to test whether increased population structure can promote pathogen richness by facilitating invasion of new pathogens. -I found that dispersal does affect the ability of a new pathogen to invade and persist in a population. -I also found evidence that pathogen invasion was less likely in completely isolated colonies. -However, apart from the completely unconnected network, the topology of the metapopulation network did not affect invasion probability. -Increasing transmission rate quickly reaches a state where new pathogens always invade as long as the metapopulation is not completely unconnected. -Decreasing the transmission rate quickly reaches a state where invasion is impossible. - -The result that increased population structure decreases pathogen richness supports many existing predictions that increasing $R_0$ should increase pathogen richness \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. -However, many comparative studies have found the opposite relationship, with increased population structure increasing pathogen richness (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}). -Furthermore, simple analytical models suggest that population structure should increase pathogen richness \cite{qiu2013vector, allen2004sis, nunes2006localized} and I find no evidence of this. - - -\tmpsection{Link results to consequences} - -These results suggest that if population structure does in fact affect pathogen richness, as observed in comparative studies (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), it must occur by a mechanism other than the one studied here. -In this study the hypothesised mechanism for the relationship between population structure and pathogen richness, was that the spread and persistence of a newly evolved pathogen would be facilitated in highly structured populations as the lack of movement between colonies would stochastically create areas of low prevalence of the endemic pathogen. -If the invading pathogen evolved (i.e.\ was seeded) in one of these areas of low prevalence, invasion would be more likely. -Instead, reduced population structure allowed the new pathogen to quickly spread outside of the colony in which it evolved. -As the mechanism studied here cannot explain the relationship between population structure and pathogen richness seen in wild species (Chapter~\ref{ch:empirical}, \cites{vitone2004body, maganga2014bat, turmelle2009correlates}), other mechanisms should be studied. -Other mechanisms that should be examined include reduced competitive exclusion of already established pathogens or increased invasion of less closely related and less strongly competing pathogens, perhaps mediated by ecological competition of pathogens (i.e.\ reduction of the susceptible pool by disease induced mortality). -Furthermore, single pathogen dynamics could have an important role such as population structure causing a much slower, asynchronous epidemic preventing acquired herd immunity \cite{plowright2011urban}. - -I ran simulations of a completely unstructured population as a baseline comparison of pathogen invasion probability. -However, this unstructured population could also be considered one, very large, subpopulation or colony. -The fact that invasion occurred 100\% of the time in these simulations suggests that colony size has an important role in pathogen richness. -Therefore the interplay between population structure and colony size should be studied further especially as the range of colony size in bats is large, ranging from ten to 1 million \cite{jones2009pantheria} individuals. - -My simulations also highlighted the importance of competition for the spread of a new pathogen. -All parameters used corresponded to pathogens with $R_0>1$ (as seen by the consistent spread of Pathogen 1). -However, the competition with the endemic pathogen meant that for some transmission rates the chance of epidemic spread and persistence of Pathogen 2 was close to zero. -This has implications for human epidemics as well --- if there is strong competition between a newly evolved strain and an endemic strain, we are unlikely to see the new strain spread, regardless of population structure. - - - -\subsection{Model assumptions} - -\subsubsection{Complete cross-immunity} - -I have assumed that once recovered, individuals are immune to both pathogens. -Furthermore, when a coinfected individual recovers from one pathogen, it immediately recovers from the other as well. -This is probably a reasonable assumption given that I am modelling a newly evolved strain. -However, the rate of recovery from pathogens in the presence of coinfections has not been well studied. -In humans, the rate of recovery from respiratory syncytial virus was faster in individuals that had recently recovered from one of a number of co-circulating viruses \cite{munywoki2015influence}. -However, currently coinfected individuals recovered more slowly than average \cite{munywoki2015influence}. - -However, further work could relax this assumption using a model similar to \cite{poletto2015characterising} which contains additional classes for ``infected with Pathogen 1, immune to Pathogen 2'' and ``infected with Pathogen 2, immune to Pathogen 1''. -The model here was formulated such that the study of systems with greater than two pathogens (an avenue for further study) is still computationally feasible. -A model such as used in \cite{poletto2015characterising} contains $3^\rho$ classes for a system with $\rho$ pathogen species. -This quickly becomes computationally restrictive. -It might be expected that there is an upper limit to the total number of pathogen species that can coexist in a population. -In particular, it is possible that once a certain number of species are endemic in a population, no more pathogens can invade into the population. -This has not been studied in the context of metapopulations. - -\subsubsection{Identical strains} - -Many papers on pathogen richness have focused on the evolution of pathogen traits and have considered a trade-off between transmission rate and virulence \cite{nowak1994superinfection, nowak1994superinfection} or infectious period \cite{poletto2013host}. -However, here I am interested in host traits. -Therefore I have assumed that pathogen strains are identical. -It is clear however that there are a number of factors that affect pathogen richness and my focus on host population structure does not imply that pathogen traits are not important. - -\subsubsection{Complex social structure and behaviour} - -With the models here I have aimed to tread a middle ground between the overly simplistic models employed in analytical studies \cite{allen2004sis} and the full complexity and variety of true bat social systems \cite{kerth2008causes}. -The factors that have not been modelled here include seasonal migration, maternity roosts, hibernation roosts and swarming sites \cite{kerth2008causes, fleming2003ecology, richter2008first, cryan2014continental}. -While future models might aim to model this complexity more fully, the number of parameters that are required to be estimated and varied becomes very large. -Furthermore, not all of these social complexities exist in all bat species, so in limiting my analysis to the simpler end of bat social systems it is hoped that the results are more broadly representative of the order. - -Furthermore, I have considered a single host species in isolation. -It seems likely that sympatry in bats and other mammals is epidemiologically important \cite{brierley2016quantifying, luis2013comparison, pilosof2015potential} but this was beyond the scope of this study. -There is potential for this to be effectively modelled as a multi-layered network \cite{wang2016structural, funk2010interacting} and this would be expected to act to reduce population structure. -Conversely, the case of interspecies roost sharing could be modelled as an additional layer of within-colony, population structure which would tend to increase population structure. - -Finally, many species of bat exhibit strong seasonal birth pulses which are known to affect disease dynamics \cite{hayman2015biannual,peel2014effect,amman2012seasonal}. -This would be expected to facilitate the invasion of new pathogen species; if a new strain evolved or entered the population by migration during a period of low population immunity, it would have a higher chance of invading and establishing in the population. -Again this was beyond the scope of this study, but birth pulses and their interactions with seasonally varying transmission rates is a useful area for further research. - -\subsection{Conclusions} - -In conclusion I have found evidence that reduced population structure facilitates the invasion and establishment of newly evolved pathogen species. -However, the direction of the relationship contradicts those found in wild species. -This suggests that if population structure does have a role in shaping pathogen communities, it is unlikely to be by this specific mechanism. - - - - - - From ae87ba09020c97bdbd691c27dd95e602db782757 Mon Sep 17 00:00:00 2001 From: Tim Lucas Date: Mon, 25 Jul 2016 17:04:18 +0100 Subject: [PATCH 17/17] Various edits that should have been committed earlier. --- Appendix2.tex | 266 ++++++++++++++++++ Appendix3.tex | 218 +++++++------- Appendix4.tex | 151 ++++++++++ Discussion.tex | 18 +- Introduction.tex | 2 +- LinksAndMetadata.tex | 4 +- Preamble.tex | 4 +- epilit.bib | 16 ++ ...et_al_supplementarymaterial_2015-01-20.tex | 18 +- tim-lucas-thesis.tex | 8 +- 10 files changed, 576 insertions(+), 129 deletions(-) create mode 100644 Appendix2.tex create mode 100644 Appendix4.tex diff --git a/Appendix2.tex b/Appendix2.tex new file mode 100644 index 0000000..d1b1075 --- /dev/null +++ b/Appendix2.tex @@ -0,0 +1,266 @@ +%--------------------------------------------------------------------------------------------------------------------------------% +% Code for "Appendix C: Understanding how population structure affects pathogen richness in a mechanistic model of bat populations" +% Appendix for Chapter 3 of thesis "The role of population structure and size in determining bat pathogen richness" +% by Tim CD Lucas +% +% NB The file is numbered Appendix2 as Chapter 3 was previously Chapter 2 in the thesis. +% +%---------------------------------------------------------------------------------------------------------------------------------% + + + + + + + + + + + + + + + + + + +% --------------------------------------------------------------------------- % +% Invading pathogen plots +% --------------------------------------------------------------------------- % + + + + +\begin{knitrout}\footnotesize +\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{figure}[t] + +{\centering \includegraphics[width=\textwidth]{figure/A-plotsInvade-1} + +} + +\caption[ +Examples of simulated SIR dynamics with successfull invasions +]{ +Two examples (A and B) of a successful invasion plotted on a logged $y$-axis. +The lines are coloured such that blue represents susceptibles, brown represents individuals infected with one pathogen (the two seperate brown lines are Pathogen 1 and 2), black represents co-infected individuals and yellow represents recovered and immune individuals. +Pathogen 2 is seeded after \SI{300000} events (approximately 30 years). +Simulations are run on a fully-connected network. +Parameter values are: dispersal rate = 0.1, transmission rate = 0.2. +All other parameters are as stated in Table~\ref{t:params}. +}\label{fig:plotsInvade} +\end{figure} + + +\end{knitrout} + + + +% --------------------------------------------------------------------------- % +% No invasion plots +% --------------------------------------------------------------------------- % + + + + + + + +\begin{knitrout}\footnotesize +\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{figure}[t] + +{\centering \includegraphics[width=\textwidth]{figure/A-plotsNoInvade-1} + +} + +\caption[ +Examples of simulated SIR dynamics with unsuccessfull invasions +]{ +Two examples (A and B) of an unsuccessful invasion plotted on a logged $y$-axis. +The lines are coloured such that blue represents susceptibles, brown represents individuals infected with one pathogen (the two separate brown lines are Pathogen 1 and 2), black represents co-infected individuals and yellow represents recovered and immune individuals. +Pathogen 2 is seeded after \SI{300000} events (approximately 30 years). +It can be seen that after seeding Pathogen 2, there is a very short period before extinction as opposed to a long fade out of disease. +Simulations are run on a fully-connected network. +Parameter values are: dispersal rate = 0.1, transmission rate = 0.2. +All other parameters are as stated in Table~\ref{t:params}. +}\label{fig:plotsNoInvade1} +\end{figure} + +\begin{figure}[t] + +{\centering \includegraphics[width=0.8\textwidth]{figure/A-plotsNoInvade-2} + +} + +\caption[ +Examples of colony size dynamics +]{ +Two examples (A and B) of the change in colony sizes throughout a simulation (note the truncated $y$-axis). +The size of each colony changes as a random walk. +However, given the length of the simulations, there is little risk of colonies going extinct or becoming very large. +Birth and death rate are equal and set to 0.05, giving a generation time of 20 years. +The metapopulation network is fully-connected and the dispersal rate is 0.1 per year. +The starting colony size is \SI{3000}. +}\label{fig:plotsNoInvade2} +\end{figure} + + +\end{knitrout} + + + + + + + + + + + + + + +% ------------------------------------------------------------------ % +% Topology Sims +% ------------------------------------------------------------------ % + + + + + + + + + + + + + + + + + +% ------------------------------------------------ % +% Raw data tables +% ------------------------------------------------ % + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +\clearpage +% latex table generated in R 3.3.1 by xtable 1.8-2 package +% Mon Jul 25 00:04:16 2016 +\begin{table}[ht] +\centering +\caption[ +Raw data for dispersal simulations + ]{ +Raw data for dispersal simulations. +The relevant parameters are shown along with the number of invasions and the number of simulations. +$\beta$ is the transmission rate. +} +\label{B-disp} +\begingroup\small +\begin{tabular}{@{}rrrr@{}} + \toprule +$\beta$ & Dispersal & Invasions & Sims \\ + \midrule +0.1 & 0.000 & 0 & 100 \\ + 0.1 & 0.001 & 1 & 101 \\ + 0.1 & 0.010 & 0 & 100 \\ + 0.1 & 0.100 & 0 & 99 \\ + 0.2 & 0.000 & 4 & 100 \\ + 0.2 & 0.001 & 42 & 126 \\ + 0.2 & 0.010 & 41 & 126 \\ + 0.2 & 0.100 & 63 & 123 \\ + 0.3 & 0.000 & 47 & 100 \\ + 0.3 & 0.001 & 113 & 125 \\ + 0.3 & 0.010 & 113 & 126 \\ + 0.3 & 0.100 & 112 & 124 \\ + 0.4 & 0.000 & 75 & 100 \\ + 0.4 & 0.001 & 96 & 100 \\ + 0.4 & 0.010 & 98 & 100 \\ + 0.4 & 0.100 & 96 & 100 \\ + \bottomrule +\end{tabular} +\endgroup +\end{table} + + + + + +% latex table generated in R 3.3.1 by xtable 1.8-2 package +% Mon Jul 25 00:04:16 2016 +\begin{table}[ht] +\centering +\caption[ +Raw data for topology simulations + ]{ +Raw data for topology simulations. +The relevant parameters are shown along with the number of invasions and the number of simulations. +$\beta$ is the transmission rate. +} +\label{B-topo} +\begingroup\small +\begin{tabular}{@{}rllr@{}} + \toprule +$\beta$ & Topology & Invasions & Sims \\ + \midrule +0.1 & Unconnected & 0 & 100 \\ + 0.1 & Minimally & 1 & 101 \\ + 0.1 & Fully & 1 & 99 \\ + 0.2 & Unconnected & 4 & 100 \\ + 0.2 & Minimally & 30 & 100 \\ + 0.2 & Fully & 28 & 100 \\ + 0.3 & Unconnected & 47 & 100 \\ + 0.3 & Minimally & 94 & 100 \\ + 0.3 & Fully & 88 & 100 \\ + 0.4 & Unconnected & 75 & 100 \\ + 0.4 & Minimally & 97 & 100 \\ + 0.4 & Fully & 99 & 100 \\ + \bottomrule +\end{tabular} +\endgroup +\end{table} + + + + + + diff --git a/Appendix3.tex b/Appendix3.tex index 60c5ae4..b8b6f57 100644 --- a/Appendix3.tex +++ b/Appendix3.tex @@ -1,3 +1,13 @@ +%--------------------------------------------------------------------------------------------------------------------------------% +% Code for "Appendix B: A comparative test of the role of population structure in determining pathogen richness" +% Appendix for Chapter 3 of thesis "The role of population structure and size in determining bat pathogen richness" +% by Tim CD Lucas +% +% NB The file is numbered Appendix3 as Chapter 2 was previously Chapter 3 in the thesis. +% +%---------------------------------------------------------------------------------------------------------------------------------% + + @@ -24,12 +34,12 @@ \begin{landscape} -% latex table generated in R 3.3.0 by xtable 1.8-2 package -% Sat May 21 15:45:20 2016 +% latex table generated in R 3.3.1 by xtable 1.8-2 package +% Mon Jul 25 00:11:19 2016 \begingroup\tiny \begin{longtable}{@{}llrrrrrrrrrl@{}} \caption[ -Raw data for both analyses. +Raw data for both analyses ]{ Raw data for both analyses. Range Length is the distance between furthest apart points in the species range. @@ -298,8 +308,8 @@ } \caption[ - Logged number of references on Scholar and PubMed, with a fitted phylogenetic linear model]{ - Logged number of references on Scholar and PubMed, with a fitted phylogenetic linear model. + Logged number of references on Google Scholar and PubMed, with a fitted phylogenetic linear model]{ + Logged number of references on Google Scholar and PubMed, with a fitted phylogenetic linear model. Colours indicate family. (pgls: $t$ = 19.32, df = 194, $p < 10^{-5}$).}\label{fig:scholarvspubmedPlot} \end{figure} @@ -337,15 +347,15 @@ -% latex table generated in R 3.3.0 by xtable 1.8-2 package -% Sat May 21 15:12:11 2016 +% latex table generated in R 3.3.1 by xtable 1.8-2 package +% Sun Jul 24 23:57:44 2016 \begin{table}[ht] \centering \caption[ - Full model selection results for number of subspecies analysis. + Full model selection results for number of subspecies analysis ]{ Model selection results for number of subspecies analysis. - $\bar{\text{AICc}}$ is the mean AICc score across inline{nBoots} resamplings of the null random variable. + $\bar{\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\sum w$ is the cumulative sum of the Akaike weights. @@ -404,15 +414,15 @@ -% latex table generated in R 3.3.0 by xtable 1.8-2 package -% Sat May 21 15:12:11 2016 +% latex table generated in R 3.3.1 by xtable 1.8-2 package +% Sun Jul 24 23:57:44 2016 \begin{table}[ht] \centering \caption[ - Full model selection results for effective gene flow analysis. + Full model selection results for effective gene flow analysis ]{ Model selection results for effective gene flow analysis. - $\bar{\text{AICc}}$ is the mean AICc score across inline{nBoots} resamplings of the null random variable. + $\bar{\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\sum w$ is the cumulative sum of the Akaike weights. @@ -486,7 +496,7 @@ } -\caption[Pruned alternative phylogeny with dot size showing number of pathogens and colour showing family.]{ +\caption[Pruned alternative phylogeny showing number of pathogens and family]{ The distribution of viral richness on the alternate phylogeny. The phylogeny is from \textcite{jones2005bats} (version 2) pruned to include all species used in either the number of subspecies or gene flow analysis. Dot size shows the number of known viruses for that species and colour shows family. @@ -520,11 +530,13 @@ \begin{figure}[t] \centering \includegraphics[width=1\textwidth]{figure/fstITPlots2-1.pdf} - \caption[Akaike variable weights for analysis using alternative phylogeny]{ -Akaike variable weights for both analyses using the phylogeny from \cite{jones2005bats}. -The probability that each variable is in the best model (amongst the models test) is shown, with the boxplots showing the variation amongst the models over 50 resamplings of the uniformly random ``null'' variable. -The three bars of the boxplot show the median values and upper and lower quartiles of the data, vertical lines show the range and points display outliers. -The purple ``Random'' box is the uniformly random variable. + \caption[The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness using alternative phylogeny]{ +The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness using the phylogeny from \cite{jones2005bats}. +The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis. +The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. +The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles. +The red ``Random'' box is the uniformly random variable. +Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses. } \label{f:A-itplots} \end{figure} @@ -547,15 +559,15 @@ -% latex table generated in R 3.3.0 by xtable 1.8-2 package -% Sat May 21 15:14:27 2016 +% latex table generated in R 3.3.1 by xtable 1.8-2 package +% Sun Jul 24 23:57:45 2016 \begin{table}[ht] \centering \caption[ - Full model selection results for number of subspecies analysis using alternative phylogeny. + Full model selection results for number of subspecies analysis using alternative phylogeny ]{ Model selection results for number of subspecies analysis using phylogeny from \cite{jones2005bats}. - $\bar{\text{AICc}}$ is the mean AICc score across inline{nBoots} resamplings of the null random variable. + $\bar{\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\sum w$ is the cumulative sum of the Akaike weights. @@ -567,45 +579,45 @@ \toprule Model & $\bar{\text{AICc}}$ & $\Delta$AICc & $w$ & $\sum w$ \\ \midrule -log(log(Scholar))*NSubspecies + log(log(Scholar)) + NSubspecies + log(log(Mass)) + log(log(RangeSize)) & 756.44 & 0.00 & 0.21 & 0.21 \\ - log(log(Scholar))*NSubspecies + log(log(Scholar)) + NSubspecies + log(log(Mass)) & 756.90 & 0.46 & 0.17 & 0.38 \\ - log(log(Scholar))*NSubspecies + log(log(Scholar)) + NSubspecies & 757.64 & 1.19 & 0.12 & 0.50 \\ - log(log(Scholar))*NSubspecies + log(log(Scholar)) + NSubspecies + log(log(Mass)) + rand & 758.07 & 1.62 & 0.09 & 0.59 \\ - log(log(Scholar))*NSubspecies + log(log(Scholar)) + NSubspecies + log(log(RangeSize)) & 758.35 & 1.90 & 0.08 & 0.68 \\ - log(log(Scholar))*NSubspecies + log(log(Scholar)) + NSubspecies + rand & 758.81 & 2.37 & 0.07 & 0.74 \\ - log(log(Scholar)) + NSubspecies + log(log(Mass)) + log(log(RangeSize)) & 759.28 & 2.83 & 0.05 & 0.79 \\ - log(log(Scholar))*NSubspecies + log(log(Scholar)) + NSubspecies + log(log(RangeSize)) + rand & 759.56 & 3.12 & 0.04 & 0.84 \\ - log(log(Scholar)) + NSubspecies + log(log(Mass)) & 759.92 & 3.47 & 0.04 & 0.87 \\ - log(log(Scholar)) + NSubspecies + log(log(Mass)) + log(log(RangeSize)) + rand & 760.55 & 4.10 & 0.03 & 0.90 \\ - log(log(Scholar)) + NSubspecies & 760.76 & 4.31 & 0.02 & 0.93 \\ - log(log(Scholar)) + NSubspecies + log(log(Mass)) + rand & 761.16 & 4.71 & 0.02 & 0.95 \\ - log(log(Scholar)) + NSubspecies + log(log(RangeSize)) & 761.34 & 4.90 & 0.02 & 0.96 \\ - log(log(Scholar)) + NSubspecies + rand & 761.99 & 5.55 & 0.01 & 0.98 \\ - log(log(Scholar)) + NSubspecies + log(log(RangeSize)) + rand & 762.62 & 6.17 & 0.01 & 0.99 \\ - log(log(Scholar)) + log(log(Mass)) + log(log(RangeSize)) & 765.17 & 8.73 & 0.00 & 0.99 \\ - log(log(Scholar)) + log(log(Mass)) & 765.74 & 9.29 & 0.00 & 0.99 \\ - log(log(Scholar)) & 766.00 & 9.55 & 0.00 & 0.99 \\ - log(log(Scholar)) + log(log(Mass)) + log(log(RangeSize)) + rand & 766.37 & 9.92 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(RangeSize)) & 766.49 & 10.05 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(Mass)) + rand & 766.89 & 10.44 & 0.00 & 1.00 \\ - log(log(Scholar)) + rand & 767.15 & 10.70 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(RangeSize)) + rand & 767.68 & 11.24 & 0.00 & 1.00 \\ - NSubspecies + log(log(Mass)) + log(log(RangeSize)) & 778.19 & 21.74 & 0.00 & 1.00 \\ - NSubspecies + log(log(Mass)) + log(log(RangeSize)) + rand & 779.40 & 22.96 & 0.00 & 1.00 \\ - NSubspecies + log(log(RangeSize)) & 784.68 & 28.23 & 0.00 & 1.00 \\ - NSubspecies + log(log(RangeSize)) + rand & 785.92 & 29.47 & 0.00 & 1.00 \\ - NSubspecies + log(log(Mass)) & 789.77 & 33.32 & 0.00 & 1.00 \\ - log(log(Mass)) + log(log(RangeSize)) & 790.49 & 34.04 & 0.00 & 1.00 \\ - NSubspecies + log(log(Mass)) + rand & 791.00 & 34.55 & 0.00 & 1.00 \\ +log(Scholar)*NSubspecies + log(Scholar) + NSubspecies + log(Mass) + log(RangeSize) & 756.44 & 0.00 & 0.21 & 0.21 \\ + log(Scholar)*NSubspecies + log(Scholar) + NSubspecies + log(Mass) & 756.90 & 0.46 & 0.17 & 0.38 \\ + log(Scholar)*NSubspecies + log(Scholar) + NSubspecies & 757.64 & 1.19 & 0.12 & 0.49 \\ + log(Scholar)*NSubspecies + log(Scholar) + NSubspecies + log(Mass) + rand & 758.04 & 1.59 & 0.09 & 0.59 \\ + log(Scholar)*NSubspecies + log(Scholar) + NSubspecies + log(RangeSize) & 758.35 & 1.90 & 0.08 & 0.67 \\ + log(Scholar)*NSubspecies + log(Scholar) + NSubspecies + rand & 758.79 & 2.34 & 0.07 & 0.73 \\ + log(Scholar) + NSubspecies + log(Mass) + log(RangeSize) & 759.28 & 2.83 & 0.05 & 0.79 \\ + log(Scholar)*NSubspecies + log(Scholar) + NSubspecies + log(RangeSize) + rand & 759.50 & 3.06 & 0.05 & 0.83 \\ + log(Scholar) + NSubspecies + log(Mass) & 759.92 & 3.47 & 0.04 & 0.87 \\ + log(Scholar) + NSubspecies + log(Mass) + log(RangeSize) + rand & 760.33 & 3.89 & 0.03 & 0.90 \\ + log(Scholar) + NSubspecies & 760.76 & 4.31 & 0.02 & 0.92 \\ + log(Scholar) + NSubspecies + log(Mass) + rand & 760.99 & 4.54 & 0.02 & 0.94 \\ + log(Scholar) + NSubspecies + log(RangeSize) & 761.34 & 4.90 & 0.02 & 0.96 \\ + log(Scholar) + NSubspecies + rand & 761.83 & 5.39 & 0.01 & 0.98 \\ + log(Scholar) + NSubspecies + log(RangeSize) + rand & 762.42 & 5.98 & 0.01 & 0.99 \\ + log(Scholar) + log(Mass) + log(RangeSize) & 765.17 & 8.73 & 0.00 & 0.99 \\ + log(Scholar) + log(Mass) & 765.74 & 9.29 & 0.00 & 0.99 \\ + log(Scholar) & 766.00 & 9.55 & 0.00 & 0.99 \\ + log(Scholar) + log(Mass) + log(RangeSize) + rand & 766.21 & 9.76 & 0.00 & 1.00 \\ + log(Scholar) + log(RangeSize) & 766.49 & 10.05 & 0.00 & 1.00 \\ + log(Scholar) + log(Mass) + rand & 766.78 & 10.33 & 0.00 & 1.00 \\ + log(Scholar) + rand & 767.04 & 10.60 & 0.00 & 1.00 \\ + log(Scholar) + log(RangeSize) + rand & 767.54 & 11.10 & 0.00 & 1.00 \\ + NSubspecies + log(Mass) + log(RangeSize) & 778.19 & 21.74 & 0.00 & 1.00 \\ + NSubspecies + log(Mass) + log(RangeSize) + rand & 779.22 & 22.78 & 0.00 & 1.00 \\ + NSubspecies + log(RangeSize) & 784.68 & 28.23 & 0.00 & 1.00 \\ + NSubspecies + log(RangeSize) + rand & 785.76 & 29.31 & 0.00 & 1.00 \\ + NSubspecies + log(Mass) & 789.77 & 33.32 & 0.00 & 1.00 \\ + log(Mass) + log(RangeSize) + rand & 790.42 & 33.98 & 0.00 & 1.00 \\ + log(Mass) + log(RangeSize) & 790.49 & 34.04 & 0.00 & 1.00 \\ + NSubspecies + log(Mass) + rand & 790.85 & 34.41 & 0.00 & 1.00 \\ NSubspecies & 792.53 & 36.09 & 0.00 & 1.00 \\ - NSubspecies + rand & 793.76 & 37.31 & 0.00 & 1.00 \\ - log(log(Mass)) + log(log(RangeSize)) + rand & 796.23 & 39.79 & 0.00 & 1.00 \\ - log(log(RangeSize)) & 796.89 & 40.44 & 0.00 & 1.00 \\ - log(log(RangeSize)) + rand & 798.03 & 41.59 & 0.00 & 1.00 \\ - log(log(Mass)) & 804.51 & 48.06 & 0.00 & 1.00 \\ + NSubspecies + rand & 793.64 & 37.19 & 0.00 & 1.00 \\ + log(RangeSize) & 796.89 & 40.44 & 0.00 & 1.00 \\ + log(RangeSize) + rand & 797.96 & 41.52 & 0.00 & 1.00 \\ + log(Mass) & 804.51 & 48.06 & 0.00 & 1.00 \\ + log(Mass) + rand & 804.78 & 48.33 & 0.00 & 1.00 \\ Intercept only & 806.58 & 50.13 & 0.00 & 1.00 \\ - rand & 807.68 & 51.24 & 0.00 & 1.00 \\ - log(log(Mass)) + rand & 816.08 & 59.64 & 0.00 & 1.00 \\ + rand & 807.66 & 51.22 & 0.00 & 1.00 \\ \bottomrule \end{tabular} \endgroup @@ -614,15 +626,15 @@ -% latex table generated in R 3.3.0 by xtable 1.8-2 package -% Sat May 21 15:14:27 2016 +% latex table generated in R 3.3.1 by xtable 1.8-2 package +% Sun Jul 24 23:57:45 2016 \begin{table}[ht] \centering \caption[ - Full model selection results for effective gene flow analysis with alternative phylogeny. + Full model selection results for effective gene flow analysis with alternative phylogeny ]{ Model selection results for effective gene flow analysis using phylogeny from \cite{jones2005bats}. - $\bar{\text{AICc}}$ is the mean AICc score across inline{nBoots} resamplings of the null random variable. + $\bar{\text{AICc}}$ is the mean AICc score across 50 resamplings of the null random variable. $\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. $w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set). $\sum w$ is the cumulative sum of the Akaike weights. @@ -633,38 +645,38 @@ \toprule Model & $\bar{\text{AICc}}$ & $\Delta$AICc & $w$ & $\sum w$ \\ \midrule -log(log(Mass)) + log(log(RangeSize)) & 74.45 & 0.00 & 1.00 & 1.00 \\ - log(Gene Flow) & 108.76 & 34.31 & 0.00 & 1.00 \\ - log(Gene Flow) + log(log(Mass)) & 111.34 & 36.89 & 0.00 & 1.00 \\ - log(log(Mass)) & 112.93 & 38.48 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(Mass)) & 119.91 & 45.46 & 0.00 & 1.00 \\ - log(log(Scholar)) & 121.81 & 47.36 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) + log(log(Mass)) & 122.79 & 48.35 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(RangeSize)) & 123.39 & 48.94 & 0.00 & 1.00 \\ - Intercept only & 126.14 & 51.69 & 0.00 & 1.00 \\ - rand & 210.23 & 135.78 & 0.00 & 1.00 \\ - log(log(Mass)) + rand & 221.08 & 146.63 & 0.00 & 1.00 \\ - log(Gene Flow) + rand & 301.65 & 227.21 & 0.00 & 1.00 \\ - log(log(Scholar)) + rand & 307.56 & 233.11 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(RangeSize)) + rand & 309.97 & 235.52 & 0.00 & 1.00 \\ - log(Gene Flow) + log(log(Mass)) + rand & 412.19 & 337.75 & 0.00 & 1.00 \\ - log(log(RangeSize)) + rand & 426.62 & 352.17 & 0.00 & 1.00 \\ - log(Gene Flow) + log(log(RangeSize)) + rand & 503.58 & 429.13 & 0.00 & 1.00 \\ - log(log(Mass)) + log(log(RangeSize)) + rand & 512.05 & 437.60 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(Mass)) + rand & 536.48 & 462.03 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) + rand & 553.89 & 479.44 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) + log(log(Mass)) + rand & 673.04 & 598.59 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) + log(log(RangeSize)) + rand & 694.02 & 619.57 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(Mass)) + log(log(RangeSize)) + rand & 819.54 & 745.09 & 0.00 & 1.00 \\ - log(Gene Flow) + log(log(Mass)) + log(log(RangeSize)) + rand & 849.53 & 775.08 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) + log(log(Mass)) + log(log(RangeSize)) + rand & 884.51 & 810.06 & 0.00 & 1.00 \\ - log(log(RangeSize)) & 916.14 & 841.69 & 0.00 & 1.00 \\ - log(Gene Flow) + log(log(RangeSize)) & 916.14 & 841.69 & 0.00 & 1.00 \\ - log(Gene Flow) + log(log(Mass)) + log(log(RangeSize)) & 916.14 & 841.69 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(log(Mass)) + log(log(RangeSize)) & 916.14 & 841.69 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) & 916.14 & 841.69 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) + log(log(RangeSize)) & 916.14 & 841.69 & 0.00 & 1.00 \\ - log(log(Scholar)) + log(Gene Flow) + log(log(Mass)) + log(log(RangeSize)) & 916.14 & 841.69 & 0.00 & 1.00 \\ +log(Mass) + log(RangeSize) & 106.05 & 0.00 & 1.00 & 1.00 \\ + log(Scholar) + log(Mass) + rand & 119.34 & 13.30 & 0.00 & 1.00 \\ + Gene Flow + log(Mass) & 120.15 & 14.11 & 0.00 & 1.00 \\ + log(Mass) + rand & 122.83 & 16.78 & 0.00 & 1.00 \\ + log(Mass) & 123.09 & 17.04 & 0.00 & 1.00 \\ + log(Scholar) + log(Mass) & 124.22 & 18.18 & 0.00 & 1.00 \\ + Gene Flow + log(Mass) + log(RangeSize) + rand & 124.52 & 18.48 & 0.00 & 1.00 \\ + log(Mass) + log(RangeSize) + rand & 124.68 & 18.64 & 0.00 & 1.00 \\ + rand & 125.42 & 19.37 & 0.00 & 1.00 \\ + log(Scholar) + log(Mass) + log(RangeSize) + rand & 125.69 & 19.64 & 0.00 & 1.00 \\ + log(Scholar) + rand & 126.04 & 19.99 & 0.00 & 1.00 \\ + Gene Flow + log(Mass) + rand & 126.52 & 20.48 & 0.00 & 1.00 \\ + log(Scholar) & 126.56 & 20.52 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow + log(Mass) + rand & 126.90 & 20.86 & 0.00 & 1.00 \\ + log(Scholar) + log(RangeSize) + rand & 127.83 & 21.78 & 0.00 & 1.00 \\ + log(Scholar) + log(RangeSize) & 127.96 & 21.91 & 0.00 & 1.00 \\ + log(Scholar) + log(Mass) + log(RangeSize) & 128.01 & 21.96 & 0.00 & 1.00 \\ + Gene Flow + log(RangeSize) + rand & 128.07 & 22.02 & 0.00 & 1.00 \\ + Gene Flow + rand & 128.37 & 22.33 & 0.00 & 1.00 \\ + log(RangeSize) + rand & 129.02 & 22.98 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow & 129.09 & 23.04 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow + log(Mass) & 129.18 & 23.13 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow + rand & 130.46 & 24.42 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow + log(Mass) + log(RangeSize) & 130.80 & 24.76 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow + log(RangeSize) & 130.81 & 24.76 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow + log(RangeSize) + rand & 130.89 & 24.84 & 0.00 & 1.00 \\ + log(RangeSize) & 131.22 & 25.17 & 0.00 & 1.00 \\ + Gene Flow + log(Mass) + log(RangeSize) & 131.85 & 25.80 & 0.00 & 1.00 \\ + log(Scholar) + Gene Flow + log(Mass) + log(RangeSize) + rand & 132.97 & 26.93 & 0.00 & 1.00 \\ + Gene Flow + log(RangeSize) & 133.17 & 27.12 & 0.00 & 1.00 \\ + Gene Flow & 135.91 & 29.86 & 0.00 & 1.00 \\ + Intercept only & 136.23 & 30.18 & 0.00 & 1.00 \\ \bottomrule \end{tabular} \endgroup @@ -681,7 +693,7 @@ \begin{table}[t] \centering \caption[Estimated variable weights and coefficients using alternative phylogeny]{ -Estimated variable weights (probability that a variable is in the best model) and their estimated coefficients for both number of subspecies and gene flow analyses using phylogeny from \cite{jones2005bats}. . +Estimated variable weights (probability that a variable is in the best model) and their estimated coefficients for both number of subspecies and gene flow analyses using phylogeny from \cite{jones2005bats}. The coefficients for the number of subspecies variable are also separated for models with and without the interaction term because this term strongly changes the coefficient and because the coefficient can only be usefully interpreted when estimated without the interaction. However, there are no weights for these separated terms as they are not directly compared in the model selection framework. } @@ -696,15 +708,15 @@ \hspace{3mm}Models without interaction term && 0.5 &&\\ \hspace{3mm}Models with interaction term && 0.38 &&\\ Number of subspecies*log(Scholar) & 0.78 & 0.50 && \\[2.5mm] -Gene flow & & & \ensuremath{1.1\times 10^{-3}} & \ensuremath{-0.9}\\[2.5mm] +Gene flow & & & 0.00 & \ensuremath{-0.9}\\[2.5mm] log(Scholar) & 1.00 & 1.01 & - \ensuremath{1.66\times 10^{-3}} & 3.17\\ + 0.00 & 3.17\\ log(Mass) & 0.62 & 0.47 & - 1 & \ensuremath{-0.4}\\ + 1.00 & \ensuremath{-0.4}\\ log(Range size) & 0.45 & 0.33& - 1 & 3.9\\ -Random & 0.29 & \ensuremath{-4.77\times 10^{-3}}& - \ensuremath{2\times 10^{-3}} & 0.2\\ + 1.00 & 3.9\\ +Random & 0.29 & -0.00& + 0.00 & 0.2\\ \bottomrule \end{tabular} diff --git a/Appendix4.tex b/Appendix4.tex new file mode 100644 index 0000000..fe540b1 --- /dev/null +++ b/Appendix4.tex @@ -0,0 +1,151 @@ +%--------------------------------------------------------------------------------------------------------------------------------% +% Code for "Appendix C: A mechanistic model to compare the importance of interrelated population measures: population size, population density and colony size" +% Appendix for Chapter 4 of thesis "The role of population structure and size in determining bat pathogen richness" +% by Tim CD Lucas +% +%---------------------------------------------------------------------------------------------------------------------------------% + + + + + + + + + + + + + + + +% ----------------------------------------------- % +% Print tables +% ----------------------------------------------- % + + + +% latex table generated in R 3.3.0 by xtable 1.8-2 package +% Sat Jun 11 16:59:36 2016 +\begin{table}[ht] +\centering +\caption[ +Raw data for range size simulations + ]{ +Raw data for range size simulations. +The population parameters are shown along with the number of invasions and the number of simulations. +Note that simulations where both pathogens went extinct have been removed (100 simulations were originally run for each parameter set). +$\beta$ is the transmission rate, $n$ is colony size, $m$ is the number of colonies and $N$ is the total population size. +} +\label{C-pop} +\begingroup\small +\begin{tabular}{@{}rrrrrrrr@{}} + \toprule +$\beta$ & $n$ & $m$ & Area \tiny{($\times 1000$ km$^2$)} & $N$ \tiny{($\times 1000$)} & Density \tiny{(km$^{-2}$)} & Invasions & Sims \\ + \midrule +0.1 & 400 & 20 & 2.5 & 8 & 3.2 & 2 & 100 \\ + 0.1 & 400 & 20 & 5.0 & 8 & 1.6 & 3 & 100 \\ + 0.1 & 400 & 20 & 10.0 & 8 & 0.8 & 2 & 100 \\ + 0.1 & 400 & 20 & 20.0 & 8 & 0.4 & 3 & 100 \\ + 0.1 & 400 & 20 & 40.0 & 8 & 0.2 & 2 & 100 \\ + 0.2 & 400 & 20 & 2.5 & 8 & 3.2 & 3 & 100 \\ + 0.2 & 400 & 20 & 5.0 & 8 & 1.6 & 3 & 100 \\ + 0.2 & 400 & 20 & 10.0 & 8 & 0.8 & 1 & 100 \\ + 0.2 & 400 & 20 & 20.0 & 8 & 0.4 & 4 & 100 \\ + 0.2 & 400 & 20 & 40.0 & 8 & 0.2 & 1 & 100 \\ + 0.3 & 400 & 20 & 2.5 & 8 & 3.2 & 3 & 100 \\ + 0.3 & 400 & 20 & 5.0 & 8 & 1.6 & 3 & 100 \\ + 0.3 & 400 & 20 & 10.0 & 8 & 0.8 & 3 & 100 \\ + 0.3 & 400 & 20 & 20.0 & 8 & 0.4 & 5 & 100 \\ + 0.3 & 400 & 20 & 40.0 & 8 & 0.2 & 9 & 100 \\ + \bottomrule +\end{tabular} +\endgroup +\end{table} + + + +% latex table generated in R 3.3.0 by xtable 1.8-2 package +% Sat Jun 11 16:59:36 2016 +\begin{table}[ht] +\centering +\caption[ +Raw data for colony size simulations + ]{ +Raw data for colony size simulations. +The population parameters are shown along with the number of invasions and the number of simulations. +Note that simulations where both pathogens went extinct have been removed (100 simulations were originally run for each parameter set). +$\beta$ is the transmission rate, $n$ is colony size, $m$ is the number of colonies and $N$ is the total population size. +} +\label{C-dens1} +\begingroup\small +\begin{tabular}{@{}rrrrrrrr@{}} + \toprule +$\beta$ & $n$ & $m$ & Area \tiny{($\times 1000$ km$^2$)} & $N$ \tiny{($\times 1000$)} & Density \tiny{(km$^{-2}$)} & Invasions & Sims \\ + \midrule +0.1 & 100 & 20 & 2.5 & 2 & 0.8 & 4 & 88 \\ + 0.1 & 200 & 20 & 5.0 & 4 & 0.8 & 5 & 100 \\ + 0.1 & 400 & 20 & 10.0 & 8 & 0.8 & 2 & 100 \\ + 0.1 & 800 & 20 & 20.0 & 16 & 0.8 & 0 & 100 \\ + 0.1 & 1600 & 20 & 40.0 & 32 & 0.8 & 55 & 100 \\ + 0.2 & 100 & 20 & 2.5 & 2 & 0.8 & 3 & 92 \\ + 0.2 & 200 & 20 & 5.0 & 4 & 0.8 & 6 & 100 \\ + 0.2 & 400 & 20 & 10.0 & 8 & 0.8 & 0 & 100 \\ + 0.2 & 800 & 20 & 20.0 & 16 & 0.8 & 39 & 100 \\ + 0.2 & 1600 & 20 & 40.0 & 32 & 0.8 & 95 & 100 \\ + 0.3 & 100 & 20 & 2.5 & 2 & 0.8 & 1 & 91 \\ + 0.3 & 200 & 20 & 5.0 & 4 & 0.8 & 4 & 100 \\ + 0.3 & 400 & 20 & 10.0 & 8 & 0.8 & 7 & 100 \\ + 0.3 & 800 & 20 & 20.0 & 16 & 0.8 & 67 & 100 \\ + 0.3 & 1600 & 20 & 40.0 & 32 & 0.8 & 100 & 100 \\ + \bottomrule +\end{tabular} +\endgroup +\end{table} + + + + + +% latex table generated in R 3.3.0 by xtable 1.8-2 package +% Sat Jun 11 16:59:36 2016 +\begin{table}[ht] +\centering +\caption[ +Raw data for number of colonies simulations + ]{ +Raw data for number of colonies simulations. +The population parameters are shown along with the number of invasions and the number of simulations. +Note that simulations where both pathogens went extinct have been removed (100 simulations were originally run for each parameter set). +$\beta$ is the transmission rate, $n$ is colony size, $m$ is the number of colonies and $N$ is the total population size. +} +\label{C-dens2} +\begingroup\small +\begin{tabular}{@{}rrrrrrrr@{}} + \toprule +$\beta$ & $n$ & $m$ & Area \tiny{($\times 1000$ km$^2$)} & $N$ \tiny{($\times 1000$)} & Density \tiny{(km$^{-2}$)} & Invasions & Sims \\ + \midrule +0.1 & 400 & 5 & 2.5 & 2 & 0.8 & 0 & 97 \\ + 0.1 & 400 & 10 & 5.0 & 4 & 0.8 & 0 & 100 \\ + 0.1 & 400 & 20 & 10.0 & 8 & 0.8 & 2 & 100 \\ + 0.1 & 400 & 40 & 20.0 & 16 & 0.8 & 2 & 100 \\ + 0.1 & 400 & 80 & 40.0 & 32 & 0.8 & 7 & 100 \\ + 0.2 & 400 & 5 & 2.5 & 2 & 0.8 & 2 & 99 \\ + 0.2 & 400 & 10 & 5.0 & 4 & 0.8 & 1 & 100 \\ + 0.2 & 400 & 20 & 10.0 & 8 & 0.8 & 0 & 100 \\ + 0.2 & 400 & 40 & 20.0 & 16 & 0.8 & 3 & 100 \\ + 0.2 & 400 & 80 & 40.0 & 32 & 0.8 & 11 & 100 \\ + 0.3 & 400 & 5 & 2.5 & 2 & 0.8 & 1 & 96 \\ + 0.3 & 400 & 10 & 5.0 & 4 & 0.8 & 2 & 100 \\ + 0.3 & 400 & 20 & 10.0 & 8 & 0.8 & 7 & 100 \\ + 0.3 & 400 & 40 & 20.0 & 16 & 0.8 & 15 & 100 \\ + 0.3 & 400 & 80 & 40.0 & 32 & 0.8 & 17 & 100 \\ + \bottomrule +\end{tabular} +\endgroup +\end{table} + + + + + diff --git a/Discussion.tex b/Discussion.tex index bc39005..129bde6 100644 --- a/Discussion.tex +++ b/Discussion.tex @@ -19,7 +19,7 @@ \section{Overview} However, I found the opposite relationship to Chapter~\ref{ch:empirical}; I found that decreasing host population structure increased the rate of pathogen invasion. In Chapter~\ref{ch:sims2} I used the same model as Chapter~\ref{ch:sims1} to test whether host population size or density more strongly promoted pathogen invasion and establishment and whether a pathogen invaded more easily into a population comprising many small colonies or fewer big colonies. I found that population size had a much stronger effect than density on the probability of pathogen invasion and that colony size had a much stronger effect than the number of colonies. -Theory \cite{may1979population, anderson1979population}, previous literature \cite{kamiya2014determines, nunn2003comparative, morand1998density} and Chapters \ref{ch:sims1} and \ref{ch:sims2} suggested that population size (either local group size or global population size) strongly influences the dynamics of disease and pathogen richness. +Theory \cite{may1979population, anderson1979population}, previous literature \cite{kamiya2014determines, nunn2003comparative, morand1998density} and Chapters~\ref{ch:sims1} and \ref{ch:sims2} suggested that population size (either local group size or global population size) strongly influences the dynamics of disease and pathogen richness. However, this variable was not included in the empirical study in Chapter~\ref{ch:empirical} as there are very few estimates of population size for bats and colony counts are time consuming and costly \cite{kloepper2016estimating}. However data from acoustic camera trap surveys are increasingly available \cite{jones2011indicator}. To make the estimation of population sizes easier, for bats and other acoustic detectable animals, I developed a general method for estimating population size and density from acoustic detectors or camera traps (Chapter~\ref{ch:grem}). @@ -31,7 +31,7 @@ \section{Comparison to the literature} However, my results imply a more nuanced relationship. In Chapter~\ref{ch:sims1} I found that reduced global population structure promoted the invasion of new pathogen species. In Chapter~\ref{ch:sims2}, I found that while global host population density --- which affected population structure --- had an effect on invasion rate, group size had a much stronger effect. -In contrast, in Chapter \ref{ch:empirical}, I found the opposite relationship; that in wild bat populations, increased host population structure promotes pathogen richness. +In contrast, in Chapter~\ref{ch:empirical}, I found the opposite relationship, that in wild bat populations, increased host population structure promotes pathogen richness. One interpretation of this is that there are two distinct phases to pathogen competition. When a new pathogen first enters a population, many contacts (i.e. a highly connected population) allows the pathogen to spread and avoid stochastic extinction. However, after this initial spread, host population structure may enable the pathogen to persist for longer. @@ -53,10 +53,10 @@ \section{Comparison to the literature} \section{Other mechanisms controlling pathogen richness} Colony size has been found to have both a negative relationship \cite{gay2014parasite} and no relationship \cite{turmelle2009correlates} with parasite richness in previous comparative studies using relatively small data sets. -However, in Chapter \ref{ch:sims2} I found that colony size is particularly important for promoting pathogen richness. -I did not include colony size in my comparative analysis (Chapter \ref{ch:empirical}) for three reasons. +However, in Chapter~\ref{ch:sims2} I found that colony size is particularly important for promoting pathogen richness. +I did not include colony size in my comparative analysis (Chapter~\ref{ch:empirical}) for three reasons. Firstly, the focus of the chapter was broad-scale population structure. -Secondly, there was a lack of previously published, strong evidence of a relationship between colony size and pathogen richness.\cite{turmelle2009correlates}. +Secondly, there was a lack of previously published, strong evidence of a relationship between colony size and pathogen richness \cite{turmelle2009correlates}. Finally, there is a considerable lack of data on colony size and I was aiming for a large sample size. However, given the results of Chapter~\ref{ch:sims2}, filling these data gaps would be a useful avenue for further research. In particularly, testing the relative effects of population density and colony size would be a useful test of the model used in Chapter~\ref{ch:sims2}. @@ -109,8 +109,8 @@ \section{Predictive modelling} \section{Bat social structure} It is important to note that I have ignored much of the social complexity found in bats. -Information on these other social behaviours was not explicitly included in the empirical study in Chapter \ref{ch:empirical}. -Furthermore, in Chapters \ref{ch:sims1} and \ref{ch:sims2} I have modelled bat populations as a metapopulation where the only social structure is the grouping of individuals into subpopulations. +Information on these other social behaviours was not explicitly included in the empirical study in Chapter~\ref{ch:empirical}. +Furthermore, in Chapters~\ref{ch:sims1} and \ref{ch:sims2} I have modelled bat populations as a metapopulation where the only social structure is the grouping of individuals into subpopulations. There is dispersal between these subpopulations but otherwise they are static. Firstly, I have not modelled the creation of new colonies, or the disbanding of colonies \cite{metheny2008genetic}. Especially in the face of habitat destruction, it is likely that the number of colonies of a species will be decreasing. @@ -168,8 +168,8 @@ \section{Conclusions} %\end{itemize} Overall my studies suggest that population size and structure have an important role in controlling pathogen richness. -However, my two studies on this topic give contradictory results and so the exact mechanisms by which these effects occur are still not clear. -I have found that population size and colony size is particularly important for controlling pathogen richness in the case of closely related, strongly competing pathogens. +However, my two studies on population structure give contradictory results and so the exact mechanisms by which these effects occur are still not clear. +I have found that population size and colony size are particularly important for controlling pathogen richness in the case of closely related, strongly competing pathogens. I have also provided a tool to facilitate the estimation of population sizes in echolocating bats and other mammals. diff --git a/Introduction.tex b/Introduction.tex index fa5afb7..5db2bd1 100644 --- a/Introduction.tex +++ b/Introduction.tex @@ -64,7 +64,7 @@ \subsection{Single-pathogen models} The roles of population size and density in the dynamics of single pathogens are also well established \cite{may1979population, anderson1979population, heesterbeek2002brief, lloyd2005should}. Broadly, larger populations can maintain a disease more easily by having a larger pool of susceptible individuals (individuals without acquired immunity) and having a greater number of new susceptible individuals enter the population by birth or immigration \cite{may1979population, anderson1979population}. High density populations are expected to have a greater number of contacts between individuals and so promote the spread of a pathogen. -However, there is much discussion about if and when the number of contacts might scale independently of density \cite{mccallum2001should}. +However, there is much discussion about if, and when, the number of contacts might scale independently of density \cite{mccallum2001should}. \subsection{Multi-pathogen models} % multi path is important diff --git a/LinksAndMetadata.tex b/LinksAndMetadata.tex index c7fb361..3a49eb4 100644 --- a/LinksAndMetadata.tex +++ b/LinksAndMetadata.tex @@ -12,4 +12,6 @@ \makeatletter \makeatother - + +\PassOptionsToPackage{hyphens}{url}\usepackage{hyperref} + diff --git a/Preamble.tex b/Preamble.tex index d5d245b..a2cd301 100644 --- a/Preamble.tex +++ b/Preamble.tex @@ -6,7 +6,7 @@ \lettr{P}athogens acquired from animals make up the majority of emerging human diseases, are often highly virulent and can have large affects on public health and economic development. Identifying species with high pathogen species richness enables efficient sampling and monitoring of potentially dangerous pathogens. I examine the role of host population structure and size in maintaining pathogen species richness in an important reservoir host for zoonotic viruses, bats (Order, Chiroptera). -Firstly I test whether population structure is associated with high viral richness across bat species within a comparative, phylogenetic analysis. +Firstly I test whether population structure is associated with high viral richness across bat species with a comparative, phylogenetic analysis. I find evidence that bat species with more structured populations have more virus species. As this type of study cannot distinguish between specific mechanisms, I then formulate epidemiological models to test whether more structured host populations may allow invading pathogens to avoid competition. However, these models show that increasing population structure decreases the rate of pathogen invasion. @@ -34,7 +34,7 @@ \tmpsection{Kat + dylan, mum and dad} Firstly and most importantly I would like to thank my wife, Katrina, for helping me beyond measure throughout my PhD. -Although he has contributed very little towards my thesis, I would also like to thank my son, Dylan, for making my life tiring and brilliant for the last two years. +Secondly, although he has contributed very little towards my thesis, I would also like to thank my son, Dylan, for making my life tiring and brilliant for the last two years. I would also like to thank my parents for the endless support they have given me both during and before my studies. diff --git a/epilit.bib b/epilit.bib index a57e6eb..44be9a8 100644 --- a/epilit.bib +++ b/epilit.bib @@ -12182,6 +12182,22 @@ @Manual{knitr Url = {http://CRAN.R-project.org/package=knitr} } + +% to read +@article{han2016undiscovered, + author = {Han, Barbara A. AND Schmidt, John Paul AND Alexander, Laura W. AND Bowden, Sarah E. AND Hayman, David T. S. AND Drake, John M.}, + journal = {PLoS Negl Trop Dis}, + publisher = {Public Library of Science}, + title = {Undiscovered Bat Hosts of Filoviruses}, + year = {2016}, + month = {07}, + volume = {10}, + %url = {http://dx.doi.org/10.1371%2Fjournal.pntd.0004815}, + pages = {1-10}, + number = {7}, + doi = {10.1371/journal.pntd.0004815} +} + @Article{yack2013passive, Title = {{P}assive acoustic monitoring using a towed hydrophone array results in identification of a previously unknown beaked whale habitat}, Author = {Yack, Tina M and Barlow, Jay and Calambokidis, John and Southall, Brandon and Coates, Shannon}, diff --git a/lucas_et_al_supplementarymaterial_2015-01-20.tex b/lucas_et_al_supplementarymaterial_2015-01-20.tex index 09b0e67..5d6915c 100644 --- a/lucas_et_al_supplementarymaterial_2015-01-20.tex +++ b/lucas_et_al_supplementarymaterial_2015-01-20.tex @@ -33,11 +33,11 @@ \section{Table of symbols} \clearpage -\section{Supplementary Methods} +\section{Supplementary methods} \subsection{Introduction} \lettr{T}hese supplementary methods derive all the models used. For continuity, the gas model derivation is included here as well as in the main text. -The calculation of all integrals used in the gREM is included in the Python script S3. +The calculation of all integrals used in the gREM is included in the \emph{Python} script S3. \input{REM-methods.tex} @@ -48,7 +48,7 @@ \subsection{Introduction} \clearpage -\section{Supplementary Information: Simulation model results of the gREM precision} +\section{Supplementary information: Simulation model results of the gREM precision} \setcounter{figure}{0} @@ -66,7 +66,7 @@ \section{Supplementary Information: Simulation model results of the gREM precisi \clearpage -\section{Supplementary Information: Impact of parameter error} +\section{Supplementary information: Impact of parameter error} @@ -74,22 +74,22 @@ \section{Supplementary Information: Impact of parameter error} \begin{figure}[h!] \centering { - \subfloat[label{f:signal}]{ + \subfloat[\label{f:signal}]{ \includegraphics[width=0.4\textwidth]{imgs/AverageModelBias_callerror.pdf} } - \subfloat[label{f:sensor}]{ + \subfloat[\label{f:sensor}]{ \includegraphics[width=0.4\textwidth]{imgs/AverageModelBias_cameraerror.pdf} } - \subfloat[label{f:radius}]{ + \subfloat[\label{f:radius}]{ \includegraphics[width=0.4\textwidth]{imgs/AverageModelBias_radiuserror.pdf} }%% - \subfloat[label{f:speed}]{ + \subfloat[\label{f:speed}]{ \includegraphics[width=0.4\textwidth]{imgs/AverageModelBias_speederror.pdf} } }%% \caption[Model sensitivity to error in parameter estimates]{ -Model sensitivity (for all gREM submodels) to error in estimates of a) signal width $\alpha$, b) sensor width $\theta$, c) detection distance $r$ and d) animal movement speed $v$. +Model sensitivity (for all gREM submodels) to error in estimates of A) signal width $\alpha$, B) sensor width $\theta$, C) detection distance $r$ and D) animal movement speed $v$. Estimates are -10\% (red), -1\% (orange), 0\% (grey), +1\% (green) and +10\% (blue) of the true parameter value. The black dashed line indicates zero error in density estimates. The error bars 95\% confidence intervals across all simulations. diff --git a/tim-lucas-thesis.tex b/tim-lucas-thesis.tex index 65f58bc..d6bdf61 100644 --- a/tim-lucas-thesis.tex +++ b/tim-lucas-thesis.tex @@ -125,8 +125,8 @@ \chapter{Understanding how population structure affects pathogen richness in a m \label{ch:sims1} \include{Chapter2} -\chapter{A mechanistic model to compare the importance of interrelated population measures: population size, population density and colony size}{This work was conducted in collaboration with Kate Jones and Hilde Wilkinson-Herbots.} -\chaptermark{A comparison of population size, population density and colony size} +\chapter{A mechanistic model to compare the importance of interrelated population measures: host population size, density and colony size}{This work was conducted in collaboration with Kate Jones and Hilde Wilkinson-Herbots.} +\chaptermark{A comparison of host population size, density and colony size} \label{ch:sims2} \include{Chapter4} @@ -135,8 +135,8 @@ \chapter{A generalised random encounter model for estimating animal density with This work was conducted in collaboration with Elizabeth Moorcroft, Robin Freeman, Marcus Rowcliffe and Kate Jones and is now published in Methods in Ecology and Evolution \cite{lucas2015generalised}. The text here is almost completely reproduced from \textcite{lucas2015generalised}. I formulated and analysed the analytical model. -Elizabeth Moorcroft wrote to code for and carried out the simulations. -I led the writing of the manuscript with contributions fro the other coauthors. +Elizabeth Moorcroft wrote the code for and carried out the simulations. +I led the writing of the manuscript with contributions from the other coauthors. } \chaptermark{A generalised random encounter model for estimating animal density} \label{ch:grem}