-
Notifications
You must be signed in to change notification settings - Fork 2
/
results.tex
239 lines (197 loc) · 21.4 KB
/
results.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
\section*{Results}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\renewcommand{\arraystretch}{1.1}
\begin{table}[tb]
\begin{center}
\caption[]{$F_{ST}$ of synonymous and noncoding GBS SNPs}
\textbf{}\\[-2mm]
{\fontsize{6}{9}\sf
\begin{tabular}{llccccccl}
\hline
& & \\[-3mm]
& & \multicolumn{2}{c}{Mesoamerica} & \multicolumn{2}{c}{S. America} \\
& & Lowlands & Highlands & Lowlands & Highlands \\
\hline
& & \\[-3mm]
Mesoamerica & Lowlands & -- & & & \\
& Highlands & 0.0244 & -- & & \\
S. America & Lowlands & 0.0227 & 0.0343 & -- & \\
& Highlands & 0.0466 & 0.0534 & 0.0442 & -- \\ [1mm]
\hline
\end{tabular}
\label{FstP} % caption is needed to make this work
}
\end{center}
\end{table}
\renewcommand{\arraystretch}{1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\renewcommand{\arraystretch}{1.1}
\begin{table}[tb]
\begin{center}
\caption[]{Estimated parameters of population size model}
\textbf{}\\[-2mm]
{\fontsize{7}{11}\sf
\begin{tabular}{lcccccccl} \hline
& & \\[-3mm]
Mesoamerica & \multicolumn{2}{c}{Model IA} &\multicolumn{2}{c}{Model IB}\\[0.1cm]
\hline
& & \\[-3mm]
& Likelihood & $-$5592.80 & Likelihood & $-$4654.79 \\
&$N_C$ & 138,000 & $N_C$ & 225,000 \\
&$N_1$ & 52,440 & $N_1$ & 171,000\\
&$N_2$ & 85,560 & $N_2$ & 54,000\\
&$N_{2P}$ & 85,560 & $N_{2P}$ & 54,000\\
\hline
& & \\[-3mm]
S. America & \multicolumn{2}{c}{Model IA} &\multicolumn{2}{c}{Model II}\\[0.1cm]
\hline
& & \\[-3mm]
& Likelihood & $-$3855.28 & Likelihood & $-$8044.71 \\
&$N_C$ & 78,000 & $N_C$ & 150,000 \\
&$N_1$ & 75,660 & $N_1$ & 96,000\\
&$N_2$ & 2,340 & $N_2$ & 54,000\\
&$N_{2P}$ & 205,920 & $N_3$ & 51,300\\
& & & $N_4$ & 2,700\\
& & & $N_{4P}$ & 145,800\\ [1mm]
\hline
% \multicolumn{9}{l}{$^{a}$ The groups were based on phylogenetic analysis in fig.~\ref{tree}\emph{A}}\\
\end{tabular}
\label{param} % caption is needed to make this work
}
\end{center}
\end{table}
\renewcommand{\arraystretch}{1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection*{Samples and data}
We sampled 94 maize landraces from four distinct regions in the Americas (Table \ref{srkid}; Figure \ref{map}): the lowlands of Mesoamerica (Mexico/Guatemala; $n=24$) and northern S. America ($n=23$) and the highlands of Mesoamerica ($n=24$) and the Andes ($n=23$).
Samples were genotyped using the MaizeSNP50 Beadchip platform (``MaizeSNP50''; $n=94$) and genotyping-by-sequencing (``GBS''; $n=87$).
After filtering for Hardy-Weinberg genotype frequencies and minimum sample size at least 10 in each of the four populations (see Materials and Methods) 91,779 SNPs remained, including 67,828 and 23,951 SNPs from GBS and MaizeSNP50 respectively.
\subsection*{Population structure}
We performed a {\sf STRUCTURE} analysis \cite[]{Pritchard_2000_10835412,Falush_2003_12930761} of our landrace samples, varying the number of groups from $K$ = 2 to 6 (Figure~\ref{map}B, Figure~\ref{supp:struct}).
Most landraces were assigned to groups consistent with \emph{a priori} population definitions, but admixture between highland and lowland populations was evident at intermediate elevations ($\sim1700$m). Consistent with previously described scenarios for maize diffusion \cite[]{Piperno_2006_69}, we find evidence of shared ancestry between lowland Mesoamerican maize and both Mesoamerican highland and S. American lowland populations. Pairwise $F_{ST}$ among populations reveals low overall differentiation (Table \ref{FstP}), and the higher $F_{ST}$ values observed in S. America are consistent with the decreased admixture seen in {\sf STRUCTURE}. Archaeological evidence supports a more recent colonization of the highlands in S. America \cite[]{Piperno_2006_69,Perry_2006_16511492,Grobman_2012_22307642}, suggesting that the observed differentiation may be the result of a stronger bottleneck during colonization of the S. American highlands.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIGURE
\begin{figure}[tb]
\begin{center}
\vspace{-0mm}
%\includegraphics[width=0.23\textwidth]{figs/model}
\includegraphics[width=0.5\textwidth]{fig/Fig4}
\renewcommand{\baselinestretch}{0.9}
\vspace{-3mm}
\caption{Observed and expected joint distributions of minor allele frequencies in lowland and highland populations in (A) Mesoamerica and (B) S. America. Residuals are calculated as $(\mbox{model}-\mbox{data})/\sqrt[]{\mbox{model}}$.}
\vspace{-6mm}
\label{JFD}
\end{center}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIGURE
\subsection*{Population differentiation}
To provide a null expectation for allele frequency differentiation, we used the joint site frequency distribution (JFD) of lowland and highland populations to estimate parameters of two demographic models using the maximum likelihood method implemented in $\delta a \delta i$ \cite[]{Gutenkunst_2009_19851460}.
All models incorporate a domestication bottleneck and population differentiation between lowland and highland populations, but differ in their consideration of admixture and ascertainment bias (Figure~\ref{model}; see Materials and Methods for details). {We used published estimates of the strength of the domestication bottleneck \cite[]{Eyre-Walker_1998_9539756,Tenaillon_2004_15014173,Wright_2005_15919994}, but confirmed that changing the strength of the bottleneck had little influence on the null distributions of $F_{ST}$ values (not shown).}
Estimated parameter values are listed in Table~\ref{param}; while the observed and expected JFDs were quite similar for both models, residuals indicated an excess of rare variants in the observed JFDs in all cases (Figure~\ref{JFD}).
Under both models IA and IB, we found expansion in the highland population in Mesoamerica to be unlikely, but a strong bottleneck followed by population expansion is supported in S. American highland maize in both models IA and II.
In Mesoamerica, the likelihood value of model IB was higher than the likelihood of model IA by 850 units of log-likelihood (Table~\ref{param}), consistent with analyses suggesting a significant role for introgression from \textit{mexicana} during the spread of maize into the highlands \cite[]{Profford_2013}.
Comparisons of our empirical $F_{ST}$ values to the null expectation simulated under our demographic models allowed us to identify significantly differentiated SNPs between lowland and highland populations. In all cases, observed $F_{ST}$ values were quite similar to those generated under our null models (Figure~\ref{FstDist}), and model choice %-- including the parameterization of the domestication bottleneck --
had little impact on the distribution of estimated \emph{P}-values (Figure~\ref{fig:qq}).
We show results under Model IB for Mesoamerican populations and Model II for S. American populations.
We chose $P<0.01$ as the cut-off for significant differentiation between lowland and highland populations, and identified 687 SNPs in Mesoamerica (687/76,989=0.89\%) and 409 SNPs in S. America (409/63,160=0.65\%) as significant outliers (Figure~\ref{PvDist}). All results were qualitatively identical with different cutoff values (0.05 or 0.001; data not shown). SNPs with significant $F_{ST}$ $P$-values were enriched in intergenic regions rather than protein coding regions (60.0\% vs. 47.9\%, Fisher's Exact Test $P < 10^{-7}$ for Mesoamerica; 62.0\% vs. 47.8\%, FET $P<10^{-5}$ for S. America).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIGURE
\begin{figure}[tb]
\begin{center}
\vspace{-0mm}
\includegraphics[width=0.4\textwidth]{fig/Fig6}
\renewcommand{\baselinestretch}{0.9}
\vspace{-3mm}
\caption{Scatter plot of $-\log_{10} P$-values of observed $F_{ST}$ values based on simulation from estimated demographic models. $P$-values are shown for each SNP in both Mesoamerica (Model IB; $P_M$ on $x$-axis) and S. America (Model II; $P_S$ on $y$-axis).
Red, blue, orange and gray dots represents SNPs showing significance in both Mesoamerica and S. America, only in Mesoamerica, only in S. America, or in neither region, respectively.
The number of SNPs in each category is shown in the same color as the points.}
\vspace{-6mm}
\label{PvDist}
\end{center}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FIGURE
\subsection*{Patterns of adaptation}
Given the historical spread of maize from an origin in the lowlands, it is tempting to assume that the observation of significant population differentiation at a SNP should be primarily due to an increase in frequency of adaptive alleles in the highlands.
To test this hypothesis, we sought to identify the adaptive allele at each locus using comparisons between Mesoamerica and S. America as well as to \emph{parviglumis} (see Methods).
Consistent with predictions, we infer that differentiation at 72.3\% (264) and 76.7\% (230) of SNPs in Mesoamerica and S. America is due to adaptation in the highlands after excluding SNPs with ambiguous patterns likely due to recombination (Table \ref{supp:phs}).
{As further evidence of selection, we asked whether alleles showing excess differentiation also exhibit longer haplotypes than expected. We calculated the empirical quantile of the pairwise haplotype score from \cite{Toomajian_2006_16623598} for each putatively adaptive SNP as the proportion of all SNPs at a similar frequency with PHS scores greater than or equal to the PHS score observed at the focal SNP (Table~\ref{supp:phs}).
If $F_{ST}$ outliers have indeed been targeted by selection in a particular population, we expect this empirical quantile to be smaller (i.e., fewer random SNPs of similar frequency have as large a PHS score) than in other populations. Indeed, we find that SNPs identified as putatively adaptive in each of the four populations show smaller empirical PHS quantiles more often than the 50\% expected by chance (Table~\ref{supp:phs}).}
%
%We say that the PHS test is consistent if the empirical quantile in highlands is smaller than that in lowlands (haplotype length is longer as the empirical quantile is smaller).
%If we extract random SNPs, the probability that the haplotype length in highlands is longer than that in lowlands is expected to be 50\% by chance.
%However, we rejected the null hypothesis by a Fisher's exact test in all cases (Table~\ref{supp:phs}).}
%The majority of these SNPs show patterns of haplotype variation (by the PHS test) consistent with our inference of selection (Table~\ref{supp:phs} and Supporting Information, File S1). \jri{we could just move the S1 data here}
%Finally, we asked whether the PHS test supports highland and lowland adaptation scenario (summarized in Table~\ref{supp:phs}).
%Consider the case of highland adaptation.
%We assumed that the putative derived allele is adaptive in highlands and checked whether the haplotype length is longer in the highlands than in the lowlands.
%However, haplotype length cannot be compared directly because the derived allele frequency is different between highlands and lowlands.
%Thus, we compared the empirical {quantile} of PHS test as an indicator of haplotype length given allele frequency (Pr($PHS_{xA}\leq PHS_{null|p}$) in Materials and Methods).
%We say that the PHS test is consistent if the empirical quantile in highlands is smaller than that in lowlands (haplotype length is longer as the empirical quantile is smaller).
%If we extract random SNPs, the probability that the haplotype length in highlands is longer than that in lowlands is expected to be 50\% by chance.
%However, we rejected the null hypothesis by a Fisher's exact test in all cases (Table~\ref{supp:phs}).
Convergent evolution at the nucleotide level should be reflected in an excess of SNPs showing significant differentiation between lowland and highland populations in both Mesoamerica and S. America.
Although the 19 SNPs showing $F_{ST}$ \emph{P}-values $<0.01$ in both Mesoamerica ($P_M$) and S. America ($P_S$) is statistically greater than the $\approx 5$ expected ($48,370\times 0.01 \times 0.01 \approx 4.8$; $\chi^2$-test, $P\ll0.001$), it nonetheless represents a small fraction ($\approx 7-8\%$) of all SNPs showing evidence of selection.
This paucity of shared selected SNPs does not appear to be due to our demographic model:
a simple outlier approach based using the 1\% highest $F_{ST}$ values finds no shared adaptive SNPs between Mesoamerican and S. American highland populations.
For 13 of 19 SNPs showing putative evidence of shared selection we could use data from \textit{parviglumis} to infer whether these SNPs were likely selected in lowland or highland conditions (see Methods).
Surprisingly, SNPs identified as shared adaptive variants more frequently showed segregation patterns consistent with lowland (10 SNPs) rather than highland adaptation (2 SNPs). %(1 SNP with an ambiguous pattern).
We also investigated how often different SNPs in the same gene may have been targeted by selection.
To search for this pattern, we considered all SNPs within 10kb of a transcript as part of the same gene, excluding SNPs in an miRNA or second transcript.
We classified SNPs showing significant $F_{ST}$ in Mesoamerica, S. America or in both regions into 778 genes.
Of these, 485 and 277 genes showed Mesoamerica-specific and SA-specific significant SNPs, while 14 genes contained at least one SNP with a pattern of differentiation suggesting convergent evolution and 2 genes contained both Mesoamerica-specific and SA-specific significant SNPs.
Overall, however, fewer genes showed evidence of convergent evolution than expected by chance (permutation test; $P<10^{-5}$).
{
Finally, we tested whether genes showing evidence of selection in both highland populations were enriched for particular metabolic pathways using data on 481 metabolic pathways from the MaizeCyc database \cite[ver. 2.2;][]{maizecyc2013}.
We found 92 pathways that include a selected gene from only one of the highland populations, but no significant excess of shared pathways: only 32 pathways included a selected gene in both populations ($P=0.0961$; Table~\ref{supp:metabo}). Despite similar phenotypes and environments, we thus see little evidence for convergent evolution at the SNP, gene, and metabolic-pathway levels. }
\subsection*{Comparison to theory}
% # \mutrate computation:
% A <- 500; rho <- 5000; sb <- 10^(-(1:4)); xisq <- 30
% sapply( 10^c(-5,-8), function (mu) mu * (2 * rho * A * sb)/xisq )
Given the limited empirical evidence for convergent evolution at the molecular level, we took advantage of recent theoretical efforts \cite[]{ralph2014convergent} to assess the degree of convergence expected under a spatially explicit population genetic model (see Materials and Methods).
Using current estimates of maize cultivation in S. America, we find a $270,200\text{km}^2$ area in which maize is cultivated in $\ge 1\%$ of the land area, for a total area of cultivation of $\approx 600,000 \text{ha}$.
At a planting density of $\rho \approx 20,000$ plants per hectare, this gives a total maize population of $\approx 12$ billion.
Assuming an offspring variance of $\xi^2 = 30$, we can then compute the waiting time $\Tmut=1/\mutrate$ for a new beneficial mutation to appear and fix.
If we assume an average selection coefficient of $s_b=10^{-5}$ for each mutation, a single-base mutation with mutation rate $\mu=3\times 10^{-8}$ \citep{Clark_2005_16079248} would take an expected 4,162 generations to appear and fix.
% code for this:
% percent area p=1%.
% p=0.01;
% From matt's Area_Script on github we then do:
% sub<-subset(STD, alt > 1700 & maizeData > p);
% A=sum(sub$maizeData)*100;
%this gets us area in km^2. density is 20K plants per ha so:
% rho <- 100*20E3;
%range of selection coefficients
% sb <- 10^(-(1:5));
%peter's code:
% xisq <- 30; 1/sapply( 10^c(-8), function (mu) mu * (2 * rho * A * sb)/xisq );
Our estimate of the maize population size uses the land area currently under cultivation and is likely an overestimate; $\Tmut$ scales linearly with the population size and lower estimates of $A$ will thus increase $\Tmut$ proportionally.
However, because $\Tmut$ also scales approximately linearly with both the selection coefficient and the mutation rate, strong selection and the existence of multiple equivalent mutable sites could reduce this time.
For example, if any one of 10 sites within a gene were to have an equivalent selective benefit of $s_b=10^{-4}$, $\Tmut$ would be reduced to 42 generations assuming constant $A$ over time.
% # Tmig computation:
% A <- 500; rho <- 5000; xisq <- 30; sigma <- 3.5
% params <- expand.grid( R=1000*(1:4), sm=10^(-(1:4)) )
% params$char.length <- with(params, 1/(sqrt(2*sm)/sigma) )
% params$Tmig <- with( params, 1 / ( sm * (A * rho / 2) * exp(- sqrt(2*sm)*R/sigma ) ) )
% print(params)
Gene flow between highland regions could also generate patterns of shared adaptive SNPs.
{The coalescent calculations described above suggest that highland area today is unlikely to draw any ancestry from a region more than $6\sigma\sqrt{m}$ kilometers away from $m$ generations ago in any part of the genome that is neutral in the lowlands. Our estimated dispersal of $\sigma=3.5$km thus provides an estimate of 1,328km. The Mesoamerican and Andean highlands are approximately 4,000 km apart, and neutral alleles are therefore not expected to transit between the Mesoamerican and Andean highlands within 4,000 generations.
Changing the typical distance over which farmers share seed by a factor of 10 would change this conclusion,
but data from field surveys do not lend support to such high dispersal distances \citep{bellon2011}.}
{These results for neutral alleles put a lower bound on the time for deleterious alleles to transit as well, suggesting that we should not expect even weakly deleterious alleles (e.g., $s_m=10^{-5}$) to have moved between highlands.
We expect many of the alleles adaptive in the highlands to be deleterious in the lowlands,
and analyze this case in more detail in the Appendix.}
Taken together, these theoretical considerations suggest that any alleles beneficial in the highlands that are neutral or deleterious in the lowlands and shared by both the Mesoamerican and S. American highlands would have been present as standing variation in both populations, rather than passed between them.
\subsection*{Alternative routes of adaptation}
The lack of both empirical and theoretical support for convergent adaptation at SNPs or genes led us to investigate alternative patterns of adaptation.
We first sought to understand whether SNPs showing high differentiation between the lowlands and the highlands arose primarily via new mutations or were selected from standing genetic variation.
We found that putatively adaptive variants identified in both Mesoamerica and S. America tended to segregate in both the lowland population (85.3\% vs. 74.8\% in Mesoamerica (Fisher's exact test $P < 10^{-9}$ and 94.8\% vs 87.4\% in S. America, $P< 10^{-4}$) and \emph{parviglumis} (78.3\% vs. 72.2\% in Mesoamerica
(Fisher's exact test $P < 0.01$ and 80.2\% vs 72.8\% in S. America, $P< 0.01$) more often than other SNPs of similar mean allele frequency .
While maize in highland Mesoamerica grows in sympatry with the highland teosinte \textit{mexicana}, maize in S. America is outside the range of wild \textit{Zea} species, leading to a marked difference in the potential for adaptive introgression from wild relatives.
\citet{Pyhajarvi2013} recently investigated local adaptation in \textit{parviglumis} and \textit{mexicana} populations, characterizing differentiation between these subspecies using an outlier approach.
Genome-wide, only a small proportion (2--7\%) of our putatively adaptive SNPs were identified by \citet{Pyhajarvi2013}, though these numbers are still in excess of expectations (Fisher's exact test $P<10^{-3}$ for S. America and $P<10^{-8}$ for Mesoamerica; Table~\ref{tanja}).
The proportion of putatively adaptive SNPs shared with teosinte was twice as high in Mesoamerica, however,
leading us to evaluate the contribution of introgression from \textit{mexicana} \cite[]{Profford_2013} in patterning differences between S. American and Mesoamerican highlands.
The proportion of putatively adaptive SNPs in introgressed regions of the genome in highland maize in Mesoamerica was nearly four times higher than found in S. America (FET $P<10^{-11}$), while differences outside introgressed regions were much smaller (7.5\% vs. 6.2\%; Table \ref{supp:introgressed}).
Furthermore, of the 77 regions identified as introgressed in \cite{Profford_2013}, more than twice as many contain at least one $F_{ST}$ outlier in Mesoamerica as in S. America (23 compared to 9, one-tailed Z-test $P=0.0027$).
Excluding putatively adaptive SNPs, mean $F_{ST}$ between Mesoamerica and S. America is only slightly higher in introgressed regions (0.032) than across the rest of the genome (0.020), suggesting the enrichment of high $F_{ST}$ SNPs seen in Mesoamerica is not simply due to neutral introgression of a divergent teosinte haplotype.
%\ref{supp:introgressed}
%When focusing on GUs, we identified 99/586 (14.5\%) and 22/466 (4.7\%) GUs of Mesoamerica- and SA-specific significance in introgressed regions (Fisher's exact test, $P<10^{-6}$).