-
Notifications
You must be signed in to change notification settings - Fork 31
/
bootscanning.tex
39 lines (35 loc) · 2.1 KB
/
bootscanning.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
\section{Finding Recombination Breakpoints}
\label{sct_bootscanning}
Bootscanning is a technique for finding recombination breakpoints in a
sequence. The original procedure \cite{Salminen_1995} uses the bootstrap value
of competing topologies as a criterion, the procedure shown below uses the
patristic distance (\ie, distance along the tree) according to the most likely
tree, and is thus not strictly speaking a bootscan but a distance
plot.\footnote{Thanks to Heiko Schmidt for pointing this out.} It involves
aligning the sequence of interest (called \emph{query} or \emph{reference})
with related sequences (including the putative recombinant's parents) and
computing phylogenies locally over the alignment. Recombination is expected to
cause changes in topology. The tasks involved are shown below:
\begin{enumerate}
\item align the sequences $\rightarrow$ multiple alignment
\item slice the multiple alignment $\rightarrow$ slices
\item build a tree for each slice $\rightarrow$ trees
\item extract distance from query to other sequences (each tree) $\rightarrow$ tabular data
\item plot data $\rightarrow$ graphics
\end{enumerate}
The distribution contains a script, \texttt{src/bootscan.sh}, that performs the whole process. Here is an example run:
\begin{verbatim}
$ bootscan.sh HRV_3UTR.dna HRV-93 CL073908
\end{verbatim}
where \texttt{HRV\_3UTR.dna} is a FastA file of (unaligned) sequences, \texttt{HRV-93} is the outgroup, and \texttt{CL073908} is the query. Here is the result:
\begin{centering}
\includegraphics[width=\textwidth]{bootscan_1.pdf}
\end{centering}
\smallskip{}
\noindent{}until position 450 or so, the query sequence's nearest relatives (in
terms of substitutions/site) are \texttt{HRV-36} and \texttt{HRV-89}. After
that point, it is \texttt{HRV-67}. This suggests that there is a recombination
breakpoint near position 450.
The script uses \reroot{} to reroot the trees on the outgroup, \clade{} and
\labels{} to get the labels of the ingroup, \distance{} to extract the distance
between the query and the other sequences, as well as the usual \texttt{sed}, \texttt{grep}, etc. The plot is done with gnuplot.