Skip to content

Commit

Permalink
paper edits, removing lost of materials to shift to sup materials
Browse files Browse the repository at this point in the history
  • Loading branch information
vsbuffalo committed Mar 8, 2012
1 parent 133fc91 commit fcc06c1
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 40 deletions.
84 changes: 44 additions & 40 deletions paper/scythe-paper.org
Expand Up @@ -4,21 +4,41 @@
#+date: #+date:
#+babel: :results output :exports both :session :comments org #+babel: :results output :exports both :session :comments org


* 3'-end Contamination
#+begin_abstract
*Motivation:* Modern sequencing technologies can leave artifactual
contaminant sequences at the 3'-end of reads. 3'-end regions are also
have the lowest quality bases and likely to be called incorrectly,
which makes identifying and removing 3'-end contaminants
difficult. Fixed-number mismatch approaches to remove contaminants
can fail in these low quality regions. Failing to remove such
contaminants can seriously confound downstream analyses like assembly
and mapping.


*Results:* Scythe is a program designed specifically to remove 3'-end
contaminants. It searches for 3'-end contaminants and uses a Bayesian
model that considers individual base qualities to decide whether a
given match is a contaminant or background sequence. Even for a
variety of prior contamination rates, Scythe out performs other
adapter removal software tools.

*Availability:* Scythe is freely available under the MIT license at
https://github.com/vsbuffalo/scythe.
#+end_abstract

* Introduction


Scythe focuses on 3'-end contaminants, specifically those due to Scythe focuses on 3'-end contaminants, specifically those due to
adapters or barcodes. Many second-generation sequencing technologies adapters or barcodes. It embraces the Unix Philosophy of "programs
such as Illumina's Genome Analyzer II and HiSeq have lower-quality that do one thing well"
3'-end bases. These low-quality bases are more likely to have (http://www.faqs.org/docs/artu/ch01s06.html). Many second-generation
nucleotides called incorrectly, making contaminant identification more sequencing technologies such as Illumina's Genome Analyzer II and
difficult. Futhermore, 3'-end quality deterioration is not uniform HiSeq have lower-quality 3'-end bases. These low-quality bases are
across all reads; as figure \ref{fig:qual_plot} shows, there is more likely to have nucleotides called incorrectly, making contaminant
variation in the quality per base. identification more difficult. Futhermore, 3'-end quality

deterioration is not uniform across all reads (see figure 1 in
#+caption: 3'-end quality deterioration. Supplementary Materials), there is variation in the quality per base.
#+label: fig:qual_plot
#+attr_latex: width=12cm
[[./qual_plot.png]]


A common step in read quality improvement procedures is to remove A common step in read quality improvement procedures is to remove
these low-quality 3'-end sequences from reads. This is thought to these low-quality 3'-end sequences from reads. This is thought to
Expand All @@ -33,40 +53,24 @@ can be incorporated into classification procedures.
Fixed-number of mismatch approaches have the disadvantage that they Fixed-number of mismatch approaches have the disadvantage that they
don't differentially weight a mismatch on a low-quality base from a don't differentially weight a mismatch on a low-quality base from a
mismatch on a high-quality base. Futhermore, the fixed-number could mismatch on a high-quality base. Futhermore, the fixed-number could
easily be exhausted in a run of bad bases, even though every easily be exhausted in a run of bad bases (which are quite common in
good-quality base perfectly matches the contaminant sequence. the 3'-end), even though every good-quality base perfectly matches the
contaminant sequence.




* Scythe's Methods * Scythe's Methods


Scythe uses Bayesian methods to identify 3'-end contaminants. Scythe Scythe uses Bayesian methods to identify and remove a given set of
does not address the problem of contaminants found anywhere in the 3'-end contaminants (most often sequencing adapters). Scythe only
read; contaminants in higher quality regions are easier to identify checks for the adapters in the 3'-end; contaminants further towards
and fixed-number of mismatch techniques are accurate and the middle and 5'-ends often have high quality bases, so identifying
fastest. Scythe only addresses the challenging problem of contaminants and removing them is much simpler. Scythe makes some assumptions: see
in low-quality 3-ends. In doing so, it makes some assumptions: the Supplementary Materials. TODOREF


1. All contaminants sequences are known /a priori/ and are reliable.
2. A contaminant sequence with length $l_c$ in a read of length $l_r$
will only be found between $l_r - l_c$ and $l_r$.
3. If the read is contaminated, the number of bases contaminating the
read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
the 5'-end of the contaminant sequence. While this seems limiting,
it is the mode by which 3'-end adapter and barcode contamination
occurs.[fn:: We have encountered Illumina data in which adapters
contaminate the read and are present past their length in the
3'-end. The sequence after the adapter was all poly-A
sequence. These extreme cases can be removed by appending poly-A
sequence to the end of the adapters in the adapters file.]
4. Contaminants will only have mismatching bases; no contaminants with
insertions, deletions, or transpositions will be addressed by Scythe.
5. The quality line of the FASTQ file accurately estimates the
probability of bases being called correctly.


** String matching in Scythe ** String matching in Scythe


Scythe begins by searching for a length of contaminant in the For each adapter in the file, Scythe looks for the best match in
read. For each adapter in the file, Scythe looks for the best match in
terms of scores. A nucleotide match is scored as a 1, and a mismatch terms of scores. A nucleotide match is scored as a 1, and a mismatch
is scored as a -1. Because Scythe doesn't address contaminants with is scored as a -1. Because Scythe doesn't address contaminants with
insertions or deletions, it doesn't use a standard alignment strategy insertions or deletions, it doesn't use a standard alignment strategy
Expand Down
34 changes: 34 additions & 0 deletions paper/supplementary-materials.org
@@ -0,0 +1,34 @@
#+title: Scythe Supplementary Materials
#+author: Vince Buffalo
#+email: vsbuffalo@ucdavis.edu
#+date:
#+babel: :results output :exports both :session :comments org

* 3'-end Quality

#+caption: 3'-end quality deterioration.
#+label: fig:qual_plot
#+attr_latex: width=12cm
[[./qual_plot.png]]


* Method

** Assumptions

1. All contaminants sequences are known /a priori/ and are reliable.
2. A contaminant sequence with length $l_c$ in a read of length $l_r$
will only be found between $l_r - l_c$ and $l_r$.
3. If the read is contaminated, the number of bases contaminating the
read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
the 5'-end of the contaminant sequence. While this seems limiting,
it is the mode by which 3'-end adapter and barcode contamination
occurs.[fn:: We have encountered Illumina data in which adapters
contaminate the read and are present past their length in the
3'-end. The sequence after the adapter was all poly-A
sequence. These extreme cases can be removed by appending poly-A
sequence to the end of the adapters in the adapters file.]
4. Contaminants will only have mismatching bases; no contaminants with
insertions, deletions, or transpositions will be addressed by Scythe.
5. The quality line of the FASTQ file accurately estimates the
probability of bases being called correctly.

0 comments on commit fcc06c1

Please sign in to comment.