Permalink
Browse files

paper edits, removing lost of materials to shift to sup materials

  • Loading branch information...
1 parent 133fc91 commit fcc06c14d1803dc55d952f4db871758e78beccec @vsbuffalo committed Mar 8, 2012
Showing with 78 additions and 40 deletions.
  1. +44 −40 paper/scythe-paper.org
  2. +34 −0 paper/supplementary-materials.org
View
@@ -4,21 +4,41 @@
#+date:
#+babel: :results output :exports both :session :comments org
-* 3'-end Contamination
+
+#+begin_abstract
+*Motivation:* Modern sequencing technologies can leave artifactual
+ contaminant sequences at the 3'-end of reads. 3'-end regions are also
+ have the lowest quality bases and likely to be called incorrectly,
+ which makes identifying and removing 3'-end contaminants
+ difficult. Fixed-number mismatch approaches to remove contaminants
+ can fail in these low quality regions. Failing to remove such
+ contaminants can seriously confound downstream analyses like assembly
+ and mapping.
+
+
+*Results:* Scythe is a program designed specifically to remove 3'-end
+ contaminants. It searches for 3'-end contaminants and uses a Bayesian
+ model that considers individual base qualities to decide whether a
+ given match is a contaminant or background sequence. Even for a
+ variety of prior contamination rates, Scythe out performs other
+ adapter removal software tools.
+
+*Availability:* Scythe is freely available under the MIT license at
+ https://github.com/vsbuffalo/scythe.
+#+end_abstract
+
+* Introduction
Scythe focuses on 3'-end contaminants, specifically those due to
-adapters or barcodes. Many second-generation sequencing technologies
-such as Illumina's Genome Analyzer II and HiSeq have lower-quality
-3'-end bases. These low-quality bases are more likely to have
-nucleotides called incorrectly, making contaminant identification more
-difficult. Futhermore, 3'-end quality deterioration is not uniform
-across all reads; as figure \ref{fig:qual_plot} shows, there is
-variation in the quality per base.
-
-#+caption: 3'-end quality deterioration.
-#+label: fig:qual_plot
-#+attr_latex: width=12cm
-[[./qual_plot.png]]
+adapters or barcodes. It embraces the Unix Philosophy of "programs
+that do one thing well"
+(http://www.faqs.org/docs/artu/ch01s06.html). Many second-generation
+sequencing technologies such as Illumina's Genome Analyzer II and
+HiSeq have lower-quality 3'-end bases. These low-quality bases are
+more likely to have nucleotides called incorrectly, making contaminant
+identification more difficult. Futhermore, 3'-end quality
+deterioration is not uniform across all reads (see figure 1 in
+Supplementary Materials), there is variation in the quality per base.
A common step in read quality improvement procedures is to remove
these low-quality 3'-end sequences from reads. This is thought to
@@ -33,40 +53,24 @@ can be incorporated into classification procedures.
Fixed-number of mismatch approaches have the disadvantage that they
don't differentially weight a mismatch on a low-quality base from a
mismatch on a high-quality base. Futhermore, the fixed-number could
-easily be exhausted in a run of bad bases, even though every
-good-quality base perfectly matches the contaminant sequence.
+easily be exhausted in a run of bad bases (which are quite common in
+the 3'-end), even though every good-quality base perfectly matches the
+contaminant sequence.
* Scythe's Methods
-Scythe uses Bayesian methods to identify 3'-end contaminants. Scythe
-does not address the problem of contaminants found anywhere in the
-read; contaminants in higher quality regions are easier to identify
-and fixed-number of mismatch techniques are accurate and
-fastest. Scythe only addresses the challenging problem of contaminants
-in low-quality 3-ends. In doing so, it makes some assumptions:
-
-1. All contaminants sequences are known /a priori/ and are reliable.
-2. A contaminant sequence with length $l_c$ in a read of length $l_r$
- will only be found between $l_r - l_c$ and $l_r$.
-3. If the read is contaminated, the number of bases contaminating the
- read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
- the 5'-end of the contaminant sequence. While this seems limiting,
- it is the mode by which 3'-end adapter and barcode contamination
- occurs.[fn:: We have encountered Illumina data in which adapters
- contaminate the read and are present past their length in the
- 3'-end. The sequence after the adapter was all poly-A
- sequence. These extreme cases can be removed by appending poly-A
- sequence to the end of the adapters in the adapters file.]
-4. Contaminants will only have mismatching bases; no contaminants with
- insertions, deletions, or transpositions will be addressed by Scythe.
-5. The quality line of the FASTQ file accurately estimates the
- probability of bases being called correctly.
+Scythe uses Bayesian methods to identify and remove a given set of
+3'-end contaminants (most often sequencing adapters). Scythe only
+checks for the adapters in the 3'-end; contaminants further towards
+the middle and 5'-ends often have high quality bases, so identifying
+and removing them is much simpler. Scythe makes some assumptions: see
+the Supplementary Materials. TODOREF
+
** String matching in Scythe
-Scythe begins by searching for a length of contaminant in the
-read. For each adapter in the file, Scythe looks for the best match in
+For each adapter in the file, Scythe looks for the best match in
terms of scores. A nucleotide match is scored as a 1, and a mismatch
is scored as a -1. Because Scythe doesn't address contaminants with
insertions or deletions, it doesn't use a standard alignment strategy
@@ -0,0 +1,34 @@
+#+title: Scythe Supplementary Materials
+#+author: Vince Buffalo
+#+email: vsbuffalo@ucdavis.edu
+#+date:
+#+babel: :results output :exports both :session :comments org
+
+* 3'-end Quality
+
+#+caption: 3'-end quality deterioration.
+#+label: fig:qual_plot
+#+attr_latex: width=12cm
+[[./qual_plot.png]]
+
+
+* Method
+
+** Assumptions
+
+1. All contaminants sequences are known /a priori/ and are reliable.
+2. A contaminant sequence with length $l_c$ in a read of length $l_r$
+ will only be found between $l_r - l_c$ and $l_r$.
+3. If the read is contaminated, the number of bases contaminating the
+ read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
+ the 5'-end of the contaminant sequence. While this seems limiting,
+ it is the mode by which 3'-end adapter and barcode contamination
+ occurs.[fn:: We have encountered Illumina data in which adapters
+ contaminate the read and are present past their length in the
+ 3'-end. The sequence after the adapter was all poly-A
+ sequence. These extreme cases can be removed by appending poly-A
+ sequence to the end of the adapters in the adapters file.]
+4. Contaminants will only have mismatching bases; no contaminants with
+ insertions, deletions, or transpositions will be addressed by Scythe.
+5. The quality line of the FASTQ file accurately estimates the
+ probability of bases being called correctly.

0 comments on commit fcc06c1

Please sign in to comment.