paper edits, removing lost of materials to shift to sup materials

vsbuffalo · Mar 8, 2012 · fcc06c1 · fcc06c1
1 parent 133fc91
commit fcc06c1
Show file tree

Hide file tree

Showing 2 changed files with 78 additions and 40 deletions.
diff --git a/paper/scythe-paper.org b/paper/scythe-paper.org
@@ -4,21 +4,41 @@
 #+date: 
 #+babel: :results output :exports both :session :comments org
 
-* 3'-end Contamination
+
+#+begin_abstract
+*Motivation:* Modern sequencing technologies can leave artifactual
+ contaminant sequences at the 3'-end of reads. 3'-end regions are also
+ have the lowest quality bases and likely to be called incorrectly,
+ which makes identifying and removing 3'-end contaminants
+ difficult. Fixed-number mismatch approaches to remove contaminants
+ can fail in these low quality regions. Failing to remove such
+ contaminants can seriously confound downstream analyses like assembly
+ and mapping.
+
+
+*Results:* Scythe is a program designed specifically to remove 3'-end
+ contaminants. It searches for 3'-end contaminants and uses a Bayesian
+ model that considers individual base qualities to decide whether a
+ given match is a contaminant or background sequence. Even for a
+ variety of prior contamination rates, Scythe out performs other
+ adapter removal software tools.
+
+*Availability:* Scythe is freely available under the MIT license at
+ https://github.com/vsbuffalo/scythe.
+#+end_abstract
+
+* Introduction
 
 Scythe focuses on 3'-end contaminants, specifically those due to
-adapters or barcodes. Many second-generation sequencing technologies
+adapters or barcodes. It embraces the Unix Philosophy of "programs
-such as Illumina's Genome Analyzer II and HiSeq have lower-quality
+that do one thing well"
-3'-end bases. These low-quality bases are more likely to have
+(http://www.faqs.org/docs/artu/ch01s06.html). Many second-generation
-nucleotides called incorrectly, making contaminant identification more
+sequencing technologies such as Illumina's Genome Analyzer II and
-difficult. Futhermore, 3'-end quality deterioration is not uniform
+HiSeq have lower-quality 3'-end bases. These low-quality bases are
-across all reads; as figure \ref{fig:qual_plot} shows, there is
+more likely to have nucleotides called incorrectly, making contaminant
-variation in the quality per base.
+identification more difficult. Futhermore, 3'-end quality
-
+deterioration is not uniform across all reads (see figure 1 in
-#+caption: 3'-end quality deterioration.
+Supplementary Materials), there is variation in the quality per base.
-#+label: fig:qual_plot
-#+attr_latex: width=12cm
-[[./qual_plot.png]]
 
 A common step in read quality improvement procedures is to remove
 these low-quality 3'-end sequences from reads. This is thought to
@@ -33,40 +53,24 @@ can be incorporated into classification procedures.
 Fixed-number of mismatch approaches have the disadvantage that they
 don't differentially weight a mismatch on a low-quality base from a
 mismatch on a high-quality base. Futhermore, the fixed-number could
-easily be exhausted in a run of bad bases, even though every
+easily be exhausted in a run of bad bases (which are quite common in
-good-quality base perfectly matches the contaminant sequence.
+the 3'-end), even though every good-quality base perfectly matches the
+contaminant sequence.
 
 
 * Scythe's Methods
 
-Scythe uses Bayesian methods to identify 3'-end contaminants. Scythe
+Scythe uses Bayesian methods to identify and remove a given set of
-does not address the problem of contaminants found anywhere in the
+3'-end contaminants (most often sequencing adapters). Scythe only
-read; contaminants in higher quality regions are easier to identify
+checks for the adapters in the 3'-end; contaminants further towards
-and fixed-number of mismatch techniques are accurate and
+the middle and 5'-ends often have high quality bases, so identifying
-fastest. Scythe only addresses the challenging problem of contaminants
+and removing them is much simpler. Scythe makes some assumptions: see
-in low-quality 3-ends. In doing so, it makes some assumptions:
+the Supplementary Materials. TODOREF
-
+
-1. All contaminants sequences are known /a priori/ and are reliable.
-2. A contaminant sequence with length $l_c$ in a read of length $l_r$
-   will only be found between $l_r - l_c$  and $l_r$.
-3. If the read is contaminated, the number of bases contaminating the
-   read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
-   the 5'-end of the contaminant sequence. While this seems limiting,
-   it is the mode by which 3'-end adapter and barcode contamination
-   occurs.[fn:: We have encountered Illumina data in which adapters
-   contaminate the read and are present past their length in the
-   3'-end. The sequence after the adapter was all poly-A
-   sequence. These extreme cases can be removed by appending poly-A
-   sequence to the end of the adapters in the adapters file.]
-4. Contaminants will only have mismatching bases; no contaminants with
-   insertions, deletions, or transpositions will be addressed by Scythe.
-5. The quality line of the FASTQ file accurately estimates the
-   probability of bases being called correctly.
 
 ** String matching in Scythe
 
-Scythe begins by searching for a length of contaminant in the
+For each adapter in the file, Scythe looks for the best match in
-read. For each adapter in the file, Scythe looks for the best match in
 terms of scores. A nucleotide match is scored as a 1, and a mismatch
 is scored as a -1. Because Scythe doesn't address contaminants with
 insertions or deletions, it doesn't use a standard alignment strategy

diff --git a/paper/supplementary-materials.org b/paper/supplementary-materials.org
@@ -0,0 +1,34 @@
+#+title: Scythe Supplementary Materials
+#+author: Vince Buffalo
+#+email: vsbuffalo@ucdavis.edu
+#+date: 
+#+babel: :results output :exports both :session :comments org
+
+* 3'-end Quality
+
+#+caption: 3'-end quality deterioration.
+#+label: fig:qual_plot
+#+attr_latex: width=12cm
+[[./qual_plot.png]]
+
+
+* Method
+
+** Assumptions
+
+1. All contaminants sequences are known /a priori/ and are reliable.
+2. A contaminant sequence with length $l_c$ in a read of length $l_r$
+   will only be found between $l_r - l_c$  and $l_r$.
+3. If the read is contaminated, the number of bases contaminating the
+   read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
+   the 5'-end of the contaminant sequence. While this seems limiting,
+   it is the mode by which 3'-end adapter and barcode contamination
+   occurs.[fn:: We have encountered Illumina data in which adapters
+   contaminate the read and are present past their length in the
+   3'-end. The sequence after the adapter was all poly-A
+   sequence. These extreme cases can be removed by appending poly-A
+   sequence to the end of the adapters in the adapters file.]
+4. Contaminants will only have mismatching bases; no contaminants with
+   insertions, deletions, or transpositions will be addressed by Scythe.
+5. The quality line of the FASTQ file accurately estimates the
+   probability of bases being called correctly.