Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

paper edits, removing lost of materials to shift to sup materials

  • Loading branch information...
commit fcc06c14d1803dc55d952f4db871758e78beccec 1 parent 133fc91
@vsbuffalo authored
Showing with 78 additions and 40 deletions.
  1. +44 −40 paper/scythe-paper.org
  2. +34 −0 paper/supplementary-materials.org
View
84 paper/scythe-paper.org
@@ -4,21 +4,41 @@
#+date:
#+babel: :results output :exports both :session :comments org
-* 3'-end Contamination
+
+#+begin_abstract
+*Motivation:* Modern sequencing technologies can leave artifactual
+ contaminant sequences at the 3'-end of reads. 3'-end regions are also
+ have the lowest quality bases and likely to be called incorrectly,
+ which makes identifying and removing 3'-end contaminants
+ difficult. Fixed-number mismatch approaches to remove contaminants
+ can fail in these low quality regions. Failing to remove such
+ contaminants can seriously confound downstream analyses like assembly
+ and mapping.
+
+
+*Results:* Scythe is a program designed specifically to remove 3'-end
+ contaminants. It searches for 3'-end contaminants and uses a Bayesian
+ model that considers individual base qualities to decide whether a
+ given match is a contaminant or background sequence. Even for a
+ variety of prior contamination rates, Scythe out performs other
+ adapter removal software tools.
+
+*Availability:* Scythe is freely available under the MIT license at
+ https://github.com/vsbuffalo/scythe.
+#+end_abstract
+
+* Introduction
Scythe focuses on 3'-end contaminants, specifically those due to
-adapters or barcodes. Many second-generation sequencing technologies
-such as Illumina's Genome Analyzer II and HiSeq have lower-quality
-3'-end bases. These low-quality bases are more likely to have
-nucleotides called incorrectly, making contaminant identification more
-difficult. Futhermore, 3'-end quality deterioration is not uniform
-across all reads; as figure \ref{fig:qual_plot} shows, there is
-variation in the quality per base.
-
-#+caption: 3'-end quality deterioration.
-#+label: fig:qual_plot
-#+attr_latex: width=12cm
-[[./qual_plot.png]]
+adapters or barcodes. It embraces the Unix Philosophy of "programs
+that do one thing well"
+(http://www.faqs.org/docs/artu/ch01s06.html). Many second-generation
+sequencing technologies such as Illumina's Genome Analyzer II and
+HiSeq have lower-quality 3'-end bases. These low-quality bases are
+more likely to have nucleotides called incorrectly, making contaminant
+identification more difficult. Futhermore, 3'-end quality
+deterioration is not uniform across all reads (see figure 1 in
+Supplementary Materials), there is variation in the quality per base.
A common step in read quality improvement procedures is to remove
these low-quality 3'-end sequences from reads. This is thought to
@@ -33,40 +53,24 @@ can be incorporated into classification procedures.
Fixed-number of mismatch approaches have the disadvantage that they
don't differentially weight a mismatch on a low-quality base from a
mismatch on a high-quality base. Futhermore, the fixed-number could
-easily be exhausted in a run of bad bases, even though every
-good-quality base perfectly matches the contaminant sequence.
+easily be exhausted in a run of bad bases (which are quite common in
+the 3'-end), even though every good-quality base perfectly matches the
+contaminant sequence.
* Scythe's Methods
-Scythe uses Bayesian methods to identify 3'-end contaminants. Scythe
-does not address the problem of contaminants found anywhere in the
-read; contaminants in higher quality regions are easier to identify
-and fixed-number of mismatch techniques are accurate and
-fastest. Scythe only addresses the challenging problem of contaminants
-in low-quality 3-ends. In doing so, it makes some assumptions:
-
-1. All contaminants sequences are known /a priori/ and are reliable.
-2. A contaminant sequence with length $l_c$ in a read of length $l_r$
- will only be found between $l_r - l_c$ and $l_r$.
-3. If the read is contaminated, the number of bases contaminating the
- read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
- the 5'-end of the contaminant sequence. While this seems limiting,
- it is the mode by which 3'-end adapter and barcode contamination
- occurs.[fn:: We have encountered Illumina data in which adapters
- contaminate the read and are present past their length in the
- 3'-end. The sequence after the adapter was all poly-A
- sequence. These extreme cases can be removed by appending poly-A
- sequence to the end of the adapters in the adapters file.]
-4. Contaminants will only have mismatching bases; no contaminants with
- insertions, deletions, or transpositions will be addressed by Scythe.
-5. The quality line of the FASTQ file accurately estimates the
- probability of bases being called correctly.
+Scythe uses Bayesian methods to identify and remove a given set of
+3'-end contaminants (most often sequencing adapters). Scythe only
+checks for the adapters in the 3'-end; contaminants further towards
+the middle and 5'-ends often have high quality bases, so identifying
+and removing them is much simpler. Scythe makes some assumptions: see
+the Supplementary Materials. TODOREF
+
** String matching in Scythe
-Scythe begins by searching for a length of contaminant in the
-read. For each adapter in the file, Scythe looks for the best match in
+For each adapter in the file, Scythe looks for the best match in
terms of scores. A nucleotide match is scored as a 1, and a mismatch
is scored as a -1. Because Scythe doesn't address contaminants with
insertions or deletions, it doesn't use a standard alignment strategy
View
34 paper/supplementary-materials.org
@@ -0,0 +1,34 @@
+#+title: Scythe Supplementary Materials
+#+author: Vince Buffalo
+#+email: vsbuffalo@ucdavis.edu
+#+date:
+#+babel: :results output :exports both :session :comments org
+
+* 3'-end Quality
+
+#+caption: 3'-end quality deterioration.
+#+label: fig:qual_plot
+#+attr_latex: width=12cm
+[[./qual_plot.png]]
+
+
+* Method
+
+** Assumptions
+
+1. All contaminants sequences are known /a priori/ and are reliable.
+2. A contaminant sequence with length $l_c$ in a read of length $l_r$
+ will only be found between $l_r - l_c$ and $l_r$.
+3. If the read is contaminated, the number of bases contaminating the
+ read $n_c$ is limited to $1 \le n_c \le l_c$, and always beginning from
+ the 5'-end of the contaminant sequence. While this seems limiting,
+ it is the mode by which 3'-end adapter and barcode contamination
+ occurs.[fn:: We have encountered Illumina data in which adapters
+ contaminate the read and are present past their length in the
+ 3'-end. The sequence after the adapter was all poly-A
+ sequence. These extreme cases can be removed by appending poly-A
+ sequence to the end of the adapters in the adapters file.]
+4. Contaminants will only have mismatching bases; no contaminants with
+ insertions, deletions, or transpositions will be addressed by Scythe.
+5. The quality line of the FASTQ file accurately estimates the
+ probability of bases being called correctly.
Please sign in to comment.
Something went wrong with that request. Please try again.