Skip to content

Commit

Permalink
added helper script to mine reads for (TruSeq) sample IDs
Browse files Browse the repository at this point in the history
  • Loading branch information
jfass committed Aug 21, 2012
1 parent 5dda414 commit b294cec
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 3 deletions.
13 changes: 10 additions & 3 deletions README.md
Expand Up @@ -138,9 +138,16 @@ Scythe adapter files that contain all possible barcodes concatenated
with possible adapters, so that both can be recognized and
removed. This has worked well and is recommended for cases when 3'-end
quality deteriorates and prevents barcode removal. Newer Illumina
chemistry has the barcode separated from the fragment, so that it
appears as an entirely separate read and is used to demultiplex sample
reads by Illumina's CASAVA pipeline.
chemistry (TruSeq) has the barcode separated from the fragment, so
that it appears as an entirely separate read that is used to
demultiplex sample reads by Illumina's CASAVA pipeline. In case the 6
bp sample IDs are not readily available, we have included a helper
script to profile the IDs found in the forward reads of TruSeq data
set (reverse reads are subject to contamination by a different adapter
that doesn't contain variable sequence). Use this script as follows
(e.g.):

cat reads.fq | head -4000000 | perl profileTruSeqIDs.pl > IDsFound.txt

### Does Scythe work on 5'-end or other contaminants?

Expand Down
58 changes: 58 additions & 0 deletions profileTruSeqIDs.pl
@@ -0,0 +1,58 @@
#!/usr/bin/perl -w

# AUTHOR: Joseph Fass <joseph.fass@gmail.com>
# LAST REVISED: August 2012
# The Bioinformatics Core at UC Davis Genome Center
# http://bioinformatics.ucdavis.edu

# profileTruSeqIDs.pl is the proprietary property of The Regents of
# the University of California (“The Regents.”) Copyright 2007-12 The
# Regents of the University of California, Davis campus. All Rights
# Reserved. Redistribution and use in source and binary forms, with
# or without modification, are permitted by nonprofit, research
# institutions for research use only, provided that the following
# conditions are met: Redistributions of source code must retain the
# above copyright notice, this list of conditions and the following
# disclaimer. Redistributions in binary form must reproduce the above
# copyright notice, this list of conditions and the following
# disclaimer in the documentation and/or other materials provided with
# the distribution. The name of The Regents may not be used to
# endorse or promote products derived from this software without
# specific prior written permission. The end-user understands that
# the program was developed for research purposes and is advised not
# to rely exclusively on the program for any reason. THE SOFTWARE
# PROVIDED IS ON AN "AS IS" BASIS, AND THE REGENTS HAVE NO OBLIGATION
# TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
# MODIFICATIONS. THE REGENTS SPECIFICALLY DISCLAIM ANY EXPRESS OR
# IMPLIED WARRANTIES, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE REGENTS BE LIABLE TO ANY
# PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY OR
# CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES, LOSS OF USE, DATA OR PROFITS, OR
# BUSINESS INTERRUPTION, HOWEVER CAUSED AND UNDER ANY THEORY OF
# LIABILITY WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE AND ITS DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY
# OF SUCH DAMAGE. If you do not agree to these terms, do not download
# or use the software. This license may be modified only in a writing
# signed by authorized signatory of both parties.

# searches for TruSeq adapters in fastq, extracts and counts the 6 bp
# "barcodes," or multiplex identifiers

while (<>) {
$seq = <>;
<>;
<>;
if ($seq =~ /AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC([ACGT]{6})/) {
$count{$1}++;
}
}
if (%count) {
foreach $ID (reverse sort {$count{$a}<=>$count{$b}} keys %count) {
print "$ID\t$count{$ID}\n";
}
} else {
print "\nNo ID's found! Reads too short? Try visual inspection, with a shorter search string, like \"AGATCGGAAGAG\".\n\n";
}

0 comments on commit b294cec

Please sign in to comment.