diff --git a/README.md b/README.md index 781a357..15639f0 100644 --- a/README.md +++ b/README.md @@ -138,9 +138,16 @@ Scythe adapter files that contain all possible barcodes concatenated with possible adapters, so that both can be recognized and removed. This has worked well and is recommended for cases when 3'-end quality deteriorates and prevents barcode removal. Newer Illumina -chemistry has the barcode separated from the fragment, so that it -appears as an entirely separate read and is used to demultiplex sample -reads by Illumina's CASAVA pipeline. +chemistry (TruSeq) has the barcode separated from the fragment, so +that it appears as an entirely separate read that is used to +demultiplex sample reads by Illumina's CASAVA pipeline. In case the 6 +bp sample IDs are not readily available, we have included a helper +script to profile the IDs found in the forward reads of TruSeq data +set (reverse reads are subject to contamination by a different adapter +that doesn't contain variable sequence). Use this script as follows +(e.g.): + +cat reads.fq | head -4000000 | perl profileTruSeqIDs.pl > IDsFound.txt ### Does Scythe work on 5'-end or other contaminants? diff --git a/profileTruSeqIDs.pl b/profileTruSeqIDs.pl new file mode 100755 index 0000000..e570e3e --- /dev/null +++ b/profileTruSeqIDs.pl @@ -0,0 +1,58 @@ +#!/usr/bin/perl -w + +# AUTHOR: Joseph Fass +# LAST REVISED: August 2012 +# The Bioinformatics Core at UC Davis Genome Center +# http://bioinformatics.ucdavis.edu + +# profileTruSeqIDs.pl is the proprietary property of The Regents of +# the University of California (“The Regents.”) Copyright 2007-12 The +# Regents of the University of California, Davis campus. All Rights +# Reserved. Redistribution and use in source and binary forms, with +# or without modification, are permitted by nonprofit, research +# institutions for research use only, provided that the following +# conditions are met: Redistributions of source code must retain the +# above copyright notice, this list of conditions and the following +# disclaimer. Redistributions in binary form must reproduce the above +# copyright notice, this list of conditions and the following +# disclaimer in the documentation and/or other materials provided with +# the distribution. The name of The Regents may not be used to +# endorse or promote products derived from this software without +# specific prior written permission. The end-user understands that +# the program was developed for research purposes and is advised not +# to rely exclusively on the program for any reason. THE SOFTWARE +# PROVIDED IS ON AN "AS IS" BASIS, AND THE REGENTS HAVE NO OBLIGATION +# TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR +# MODIFICATIONS. THE REGENTS SPECIFICALLY DISCLAIM ANY EXPRESS OR +# IMPLIED WARRANTIES, INCLUDING BUT NOT LIMITED TO, THE IMPLIED +# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +# NONINFRINGEMENT. IN NO EVENT SHALL THE REGENTS BE LIABLE TO ANY +# PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY OR +# CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO PROCUREMENT OF +# SUBSTITUTE GOODS OR SERVICES, LOSS OF USE, DATA OR PROFITS, OR +# BUSINESS INTERRUPTION, HOWEVER CAUSED AND UNDER ANY THEORY OF +# LIABILITY WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING +# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +# SOFTWARE AND ITS DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY +# OF SUCH DAMAGE. If you do not agree to these terms, do not download +# or use the software. This license may be modified only in a writing +# signed by authorized signatory of both parties. + +# searches for TruSeq adapters in fastq, extracts and counts the 6 bp +# "barcodes," or multiplex identifiers + +while (<>) { + $seq = <>; + <>; + <>; + if ($seq =~ /AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC([ACGT]{6})/) { + $count{$1}++; + } +} +if (%count) { + foreach $ID (reverse sort {$count{$a}<=>$count{$b}} keys %count) { + print "$ID\t$count{$ID}\n"; + } +} else { + print "\nNo ID's found! Reads too short? Try visual inspection, with a shorter search string, like \"AGATCGGAAGAG\".\n\n"; +}