Permalink
Browse files

Create gh-pages branch via GitHub

  • Loading branch information...
1 parent 0035299 commit a2788c66b9a9a431431bda10f4fef068b6791582 @srobb1 committed Feb 18, 2013
Showing with 89 additions and 67 deletions.
  1. +88 −66 index.html
  2. +1 −1 params.json
View
@@ -31,11 +31,19 @@
<hr>
<section id="main_content">
- <p><a href="http://srobb1.github.com/RelocaTE/">RelocaTE</a>: A tool to identify the locations of transposable element insertion events that are present in DNA short read data but absent in the reference genome sequence.</p>
+ <p><a href="http://srobb1.github.com/RelocaTE/">RelocaTE</a>: is a collection of scripts in which short reads (paired or unpaired), a fasta containing the sequences of transposable elements and a reference genome sequence are the input and the output is a series of files containing the locations (relative to the reference genome) of TE insertions in the reference and short reads</p>
-<p>RelocaTE is a collection of scripts in which short reads (paired or unpaired), a fasta containing the sequences of transposable elements and a reference genome sequence are the input and the output is a series of files containing the locations (relative to the reference genome) of TE insertions in the short reads. These insertions are insertions that are present only in the short reads and not present in the reference genome. If a tab-delimited file containing the coordinates of TEs in the reference is provided a list of the number of reads that support the presence of existing TE insertions is produced.</p>
-
-<p>CharacTErizer: A companion tool compares the numbers of reads that flank the TE sequence and contain genomic sequence to the number of reads that span a predicted insertion site with no gaps. These spanners contain no TE sequence. The ratio of spanners to flankers is used to classify the insertion as homozygous, heterozygous, new (somatic) or other.</p>
+<ol>
+<li>
+<strong>non-reference</strong> transposable element insertion events that are present in DNA short read data but absent in the reference genome sequence.</li>
+<li>
+<strong>reference</strong> transposable element insertions that are present in the reference<br>
+</li>
+<li>
+<strong>shared</strong> transposable element insertions that are present in the reference and the reads</li>
+<li>
+<strong>reference-only</strong> transposable element insertions that are present in the reference and no evidence of the insertion in the reads. This could be due to a lack of data. Future releases of RelocaTE will report evidence based reference-only insertions </li>
+</ol><p><a href="http://srobb1.github.com/RelocaTE/#characterizer">CharacTErizer</a> is a companion tool compares the numbers of reads that flank the TE sequence and contain genomic sequence to the number of reads that span a predicted insertion site with no gaps. These spanners contain no TE sequence. The ratio of spanners to flankers is used to classify the insertion as homozygous, heterozygous, or new (somatic). Somatic excision events can also be predicted.</p>
<hr><hr><h3>Table of Contents:<br>
</h3>
@@ -71,13 +79,14 @@
<ul>
<li><a href="http://genome.ucsc.edu/FAQ/FAQblat.html#blat3">Blat</a></li>
-<li>Bowtie (recommended <a href="http://bowtie-bio.sourceforge.net/manual.shtml#obtaining-bowtie">Bowtie1</a>, but RelocaTE is compatible with <a href="http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#obtaining-bowtie-2">Bowtie2</a>)</li>
+<li><a href="http://bowtie-bio.sourceforge.net/manual.shtml#obtaining-bowtie">Bowtie 1</a></li>
<li><a href="http://www.bioperl.org/wiki/Installing_BioPerl">BioPerl</a></li>
+<li><a href="http://samtools.sourceforge.net/">Samtools</a></li>
<li>
-<a href="http://samtools.sourceforge.net/">Samtools</a>
-<br><br>
-</li>
-</ul><h3>
+<a href>BWA</a> Recommeded for the creation of the BAM file needed by CharacTErizer </li>
+<li>
+<a href="http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&amp;PAGE_TYPE=BlastDocs&amp;DOC_TYPE=Download">Blast (Legacy)</a> formatdb and fastacmd are used for indexed sequence retrieval in an additional companion tool, ConstrucTEr, more info coming soon.</li>
+</ul><p><br></p><hr><br><h3>
<a name="cmd">RelocaTE Command Line Options</a>:</h3>
<pre>
@@ -102,10 +111,9 @@
find the location of your TE in the reference. option-2) input the
file name of a tab-delimited file containing the coordinates of TE
insertions pre-existing in the reference sequence. [no default]
-<a href="#b2">-b2 | --bowtie2</a> 0|1: Use bowtie2 use '-b2 1' else for Bowtie1 use '-b2 0' [0]
</pre>
-<h4><a name="t">-t TE Fasta File</a></h4>
+<h4><a name="t">-t | --te_fasta Str: TE FASTA File</a></h4>
<p>Required. No default value.</p>
@@ -135,33 +143,33 @@
Example: CGA followed by any character then an A then CT or G: TSD=CGA.A(CT|G)
</pre>
-<h4><a name="d">-d directory of fq files</a></h4>
+<h4><a name="d">-d | --fq_dir Str: directory of fq files</a></h4>
<p>Required. No default value.</p>
<p>The name of the directory of paired and unpaired fastq files (paired _p1.fq &amp; _p2.fq). Both the .fq and .fastq extensions are accepted as extensions of fastq files. If something different is used RelocaTE will not recognize those files as being fastq files.</p>
-<h4><a name="g">-g reference genome fasta file</a></h4>
+<h4><a name="g">-g | --genome_fasta Str: reference genome fasta file</a></h4>
<p>Optional, Recommended. No default value.</p>
<p>The file name of the fasta file containing the genome sequence of the reference. If it is provided, the locations of the insertion events will be reported relative to the reference genome coordinates. </p>
<p>If the genome sequence is not provided a series of files will be generated. One set will contain the intact reads that align to the TE. The second and third set of files will be made up of trimmed reads. The second set will be only the trimmed portion of the reads in the first set that align to the TE. The third set will contain the trimmed portion of the reads that do not align to the TE, therefore the portion of the reads that should align to the genome sequence not containing a TE insertion.</p>
-<h4><a name="e">-e Sample identifier</a></h4>
+<h4><a name="e">-e | --exper Str: Sample identifier</a></h4>
<p>Optional, Recommended. The default value is not.given</p>
<p>A short string for sample name. This string will be used in the output files to create IDs for the insert (ex. A123)</p>
-<h4><a name="o">-o output directory name</a></h4>
+<h4><a name="o">-o | --outdir Str: output directory name</a></h4>
<p>Optional, Recommended. The default value is outdir_teSearch</p>
<p>A short string for the output directory name. This string will be used to create a directory to contain the output files and directories in the current working directory. The complete path is not required, only the desired name for the directory. </p>
-<h4><a name="1">-1 unique mate/pair 1 string</a></h4>
+<h4><a name="1">-1 | --mate_1_id Str: unique mate/pair 1 string</a></h4>
<p>Optional, Recommended. The default value is _p1</p>
@@ -185,7 +193,7 @@
identify all _p1 files and no _p2 files.
</pre>
-<h4><a name="2">-2 unique mate/pair 2 string</a></h4>
+<h4><a name="2">-2 | --mate_2_id Str: unique mate/pair 2 string</a></h4>
<p>Optional, Recommended. The default value is _p2</p>
@@ -198,7 +206,7 @@
String: _p2
</pre>
-<h4><a name="u">-u unique unPaired string</a></h4>
+<h4><a name="u">-u | --unpaired_id Str: unique unPaired string</a></h4>
<p>Optional, Recommended. The default value is .unPaired.</p>
@@ -211,18 +219,18 @@
String: .unParied
</pre>
-<h4><a name="p">-p n</a></h4>
+<h4><a name="p">-p | --parallel 0|1: split into many jobs</a></h4>
<p>Optional. Default value is 1.</p>
<p>n is 0 or 1.</p>
-<p>0: means only one large job will be ran.<br>
-1: many shell scripts will be generated for the user to run<br></p>
+<p>0: means only one large job will be run.<br>
+1: many shell scripts will be generated for the user to manually run<br></p>
<p>Break down the single big job of relocaTE into as many smaller jobs as possible. If selected this option will cause the creation of shell scripts which can be manually ran or submitted to a queue. This enables the jobs to be run in parallel. The folders of shell scripts should be run as ordered. Step_1 needs to run and be complete before Step_2 jobs can be proper started. If the genome fasta had already been split and indexed this job will be skipped. </p>
-<h4><a name="a">-a n</a></h4>
+<h4><a name="a">-a | --qsub_array 0|1: create PBS array job script</a></h4>
<p>Optional. Default value is 1.</p>
@@ -237,13 +245,13 @@
<p>See run_these_jobs.sh for the array jobs.</p>
-<h4><a name="w">-w working directory name</a></h4>
+<h4><a name="w">-w | --workingdir Str: working directory name</a></h4>
<p>Optional. Default value is the current working directory.</p>
<p>If a directory different form the cwd is given it needs to exist, will not create. Provide the full path. </p>
-<h4><a name="l">-l n</a></h4>
+<h4><a name="l">-l | --len_cutoff n: min length cutoff for alignment to reference</a></h4>
<p>Optional. Default value is 10.</p>
@@ -254,7 +262,7 @@
<li>How many bps are needed to limit false alignments to the reference?</li>
<li>How many bps are needed to recognize the TE? </li>
<li>The answer to the above two questions should not total more than the read length.</li>
-</ul><h4><a name="bm">-bm n</a></h4>
+</ul><h4><a name="bm">-bm | --blat_minScore n: blat minScore for alignment to TE</a></h4>
<p>Optional. Default value is 10.</p>
@@ -267,15 +275,15 @@
mismatches minus some sort of gap penalty.</p>
</blockquote>
-<h4><a name="m">-m n&lt;=0</a></h4>
+<h4><a name="m">-m | --mismatch (n&lt;=0) mismatches allowed in blat alignment to TE</a></h4>
<p>Optional. Default value is 0.</p>
<p>Any number less than or equal to 0.</p>
<p>Fraction of the bps that aligned to the TE that are allowed to not be an exact match. For example, if 10 bp align to the TE and the allowance is 0.1, 1 bp can be a mismatch.</p>
-<h4><a name="bt">-bt n</a></h4>
+<h4><a name="bt">-bt | --blat_tileSize n: blat tileSize for alignmetn to TE</a></h4>
<p>Optional. Default value is 7.</p>
@@ -289,13 +297,13 @@
Default is 11 for DNA and 5 for protein.</p>
</blockquote>
-<h4><a name="f">-f n</a></h4>
+<h4><a name="f">-f | --flanking_seq_len n: length of the insertion site flanking seq to be returned</a></h4>
<p>Optional. Default value is 100.</p>
<p>n is the length of the sequence flanking the found insertion site to be returned in an output fasta file and in the output gff file. This sequence is taken from the reference genome.</p>
-<h4><a name="x">-r [1|File]</a></h4>
+<h4><a name="x">-r | --reference_ins [1|File] To identify reference and shared insertions (reference and reads)</a></h4>
<p>Optional. No default value.</p>
@@ -312,9 +320,7 @@
mping Chr12:1045463..1045892
</pre>
-<p><br><br></p>
-
-<h3>
+<p><br></p><hr><br><h3>
<a name="quick">Quick Start Guide</a>:</h3>
<p>1.  Get the sequence of your TE, including the TIRs. Create a fasta file with your sequence, TE name and the TSD. The TSD= is required. With DNA transposons, by definition, during an insertion event the target site is duplicated. Therefore the target site will be used to identify an insertion event. The reverse complement of each read containing portions of the ends of the provided TE will also be searched for the TSD to identify insertion events. A specific sequence of nucleotides can be used or a perl regular expression.
@@ -378,7 +384,7 @@
<p>7.  You are now ready to run relocaTE.pl with your data. If you run the program without any command line options, it will print out a list of the options and short descriptions.
<br><br></p>
-<h3>
+<p><br></p><hr><br><h3>
<a name="input">RelocaTE Input Files</a>:</h3>
<ol>
@@ -388,23 +394,34 @@
</li>
<li>Paired and/or unpaired Fastq files. (ex: reads_p1.fq, reads_p2.fq, reads_unPaired.fq, reads.fq) (required)</li>
<li>Tab-delimited file with coordinates of TE insertions in the reference genome (not required)</li>
-</ol><h3>
+</ol><p><br></p><hr><br><h3>
<a name="output">RelocaTE Output Files</a>:</h3>
<pre>
-Experiment_name.TE_name.all_inserts.gff: GFF3 file containing all reference and non-reference insertions
-Experiment_name.TE_name.all_nonref.txt: tab-delimited file containing all potential non-reference (insertions
- found only in reads, absent from reference) insertion sites
-Experiment_name.TE_name.confident_nonref.txt: tab-delimited file containing only the confident non-reference
- insertion sites
-Experiment_name.TE_name.confident_nonref_genomeflanks.fa: FASTA file containing the genome sequence which
- flanks each confident non-reference site
-Experiment_name.TE_name.confident_nonref_reads_list.txt: text file containing a list of the reads used to call
- each confident non-reference insertion
-Experiment_name.TE_name.all_reference.txt: text file containing counts of reads which overlap the reference insertions.
+sample_name.TE_name.all_inserts.gff
+ GFF3 file containing all reference and non-reference insertions
+
+sample_name.TE_name.all_nonref.txt
+ tab-delimited file containing all potential non-reference (insertions
+ found only in reads, absent from reference) insertion sites
+
+sample_name.TE_name.confident_nonref.txt
+ tab-delimited file containing only the confident non-reference
+ insertion sites
+
+sample_name.TE_name.confident_nonref_genomeflanks.fa
+ FASTA file containing the genome sequence which
+ flanks each confident non-reference site
+
+sample_name.TE_name.confident_nonref_reads_list.txt
+ text file containing a list of the reads used to call
+ each confident non-reference insertion
+
+sample_name.TE_name.all_reference.txt
+ text file containing counts of reads which overlap the reference insertions.
</pre>
-<h3>
+<p><br></p><hr><br><h3>
<a name="tips">RelocaTE Tips</a>:</h3>
<p>If you have a multi-node cluster you can speed up your RelocaTE run by dividing your fastq files into many smaller files.
@@ -422,7 +439,7 @@
./fastq_split -o split_fq ~/somedir/*fq
</pre>
-<h3>
+<p><br></p><hr><br><h3>
<a name="characterizer">CharacTErizer</a>:</h3>
<pre>
@@ -442,35 +459,40 @@
<ul>
<li>It is suggested that the BAM files are generated by using BWA to align the fastq files to the reference genome <a href="http://sourceforge.net/projects/bio-bwa/files/">download</a> <a href="http://bio-bwa.sourceforge.net/bwa.shtml">Manual</a>.</li>
<li>Example BWA command line</li>
-</ul><p>create an index</p>
-
-<pre>bwa index -a bwtsw genome.fasta</pre>
-
-<p>Align Pair 1 fastq</p>
+</ul><pre>
+#create bwa index file
+bwa index -a bwtsw MSUr7.sample.fa
-<pre>bwa aln genome.fasta read_p1.fastq &gt; aln_p1.sai</pre>
+#Align Pair 1 fastq
+bwa aln MSUr7.sample.fa fq/sample_p1.fq &gt; sample_p1.sai
-<p>align Pair 2 fastq</p>
+#align Pair 2 fastq
+bwa aln MSUr7.sample.fa fq/sample_p2.fq &gt; sample_p2.sai
-<pre>bwa aln genome.fasta read_p2.fastq &gt; aln_p2.sai</pre>
+#generate SAM for paired reads
+bwa sampe MSUr7.sample.fa sample_p1.sai sample_p2.sai fq/sample_p1.fq fq/sample_p2.fq &gt; sample.paired.sam
-<p>generate SAM for paired reads</p>
+#align unparied
+bwa aln MSUr7.sample.fa fq/sample.unPaired.fq &gt; sample.unPaired.sai
-<pre>bwa sampe genome.fasta aln_p1.sai aln_p2.sai read_p1.fq read_p2.fq &gt; aln_paried.sam</pre>
+#generate SAM for unpaired reads
+bwa samse MSUr7.sample.fa sample.unPaired.sai fq/sample.unPaired.fq &gt; sample.unPaired.sam
-<p>align unparied</p>
+#generate BAM with SAMtools
+samtools view -h -b -S -T MSUr7.sample.fa sample.paired.sam &gt; sample.paired.bam
+samtools view -h -b -S -T MSUr7.sample.fa sample.unPaired.sam &gt; sample.unPaired.bam
-<pre>bwa aln genome.fasta read.fastq &gt; aln.sai</pre>
+#combine BAM
+samtools cat -o sample.bam sample.unPaired.bam sample.paired.bam
-<p>generate SAM for unpaired reads</p>
+#sort BAM with SAMtools
+samtools sort sample.bam sample.sorted
-<pre>bwa samse genome.fasta aln.sai read.fastq &gt; aln.sam</pre>
-
-<p>generate BAM with SAMtools</p>
-
-<pre>samtools view -h -b -S -T genome.fasta aln.sam &gt; aln.bam</pre>
+#index BAM with SAMtools
+samtools index sample.sorted.bam
+</pre>
-<h3>What does relocaTE.pl actually do?</h3>
+<p><br></p><hr><br><h3>What does relocaTE.pl actually do?</h3>
<ol>
<li>if not already done, creates a bowtie index for the complete reference fasta.</li>
@@ -481,7 +503,7 @@
<li>runs relocaTE_align.pl: aligns the trimmed reads to the reference fasta. a shell script created if -p 1 and -a 1</li>
<li>runs relocaTE_insertionFinder.pl: one job for every TE for every sequence of the reference fasta. shell scripts and array jobs will be created if -p 1 and -a 1.</li>
<li>concatenates the results of each reference sequence into one file: one job for every TE. shell scripts and array jobs will be created if -p 1 and -a 1.</li>
-</ol><h3>
+</ol><p><br></p><hr><br><h3>
<a name="issue">Report an Issue</a>:</h3>
<p>For any of the listed reasons, or anything else, please leave us a <a href="https://github.com/srobb1/RelocaTE/issues?page=1&amp;sort=comments&amp;state=open">message here</a><br></p>
Oops, something went wrong.

0 comments on commit a2788c6

Please sign in to comment.