Skip to content

Commit

Permalink
feat: new gif and other edits
Browse files Browse the repository at this point in the history
  • Loading branch information
a-frantz committed Mar 30, 2021
1 parent 21775e6 commit 6190da2
Show file tree
Hide file tree
Showing 4 changed files with 37 additions and 46 deletions.
83 changes: 37 additions & 46 deletions docs/genomics-platform/workflow-guides/warden/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,14 @@ LIMMA analysis.
Depending on which entry point is chosen, inputs may be an array of FastQ files, RNA-Seq BAM files, or feature count files.

Each entrypoint has it's own input file type, but they all require a similarly formatted "sample sheet" which describes the relationships between samples.
Each WARDEN workflow requires two types of input files and that two or 3 parameters be set manually. All other parameters are preset with reasonable defaults.
Each WARDEN workflow requires an array of input files, a sample sheet, and has two to three parameters which must be set explicitly. All other parameters are preset with reasonable defaults.

| Name | Type | Description | Example |
| ----------------------------------- | ----------- | --------------------------------------------------- | -------------------------------------------------- |
| FastQ files (for WARDEN \[FastQ\]) | Input files | FastQ files generated by RNA-Seq experiment | Sample1.fastq.gz, Sample2.fastq.gz |
| BAM files (for WARDEN \[BAM\]) | Input files | BAM files generated by RNA-Seq experiment | Sample1.bam, Sample2.bam |
| Count files (for WARDEN \[Counts\]) | Input files | Feature count files generated by RNA-Seq experiment | Sample1.htseq_counts.txt, Sample2.htseq_counts.txt |
| Sample sheet (**required**) | Input file | Sample sheet generated and uploaded by the user | \*.txt or \*.xlsx |
| Name | Description | Example |
| ----------------------------------- | --------------------------------------------------- | -------------------------------------------------- |
| FastQ files (for WARDEN \[FastQ\]) | FastQ files generated by RNA-Seq experiment | Sample1.fastq.gz, Sample2.fastq.gz |
| BAM files (for WARDEN \[BAM\]) | BAM files generated by RNA-Seq experiment | Sample1.bam, Sample2.bam |
| Count files (for WARDEN \[Counts\]) | Feature count files generated by RNA-Seq experiment | Sample1.htseq_counts.txt, Sample2.htseq_counts.txt |
| Sample sheet (**required**) | Sample sheet generated and uploaded by the user | \*.txt or \*.xlsx |

### Sample sheet configuration

Expand Down Expand Up @@ -80,11 +80,11 @@ Each row in the spreadsheet (except for the last row, which we will talk about i

* The sample name should be unique and should only contain letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names.
* The condition/phenotype column associates similar samples together. The values should contain only letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names.
* If using WARDEN [FastQ]:
* If using WARDEN \[FastQ\]:
* The third column should contain forward reads (e.g. `*.R1.fastq.gz` or `*_1.fastq.gz`).
* The fourth column will contain reads in reverse orientation to the FastQ in column three (e.g. `*.R2.fastq.gz` or `*_2.fastq.gz`).
* For single end reads a single dash (`-`) should be entered in the fourth column.
* If using WARDEN [BAM] or WARDEN [Counts]:
* If using WARDEN \[BAM\] or WARDEN \[Counts\]:
* The third column should contain the name of the sample's BAM or counts file.
* The fourth column is ignored and can be safely deleted or left blank.

Expand All @@ -101,9 +101,9 @@ This line may appear anywhere in the file, but the examples place it at the bott
!!!example
The following lines are all valid examples.

1. `#comparisons=KO-WT`
2. `#comparisons=Condition1-Control,Condition2-Control`
3. `#comparisons=Phenotype2-Phenotype1,Phenotype3-Phenotype2,Phenotype3-Phenotype1`
* `#comparisons=KO-WT`
* `#comparisons=Condition1-Control,Condition2-Control`
* `#comparisons=Phenotype2-Phenotype1,Phenotype3-Phenotype2,Phenotype3-Phenotype1`
!!!

!!!note
Expand Down Expand Up @@ -145,34 +145,32 @@ Refer to [the general workflow guide](../../analyzing-data/running-sj-workflows/

### Hooking up Inputs

You'll need to hook up the FastQ files, BAM files, or count files (depending on which entrypoint you wish to use) and sample sheet you uploaded in [the upload data section](#uploading-input-files).
Click the `FASTQ_FILES`, `BAM_FILES`, or `COUNT_FILES` input field and select **all** input files listed in your sample sheet. Next, click the `sample_list` input field and select the corresponding sample sheet.
First, in the `Execution Output Folder` field, select a folder to output to. You can structure your experiments however you like (e.g. `/my_outputs`). If left blank, a directory named with the execution ID will be created in order to avoid cluttering your workspace and keep seperate runs seperate.

![](./inputs-warden-2.gif)
Next, you'll need to hook up the FastQ files, BAM files, or count files (depending on which entrypoint you wish to use) and sample sheet you uploaded in [the upload data section](#uploading-input-files). Click the `FASTQ_FILES`, `BAM_FILES`, or `COUNT_FILES` input field and select **all** input files listed in your sample sheet. Next, click the `sample_list` input field and select the corresponding sample sheet.

Then select the `sequence_strandedness` drop down menu and choose the appropriate option. This information can be determined from the sequencing or source
of the data. If you don't know what to put here, select "Unstranded".

Finally, select the `Genome` pulldown menu, choose the appropriate option, and WARDEN is ready to be run! Continue reading to learn about the available advanced options.

![](./warden-inputs.gif)

### Selecting Parameters

We now need to configure the parameters for the pipeline, such as reference genome and sequencing method. For the general workflow instructions refer [here](../../analyzing-data/running-sj-workflows#selecting-parameters)

!!!example Parameter setup steps

1. In the `Execution Output Folder` field, select a folder to output to. You can
structure your experiments however you like (e.g. `/My_Outputs`). If left blank, a directory named with the execution ID will be created in order to avoid cluttering your workspace and keep seperate runs seperate.
2. Select the `sequence_strandedness` from the drop down menu.
This information can be determined from the sequencing or source
of the data. If you don't know what to put here, select "Unstranded".
3. Select the `Genome` pulldown menu. Choose the appropriate box.
4. Options under "Advanced: Run Control" can be enabled or disabled, though `generate_name_sorted_BAM`, `generate_transcriptome_BAM`, [`run_FastQC`](#quality-control-results-fastqc), and [`run_coverage`](#bigWig-viewer) are disabled by default to reduce run time and costs.
5. `STAR_subsample_n_reads` can be used to reduce runtime and run costs. The default of 100,000,000 reads will map the entirety of many samples and is a sufficient of number of reads for differential expression analysis. Setting this value to "0" or "-1" will disable subsampling, and map the entirety of all input FastQs. With sufficiently large FastQs, this can take a long time and cost a significant amount of money. Large FastQs also occaisonally cause errors in the STAR step. If those are encountered, we recommend re-enabling subsampling or increasing the size of the `star_instance`. **Warning:** a larger STAR instance will incur larger costs.
6. The LIMMA parameters can be left alone for most analyses. If you are
* Options under "Advanced: Run Control" can be enabled or disabled, though `generate_name_sorted_BAM`, `generate_transcriptome_BAM`, [`run_FastQC`](#quality-control-results-fastqc), and [`run_coverage`](#bigwig-viewer) are disabled by default to reduce run time and costs.
* `STAR_subsample_n_reads` can be used to reduce runtime and run costs. The default of 100 million reads will map the entirety of many samples and is a sufficient number of reads for differential expression analysis. Setting this value to "0" or "-1" will disable subsampling, and map the entirety of all input FastQs. With sufficiently large FastQs, this can take a long time and cost a significant amount of money. Large FastQs also occaisonally cause errors in the STAR step. If those are encountered, we recommend re-enabling subsampling or increasing the size of the `star_instance`. **Warning:** a larger STAR instance will incur larger costs.
* The LIMMA parameters can be left alone for most analyses. If you are
an advanced LIMMA user, you can change the various settings exposed.
7. If you are interested in a feature besides genes, you should change the `feature_type` and `id_attribute` HTSeq-count parameters. Note that changing from the defaults will disable FPKM calculations. The other options should only be changed by advanced users of HTSeq-count.
8. Similarly STAR parameters should only be changed by advanced users familiar with the STAR aligner. You can read the STAR v2.5.3a manual [here](https://github.com/alexdobin/STAR/blob/2.5.3a/doc/STARmanual.pdf).
9. When all parameters have been set, you're ready to run WARDEN!
* If you are interested in a feature besides genes, you should change the `feature_type` and `id_attribute` HTSeq-count parameters. Note that changing from the defaults will disable FPKM calculations. The other options should only be changed by advanced users of HTSeq-count. The HTSeq documentation can be found [here](https://htseq.readthedocs.io/en/master/count.html).
* Similarly STAR parameters should only be changed by advanced users familiar with the STAR aligner. You can read the STAR v2.5.3a manual [here](https://github.com/alexdobin/STAR/blob/2.5.3a/doc/STARmanual.pdf).
* When all parameters have been adjusted to your needs, you're ready to run WARDEN!
!!!

![](./parameters-warden-3.gif)

## Summary of Results

Each tool in St. Jude Cloud produces a visualization that makes understanding results more accessible than working with excel spreadsheet or tab delimited files. This is the primary way we recommend you work with your results.
Expand All @@ -197,8 +195,7 @@ generated. An example can be seen below. These files will be labeled
`mds_plot.limma.png`. For all comparisons, regardless of sample size, an MDS
plot will also be generated with Counts per million (CPM) normalized
gene counts by default. These files will be labeled `mds_plot.norm_cpm.png`.
(Within the DNAnexus output directory structure, these files will be in
the root directory.)
These files will be in the root of the output directory.

![](./mdsPlot.png)

Expand Down Expand Up @@ -258,13 +255,12 @@ HTSeq-count files are combined into a file called `combined_counts.htseq.txt`. I

#### Alignment statistics

Several files should be examined initially to determine the quality of
the results. **alignment_statistics.txt** shows alignment statistics for
**alignment_statistics.txt** shows alignment statistics for
all samples. This file is a plain text tab-delimited file that can be
opened in Excel or a text editor such as Notepad++. This file contains
information on the total reads per sample, the percentage of duplicate
reads and the percentage of mapped reads. An example of this file is
below. (Within the DNAnexus output directory structure, this file will be in the `STAR/` folder.)
below. This file will be in the `STAR/` folder.

> ![](./alignmentStatistics.png)
Expand All @@ -278,8 +274,7 @@ Other useful differential expression results will be created. This includes tabu

#### GSEA.input.<*contrast*>.txt and GSEA.tStat.<*contrast*>.txt

Input files that can be used for GSEA analysis. The tStat file is preferred for a more accurate analysis, but will not give a heatmap diagram.
Within the DNAnexus output directory structure, these files will be in the `AUXILIARY/` directory.
Input files that can be used for GSEA analysis. The tStat file is preferred for a more accurate analysis, but will not give a heatmap diagram. These files will be in the `AUXILIARY/` directory.

#### Coverage results

Expand All @@ -290,10 +285,7 @@ strandedness, there will be bigWig files labeled,
`*.sortedcoverage_file.bed.bw` where '\*' is the sample name. For
stranded data there will also be `*.sortedPoscoverage_file.bed.bw` and
`*.sortedNegcoverage_file.bed.bw` which contains coverage information
for the positive and negative strand of the genome.

(Within the DNAnexus output directory structure, these files will be in
the `BIGWIG/` directory.)
for the positive and negative strand of the genome. These files will be in the `BIGWIG/` directory.

#### Quality Control Results (FastQC)

Expand Down Expand Up @@ -322,8 +314,7 @@ files can be found [here](http://labshare.cshl.edu/shares/gingeraslab/www-data/d
files are labeled `*.Chimeric.out.bam` and
`*.Chimeric.out.junction`.

(Within the DNAnexus output directory structure, `*.SJ.out.tab` files will be in `STAR/TABS`
and the chimeric BAMs and chimeric junction files will be in the `STAR/CHIMERIC/` directory.)
`*.SJ.out.tab` files will be in `STAR/TABS` and chimeric BAMs and chimeric junction files will be in the `STAR/CHIMERIC/` directory.

#### FPKM and count files (per sample)

Expand All @@ -335,19 +326,19 @@ the sample name. Counts files will be in the `HTSEQ/` directory, and FPKM files

#### Workflow parameters

`WARDEN_parameters.json` is the full list of parameters, including defaults, that were passed into this run of WARDEN.
`WARDEN_parameters.json` is the full list of parameters, including defaults, that were passed into this run of WARDEN. It can be found in the `AUXILIARY/` folder.

## Rerunning analysis

If you complete a WARDEN run from FastQs or BAM files and wish to change some of the final differential expresssion parameters, we recommend you use the count files already generated as input to the "WARDEN [Counts]" app. This should save you sigfnificant amounts of time and money.
If you complete a WARDEN run from FastQs or BAM files and wish to change some of the final differential expresssion parameters, we recommend you use the count files already generated as input to the "WARDEN \[Counts\]" app. This should save you sigfnificant amounts of time and money.

Similarly, if you started with FastQs and wish to rerun with different parameters to HTSeq-Count, we recommend using the previously generated BAM files as input to the "WARDEN [BAM]" app.
Similarly, if you started with FastQs and wish to rerun with different parameters to HTSeq-Count, we recommend using the previously generated BAM files as input to the "WARDEN \[BAM\]" app.

BAM files and count files are output as soon as they are created (as opposed to only appearing after a successful analysis), so if you find WARDEN has failed for any reason at a later stage, you should be able to use the already output BAMs or count files to skip rerunning the stages which completed successfully.

## Frequently Asked Questions

Source code for the WARDEN apps can be found [here][https://github.com/stjude/WARDEN].
Source code for the WARDEN apps can be found on our [GitHub](https://github.com/stjude/WARDEN).

If you have any questions not covered here, feel free to reach
out on [our contact
Expand Down
Binary file not shown.
Binary file not shown.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 6190da2

Please sign in to comment.