feat: talk about feature count file format requirements

stjudecloud · Apr 13, 2021 · 2e114ab · 2e114ab
1 parent daf69ea
commit 2e114ab
Showing 1 changed file with 29 additions and 10 deletions.
diff --git a/docs/genomics-platform/workflow-guides/warden/index.md b/docs/genomics-platform/workflow-guides/warden/index.md
@@ -42,15 +42,15 @@ Depending on which entry point is chosen, inputs may include an array of FastQ f
 
 Each WARDEN workflow requires an array of input files, a sample sheet, and has one to three parameters which must be set explicitly. All other parameters are preset with reasonable defaults.
 
-| Required Input                                                | Description                                          | Example                                                        |
-| ------------------------------------------------------------- | ---------------------------------------------------- | -------------------------------------------------------------- |
-| FastQ files (WARDEN \[FastQ\])                                | FastQ files generated by RNA-Seq experiment          | Sample1.fastq.gz, Sample2.fastq.gz                             |
-| BAM files (WARDEN \[BAM\])                                    | BAM files generated by RNA-Seq experiment            | Sample1.bam, Sample2.bam                                       |
-| Count files (WARDEN \[Counts\])                               | Feature count files generated by RNA-Seq experiment  | Sample1.htseq\_counts.txt, Sample2.htseq\_counts.txt           |
-| Sample sheet (all apps)                                       | Sample sheet generated and uploaded by the user      | Sample_sheet.txt or Sample_sheet.xlsx                          |
-| Genome Build (all apps)                                       | Which genome build to use for alignment and analysis | Human\_hg38\_v31, Mouse\_mm10\_v24, etc.                       |
-| Sequencing Strandedness (WARDEN \[FastQ\] and WARDEN \[BAM\]) | Experimental procedure during sequencing             | Unstranded, First strand synthesis, or Second strand synthesis |
-| BAM sort order (WARDEN \[BAM\])                               | BAM file sort order                                  | Name or Position                                               |
+| Required Input                                                | Description                                                                                                  | Example                                                        |
+| ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------- |
+| FastQ files (WARDEN \[FastQ\])                                | FastQ files generated by RNA-Seq experiment                                                                  | Sample1.fastq.gz, Sample2.fastq.gz                             |
+| BAM files (WARDEN \[BAM\])                                    | BAM files generated by RNA-Seq experiment                                                                    | Sample1.bam, Sample2.bam                                       |
+| Count files (WARDEN \[Counts\])                               | Feature count files generated by RNA-Seq experiment. [Must have a header line](#feature-counts-file-format). | Sample1.htseq\_counts.txt, Sample2.htseq\_counts.txt           |
+| Sample sheet (all apps)                                       | Sample sheet generated and uploaded by the user                                                              | Sample_sheet.txt or Sample_sheet.xlsx                          |
+| Genome Build (all apps)                                       | Which genome build to use for alignment and analysis                                                         | Human\_hg38\_v31, Mouse\_mm10\_v24, etc.                       |
+| Sequencing Strandedness (WARDEN \[FastQ\] and WARDEN \[BAM\]) | Experimental procedure during sequencing                                                                     | Unstranded, First strand synthesis, or Second strand synthesis |
+| BAM sort order (WARDEN \[BAM\])                               | BAM file sort order                                                                                          | Name or Position                                               |
 
 ### Sample sheet configuration
 
@@ -79,7 +79,7 @@ Each row in the spreadsheet (except for the last row, which we will talk about i
 
 !!!example Guidelines
 
-* The sample name should be unique and should only contain letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names.
+* The sample name should be unique and should only contain letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names except in WARDEN \[Counts\] (see [Feature counts file format](#feature-counts-file-format) for more information).
 * The condition/phenotype column associates similar samples together. The values should contain only letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names.
 * If using WARDEN \[FastQ\]:
   * The third column should contain forward reads (e.g. `*.R1.fastq.gz` or `*_1.fastq.gz`).
@@ -128,6 +128,25 @@ point!
 
 Creating a sample sheet with a text editor is another option. The process of creating a sample sheet with a text editor is the same as creating one with Microsoft Excel, with the small difference that you must create your columns using white-space (spaces or tabs). Save the file with a .txt extension.
 
+### Feature counts file format
+
+WARDEN \[Counts\] needs feature count files to have a header that can link the files to the information in the sample sheet.
+
+Each counts file should have a header with 2 tab seperated entries. The first is a label for the features, typically `gene_name`, but as long as it's the same in each file the exact name doesn't make a difference. The second column must be labelled with a sample name that appears in the "sample sheet". Because of this linkage, WARDEN \[Counts\] will not attempt to correct malformed sample names in the sample sheet, but instead fail and ask you to manually conform to the requirements.
+
+After the header, WARDEN expects a similar format to that output by [HTSeq-count](https://htseq.readthedocs.io/en/master/count.html). In short, each row has a feature identifier followed by an integer count number, seperated by a tab.
+
+Example:
+
+```text
+gene_name ctrl_1
+A1BG      10
+A1BG-AS1  0
+A1CF      4
+A2M       477
+...
+```
+
 ## Creating a workspace
 
 Before you can run one of our workflows, you must first create a workspace in DNAnexus for the run. Refer to [the general workflow guide](../../analyzing-data/running-sj-workflows/#getting-started) to learn how to create a DNAnexus workspace for each workflow run.