Skip to content

Commit

Permalink
feat: talk about feature count file format requirements
Browse files Browse the repository at this point in the history
  • Loading branch information
a-frantz committed Apr 13, 2021
1 parent daf69ea commit 2e114ab
Showing 1 changed file with 29 additions and 10 deletions.
39 changes: 29 additions & 10 deletions docs/genomics-platform/workflow-guides/warden/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,15 @@ Depending on which entry point is chosen, inputs may include an array of FastQ f

Each WARDEN workflow requires an array of input files, a sample sheet, and has one to three parameters which must be set explicitly. All other parameters are preset with reasonable defaults.

| Required Input | Description | Example |
| ------------------------------------------------------------- | ---------------------------------------------------- | -------------------------------------------------------------- |
| FastQ files (WARDEN \[FastQ\]) | FastQ files generated by RNA-Seq experiment | Sample1.fastq.gz, Sample2.fastq.gz |
| BAM files (WARDEN \[BAM\]) | BAM files generated by RNA-Seq experiment | Sample1.bam, Sample2.bam |
| Count files (WARDEN \[Counts\]) | Feature count files generated by RNA-Seq experiment | Sample1.htseq\_counts.txt, Sample2.htseq\_counts.txt |
| Sample sheet (all apps) | Sample sheet generated and uploaded by the user | Sample_sheet.txt or Sample_sheet.xlsx |
| Genome Build (all apps) | Which genome build to use for alignment and analysis | Human\_hg38\_v31, Mouse\_mm10\_v24, etc. |
| Sequencing Strandedness (WARDEN \[FastQ\] and WARDEN \[BAM\]) | Experimental procedure during sequencing | Unstranded, First strand synthesis, or Second strand synthesis |
| BAM sort order (WARDEN \[BAM\]) | BAM file sort order | Name or Position |
| Required Input | Description | Example |
| ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------- |
| FastQ files (WARDEN \[FastQ\]) | FastQ files generated by RNA-Seq experiment | Sample1.fastq.gz, Sample2.fastq.gz |
| BAM files (WARDEN \[BAM\]) | BAM files generated by RNA-Seq experiment | Sample1.bam, Sample2.bam |
| Count files (WARDEN \[Counts\]) | Feature count files generated by RNA-Seq experiment. [Must have a header line](#feature-counts-file-format). | Sample1.htseq\_counts.txt, Sample2.htseq\_counts.txt |
| Sample sheet (all apps) | Sample sheet generated and uploaded by the user | Sample_sheet.txt or Sample_sheet.xlsx |
| Genome Build (all apps) | Which genome build to use for alignment and analysis | Human\_hg38\_v31, Mouse\_mm10\_v24, etc. |
| Sequencing Strandedness (WARDEN \[FastQ\] and WARDEN \[BAM\]) | Experimental procedure during sequencing | Unstranded, First strand synthesis, or Second strand synthesis |
| BAM sort order (WARDEN \[BAM\]) | BAM file sort order | Name or Position |

### Sample sheet configuration

Expand Down Expand Up @@ -79,7 +79,7 @@ Each row in the spreadsheet (except for the last row, which we will talk about i

!!!example Guidelines

* The sample name should be unique and should only contain letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names.
* The sample name should be unique and should only contain letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names except in WARDEN \[Counts\] (see [Feature counts file format](#feature-counts-file-format) for more information).
* The condition/phenotype column associates similar samples together. The values should contain only letters, numbers, and underscores. They should start with a letter. WARDEN will attempt to correct malformed names.
* If using WARDEN \[FastQ\]:
* The third column should contain forward reads (e.g. `*.R1.fastq.gz` or `*_1.fastq.gz`).
Expand Down Expand Up @@ -128,6 +128,25 @@ point!

Creating a sample sheet with a text editor is another option. The process of creating a sample sheet with a text editor is the same as creating one with Microsoft Excel, with the small difference that you must create your columns using white-space (spaces or tabs). Save the file with a .txt extension.

### Feature counts file format

WARDEN \[Counts\] needs feature count files to have a header that can link the files to the information in the sample sheet.

Each counts file should have a header with 2 tab seperated entries. The first is a label for the features, typically `gene_name`, but as long as it's the same in each file the exact name doesn't make a difference. The second column must be labelled with a sample name that appears in the "sample sheet". Because of this linkage, WARDEN \[Counts\] will not attempt to correct malformed sample names in the sample sheet, but instead fail and ask you to manually conform to the requirements.

After the header, WARDEN expects a similar format to that output by [HTSeq-count](https://htseq.readthedocs.io/en/master/count.html). In short, each row has a feature identifier followed by an integer count number, seperated by a tab.

Example:

```text
gene_name ctrl_1
A1BG 10
A1BG-AS1 0
A1CF 4
A2M 477
...
```

## Creating a workspace

Before you can run one of our workflows, you must first create a workspace in DNAnexus for the run. Refer to [the general workflow guide](../../analyzing-data/running-sj-workflows/#getting-started) to learn how to create a DNAnexus workspace for each workflow run.
Expand Down

0 comments on commit 2e114ab

Please sign in to comment.