Skip to content

Commit

Permalink
fix: simpler three prime QuantSeq cutadapt setup (#78)
Browse files Browse the repository at this point in the history
I have tested this on actual QuantSeq data, and the cutadapt setup in
this PR yields mostly the same results that the complicated previous
three step setup recommended by Lexogen gave. It seems to be a bit more
aggressive in the poly-A tail removal, but this should not affect
results much. And it greatly simplifies the workflow setup.
  • Loading branch information
dlaehnemann committed Sep 14, 2023
1 parent 657c465 commit ecc9ab7
Show file tree
Hide file tree
Showing 5 changed files with 98 additions and 116 deletions.
11 changes: 7 additions & 4 deletions .test/3-prime-config/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,11 @@ params:
# the reverse reads of paired end sequencing:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#trimming-paired-end-reads
cutadapt-se:
adapters: "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
extra: "-q 20"
# This setup is for Lexogen QuantSeq FWD data, based on (but simplfied):
# https://faqs.lexogen.com/faq/what-is-the-adapter-sequence-i-need-to-use-for-t-1
# For more details, see the QuantSeq section in the `config/README.md` file.
adapters: "-a 'r1adapter=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;min_overlap=7;max_error_rate=0.005'"
extra: "--minimum-length 33 --nextseq-trim=20 --poly-a"
# reasoning behind parameters:
# For reads that are produced by 3’-end sequencing, depending on the protocol, it might be recommended to remove some leading bases (e.g. see https://www.nature.com/articles/s41598-019-55434-x#Sec10)
# * `--minimum-length 33`:
Expand All @@ -134,5 +137,5 @@ params:
# * `--minimum-overlap 7`: the cutadapt default minimum overlap of `5` did trimming on the level
# of expected adapter matches by chance
cutadapt-pe:
adapters: "-a ACGGATCGATCGATCGATCGAT -g GGATCGATCGATCGATCGAT -A ACGGATCGATCGATCGATCGAT -G GGATCGATCGATCGATCGAT"
extra: "--minimum-length 33 -e 0.005 --overlap 7"
adapters: ""
extra: ""
72 changes: 64 additions & 8 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,72 @@
# General settings

To configure this workflow, modify ``config/config.yaml`` according to your needs, following the explanations provided in the file.
To configure this workflow, modify the following files to reflect your dataset and differential expression analysis model:
* `config/samples.tsv`: samples sheet with covariates and conditions
* `config/units.tsv`: (sequencing) units sheet with raw data paths
* `config/config.yaml`: general workflow configuration and differential expression model setup

# Sample and unit sheet
## samples sheet

* Add samples to `config/samples.tsv`. For each sample, the column `sample` and any number of covariates and conditions have to be specified.
These columns can be used in the model specification section within `config/config.yaml` to define the differential expression analysis of sleuth, including batch effect modeling.
Effect size estimates are calculated as so-called beta-values by sleuth. For binary comparisons, they resemble a log2 fold change. The primary comparison variable should be encoded in a way such that the group of samples that shall be the numerator in the fold change calculation have the lexicographically greater value. For example, you could simply use values like `0` and `1` to achieve this.
* For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet `config/units.tsv`.
For each unit, provide either one (column `fq1`) or two (columns `fq1`, `fq2`) FASTQ files (these can point to anywhere in your system).
If only one FASTQ file is provided (single-end data), you also need to specify `fragment_len_mean` and `fragment_len_sd`, which should usually be available from your sequencing provider.
For each biological sample, add a line to the sample sheet in `config/samples.tsv`.
The column `sample` is required and gives the sample name.
Additional columns can specify covariates (including batch effects) and conditions.
These columns can then be used in the `diffexp: models:` specification section in `config/config.yaml` (see below)

Missing values can be specified by empty columns or by writing `NA`.

## units sheet

For each sample, add one or more sequencing unit lines (runs, lanes or replicates) to the unit sheet in `config/units.tsv`.
For each unit, provide either of the following:
* The path to two pairead-end FASTQ files in the columns `fq1`, `fq2`.
* The path to a single-end FASTQ file in the column `fq1`.
For single-end data, you also need to specify `fragment_len_mean` and `fragment_len_sd`, which should usually be available from your sequencing provider.

Missing values can be specified by empty columns or by writing `NA`.

## config.yaml

This file contains the general workflow configuration and the setup for the differential expression analysis performed by sleuth.
Configurable options should be explained in the comments above the respective entry or right here in this `config/README.md` section.
If something is unclear, don't hesitate to [file an issue in the `rna-seq-kallisto-sleuth` GitHub repository](https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/issues/new/choose).

### differential expression model setup

The core functionality of this workflow is provided by the software [`sleuth`](https://pachterlab.github.io/sleuth/about).
You can use it to test for differential expression of genes or transcripts between two or more subgroups of samples.

#### main sleuth model

The main idea of sleuth's internal model, is to test a `full:` model (containing (a) variable(s) of interest AND batch effects) against a `reduced:` model (containing ONLY the batch effects).
So these are the most important entries to set up under any model that you specify via `diffexp: models:`.
If you don't know any batch effects, the `reduced:` model will have to be `~1`.
Otherwise it will be the tilde followed by an addition of the names of any columns that contain batch effects, for example: `reduced: ~batch_effect_1 + batch_effect_2`.
The full model than additionally includes variables of interest, so fore example: `full: ~variable_of_interest + batch_effect_1 + batch_effect_2`.

#### sleuth effect sizes

Effect size estimates are calculated as so-called beta-values by `sleuth`.
For binary comparisons (your variable of interest has two factor levels), they resemble a log2 fold change.
To know which variable of interest to use for the effect size calculation, you need to provide its column name as the `primary_variable:`.
And for sleuth to know what level of that variable of interest to use as the base level, specify the respective entry as the `base_level:`.

### Lexogen 3' QuantSeq data analysis

For Lexogen 3' QuantSeq data analysis, please set `experiment: 3-prime-rna-seq: activate: true` in the `config/config.yaml` file.
For more information information on Lexogen QuantSeq 3' sequencing, see: https://www.lexogen.com/quantseq-3mrna-sequencing/
In addition, for Lexogen 3' FWD QuantSeq data, we recommend setting the `params: cutadapt-se:` with:
```
adapters: "-a 'r1adapter=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;min_overlap=7;max_error_rate=0.005'"
extra: "--minimum-length 33 --nextseq-trim=20 --poly-a"
```
This is an adaptation of the [Lexogen read preprocessing recommendations for 3' FWD QuantSeq data](https://faqs.lexogen.com/faq/what-is-the-adapter-sequence-i-need-to-use-for-t-1).
Changes to the recommendations are motivated as follows:
* `-m`: We switched to the easier to read `--minimum-length` and apply this minimum length globally. In addition, we increase the minimum length to a default of 33 that makes more sense for kallisto quantification.
* `-O`: Instead of this option, minimum overlap is specified per expression.
* `-a "polyA=A{20}"`: We replace this with `cutudapt`s dedicated option for `--poly-a` tail removal (which is run after adapter trimming).
* `-a "QUALITY=G{20}"`: We replace this with `cutudapt`s dedicated option for the removal artifactual trailing `G`s in NextSeq and NovaSeq data: `--nextseq-trim=20`.
* `-n`: With the dedicated `cutadapt` options getting applied in the right order, this option is not needed any more.
* `-a "r1adapter=A{18}AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;min_overlap=3;max_error_rate=0.100000"`: We remove A{18}, as this is handled by `--poly-a`. We increase `min_overlap` to 7 and set the `max_error_rate` to the Illumina error rate of about 0.005, both to avoid spurious adapter matches being removed.
* `-g "r1adapter=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;min_overlap=20"`: This is not needed any more, as `-a` option will lead to complete removal of read sequence if adapter is found at the start of the read, see: https://cutadapt.readthedocs.io/en/stable/guide.html#rightmost
* `--discard-trimmed`: We omit this, as the `-a` with the adapter sequence will lead to complete read sequence removal if adapter is found at start, and the `--minimum-length` will then discard such empty reads.

1 change: 0 additions & 1 deletion workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ container: "docker://continuumio/miniconda3"

include: "rules/common.smk"
include: "rules/trim.smk"
include: "rules/trim_3prime.smk"
include: "rules/qc_3prime.smk"
include: "rules/ref.smk"
include: "rules/ref_3prime.smk"
Expand Down
43 changes: 27 additions & 16 deletions workflow/rules/trim.smk
Original file line number Diff line number Diff line change
Expand Up @@ -12,22 +12,33 @@ rule cutadapt_pe:
log:
"results/logs/cutadapt/{sample}-{unit}.log",
wrapper:
"v1.22.0/bio/cutadapt/pe"
"v2.6.0/bio/cutadapt/pe"


if not is_3prime_experiment:
rule cutadapt:
input:
get_fastqs,
output:
fastq="results/trimmed/{sample}-{unit}.fastq.gz",
qc="results/trimmed/{sample}-{unit}.qc.txt",
threads: 8
params:
adapters=config["params"]["cutadapt-se"]["adapters"],
extra=config["params"]["cutadapt-se"]["extra"],
log:
"results/logs/cutadapt/{sample}-{unit}.log",
wrapper:
"v2.6.0/bio/cutadapt/se"


rule cutadapt:
input:
get_fastqs,
output:
fastq="results/trimmed/{sample}-{unit}.fastq.gz",
qc="results/trimmed/{sample}-{unit}.qc.txt",
threads: 8
params:
adapters=config["params"]["cutadapt-se"]["adapters"],
extra=config["params"]["cutadapt-se"]["extra"],
log:
"results/logs/cutadapt/{sample}-{unit}.log",
wrapper:
"v1.22.0/bio/cutadapt/se"
rule max_read_length:
input:
get_all_fastqs,
output:
"results/stats/max-read-length.json",
log:
"logs/max-read-length.log",
conda:
"../envs/pysam.yaml"
script:
"../scripts/get-max-read-length.py"
87 changes: 0 additions & 87 deletions workflow/rules/trim_3prime.smk

This file was deleted.

0 comments on commit ecc9ab7

Please sign in to comment.