Skip to content

Nextflow pipeline to preprocess metagenomics reads

License

Notifications You must be signed in to change notification settings

telatin/cleanup

Repository files navigation

cleanup

nextflow-ci last commit last commit

Nextflow pipeline to preprocess metagenomics reads:

  1. Discarding failed samples (< 1000 reads), rename reads by sample name
  2. Host removal (Kraken2, tipically against Homo sapiens)
  3. Removal of specific contaminants sequences via bwa mapping (including Sars-Cov-2, PhiX. optional)
  4. Adapter filtering (fastp, quality filtering is disabled)
  5. Fast profiling (Kraken2) to evaluate the fluctuations in unclassified reads
  6. MultiQC report (example)

Phylosophy

This is a preprocessing pipeline that aims at the minimal loss of information, while allowing to store the reads in general purpose storage (where Human reads are not allowed).

The "Fastp" step disables any quality filtering, limiting its action to the adaptor removal and discarding reads that are too short afterwards.

Dependencies

The pipeline is written in Nextflow, and its dependencies are available as a docker container andreatelatin/cleanup:1.3.

The YAML file with the conda environment with the required tools (fastp, kraken2, MultiQC) is provided in deps/.

Databases

The program requires two databases: a host db used to remove matchin sequences, and a generic database to perform a profiling.

Default databases can be downloaded with:

nextflow run telatin/cleanup -entry getdb --dbdir /path/to/databases/

This will download the host database and the gutcheck profiling database (see below).

Host database

A custom database with masked/filtered Human genome (split by chromosomes), PhiX and Sars-Cov-2 is available from Zenodo, (see databases.

Alternatively, a plain Human database can be downloaded as follows (RefSeq version of GRCh38.p13):

curl -L -o kraken2_human_db.tar.gz https://ndownloader.figshare.com/files/23567780
tar -xzvf kraken2_human_db.tar.gz

Profiling database

The preliminar profiling is used to evaluate abnormal fractions of unclassified reads.

A minimal databases to detect some common gut microbial species using < 8 GB of RAM is available from zenodo (see databases.

Other profiling databases

Two valid options to perform a general purpose profiling are:

💡 To download the human database and the standard database capped at 8Gb, you can run the bash bin/utils/download.sh script.

Usage

nextflow run main.nf  --reads 'data/*_R{1,2}.fastq.gz' \
   --hostdb $DB/kraken2_human/ --krakendb $DB/std16/ [--contaminants contam.fa] [-profile docker]

Notable options:

  • --project STR: project name for the report
  • --saveraw: save reads after host removal but prior to FASTP filtering [default: false]
  • --savehost: save the reads flagged as host [default: false]
  • Reads are relabeled as SampleID_number (e.g. Sample1-1). This behaviour can be changed with:
    • --separator STR: the separator between sample name and read number [default: -]
    • --tag1 STR: the tag for read 1, for example /1 [default: none]
    • --tag2 STR: the tag for read 2, for example /2 [default: none]
  • --contaminants FASTA: also filter against a fasta file [:warning: experimental]
  • --denovo: enable assembly [:warning: experimental]

Profiles:

  • -profile docker: use a Docker container for dependencies (will fetch it from DockerHub)
  • -profile singularity: use a Singularity image for dependencies (will be created from DockerHub)
  • -profile test: will test the pipeline with minimal reads and databases (requires: 8 cores, 16Gb ram)
  • -profile nbi,slurm: will use default location in the NBI cluster and SLURM scheduler
  • -profile nbi --max_cpus INT --max_memory INT.GB: will use local resources of a QIB Virtual Machine

Output directory

The output directory contains a MultiQC report, and the following subdirectories:

  • reads: the main output: final reads without adapters and host contamination
  • host-reads: FASTQ files with the host reads (files can be empty) (requires --savehost)
  • raw-reads: FASTQ files after host removal(requires --saveraw)
  • kraken: kraken reports with the classification against the selected database
  • pipeline_info: execution report and timeline.

Example logs and running output

Cleanup Pipeline