cleanup

Nextflow pipeline to preprocess metagenomics reads:

Discarding failed samples (< 1000 reads), rename reads by sample name
Host removal (Kraken2, tipically against Homo sapiens)
Removal of specific contaminants sequences via bwa mapping (including Sars-Cov-2, PhiX. optional)
Adapter filtering (fastp, quality filtering is disabled)
Fast profiling (Kraken2) to evaluate the fluctuations in unclassified reads
MultiQC report (example)

Phylosophy

This is a preprocessing pipeline that aims at the minimal loss of information, while allowing to store the reads in general purpose storage (where Human reads are not allowed).

The "Fastp" step disables any quality filtering, limiting its action to the adaptor removal and discarding reads that are too short afterwards.

Dependencies

The pipeline is written in Nextflow, and its dependencies are available as a docker container andreatelatin/cleanup:1.3.

The YAML file with the conda environment with the required tools (fastp, kraken2, MultiQC) is provided in deps/.

Databases

The program requires two databases: a host db used to remove matchin sequences, and a generic database to perform a profiling.

Default databases can be downloaded with:

nextflow run telatin/cleanup -entry getdb --dbdir /path/to/databases/

This will download the host database and the gutcheck profiling database (see below).

Host database

A custom database with masked/filtered Human genome (split by chromosomes), PhiX and Sars-Cov-2 is available from Zenodo, (see databases.

Alternatively, a plain Human database can be downloaded as follows (RefSeq version of GRCh38.p13):

curl -L -o kraken2_human_db.tar.gz https://ndownloader.figshare.com/files/23567780
tar -xzvf kraken2_human_db.tar.gz

Profiling database

The preliminar profiling is used to evaluate abnormal fractions of unclassified reads.

A minimal databases to detect some common gut microbial species using < 8 GB of RAM is available from zenodo (see databases.

Other profiling databases

Two valid options to perform a general purpose profiling are:

Standard database with Plants and Fungi (16Gb): a small database as provided by benlangmead. Useful for execution in local machines with limited memory.
kraken2_db_uhgg_v2: a gut specific database from a unified catalogue of gastrointestinal genomes, version 2.0

💡 To download the human database and the standard database capped at 8Gb, you can run the bash bin/utils/download.sh script.

Usage

nextflow run main.nf  --reads 'data/*_R{1,2}.fastq.gz' \
   --hostdb $DB/kraken2_human/ --krakendb $DB/std16/ [--contaminants contam.fa] [-profile docker]

Notable options:

--project STR: project name for the report
--saveraw: save reads after host removal but prior to FASTP filtering [default: false]
--savehost: save the reads flagged as host [default: false]
Reads are relabeled as SampleID_number (e.g. Sample1-1). This behaviour can be changed with:
- --separator STR: the separator between sample name and read number [default: -]
- --tag1 STR: the tag for read 1, for example /1 [default: none]
- --tag2 STR: the tag for read 2, for example /2 [default: none]
--contaminants FASTA: also filter against a fasta file [:warning: experimental]
--denovo: enable assembly [:warning: experimental]

Profiles:

-profile docker: use a Docker container for dependencies (will fetch it from DockerHub)
-profile singularity: use a Singularity image for dependencies (will be created from DockerHub)
-profile test: will test the pipeline with minimal reads and databases (requires: 8 cores, 16Gb ram)
-profile nbi,slurm: will use default location in the NBI cluster and SLURM scheduler
-profile nbi --max_cpus INT --max_memory INT.GB: will use local resources of a QIB Virtual Machine

Output directory

The output directory contains a MultiQC report, and the following subdirectories:

reads: the main output: final reads without adapters and host contamination
host-reads: FASTQ files with the host reads (files can be empty) (requires --savehost)
raw-reads: FASTQ files after host removal(requires --saveraw)
kraken: kraken reports with the classification against the selected database
pipeline_info: execution report and timeline.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
bin		bin
contaminants		contaminants
databases		databases
deps		deps
md		md
modules		modules
nano		nano
test		test
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REF.md		REF.md
cleanup.jpg		cleanup.jpg
main.nf		main.nf
nextflow.config		nextflow.config
sketch.sh		sketch.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cleanup

Phylosophy

Dependencies

Databases

Host database

Profiling database

Other profiling databases

Usage

Output directory

Example logs and running output

About

Releases 4

Packages

Languages

License

telatin/cleanup

Folders and files

Latest commit

History

Repository files navigation

cleanup

Phylosophy

Dependencies

Databases

Host database

Profiling database

Other profiling databases

Usage

Output directory

Example logs and running output

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages