GitHub - Serka-M/mmlong2: Bioinformatics pipeline for recovery and analysis of metagenome-assembled genomes

Automated long-read metagenomics workflow, using either PacBio HiFi or Nanopore sequencing reads as input to generate characterized MAGs. The mmlong2 workflow is a continuation of mmlong.

Workflow description

Core features

Snakemake workflow running dependencies from a Singularity container for enhanced reproducibility
Bioinformatics tool and parameter optimizations for high complexity metagenomics samples
Circular prokaryotic genome extraction as separate genome bins
Eukaryotic contig removal for reduced prokaryotic genome contamination
Differential coverage support for improved prokaryotic genome recovery
Iterative ensemble binning strategy for improved prokaryotic genome recovery
Genome quality classification according to MIMAG guidelines
Expanded prokaryotic genome quality assessment, including microdiversity approximation and chimerism checks
Taxonomic classification at prokaryotic genome, contig and 16S rRNA levels
Generation of analysis-ready dataframes at genome bin and contig levels

Schematic overview

Installation

Bioconda

The recommended way of installing mmlong2 is by setting up a Conda environment through Bioconda:

mamba install -c bioconda mmlong2

From source (Conda)

A local Conda environment with the latest workflow code can also be created by using the following code:

mamba create --prefix mmlong2 -c conda-forge -c bioconda snakemake=8.2.3 singularity=3.8.6 zenodo_get pv pigz tar yq ncbi-amrfinderplus -y
mamba activate ./mmlong2 || source activate ./mmlong2
git clone https://github.com/Serka-M/mmlong2 mmlong2/repo
cp -r mmlong2/repo/src/* mmlong2/bin
chmod +x mmlong2/bin/mmlong2
mmlong2 -h

Databases and bioinformatics software

Bioinformatics tools and other software dependencies will be automatically installed when running the workflow for the first time. By default, a pre-built Singularity container will be downloaded and set up, although pre-defined Conda environments can also be used by running the workflow with the --conda_envs_only setting.

To acquire prokaryotic genome taxonomy and annotation results, databases are necessary and can be automatically installed by running the following command:

mmlong2 --install_databases

If some of the databases are already installed, they can also be used by the workflow without downloading (e.g. --database_gtdb option). Alternatively, a guide for manual database installation is also provided.

Running mmlong2

Usage example

mmlong2-lite -np nanopore_reads.fastq.gz -o output_dir -p 100

Full usage

MAIN INPUTS:
-np     --nanopore_reads        Path to Nanopore reads
-pb     --pacbio_reads          Path to PacBio HiFi reads
-o      --output_dir            Output directory name (default: mmlong2)
-p      --processes             Number of processes/multi-threading (default: 3)

OPTIONAL SETTINGS:
-db     --install_databases     Install missing databases used by the workflow
-dbd    --database_dir          Output directory for database installation (default: current working directory)
-cov    --coverage              CSV dataframe for differential coverage binning (e.g. NP/PB/IL,/path/to/reads.fastq)
-run    --run_until             Run pipeline until a specified stage completes (e.g.  assembly polishing filtering singletons coverage binning taxonomy annotation extraqc stats)
-tmp    --temporary_dir         Directory for temporary files (default: current working directory)
-dbg    --use_metamdbg          Use metaMDBG for assembly of PacBio reads (default: use metaFlye)
-med    --medaka_model          Medaka polishing model (default: r1041_e82_400bps_sup_v5.0.0)
-mo     --medaka_off            Do not run Medaka polishing with Nanopore assemblies (default: use Medaka)
-vmb    --use_vamb              Use VAMB for binning (default: use GraphMB)
-sem    --semibin_model         Binning model for SemiBin (default: global)
-mlc    --min_len_contig        Minimum assembly contig length (default: 3000)
-mlb    --min_len_bin           Minimum genomic bin size (default: 250000)
-rna    --database_rrna         16S rRNA database to use
-gunc   --database_gunc         Gunc database to use
-bkt    --database_bakta        Bakta database to use
-kj     --database_kaiju        Kaiju database to use
-gtdb   --database_gtdb         GTDB-tk database to use
-h      --help                  Print help information
-v      --version               Print workflow version number

ADVANCED SETTINGS:
-fmo    --flye_min_ovlp         Minimum overlap between reads used by Flye assembler (default: auto)
-fmc    --flye_min_cov          Minimum initial contig coverage used by Flye assembler (default: 3)
-env    --conda_envs_only       Use conda environments instead of container (default: use container)
-n      --dryrun                Print summary of jobs for the Snakemake workflow
-t      --touch                 Touch Snakemake output files
-r1     --rule1                 Run specified Snakemake rule for the MAG production part of the workflow
-r2     --rule2                 Run specified Snakemake rule for the MAG processing part of the workflow
-x1     --extra_inputs1         Extra inputs for the MAG production part of the Snakemake workflow
-x2     --extra_inputs2         Extra inputs for the MAG processing part of the Snakemake workflow
-xb     --extra_inputs_bakta    Extra inputs (comma-separated) for MAG annotation using Bakta

Using differential coverage binning

To perform genome recovery with differential coverage, prepare a 2-column comma-separated dataframe, indicating the additional read datatype (NP for Nanopore, PB for PacBio, IL for short reads) and read file location.
Dataframe example:

PB,/path/to/your/reads/file1.fastq
NP,/path/to/your/reads/file2.fastq
IL,/path/to/your/reads/file3.fastq.gz

The prepared dataframe can be provided to the workflow through the -cov option.

Overview of workflow results

<output_name>_assembly.fasta - assembled and polished metagenome
<output_name>_16S.fa - 16S rRNA sequences, recovered from the polished metagenome
<output_name>_bins.tsv - per-bin results dataframe
<output_name>_contigs.tsv - per-contig results dataframe
<output_name>_general.tsv - workflow result summary as a single row dataframe
dependencies.csv- list of dependencies used and their versions
bins - directory for metagenome assembled genomes
bakta - directory, containing bin annotation results from bakta

Additional documentation

Future improvements

Suggestions on improving the workflow or fixing bugs are always welcome.
Please use the GitHub Issues section or e-mail to mase@bio.aau.dk for providing feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
arc		arc
msc		msc
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workflow description

Core features

Schematic overview

Installation

Bioconda

From source (Conda)

Databases and bioinformatics software

Running mmlong2

Usage example

Full usage

Using differential coverage binning

Overview of workflow results

Additional documentation

Future improvements

About

Releases 5

Languages

License

Serka-M/mmlong2

Folders and files

Latest commit

History

Repository files navigation

Workflow description

Core features

Schematic overview

Installation

Bioconda

From source (Conda)

Databases and bioinformatics software

Running mmlong2

Usage example

Full usage

Using differential coverage binning

Overview of workflow results

Additional documentation

Future improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Languages