Skip to content

Automated strain separation of low-complexity metagenomes

License

Notifications You must be signed in to change notification settings

rvicedomini/strainberry

Repository files navigation

Strainberry logo

License: MIT GitHub release (latest by date) Strainberry Example

Automated strain separation of low-complexity metagenomes

Strainberry is a method that performs strain separation in low-complexity metagenomes using error-prone long-read technologies. It exploits state-of-the-art tools for variant calling, haplotype phasing, and genome assembly, in order to achieve single-sample assembly of strains with higher quality than other state-of-the-art long-read assemblers.

System requirements

Strainberry has been developed and tested under a Linux environment. It requires certain packages/tools in order to be installed/used:

  • GNU bash (version 4 or later recommended)
  • miniconda3

Installation

The simplest (and recommended) way to install Strainberry dependencies is through conda in an isolated environment (e.g., named sberry):

git clone https://github.com/rvicedomini/strainberry.git
cd strainberry
conda env create -n sberry --file environment.yml

The whole installation process should take about 5-10 minutes.

To make the strainberry command available, it is advised to include the absolute path of Strainberry's directory in your PATH environment variable by adding the following line to your ~/.bashrc file:

export PATH=/absolute/path/to/strainberry:${PATH}

Updating to the latest version

Assuming Strainberry conda environment has sberry name, the following commands allow to update Strainberry to the latest (development) version.

cd strainberry
git pull
conda env create -n sberry --file environment.yml --force

Usage

Activate Strainberry conda environment:

conda activate sberry

Running Strainberry:

strainberry [options] -r FASTA -b BAM -o OUTPUT_DIR

where FASTA is a strain-oblivious metagenome assembly (e.g., generated with metaFlye) and BAM is a coordinate-sorted long-read alignment in BAM format. Both FASTA and BAM files are expected to be indexed with the samtools faidx and samtools index commands, respectively. Strainberry's output is stored in OUTPUT_DIR.

After Strainberry execution the conda environment can be deactivated with the command:

conda deactivate

Output files

The output directory of Strainberry has the following structure:

OUTPUT_DIR/
├── strainberry_n2/
├── strainberry_n3/
├── ...
├── strainberry_nK/
├── assembly.scaffolds.bam
├── assembly.scaffolds.bam.bai
├── assembly.scaffolds.fa
└── assembly.scaffolds.fa.fai

Strainberry output assembly is stored in the assembly.scaffolds.fa file. A minimap2-based alignment of input reads on the output assembly is also available in the assembly.scaffolds.bam file.

All sub-directories named strainberry_n{k} contain intermediate results of Strainberry iterations (e.g. the VCF of variants found by mapping reads to contigs) and log files. After a successful run of Strainberry, they could be deleted.

Example

In order to verify that Strainberry has been correctly installed, it is possible to test it on a small dataset in the example sub-directory.

Generating the input from the reads (optional)

In order to generate a strain-oblivious assembly and a read alignment, we recommend to use metaFlye, minimap2, and samtools. Assuming these tools are available, it is possible to run the following commands, using 12 threads:

cd example
flye --meta --pacbio-raw reads.fq.gz --out-dir flye_out --genome-size 300k --threads 12
minimap2 -ax map-pb -t 12 ./flye_out/assembly.fasta reads.fq.gz | samtools sort >./flye_out/alignment.sorted.bam
samtools faidx ./flye_out/assembly.fasta
samtools index ./flye_out/alignment.sorted.bam

where --genome-size provides an estimate of the metagenome to metaFlye (not required from version 2.8). The assembly and read alignment are then available in the flye_out directory as assembly.fasta and alignment.sorted.bam respectively.

Running Strainberry

Given a strain-oblivious assembly (file assembly.fasta) and a long-read mapping (file alignment.sorted.bam), it is possible to run Strainberry using 4 threads as follows:

$ cd example
$ strainberry -r assembly.fasta -b alignment.sorted.bam -o sberry_out -c 4

Strainberry should take around 5 minutes to finish. The file assembly.fasta contains a single sequence which is a consensus of a small region of E. coli strains K12 and W. After a successful run of Strainberry, in the sberry_out directory, the file assembly.scaffolds.fa should contain two scaffolds (one closer to strain K12, the other closer to strain W).

Command line options

usage: strainberry -r PATH -b PATH -o PATH [--nanopore] [-n int] [-s float]
                   [-c int] [-h] [-V] [-v]

Automated strain separation of low-complexity metagenomes

Required arguments:
  -r PATH, --reference PATH
                        Strain-oblivious assembly in FASTA format
  -b PATH, --bam PATH   Read alignment in BAM format
  -o PATH, --out-dir PATH
                        Output directory of Strainberry assemblies

Optional arguments:
  --nanopore            Input consists of Oxford Nanopore reads
  -n int, --max-strains int
                        Attempt strain-separation at most for the provided
                        strain multiplicity [5]
  -s float, --snv-density float
                        Minimum SNV percentage to consider haplotype blocks
                        [0.1]
  -c int, --cpus int    Maximum number of CPUs to be used [1]

Other arguments:
  -h, --help            Show this help message and exit
  -V, --version         Show version number and exit
  -v, --verbose         Verbose output

Reference

If you use Strainberry in your work, please cite:

R. Vicedomini, C. Quince, A. E. Darling, R. Chikhi, Strainberry: automated strain separation in low-complexity metagenomes using long reads, Nature Communications (2021). doi: 10.1038/s41467-021-24515-9