Strainberry is a method that performs strain separation in low-complexity metagenomes using error-prone long-read technologies. It exploits state-of-the-art tools for variant calling, haplotype phasing, and genome assembly, in order to achieve single-sample assembly of strains with higher quality than other state-of-the-art long-read assemblers.
Strainberry has been developed and tested under a Linux environment. It requires certain packages/tools in order to be installed/used:
- GNU bash (version 4 or later recommended)
- miniconda3
The simplest (and recommended) way to install Strainberry dependencies is through conda in an isolated environment (e.g., named sberry
):
git clone https://github.com/rvicedomini/strainberry.git
cd strainberry
conda env create -n sberry --file environment.yml
The whole installation process should take about 5-10 minutes.
To make the strainberry
command available, it is advised to include the absolute path of Strainberry's directory in your PATH environment variable by adding the following line to your ~/.bashrc
file:
export PATH=/absolute/path/to/strainberry:${PATH}
Assuming Strainberry conda environment has sberry
name, the following commands allow to update Strainberry to the latest (development) version.
cd strainberry
git pull
conda env create -n sberry --file environment.yml --force
Activate Strainberry conda environment:
conda activate sberry
Running Strainberry:
strainberry [options] -r FASTA -b BAM -o OUTPUT_DIR
where FASTA
is a strain-oblivious metagenome assembly (e.g., generated with metaFlye) and BAM
is a coordinate-sorted long-read alignment in BAM format.
Both FASTA
and BAM
files are expected to be indexed with the samtools faidx
and samtools index
commands, respectively.
Strainberry's output is stored in OUTPUT_DIR
.
After Strainberry execution the conda environment can be deactivated with the command:
conda deactivate
The output directory of Strainberry has the following structure:
OUTPUT_DIR/
├── strainberry_n2/
├── strainberry_n3/
├── ...
├── strainberry_nK/
├── assembly.scaffolds.bam
├── assembly.scaffolds.bam.bai
├── assembly.scaffolds.fa
└── assembly.scaffolds.fa.fai
Strainberry output assembly is stored in the assembly.scaffolds.fa
file.
A minimap2-based alignment of input reads on the output assembly is also available in the assembly.scaffolds.bam
file.
All sub-directories named strainberry_n{k}
contain intermediate results of Strainberry iterations (e.g. the VCF of variants found by mapping reads to contigs) and log files.
After a successful run of Strainberry, they could be deleted.
In order to verify that Strainberry has been correctly installed, it is possible to test it on a small dataset in the example
sub-directory.
In order to generate a strain-oblivious assembly and a read alignment, we recommend to use metaFlye, minimap2, and samtools. Assuming these tools are available, it is possible to run the following commands, using 12 threads:
cd example
flye --meta --pacbio-raw reads.fq.gz --out-dir flye_out --genome-size 300k --threads 12
minimap2 -ax map-pb -t 12 ./flye_out/assembly.fasta reads.fq.gz | samtools sort >./flye_out/alignment.sorted.bam
samtools faidx ./flye_out/assembly.fasta
samtools index ./flye_out/alignment.sorted.bam
where --genome-size
provides an estimate of the metagenome to metaFlye (not required from version 2.8).
The assembly and read alignment are then available in the flye_out
directory as assembly.fasta
and alignment.sorted.bam
respectively.
Given a strain-oblivious assembly (file assembly.fasta
) and a long-read mapping (file alignment.sorted.bam
), it is possible to run Strainberry using 4 threads as follows:
$ cd example
$ strainberry -r assembly.fasta -b alignment.sorted.bam -o sberry_out -c 4
Strainberry should take around 5 minutes to finish. The file assembly.fasta
contains a single sequence which is a consensus of a small region of E. coli strains K12 and W.
After a successful run of Strainberry, in the sberry_out
directory, the file assembly.scaffolds.fa
should contain two scaffolds (one closer to strain K12, the other closer to strain W).
usage: strainberry -r PATH -b PATH -o PATH [--nanopore] [-n int] [-s float]
[-c int] [-h] [-V] [-v]
Automated strain separation of low-complexity metagenomes
Required arguments:
-r PATH, --reference PATH
Strain-oblivious assembly in FASTA format
-b PATH, --bam PATH Read alignment in BAM format
-o PATH, --out-dir PATH
Output directory of Strainberry assemblies
Optional arguments:
--nanopore Input consists of Oxford Nanopore reads
-n int, --max-strains int
Attempt strain-separation at most for the provided
strain multiplicity [5]
-s float, --snv-density float
Minimum SNV percentage to consider haplotype blocks
[0.1]
-c int, --cpus int Maximum number of CPUs to be used [1]
Other arguments:
-h, --help Show this help message and exit
-V, --version Show version number and exit
-v, --verbose Verbose output
If you use Strainberry in your work, please cite:
R. Vicedomini, C. Quince, A. E. Darling, R. Chikhi, Strainberry: automated strain separation in low-complexity metagenomes using long reads, Nature Communications (2021). doi: 10.1038/s41467-021-24515-9