MARVEL is a tool for recovery of draft phage genomes from whole community shotgun metagenomic sequencing data.
Main script:
- marvel_bins.py - Machine learning prediction of phage bins
Auxiliary script:
- generate_bins_from_reads.py - Generates metagenomic bins, given Illumina sequencing reads
A manuscript describing MARVEL was published in Frontiers in Genetics.
If you find MARVEL useful in your research, please cite:
Amgarten DE, Braga LP, Da Silva AM, Setubal JC. MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins. Frontiers in Genetics. 2018;9:304.
- Fix for Prokka recent changes in its out files. All Prokka versions should work just fine.
- Bins containig contigs which are too short (<2000bp) will not be considered in downstream analyses and a warning will be printed.
- Minor improvements.
- Bugs ou others suggestions for improvement, just let me know: deyvid.amgarten@usp.br
We are working to train models to new viral groups. Let us know if a particular group would be helpful in your research.
All scripts from this project were coded in Python 3. So, first of all, make sure you have it installed and updated.
MARVEL's main scrip (marvel_bins.py) requires Prokka and HMMER tools as dependencies. By installing Prokka and its dependencies, you will usually install HMMER tools automatically.
- Prokka - Rapid Prokaryotic genome annotation.
- HMMER Tools - Biosequence analysis using profile hidden Markov models
These Python libraries are required:
- Numpy, Scipy - Efficiently handling arrays and scientific computing
- Biopython - Handling biological sequences and records
- Scikit-learn - Machine learning
To install these Python libraries, just type:
pip3 install -U numpy scipy biopython scikit-learn
- If you use conda you can use the
environment.yml
file within this project to install all dependencies.
$ conda env create -n marvel -f=environment.yml
$ conda activate marvel
(marvel)$ python marvel_bins.py -h
Getting MARVEL ready to run is as simple as cloning this Github project or dowload and extract it to a directory inside your computer:
git clone https://github.com/LaboratorioBioinformatica/MARVEL
Inside the directory where MARVEL was extracted (or cloned), you will need to download and set the models. This is required only once and it is simple. Just run:
python3 download_and_set_models.py
All set!
Now, to run MARVEL type:
python3 marvel_bins.py -i input_directory -t num_threads
Change 'input_directory' to the folder where bins are stored in fasta format and 'num_threads' to the number of CPU cores to be used. Several threads should be used to speed up prokka and hmm searches.
Results will be stored in the 'Results' folder inside the input directory.
Obs: You need to execute the scripts from the directory where MARVEL was extracted, i.e., MARVEL's root folder.
We provide a folder with example datasets containing mocking bins of RefSeq viral and bacterial genomes.
To try these examples, run:
python3 marvel_bins.py -i example_data/bins_8k_refseq -t 12
MARVEL's main script receives metagenomic bins as input. However, we additionally provide a simple scrip which receives metagenomic reads (Illumina sequencing) and generates bins. metaSpades, Bowtie2 and Metabat2 are used for assembling, mapping and binning, respectively.
In case you want to generate the bins by yourself, we recommend two special parameters in the binning process: -m 1500 -s 10000. These parameters tell metabat to generate bins with contigs of at least 1500 bp and with a minimum total size of 10 kbp, which makes more sense when one is trying to retrieve viral genomes.
We can't stress enough that there are several tools for assembly and binning, which should be well-chosen according to
the researcher's purposes. Our intention here is to facilitate the use of our tool.
python3 generate_bins_from_reads.py -1 reads-R1.fastq -2 reads-R2.fastq -t num_threads
MARVEL is indicated to metagenomic studies where whole community DNA sequencing reads are available. The process of binning with Metabat2 will work better if you have several samples from a same environment (time-series samples for example), so Metabat2 can assess contigs' abundancy correlation among samples and create better bins. Here we suggest a workflow of analyses to generate draft phage genomes from time-series samples raw sequencing reads.
-
Assembly raw reads for each sample individually with metaSpades.
-
Join contigs in a single multi-FASTA and use dedupe from BBMap tools to remove duplicated contigs.
-
Use resulting deduplicated contigs as reference for mapping samples' reads individually with Bowtie2.
-
Generate bins with metabat2 using BAM files obtained in step 3. At this point, it is important to set these specific parameters in metabat2: -m 1500 -s 10000. You should use these parameters, so metabat2 can generate bins with phage genomes characteristics.
-
Run MARVEL giving bins folder as input. Bins predicted as phage will be in the folder: results/phage_genomes/.
For improved draft genomes, we suggested the following additional steps:
-
Merge contigs with overlapping ends with Phrap for each individual bin predicted by MARVEL as phage.
-
Further validate predicted bins by assessing bacterial/archaeal genes with CheckM and predicting tail/capsid proteins with VIRALpro.
If you have any question, please contact us. We will be glad to help with your analyses.
deyvid.amgarten@usp.br
All the simulated datasets used for training and testing the Random Forest classifier, as well as predicted bins from composting samples were made available through this link:
Deyvid Amgarten
This tool was developed as part of my PhD thesis by the Bioinformatics Graduate Program from the University of Sao Paulo, Brazil.
Our group is interested in studying:
- Classical machine learning techniques to create predictors, which are being applied in genomics and metagenomics;
- Deep Learning techniques applied to more broad problems, as for instance prediction of complex biological attributes in viruses and bacteria.
Please visit Setulab's page for more information.
This project is licensed under GNU license. Codes here may be used for any purposed or modified.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.