Skip to content

thackl/detectEVE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Usage

# download workflow
git clone https://github.com/thackl/detectEVE
cd detectEVE

# install dependencies via conda or mamba (https://github.com/conda-forge/miniforge)
mamba create -n detectEVE
mamba activate detectEVE
mamba env update --file workflow/envs/env.yaml

# run detectEVE
./detectEVE -h                       # show help
./detectEVE [options] [<in.fa> ...]  # analyze local fasta files
./detectEVE [options] -a acc.csv     # download & analyze NCBI accession table
./detectEVE [options] -A acc,acc     # download & analyze NCBI accession list

# or combine local fasta files and remote accessions
./detectEVE [options] [(-a acc.csv | -A acc,acc)] [<in.fa> ...]

# download and prep databases
./detectEVE --setup-databases [--snake ARGS]

# run example data
cd examples
../detectEVE *.fna

See NCBI SRA WGS for downloadable accessions. Exported csv tables from the Sequence Set Browser can be directly used with detectEVE (-a wgs_selector.csv). Genomes will be downloaded to genomes/<accession>.fna.

Note, by default, downloaded genomes will be removed again automatically after being scanned. If you want to keep them, add --notemp to --snake-args.

See Advanced database setup for alternatives and customization of databases.

See Known issues and https://github.com/thackl/detectEVE/issues for questions, problems or feedback.

Output

The pipeline produces the following final files in results/:

  • <genome_id>-validatEVEs.tsv - best hit of the EVE with evidence and confidence annotation (high confidence: EVE score > 30, low confidence: EVE score > 10)
  • <genome_id>-validatEVEs.fna - validatEVEs nucleotide sequences
  • <genome_id>-validatEVEs.pdf - graphical overview of hit distribution for validatEVEs

figures/wf-example-output.png

Workflow overview and background

detectEVE is based on the EVE search strategy developed by S. Lequime and previously used in the following publications:

The current workflow involves the following steps:

figures/wf-rulegraph.png

Advanced database setup

If you need databases in a different location you can adjust db_dir in config.yaml to whatever suits your system.

If you prefer to handle downloads manually or use existing files, copy any file you don’t want detectEVE to download automatically into databases/ (or the respective config.yaml/db_dir) before running --setup-databases.

Note though, unless you add --notemp to the --snake-arguments, all but the final diamond-formatted database files will be deleted from databases/ at the end of the setup phase.

cd databases/

# Latest RVDB
url=https://rvdb-prot.pasteur.fr/ && 
db=$(curl -fs $url | grep -oPm1 'files/U-RVDBv[0-9.]+-prot.fasta.xz')
curl $url/$db -o rvdb100.faa.xz

# UniRef50
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz

# NCBI taxonomy
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz nodes.dmp names.dmp

Known issues

tidyverse stringi libicui

If you encounter an error related to tidyverse/stringi/libicui18n.so.58, try reinstalling stringi locally. To restart the workflow from where it failed, just run the same command again.

mamba remove r-stringi r-tidyverse
R -e 'install.packages("stringi")'
mamba install r-tidyverse

diamond v2.1.9 send bug

diamond v2.1.9 has a bug and does not output send correctly when using long-read mode (–range-culling). Since at this point v2.1.9 is the latest diamond version, detectEVE defaults to diamond v2.1.8 to avoid this bug.