This repository contains all scripts used to produce the results shown on the paper. Some of these scripts are just shell wrappers around other commands. They have been included as they document the exact params used for the pipeline. To install copy them to a folder and execute from there. Make sure that all dependencies are in your PATH. The list of external dependencies required are indicated below. The entire enveomics package is not required to reproduce the results shown in the paper. The only requirement from enveomics is ani.rb, a Ruby script for ANI calculation. This pipeline has been tested on Ubuntu 16.04 LTS edition.
-
External programs:
- AcCNET v1.2
- ani.rb from enveomics repository
- Blast+ v2.6.0
- HMMER v3.1b2
- PlasmidFinder v1.3
- Gephi v0.9.2
- graph-tool v2.29
-
Other software and libraries:
- Matlab 2019a
- Pyhton 3.6
- BioPython v1.69
- R 3.4
- Perl v5.26.2
- BioPerl 1.6.924
- Easyfig v2.2.2
To reproduce the manuscript's results a number of scripts should be executed on a precise order as some steps require the output from previous commands. Not all scripts are documented as some are just simply wrappers around other commands. A broad description of the different steps follows:
download_ncbi_taxonomy.sh
: Generate a database for taxonomy annotationsextract_RefSeq84_database.sh
: Extract RefSeq plasmid sequences database (modify accordingly to your database version). One of the outputs of this command isplasmid.lst
, a file listing all accession numbers of the plasmid datasetgenerate_protein_seqs.sh
: Extract aminoacid sequences from genome GBK files. As argument use the fileplasmid.lst
generated by the previous stepextract_plasmid_info.sh
: Generate the plasmid metadata databaseplasmid.tsv
using GenBank annotations and taxonomy
assign_mob_classes.sh
: Find plasmids relaxasome. Relaxase HMM profile database shared from MOBscan. This wrapper's argument is the sameplasmid.lst
file previously usedassign_pfinder_classes.sh
: Type plasmid replicons with PlasmidFinder software and database
list_subgroups.sh
: Generate different subsets of plasmids (Enterobacterales, Escherichia, etc). Some accession numbers are blacklisted as were found to not be real plasmidsappend_pGroup_annotation.sh
: This is a bit underhanded but we update here the plasmid metadata database with the the PTUs manually defined based on the output of next commandsaccnet_RefSeq84.sh
: Execute AcCNET to generate the plasmidome/ORFeome bipartite network. Use Gephi with the output of this script to produce the network layoutani_RefSeq84.sh
: This is the main step of our analysis as it produces the files later used with Gephi to layout the PTU network. This script combines two distinct functions:calculate_ani_distances_p.py
: Produce the list of ANI pairwise comparisons. This script is, as it is at the moment, very inefficient to execute on a personal computer and will take several weeks for completiongenome_similarity_nerwork.py
: Take the ANI comparisons and generate the file of edge's similarity and distance measures
This algorithm has been implemented with the Matlab files setglobal.m
, divide.m
, escribe_componentes.m
, keephojas.m
and dibuja.m
To execute simply enter the following commands on Matlab Command Window:
>> setglobal;
>> divide(G, '0');
>> keephojas;
graph-tool_SBM_script.py
: Script used to generate different SBM modelsgraph-tool_NSBM_script.py
: Script used to generate different Hierarchical SBM modelssimulation.py
: Script used for sHSBM performance simulationptu_classifier.py
: Script used for sHSBM PTU classificationptu_comparison.py
: Script used for sHSBM and PID PTU classifications
bipartite_kept_190823.sh
: Generation of the bipartite network used for host range visualization. Use Gephi to convert on a monopartite network of hosts present per PTU
-
Connections_v2.py
: Calculate plasmid/HpC statistics of bipartite network stratified by the taxonomic levels of nodes -
Connection_plots.R
: R script for visualization ofConnections_v2.py
output -
pANI_prepare_data.sh
: Compile a BLAST database of plasmid fragments sized for aligment fraction (AF) calcutation -
pANI_Enterobacterales.sh
: Generate pairwise list of aligment fraction (AF) results. Again, this step will take too long to execute on a personal computer -
summarize_pGroups_info.sh
: Generate a basic description of PTU composition -
calculate_cluster_density.py
: Calculate inter and intra-cluster density of PTU clusters -
check_database_redundancy.py
: Verify the percentage of plasmid duplication on PTU clusters
The expected output from this pipeline are the files defining the networks shown on the paper and the list of plasmids classified into different PTUs. The Gephi network files corresponding to the Plasmidome/ORFeome bipartite network (Figure 1), the full RefSeq84 plasmidome PTU network (Figures 3 and 6) and the Enterobacterales plasmidome subset (Figure 4) can be downloaded from the Supplementary Material attached to the paper. These files are, respectively, Supplementary File SF4, SF1, and SF2. Supplementary Table ST5 lists those plasmids automatically classified into different PTUs after applying PID and sHSBM algorithm to the adjacency matrix of the RefSeq84 plasmidome network.
The execution time needed to complete the full pipeline on a personal computer will be around a few weeks because of the cuadratic number of pairwise ANI similarity comparisons. Moreover, the Plamidome/ORFeome bipartite AcCNET network is big enough to endanger the normal execution of Gephi on usual personal computers. ANI networks, being monopartite, are not yet limited by plasmid number.
All this software is released under the GPL license.