phyBWT2 is a new alignment-, assembly-, and reference-free method that builds a partition tree without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny.
It applies the properties of the Extended Burrows-Wheeler Transform (EBWT) to the idea of decomposition for phylogenetic inference. In particular, it hinges the combinatorial properties of the EBWT positional clustering framework recently introduced , overcoming the limitations of employing k-mers with a priori fixed size. Finally, phyBWT2 infers the tree structure by comparing all the sequences simultaneously, instead of performing their pairwise comparisons.
Let S={S1,...,Sn} be the input collection of sequences, where each Si is a multiset of strings representing an organism (e.g. sequencing reads, contigs, genome). The tool phyBWT takes as input the following data structures:
- the extended Burrows–Wheeler transform (eBWT), or multi-string BWT, of collection S;
- the longest common prefix array (LCP) of collection S;
- the color document array (CDA) of collection S, obtained from the document array (DA) of collection S.
git clone https://github.com/veronicaguerrini/phyBWT2
cd phyBWT2
phyBWT2 can handle datasets of different types (short reads, long reads, contigs, or entire genomes).
For short reads, phyBWT2 must be compiled by using
make
while for sequences longer than 250 bp, set SHORT=0
make SHORT=0
The required data structures eBWT, LCP and CDA can be built independently from phyBWT2. This is a good feature that allows the user to choose the most appropriate tool according to the resources available and the dataset composition (short reads or longer sequences).
For instance, to build .ebwt, .lcp, and .da files from scratch from a single fasta file, one could use BCR [https://github.com/giovannarosone/BCR_LCP_GSA] for short reads, and for longer sequences, gsufsort [https://github.com/felipelouza/gsufsort]. Note that gsufsort tool returns the output files with slightly different filename extensions.
To install and compile both BCR and gsufsort for the preprocessing step, in addition to phyBWT2, one could run
./Install.sh
Note that by default the above script compiles phyBWT2 for short reads. To correctly compile phyBWT2 for longer sequences, the parameter SHORT inside the script must be set to 0.
To obtain the color document array CDA from the document array DA (file fastaName.da), one could use
./create_cda fastaName fileInfo
where fileInfo is a tab-separated file that stores per line the number of sequences of each Si of the collection S, according to the format:
Si num
Alternatively, one could run the following script that builds up the three data structures eBWT, LCP and CDA starting from a collection of FASTA files (one for each multiset Si of the collection S) stored in the directory input_directory.
./Preprocessing.sh input_directory fastaName
The script outputs the three data structures eBWT, LCP and CDA (files fastaName.ebwt, fastaName.lcp and fastaName.cda), and a tab-separated file (fastaName.txt) that stores per line the number of sequences of each Si of the collection S (i.e. fileInfo). By default the above script computes the data structures eBWT, LCP and DA by using BCR (preferred choice for constructing the eBWT of short reads). To build them for longer sequences, please change the parameter SHORT inside the script and set it to 0.
Our alignment-, assembly-, and reference-free method can be run by using:
./phyBWT2 fastaName fileInfo output k_min tau
where fastaName is the base name of files fastaName.ebwt, fastaName.lcp and fastaName.cda, while fileInfo is a tab-separated file that stores per line the number of sequences of each Si of the collection S.
By setting SHORT=0 in Preprocessing.sh and compiling phyBWT2 with parameter SHORT=0, the input datastructures can be built by
./Preprocessing.sh ./example/hiv_sequences/ HIV-1_data
and phyBWT2 can be run by using
./phyBWT2 HIV-1_data.fasta HIV-1_data.txt Tree_HIV-1_data.new 16 0.6
Guerrini V., Conte A., Grossi R., Liti G., Rosone G., Tattini L., phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering. In Proceedings of the 22nd International Workshop on Algorithms in Bioinformatics WABI 2022. LIPIcs, vol 242, pp 23:1--23:19. doi: 10.4230/LIPIcs.WABI.2022.23
Guerrini V., Conte A., Grossi R., Liti G., Rosone G., Tattini L., phyBWT2: phylogeny reconstruction via eBWT positional clustering. (2023) Algorithms Mol Biol 18, 11. doi: 10.1186/s13015-023-00232-4
--
Supported by PNRR project “THE—Tuscany Health Ecosystem” — Spoke 6 “Precision medicine & personalized healthcare”, funded by the European Commission under the NextGeneration EU programme.