Phage-Host Interaction Search Tool
A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of k-mers shared between their sequences.
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
cd PHIST
make
mkdir ./out
python3 phist.py ./example/virus ./example/host ./out/common_kmers.csv ./out/predictions.csv
PHIST uses Kmer-db as a submodule, therefore a recursive repository clone must be performed:
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):
cd PHIST
make
Under Windows one have to build Visual Studio 2015 solutions on kmer-db and utils subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).
PHIST takes as input two directories containing FASTA files (gzipped or not) with genomic sequences of viruses and candidate hosts (see example).
python3 phist.py [options] <virus_dir> <host_dir> <out_table> <out_predictions>
Positional arguments:
virus_dir
input directory with virus FASTA files (gzipped or not),host_dir
input directory with host FASTA files (gzipped or not),out_table
output CSV file with common k-mers table (default: common_kmers.csv),out_predictions
output CSV file with hosts predictions (default: predictions.csv).
Options:
-k --k <kmer-length>
k-mer length (default: 25, max: 30),-t, --t <num-threads>
number of threads (default: number of cores),-h, --help
show this help message and exit,--version
show tool's version number and exit.
PHIST outputs two CSV files. One containing a table of common k-mers between phages and hosts, and second file with virus-host predictions.
The common_kmers.csv file stores numbers of common k-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero k-mer counts are represented as pairs (column_number : value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
kmer-length: k fraction: f | phages | φ1 | φ2 | ... | φn |
hosts | total-kmers | |φ1| | |φ2| | ... | |φn| |
h1 | |h1| | i11 : |h1 ∩ φi11| | i12 : |h1 ∩ φi12| | ||
h2 | |h2| | i21 : |h2 ∩ φi21| | i22 : |h2 ∩ φi22| | i23 : |h2 ∩ φi23| | |
h2 | |h2| | ||||
... | ... | ... | |||
hm | |hm| | im1 : |hm ∩ φim1| |
where:
- k - k-mer length,
- φ1, φ2, ..., φn - phage names,
- h1, h2, ..., hm - host names,
- |a| - number of k-mers in sample a,
- |a ∩ b| - number of k-mers common for samples a and b.
The predictions.csv file assigns each phage to its most likely host (i.e., the one having most k-mers in common). If there are multiple potential hosts with same number of common k-mers, all are reported. Each virus-host interaction is followed by p-value and adjusted p-value for multiple comparisons.
phage | host | common k-mers | p-value | adj. p-value |
---|---|---|---|---|
φ1 | host( φ1) | |φ1 ∩ host(φ1)| | ... | ... |
φ2 | host( φ2) | |φ2 ∩ host(φ2)| | ... | ... |
φ3 | host1( φ3) | |φ3 ∩ host1(φ3)| | ... | ... |
φ3 | host2( φ3) | |φ3 ∩ host2(φ3)| | ... | ... |
... | ... | ... | ... | ... |
The utils/matcher
tool retrieves the list of all exact matches of legnth >= k for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).
./utils/matcher [options] <virus> <host> <output>
Positional arguments:
virus
virus FASTA file (gzipped or not),host
host FASTA file (gzipped or not),output
output CSV file
Options:
-k --k <kmer-length>
k-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),
./utils/matcher example/virus/NC_024123.fna example/host/NC_017548.fna shared_regions.csv
example/virus/NC_024123.fna,example/host/NC_017548.fna
NC_024123.1:52942-52968,NC_017548.1:1456873-1456847
NC_024123.1:52970-53009,NC_017548.1:1456845-1456806
NC_024123.1:53011-53102,NC_017548.1:1456804-1456713
NC_024123.1:53107-53147,NC_017548.1:1456708-1456668
NC_024123.1:53830-53854,NC_017548.1:2647971-2647947
NC_024123.1:54794-54827,NC_017548.1:679998-679965