Skip to content

tlensk/HMM

Repository files navigation

Exploring the limits of bacteriophage diversity: an in-depth analysis of
the non-contractile tail tips among Proteobacteria-infecting phages

Tatiana Lenskaia1,*, Sherwood Casjens2,3, and Alan Davidson1,4

1 Molecular Genetics, Temerty Faculty of Medicine, University of Toronto, Toronto , Canada

2 School of Biological Sciences, University of Utah, Salt Lake City, USA

3 Pathology Department, School of Medicine, University of Utah, Salt Lake City, USA

4 Department of Biochemistry, University of Toronto, Toronto, Canada

* Corresponding author: t.lenskaia@utoronto.ca

Abstract

The staggering diversity of bacteriophages seems to exceed human comprehension and defy systematic understanding. In this study, we focus on the non-contractile tail tips of phages that infect Proteobacteria, a group of phages notable for both their ecological significance and their sequence variability. Yet, by focusing on this specific group of phages, we uncover patterns that bring clarity to this apparent complexity. Despite substantial divergence in tail tip protein sequences, our analysis reveals recurring organizational features that simplify their interpretation and highlight underlying functional determinants. This demonstrates that even in highly variable regions of phage genomes, systematic understanding is within reach. We further suggest that this perspective extends beyond tail tips, providing a framework for studying phage proteins at large. By narrowing the scope to carefully defined structural components, the seemingly limitless complexity of bacteriophages becomes progressively more approachable, bringing us closer to understanding of the basis for their diversity and evolutionary success.

Keywords

Morphogenetic proteins, comparative genomics, structural prediction

1. Introduction

Bacteriophages, viruses that infect bacteria, are the most abundant biological entities on Earth and exhibit an extraordinary range of strategies for recognizing their hosts and delivering their genomes into their host cells. At the molecular level, these strategies are encoded in phage genomes. Among these strategies, tailed phages are by far the most prevalent. These phages have tail-like structures that serve as devices for detecting and binding to specific bacterial surface receptors and as conduits for delivering viral DNA into the host cell. Although tailed bacteriophages are highly diverse, they exhibit three main tail morphologies: short-tailed podophages, long contractile-tailed myophages, and long non-contractile-tailed siphophages. Gene analyses of virion assembly proteins reveal that the three phage morphotypes share strong similarities in head and DNA-packaging proteins, but differ markedly in their tail proteins, reflecting their distinct tail structures. The number of complete tailed phage genome sequences in the public database has grown to at least tens of thousands. As of September 9, 2025, Caudoviricetes (complete genomes): NCBI GenBank – 23,363 entries; NCBI RefSeq – 5,145 entries. A large fraction of their genes has no specific functional prediction due to failure of current annotation pipeline methods: about 50-70% of genes are annotated as hypothetical or putative function.

Our approach to understanding the nature of phage diversity is to examine limited representative panels of phage genome sequences in substantially more detail than is possible using all known phage genomes (Casjens, Sherwood R. & Grose, 2016; Casjens, Sherwood R. et al., 2022; Zheng et al., 2025). The nature of such “slices” of extant diversity thus revealed can then be used to extrapolate insights and guide further examinations of the wider diversity. In this study we examine tail tip structures of siphophages that infect the Proteobacteria clade of phage hosts.

Over long evolutionary timescales, many phage proteins have diverged beyond recognizable sequence similarity, leaving genome annotation pipelines unable to assign functions. This is especially true for the unusually large number of small proteins encoded by phages. (Fremin et al., 2022) reported over 40,000 small-gene families in ~2.3 million phage genome contigs, with small genes being roughly three-fold more prevalent in phage genomes than in their bacterial hosts. Many of phage proteins are often labeled as “hypothetical proteins” or given vague designations such as “phage protein.” As a result, a substantial fraction of phage genes appears uncharacterized, creating the illusion of an incomprehensibly vast repertoire of novel genes (Paul et al., 2002). This challenge is compounded by inconsistent nomenclature: proteins with the same function are often described differently across phage systems, obscuring true homology and complicating systematic study. While it is undeniable that phages harbor genuine innovations, we argue that a significant fraction of these seemingly novel proteins is in fact distant homologs of known structural components whose functions can now be recognized as such.

For decades, progress in resolving phage structure was slowed by the limits of crystallography and the difficulty of analyzing large, flexible assemblies. Recent advances in cryo-electron microscopy (cryo-EM) computational structure determination methods have made it possible to analyze intact virions at near-atomic resolution, leading to detailed structural models of some tailed phages. These include a number of siphophages infecting Proteobacteria such as lambda, where (Wang et al., 2024) resolved the distal tail region, (Ge & Wang, 2024) demonstrated conserved and flexible elements across related phages, and (Xiao et al., 2023) linked structural differences to host specificity. (Chen et al., 2025) extended this to T1, while structures of Chi (Sonani et al., 2024), JBD30 (Valentová et al., 2024), R4C (Huang et al., 2023), and gene transfer agent RcGTA (Bárdy et al., 2020) further expanded the spectrum of tail-related assemblies. The T5 system has been dissected in particular depth, beginning with receptor-binding proteins solved by crystallography (Flayhan et al., 2014), followed by cryo-EM analyses of the distal tail complex (Degroux et al., 2023; van den Berg et al., 2022), and culminating in studies linking structural transitions to DNA ejection (Linares et al., 2023). Ayala et al. (2023) extended these insights to DT57C, highlighting how closely related phages can diversify to infect different E.coli strains.

These structural achievements, while transformative, now serve primarily as anchoring points for computational biology. The arrival of AlphaFold (Jumper et al., 2021) and a myriad of the subsequent structure prediction methods has moved the resource-intensive and laborious task of protein fold determination into the computational realm. What once required years of experimental work can now be explored at scale, with cryo-EM acting as a reference framework for validation rather than a universal requirement. Our in-depth examination of solved phage structures has revealed general organizing principles of infection complexes, enabling us to detect related architectures even in phage genomes lacking structural data. This shift from experimental to computational diagnosis of molecular function mirrors a broader trend familiar from medicine, where increasingly accurate and precise diagnostics have historically laid the foundation for disease classification and culminated in codified systems such as the International Classification of Diseases. In phage research, by contrast, our ability to classify pathogens of bacteria remains comparatively young, but the same lesson applies: robust diagnostic methods - whether structural, molecular, or computational - are indispensable for developing reliable and meaningful phage taxonomy.

Our study builds directly on this foundation. By performing a large-scale computational analysis anchored in the experimentally determined structures, we systematically explored the limits of phage diversity at the structural and functional levels. Through in-depth computational analysis, protein sequence and structure comparisons, and protein structure predictions, we identified major organizational themes in infection complexes and demonstrated that these themes recur across highly divergent phage protein families. This analysis not only increases the fraction of phage proteins with plausible functional annotation but also provides a unifying framework for describing phage mosaic architecture and constraints on gene exchange. Most importantly, it shows that computational molecular biology has matured to the point of defining paradigmatic structural and functional solutions in phages at scale. In doing so, our work pioneers the transition from individual structural case studies to a comprehensive computational exploration of phage diversity, establishing a foundation for systematic discovery of the repeating themes that are exhibited by the most abundant and diverse biological entities on Earth.

In this study, we examine tail tip structures in phages that infect Proteobacteria hosts (since it is not yet widely accepted, we do not use the recently proposed term Pseudomonadota for this Phylum level classification). Our collection of siphophage genomes (~42,000 genes in total) encompasses phages infecting hosts from 40 bacterial genera across three classes - Alpha-, Beta-, and Gammaproteobacteria. We understand that it is not exhaustive in representing of the full diversity of Proteobacteria phages in nature, as sequencing efforts are biased toward phages of bacteria that are pathogenic to humans, livestock, or crops. In addition, many Proteobacteria host genera remain unstudied. Nevertheless, we are confident that the size and diversity of our panel make it a valuable resource for systematic exploration of phage diversity.

First, we demonstrate that these diverse siphophage tail tips can be classified into distinct “tail tip types” that have not been previously delineated and examine these types in detail. Second, we show that, using HMM analysis (Finn et al., 2011), PSI-BLAST comparisons (Altschul et al., 1997), and AlphaFold3 (Abramson et al., 2024) structure predictions, all core proteins of the tail tip components in a large panel of Proteobacterial siphophages can be unambiguously identified, allowing the nature of their diversity to be understood. This approach substantially increases the number of encoded proteins with predicted functions and consequently reduces the number of genes that might otherwise be perceived as potentially novel. Third, we describe the mosaic organization of these tail tip components and the apparent constraints on the free exchange of genes among phages. Finally, we propose a streamlined universal nomenclature for siphophage tail tip proteins, which should facilitate future analyses and make this area of research more accessible to readers beyond the immediate field.

2. Results

2.1. Characterizing conservative tail tip features of long-tailed phages

Long-tailed bacteriophages, including siphophages and myophages infecting monoderm and diderm hosts, share important structural and functional similarities in their tail tips. The primary role of the tail tip complex is to make way for successful genome delivery into the bacterial host cell. To achieve this, the tail tip must perform a series of essential functions - recognizing a susceptible bacterium, adhering to its surface, and penetrating its cell envelope to create a conduit for phage DNA ejection.

Despite employing distinct infection strategies, myophages and siphophages face similar challenges imposed by the complex organization of bacterial cell envelopes. In diderm bacteria, the envelope consists of an inner cytoplasmic membrane, a thin peptidoglycan layer, and an outer membrane rich in lipopolysaccharides that limits macromolecular entry. By contrast, monoderm bacteria possess a single cytoplasmic membrane surrounded by a thick, highly cross-linked peptidoglycan layer embedded with teichoic acids. These structural differences create unique mechanical and biochemical barriers for phage infection, requiring distinct evolutionary adaptations.

Myophages, which possess contractile tails, penetrate the cell envelope by a forceful mechanical process: contraction of the tail sheath drives the tail tube through the cell envelop layers, enabling direct DNA injection into the cytoplasm. Siphophages, which lack a contractile sheath, employ alternative strategies. Those infecting diderm bacteria exploit existing outer membrane porins such as FhuA or LamB as docking sites and utilize enzymatic tail-associated proteins to locally degrade the peptidoglycan, allowing more passive translocation through the envelope. In contrast, siphophages infecting monoderm bacteria rely on enzymatic and structural adaptations tailored to breaching their host thick cell wall. Despite the divergent approaches, both phage types must solve the same fundamental problem - overcoming the multilayered bacterial envelope - to ensure efficient genome delivery.

Across long-tailed phages, there are key functional modules that are consistently identifiable in the tail tip complex: the Distal Tail (DT), Tail Hub (TH), and Central Fiber (CF)/ Central spike components. These represent the most conserved architectural elements among diverse long-tailed phages. The DT, typically exhibiting a conserved Tail Tube (TT)-fold, acts as a structural “strap belt” linking the tail shaft and tip components. In addition to this anchoring role, the DT often carries auxiliary domains that function as a “utility belt,” providing docking surfaces for enzymatic or receptor-binding elements. The TH serves as a structural hub, coordinating the transition between the 3-fold symmetry of the tail tip and the 6-fold symmetry of the tail shaft, ensuring mechanical stability and symmetry alignment during infection. The CF elements extend outward from the TH, forming fibers that participate in host recognition and attachment.

In siphophages infecting Gram-positive bacteria, these modules are represented by homologous yet specialized proteins: the DT is often referred to as a Dit-like protein, while the fused TH–CF component corresponds to a Tal-like protein. Figure 1(A) highlights these structural modules in three exemplar phages - a Gram-negative siphophage (lambda), a Gram-positive siphophage (phi80alpha), and a myophage (Mu). Figure 1(B) provides a closer view of the hub junctions (DT–TH interfaces) in these phages, illustrating the remarkable conservation of the tail tip’s core architecture across the long-tailed phage lineages.

Figure 1.

Fig.1. Conserved tail tip components in long-tailed phages:
(A) tail tip and (B) tail hub junction.

2.2. Establishing a Representative Database of Proteobacteria Siphophage Genomes

Our approach to investigating phage diversity follows a strategy previously demonstrated to be highly informative: examining limited, representative panels of genome sequences in depth rather than attempting to survey all known phage genomes simultaneously. Such focused “slices” of extant diversity provide tractable and biologically coherent datasets that allow identification of conserved patterns and facilitate extrapolation to the broader phage universe.

In this study, we concentrate on siphophages infecting members of the Proteobacteria clade, a major group of Gram-negative hosts. To achieve this, we assembled a comprehensive panel of Proteobacteria-infecting siphophage genomes through the following process:

1. Initial dataset construction – We began with all siphophages listed in the ICTV Master Species List 2020, the last formal classification release in which tail morphology served as a defining taxonomic criterion.

2. Database integration – These entries were merged with siphophages from three complementary sources: (i) our Lambdoid phage panel (Zheng et al. 2025), (ii) the small virulent Enterobacteriales siphophage panel (Casjens et al. 2022), and (iii) our in-house PAT database (Buttner et al. 2016).

3. Curation and filtering – From this merged list, we selected approximately 500 siphophages with complete or nearly complete genome sequences, applying several curation criteria: (i) Only phages infecting Proteobacteria were retained; prophage sequences were excluded to ensure the inclusion of functional, virion-producing phages; (ii) Genomes exhibiting numerous frameshifts or assembly artifacts that disrupted conserved structural genes were excluded; (iii) To reduce redundancy, near-identical genomes were culled; and (iv) The final dataset retained well-characterized model phages, including lambda, phi80, 21, 434, N15, PY54, T1, T5, DT57C, R4C, Chi, JBD30, and D3, to anchor the panel in existing experimental knowledge.

The resulting curated panel of 436 siphophages (Table S1) provides broad coverage of Proteobacterial siphophage diversity while remaining sufficiently compact to allow detailed manual examination. Manual curation was essential given the pervasive annotation inaccuracies in many deposited phage sequences. Analysis of this curated panel revealed that 429 of the 436 phages encode tail tip protein sets exhibiting conserved genetic organization and predicted structural motifs consistent with the canonical DT–TH–CF module arrangement described above. Only seven phages (listed in Table S1, towards the bottom of the list) showed substantial deviations from this pattern. Although their tips are not unrelated to the tips in the other 429 phages (e.g., they have DT and CF proteins) these seven have substantial gene differences, and since no tip structures are available, detailed functional and structural predictions cannot be made for all their tail tip proteins. The latter phages will not be discussed further here and will be considered in a future publication.

Accordingly, the final analysis panel described below comprises 429 Proteobacteria-infecting siphophages (listed in Table S1, towards the top of the list). This curated dataset represents a robust and balanced sample of extant Proteobacteria siphophage diversity, providing a justified foundation for assessing the conserved features of their tail tip structures.

2.3. Modeling a minimal tail tip in Proteobacteria infecting siphophages

Six proteins, or functional components, (4 main and 2 accessory ones) are usually required to build siphophage tail tips structures in the few cases studied genetically (not including receptor-binding proteins (RBPs) encoded by separate genes). The lack of a standard nomenclature of these proteins has seriously confounded accessibility to an understanding this research field. We suggest the following short names for these proteins, use these names in this report, and recommend that they will be used in the future: main - TM (tape measure protein), DT (distal tail protein), TH (tail hub protein), and CF (central fiber protein); accessory - TNLP (tail Nlp60 domain protein) and THI (tail hub internal protein). Additional sporadically present “accessory” tail tip genes are also discussed. Fig.2A shows a diagrammatic depiction of a minimal typical Proteobacterial siphophage tail tip with the locations of the six proteins indicated. These six protein types are encoded by a cluster of genes whose order in the genome is essentially constant; some phages may lack one of these genes or have multiple functions fused in a single gene. This uniform gene order is the phage lambda order shown in Fig.2B

Figure 2.

Fig.2. Minimal tail tip: (A) tail tip model and (B) conserved gene order.

2.4. Nomenclature and Functional Definitions

To facilitate cross-group comparison and standardization of terminology, we propose a concise nomenclature for the principal structural proteins of siphophage tail tips. This terminology follows established usage where possible, while introducing distinctions necessary to differentiate between tail tip systems of myophages, diderm siphophages, and momoderm siphophages.

DT (Distal Tail protein, diderm type) – The term “distal tail” has been used broadly in the phage literature to describe proteins located near, but not at the extreme end of, the tail tip. DT proteins form a structurally conserved module that links the tail shaft to the tip complex and frequently include auxiliary domains involved in receptor binding or cell wall interaction.

TH (Tail Hub protein, diderm type) – The hub represents the structural junction where the sixfold symmetry of the tail shaft cylinder transitions to the threefold symmetry characteristic of the distal tail cone. This definition parallels the “hub” used in myophage studies and corresponds to the same architectural role in momderm siphophages. TH proteins often contain multiple stacked domains and sometimes merge structurally with CF-like regions to form an integrated tail tip core.

CF (Central Fiber protein) – The term “central fiber” has a long history in phage structural literature, and we retain it here for continuity. Although many siphophages lack an obvious fibrous appendage, the CF protein forms a central axial component of the tail tip. Its most conserved domains participate directly in forming the hub structure rather than extending as an external fiber. Thus, the CF should be regarded as an internal “hub-derived” element rather than a true external appendage in most diderm siphophages.

In addition to these three conserved components, our comparative analyses revealed two additional proteins that appear unique to diderm siphophage tail tips:

THI (Tail Hub Internal protein) – Found intercalated between TH and CF genes, this small structural component may stabilize the hub complex or mediate specific protein–protein interactions within the tail tip assembly.

TNLP (Tail NlpC/P60 like protein) – Originally named to reflect its predicted catalytic domains, this protein is found adjacent to DT–TH clusters in some phages. The precise naming may be revisited following experimental validation of its activity.

2.5. Delineating the main tail tip types

Our analysis of the 436 siphophage genomes infecting Proteobacteria revealed that 98.4% possess tail tips apparently built upon the same underlying architectural principles. This broadly conserved design represents the overwhelmingly predominant type of extant Proteobacteria-infecting siphophage tail tip. Only seven phages (1.6%) could not be readily assigned to any of the seven main tail tip types defined below. Although these outliers encode DT and CF proteins that retain the universal core domains, their tail tip gene clusters diverge substantially from the canonical organizations observed in the main types. Understanding their construction will likely require additional structural data and will be addressed in a separate study.

The first major conclusion from our analysis is that similar general tail tip architectures and functions occur across the vast majority (429 of 436) of the Proteobacteria siphophages. Having established a representative database of 429 Proteobacteria-infecting siphophages, we next sought to define the principal groups of tail tip architectures present among these phages. Our goal was to determine which structural modules are conserved and how they vary across the diverse set of siphophages infecting Proteobacteria. To this end, we combined hidden Markov model (HMM) - based homology searches with genomic synteny and gene content analysis to classify the main types of tail tip organization.

This two-step approach - first HMM-based clustering and then structural-genetic refinement - proved to be both efficient and biologically meaningful. The HMM step captured the evolutionary relationships among distantly related phage tips, while the genomic context analysis helped resolve ambiguous or mosaic cases that arose from horizontal gene transfer or recombination events. Together, these methods provided a consistent framework for defining the main tail tip types.

We began by using HMMs representing the distal tail (DT) and tail hub (TH) proteins, as these two components constitute the conserved core of all known long-tailed phage tips. The initial grouping was based primarily on the TH proteins, since most of the HMMs corresponding to these were already available from the Pfam database, providing a convenient and reproducible foundation for defining groups. HMM searches were performed against all proteins in the curated phage panel, allowing us to identify clusters of homologous TH and DT sequences and to delineate the largest, most coherent tail tip groups. This initial grouping separated tail tips based on TH comparisons into four major branches: Lamda-like, D3-like, MP22-like and PY54-like. Once these major branches were defined by sequence similarity, we refined their boundaries and did further subdivision (MP22-like branch was subdivided into three groups: MP22, KL1, and øCbK; PY54-like branch was subdivided into two groups: PY54 and T5) through comparative genomic analyses - examining local synteny, morphogenetic structure, and the presence or absence of accessory tail tip-associated genes.

Therefore, comparative sequence analysis using PsiBLAST and HMMs of the core (DT, TH, CF) and auxiliary (TNLP and THI) tail tip components naturally organizes these 429 phages into seven coherent and internally consistent groups, each represented by one of the following phages: lambda, D3, MP22, KL1, øCbK, PY54, and T5. For each group, phages sharing a lambda-like DT, for example, also share lambda-like TH and CF core proteins and lambda-like auxiliary proteins, TLNP and THI. No instance was observed where a phage combined a lambda-like DT with a non-lambda-like version of another tip component, confirming the congruence of relationships across all core proteins within each group. This consistency indicates that the sequence relationships among tail tip proteins are sufficient to delineate these robust tip types within our panel.

Additional support for these groupings comes from gene content and organization within the tail tip clusters. These features corroborate the sequence-based classification and further justify separating the MP22-like and KL1-like tips, yielding seven total tip types. Specifically, TH and THI domains occur in distinct gene contexts in MP22-like versus KL1-like phages, while the tandem DT domain duplication found in a few D3-like phages was not used to define a separate category, as all other features of these tips remain D3-like.

Together, these observations define seven discrete and unambiguous Proteobacteria siphophage tail tip types. This framework greatly simplifies comparative analysis of siphophage tail architecture and provides a unified reference for future functional and structural studies. Table 1. summarizes the detailed characteristics for each of the seven main tail tip types.

Table 1. Summary of the main tail tip types.

Table 1.

1. Lambda-like tips. DT proteins have single domains forming a hexameric ring. TH proteins are encoded by separate genes and contain an HDI N-terminal domain (NTD) and a [4Fe–4S]²⁺-binding C-terminal domain (CTD). TNLP proteins possess N-terminal Prok-JAB and C-terminal NlpC/P60 protease domains. THI proteins are encoded as separate genes. CF cores contain an OB domain inserted into the HDIV domain. In phage lambda itself, receptor binding is mediated by the C-terminal domain of the CF protein, although this has not been characterized in other lambda-like members.

2. D3-like tips. DT proteins are single-domain proteins forming hexameric rings, though three phages in this group exhibit two β-sandwich domains that likely assemble as trimers rather than hexamers. TH proteins resemble those in lambda-like tips, with HDI NTD and [4Fe–4S]²⁺-binding CTD. THI sequences are not encoded separately but are fused to the N-termini of CF proteins. TNLP proteins carry only an NlpC/P60 domain. CFs resemble those of lambda-like phages, with OB insertions in HDIV, but uniquely include additional lambda CF HDII-Ins-like domains in which β-strands are discontinuous in sequence. A tailspike protein bound to the DT ring, as seen in phage D3, may serve as the adsorption apparatus in this group.

3. MP22-like tips. DT proteins consist of two β-sandwich domains forming a trimeric rather than hexameric ring. THs are encoded separately and have the characteristic HDI NTD and [4Fe–4S]²⁺-binding CTD. These phages lack a TNLP gene. THI proteins occur as separate genes. CFs lack an OB domain within HDIV but include one within the N-terminal FNIII domain. Phage JBD30 carries both a pilus-binding protein attached to the DT ring and a possible O-antigen-binding protein bound distally to the CF (Valentová et al. 2024).

4. KL1-like tips. DT proteins have two β-sandwich domains predicted to assemble as trimers. TH proteins, encoded by separate genes, contain HDI NTDs and [4Fe–4S]²⁺-binding CTDs. TNLP genes are absent. THI sequences are fused to the N-termini of CFs. CFs lack OB insertions in HDIV but contain an OB domain within the N-terminal FNIII-type domain, forming two distinct CF subbranches apart from the MP22-like group. These phages encode receptor-binding proteins homologous to JBD30 gp47/gp48.

5. øCbK-like tips. DT proteins are single-domain proteins forming hexameric rings. TH proteins are encoded separately and have the HDI NTD and [4Fe–4S]²⁺-binding CTD. TNLPs possess only an NlpC/P60 domain. THI sequences are fused to CF N-termini. CFs lack an OB insertion in HDIV but include one in an N-terminal FNIII-like domain and have large insertions between HDII and HDIV. In phage R4C, a JBDS30 hp54-like receptor-binding protein attaches to the distal CF tip. Notably, this tip type occurs in phages spanning a wide range of genome sizes - from 36 Kbp in R4C to 322 Kbp in CcrBL9, the largest siphophage identified to date.

6. PY54-like tips. DTs are single-domain hexameric proteins. THs, encoded separately, contain HDI domains but lack the [4Fe–4S]²⁺-binding CTD. TNLPs have only an NlpC/P60 domain. These phages have no separate THI gene, and it remains unclear whether the CF N-termini fulfill this role. CFs lack OB domains and have unique N-terminal architectures. OB domain is inserted in DT. The mode of receptor binding remains unknown.

7. T5-like tips. DT proteins have single domains forming a hexameric ring. An additional TTMP protein creates a trimeric ring between the TTP and DT layers, serving as an anchor for side tail fibers. These phages lack TH, TNLP, and THI genes. Linares et al. (2023) proposed that the C-terminal region of TM performs the THI-like function in phage T5, forming a pore in the host outer membrane. The phylogenetic proximity between PY54- and T5-like tips raises the possibility of similar mechanisms, although no sequence similarity has been detected between their CF C-termini. CFs lack OB domains and uniquely carry an HDI domain fused to their N-termini. OB domain is inserted in DT. The T5 CSF and receptor-binding domains are encoded by two separate genes.

Together, these seven types capture the structural and genetic diversity of Proteobacteria siphophage tail tips and provide a coherent framework for interpreting their modular evolution and host interaction mechanisms.

The HMM-based classification thus provides a robust first-order definition of the principal tail tip types among diderm siphophages, with the DT–TH–CF triad forming the structural and evolutionary core. Additional proteins such as THI and TNLP define subtype-specific elaborations that may correlate with host range or infection strategy. In subsequent analyses, we will describe how variations in gene synteny and domain architecture within these groups illuminate the structural logic of siphophage tail tips and support the proposed nomenclature framework for tail tip components.

2.6. Comparative anatomy of tail tip components

2.6.1. DT - its role and diversity

A DT protein is made by all 429 phages in our panel. It forms a ring at the distal end of the tail tube, where it forms a bridge between the cylindrical tail shaft and the conical tail tip structures. DTs can also serve as a “utility belt” that provides attachment sites for receptor-binding proteins. The DT core beta sandwich fold is similar to that of the tail shaft subunit of the major tail tube (TT) protein in all panel phages.

The D3-like DT sequence type group in our panel has single-domain DTs with the exception of three phages that have two core structural domains, JWX, BcepGomr and 83-24, all of which (uniquely in this group) infect Betaproteobacteria hosts. In this case the two domains are more closely related to one another and more distantly related to those of the MP22-like phages, suggesting that this duplication occurred independently of the MP22 group. This DT duplication apparently occurred within the D3 group after initial divergence of its members.

Panel phage DTs often have “extra” domains compared to the minimal DT exemplified by the lambda gpM protein. Fig.3. shows examples of such domain, such as the short C-terminal extension in phage D3, a moderately large N-terminal domain (NTD) in phage Chi and JBD30 (not shown), or an insert in a loop between beta strands of the core domain in MP22 (not shown), PY54, øCbK and T5. The latter inserts in the MP22-like and T5-like phage DT structures have an OB fold. The MP22-like DTs are particularly variable regarding extra domains. Other examples are RcSpartan which has a 210 AA insert between the two core domains, Soft which has a 485 AA NTD, and BcepNazgul which has a 438 AA C-terminal domain (CTD) (Figure S1 shows their AF3 ribbon diagrams). The roles of the DT “extra” domains have mostly not been studied; however, the short ~35 AA CTD of phage D3 DT protein appears to serve as the anchor point for its gp27 tailspike, while the ~174 NTD of phage JBD30 DT binds gp47-gp48 heterodimers that have also been proposed to be involved in adsorption. The tail tip structures of phages D3, R4C, Chi and JBD30 and our AF3 predictions for many other DTs show that extra material at any of the above sites is positioned on the outside the tail tip structure, where it should not block assembly or clog the tail lumen. It is not known if some of the larger extra DT domains could be directly involved in receptor binding.

The T5-like phages have an extra ring of subunits (tail tip middle protein, called TTMP in phage DT57C and p140 in phage T5) between the tail tube (TT) and the DT ring that is not present in other panel phage tail tip types. This ring is made up of three TTMP subunits, each with two core beta-sandwich folded core domains. This ring is in turn surrounded by a ring of twelve p132 proteins that serves as the anchor for the tail’s long side fibers (Linares et al. 2023).

We note that the T5-like and øCbK-like tail tubes have rings of three two-domain TTs and six one-domain DTs, while MP22-like phages have TT rings of six one-domain proteins and DT rings of three two-domain proteins, and in lambda, D3 and Py54-like phages both rings are made up of six one-domain proteins. There is no apparent correlation in this regard between the six and three multiplicity of tail tube and DT rings.

Figure 3.

Fig.3. DT representative structures.

2.6.2. TH - its role and diversity

All panel phages encode another beta sandwich fold domain that is similar to the TT and DT core folds. These are usually encoded as the NTD of the TH proteins; however, in the T5-like phages it is present as the CF NTD and there is no TH gene. This domain has been called the HDI domain. In all panel phages in the tip structure, it alternates with the similarly folded CF domain HDIV to form a six-member ring of domains immediately below the DT protein ring in the tip structure. The panel phage HDI domains fall into six major sequence types by PsiBLAST analyses and four types by HMM analysis. The PsiBLAST types are exemplified by phages lambda, D3, MP22, øCbK, PY54 and T5. PsiBLAST separates the HDI domains of the T5-like and PY54-like phages into two robust self-contained subgroups (within a single more inclusive HMM group), and similarly separates the øCbK-like and MP22 HDIs into robust PsiBLAST subgroups (also within a single HMM group). The øCbK-like and PY54 TH proteins appear to have diverged less from their MP22-like and T5-like relatives, respectively, than the other groups. Thus, the TH sequence types parallel the above DT types perfectly, if the two above examples of differential divergence rates are taken into account.

Ribbon diagrams of prototypes of the major TH sequence type groups in Fig.4A. show that their HDI folds are very similar. Like the DTs, THs can also have variable “extra” domains. Purified TH from lambda-like phage N15 was found to contain a [4Fe-4S]2+ iron cluster. Four cysteine side chains (dark orange Fig.4A.; gold in 4B) coordinate the Fe ion. This domain is visualized in the virion structures of lambda, JBD30 and R4C, where it lies on the outside of the conical tail tip structures. However, this domain is not universally present and is replaced by a relatively short extended CTD polypeptide chain in D3-TH (by cryoEM structure) and PY54 TH (by AF3 prediction); recall that T5-like phages no separate TH. Curiously, the MP22-like TH group has an OB domain (magenta in Fig.4A.) inserted into the small Fe-containing domain. In addition, for example, MP22-like phage MW2 and several others in that group have a 204 AA domain inserted into a loop between beta strands of the HDI domain (gray in Fig.4B.), the øCbK-like group has a large 307 AA region that includes an OB domain and another domain inserted into the Fe-containing domain (Fig.4C.), and Kp3 has a 173 AA NTD (not shown). The roles of these extra TH domains, including the Fe-containing domains, are not known but they could in theory participate in stabilizing virion structure, virion adsorption directly or in anchoring receptor-binding proteins.

Curioiusly, a subset of three PY54-like phages, Loki, Halfdan and IMEAB3, THs have a second adjacent TNH-like gene that also has a predicted beta-sandwich HDI-like domain and a large >700 AA extra domain at their N-terminus (phage Loki gp17 shown in Fig.4D.). The function of these proteins and their extra domain is unknown (probably, they may play a role in receptor binding), but the presence of an HDI-like domain suggests they might participate as a unique “accessory” tail shaft ring.

Figure 4.

Fig.4. TH representative structures.

2.6.3. TNLP - its role and diversity

Phage lambda gpK (TNLP) was detected in virions by Wang et al.’s (2024) proteomic analysis, but was not seen in their cryoEM reconstruction of the tail tip or in any previous analysis of lambda virion proteins. Huang et al. (2023) suspected that 1-3 molecules of unmodeled TNLP (gp15) may be present in the distal end of the phage R4C tail shaft lumen. The phages with MP22- or T5-like DTs and THs (above) do not encode a TNLP.

Most of the panel phage TNLPs are single NlpC/P60 domain proteins; however, those in the lambda-like group also have an N-terminal Prok-JAB NTD (uniquely phage 9A appears to have only the latter). PsiBLAST analysis recognizes four sequence types of phage panel TNLP NlpC/P60 domains – typified by those of phages Lambda, D3, øCbK and PY54 – but their folds are very similar, while HMM is less discriminatory and places them in groups that include two Lambda-like groups and one PY54-like group. Sequence comparison of the NlpC/P60 domains shows that again, although they can be very different in sequence even within a sequence type (e.g., D3 TNLP is only 24% identical to that of RcapMu), their predicted folds are remarkably conserved.

The NlpC/P60 domain fold was named after those of E. coli new lipoprotein C (now MepS) (Singh et al., 2012) and Listeria monocytogenes peptidase P60 (Bubert et al., 1992). Some members of this domain type have been shown to be cysteine peptidases with peptidoglycan hydrolase activity, and it has been proposed that the siphophage NlpC/P60 domain is injected into the periplasm and digests the cell wall to create a passage for the injected DNA. There are conserved, juxtaposed Cys and His amino acids in the active site (reviewed by Griffin et al 2023 Cell Chem Biol 30 436}, and the Lambda-, D3- and øCbK-like TNLPs typically have these two amino acids and so are likely active enzymes, but the PY54-like TNLPs lack the conserved His.

The Prok-JAB domain is related to the metalloprotease that cleaves ubiquitin from ubiquitinated proteins in eukaryotes. This potential activity correlates with the fact that the NTDs of the lambda-like THIs have a ubiquitin-like fold (see below), and has led to the suggestion that the lambda gpK Prok-JAB NTD might cleave between the two domains of lambda-like THI (Iyer et al., 2006). TNLP is essential for assembly of the phage lambda tail tip (Tsui & Hendrix, 1983), but its possible roles discussed above during injection and location in the virion remain unproven.

2.6.4. THI - its role and diversity

The Lambda virion structure shows AAs 135-232 of gpI (THI) CTD in an extended conformation in the lumen of the virion tip structure below the three gpH (TMP) C-termini. The rest of gpI is not visualized in the virion cryo-structure (Wang et al. 2023). The lambda gpI CTD moves down to near the outer membrane during injection (Ge & Wang, 2024). Uniquely, as mentioned above THIs in the lambda-like group have an NTD with an AF3 predicted fold that is similar to ubiquitin. We also note that the D3-like CF NTDs have a fold that is very similar to that of Lambda THI NTD and ubiquitin, but have no recognizable AA sequence similarity to those proteins by PsiBLAST; it is not known if these ubiquitin-like CF NTDs might be removed by TNLP mediated cleavage.

Gene 28 protein (also called TAP, tail tip assembly protein) of the MP22-like phage Chi is a small protein with an extended conformation that occupies a position in its tail tip that is similar to that of lambda gpI (Sonani et al., 2024). It is encoded by a small gene in the same location as lambda gene I. HMM and PsiBLAST find rather weak but convincing sequence similarity between Chi gp28 and the Lambda gpI CTD, and place them in two very distinct sequence type groups, the lambda-like and MP22-like THIs (table S2). The role of the lambda THI CTD during injection has been suggested to be formation of the pore in the outer membrane during injection (Ge & Wang, 2024). On the other hand, (Linares et al., 2023) have proposed that the C-terminal region of TM performs this function in phage T5.

Among the panel phages, only Lambda-like and MP22-like THIs and are encoded by separate genes (above), but in other panel phages homologous sequences are often found at the N-terminal tip of CF. A summary of conclusions from PsiBLAST and HMM analyses of THI in panel phages follows:

  • Lambda-like THI. THI is encoded by a separate gene between the TNLP and CF genes. It is a two-domain protein of which only the CTD in an extended state is seen in the lumen of the tail tip cryoEM structure.

  • D3-like THI. There is no separate THI gene. An AF3-predicted structural homolog of the lambda-like THI NTD is present at the N-termini of the D3-like CF proteins.

  • MP22-like THI. The small THI protein of this group is encoded by a separate gene between the TH and CF genes. It is seen in the Chi tail tip cryoEM structure in an extended conformation.

  • KL1-like THI. There is no separate THI gene. A sequence related to the MP22-like THI is present at the N-termini of these CF proteins.

  • øCbK-like THI. There is no separate THI gene. A sequence related to the MP22-like THIs is present at the N-termini of these CP proteins. Bardy et al. (2020) suggested that the CF N-terminal region of the gene transfer agent RcGTA particle makes the pore in the membrane. RcGTA has a tail tip that is similar to that of øCbK, but it is not included in our panel since it is not a true phage.

  • PY54-like THI. There is no separate THI gene. The PY54-like CF NTDs have an ~60 AA region with an AF3-predicted extended structure, but they are very variable and no homology has been found between them and the THIs or other panel phage CF NTDs.

  • T5-like THI. There is no separate THI gene and the CF N-terminus is occupied by an HDI domain. Linares et al.’s T5 injection model has a cleaved off C-terminal part of T5 TM performing the putative THI function by rearranging during injection to form the outer membrane pore.

2.6.5. TM - its role and diversity

Extended TMs have been shown to act as a molecular ruler in tail tube length determination during tail assembly (Katsura & Hendrix, 1984), and the C-terminal tips of three TM molecules are seen in the lumen at the top of the tip cone in the experimentally determined Proteobacteria phage tail tip structures. The bulk of the three TM molecules fills the tail tube lumen in an extended (probably largely helical) conformation.

As mentioned above, (Linares et al., 2023) report that the T5 TM is cleaved by an unknown protease between residues 93 and 94 from the C-terminus, and both parts of the protein are present in the virion. They also report that the 93 AA C-terminal part has a ”metalloprotease motif” and it may affect cell wall cleavage. In theory this could perform the same function as the NlpC/P60 domain of TNLP in other phages but it remains to be proven. Finally, it has been suggested that two putative membrane spanning regions near the C-terminus of the N-terminal cleavage product of TM perform the outer membrane pore-forming function in phage T5 (Linares et al., 2023). PY54-like and T5-like CFs are moderately closely related (above and below), and neither group has a separate THI gene. Outer membrane pore function may be different in these two groups, although their close relationship overall could indicate that they have similar THI and /or TM functions.

The lambda TM in the virion is also cleaved in an assembly-dependent manner (Tsui & Hendrix, 1983), but the cleavage site and role of this cleavage remain unknown. It is not known if TMs of other phages are cleaved, or what performs the cleavage in the known cases.

2.6.6. CF - its role and diversity

The published siphophage tail tip structures show that CFs are very diverse, large multidomain proteins that form the bulk of the conserved part of the siphophage tail tip structure. The multidomain nature of the CFs encoded by the panel phages complicates their analysis, and there are clear examples of recent domain exchange – for example, the CF C-terminal domains of the temperate E. coli phage HK97 and very different lytic phage 9g are 89% identical in AA sequence while their other CF domains (and other encoded proteins) are very different. Our inspection of the few experimentally determined and many AF3 predicted domain contents of siphophage CFs indicates that there is a previously unrecognized clear set of universally present “core” domains whose basic folds are largely invariant. These CF domains that have been named HDI (see above), HDII, HDIII and HDIV are universally present (Fig.5Astrong>). We note that this universality also extends to the CFs of myophage baseplates and to the tail tips of siphophages that infect hosts outside the Proteobacteria phylum. The latter phage types often have “minimal” CFs that contain five or fewer domains (Fig.5B). These four universal domains perform a common underlying role during injection, formation of the upper part/sides of the cup-shaped portion of known siphophage tail tip structures.

Figure 5.

Fig.5. CF core domain organization.

We note that in some myophage baseplates the CF domain HDIII is significantly shorter than is commonly present. Cyanophage Pam3 infects Pseudoanabena mucicola (Cyanobacteriota phylum), and its gp19 CF is the smallest {to my knowledge} currently known at 229 AAs. Its tail tip structure has been determined, and its CF has essentially no HDIII domain. Very short HDIIIs are not limited to Cyanobacteria phages, for example CFs of some Gammaproteobacteria myophages (e.g., Erwinia phage øET88 and Salmonella phage SEN34) also have short or no HDIII domains in their AF3 predicted structures. HDIII domains are present in all the Proteobacteria siphophages whose tail tip structures are known, including unusually large ones in phages Bxb1 (not shown) and Douge (Maharana et al., 2025) that infect the very distant host Mycobacterium smegmatis (Actinobacteria phylum). Unambiguous HDIIIs are present in all CFs in our siphophage panel by their predicted AF3 structures and sequence similarities.

As described above, the HDI domain is universally present, but in some panel phages it is encoded by a separate TH gene (e.g. lambda-, MP22-, PY54 and D3-like phages) while in T5-like tips they form the CF NTD (as in all the myophage CFs we analyzed). Thus, the contiguous HDII, HDIII and HDIV domains (circled in Fig.5B) can usefully serve in siphophage CF sequence comparisons. In spite of their great sequence diversity (see below), these three domains can be readily identified since AF3 does an apparently excellent job of folding the various CF domains (e.g., phage FSP_SP-016 CF in Fig.S2). Figure S2 shows ribbon diagrams that demonstrate the uniformity of these domain structures in Lambda-like, D3-like, MP22-like, PY54-like and T5-like siphophage CFs. The figure includes the very distantly related Mycobacterium siphophage Douge CF domains for comparison and to show (i) the very wide conservation of the basic domain folds, and (ii) that HDIII can accommodate fairly large lengthening (in addition to shortening, above). As noted above, the HDI and HDIV domains both have the same basic beta sandwich fold as DTs and TTs.

To avoid domain exchange and presence/absence issues, the panel phage CF sequence comparisons that follow used only the universally present contiguous domains HDII, HDIII and HDIV (to avoid probing with non-contiguous sequences the relatively short OB domain was included when it is present on HDIV). This PsiBLAST analysis placed the CF cores from the panel phages into six CF sequence types – lambda-like, D3-like, MP22-like, øCbK-like, PY54-like and T5-like.

Domain connection diagrams for a sample of CFs from the lambda-like, D3-like, MP22-like, KL1-like, øCbK-like, PY54-like, and T5-like tail tip groups (see below) are shown in Figure 6. The CF core domains have constant features within each of the groups, but there are numerous intra- and inter-group non-core domain variations. The non-core CF domains in each of these types are discussed in following sections.

Figure 6.

Fig.6. CF domain connection diagrams for the seven tail tip types.

3. Materials and methods

We analyzed the tail tip proteins encoded by each of the 429 panel phages with the following strategy in order to discover systematic similarities and differences among very these extremely diverse siphophage tail tips with the hope of understanding their diversity in more detail and organizing them into coherent groups:

1. Gene synteny points out tail tip protein candidates. It has long been known that small to medium sized tailed phages (genome sizes under about 100,000 bp) have genes arranged in particular conserved orders according to their functions (Casjens, Sherwood & Hendrix, 1974). Siphophage tails are an especially good example of such a common gene order. However, the specific functions of tail genes have only been experimentally studied in a very small number of phages. In phage lambda where the tip assembly genes have been most highly studied by genetic and biochemical means, the six contiguous members of the H, M, L, K, I and J gene cluster are essential for tail tip assembly. We thus began our analysis by assuming the phages in the have similar gene orders, even though most previous annotations of siphophage genomes have failed to predict functions for a large number of the genes in this cluster. When no homolog of the above six proteins was found to be present (see for example the lack of a TNLP in the phage D3-like and T5-like tips below), homologs were searched for in other parts of the genome, but none were found.

2. Identification of groups of distantly related proteins with similar sequence. The panel phages were examined independently by hidden Markov model analysis (Finn et al., 2011) and systematic reiterative PsiBLAST searches (Altschul et al. 1997) of the panel phages to identify sets of more closely related proteins or “sequence types” encoded by the genes in their tail tip gene regions. These methods identify homologs that are very different in amino acid sequence, and they converged on similar protein sets.

3. Similar polypeptide folds indicate ancient homology and thus likely similar function. Polypeptide fold predictions were made on-line at the implementation of AlphaFold3 (Jumper et al. 2024; Abramson et al. 2024) for proteins representative of each of the sequence types and subtypes determined in the previous step. These were examined manually in UCSF ChimeraX v1.6 for fold similarities and domain content differences. We assume that sequence types with very similar predicted folds are in fact ancient homologs and therefore almost certainly have similar specific functions in tail tip assembly.

We believe that this extensive analysis resulted in the successful identification of all previously unrecognized DTs, THs, TNLPs, THIs and CFs encoded by the 429 panel phages, and further searches with the sequence types thus identified did not find any homologs encoded by genes outside the tail tip cluster of the panel phages.

In the Results section we describe the diversity of each of the six tail tip proteins in our phage panel and discuss them in terms of their detailed functions; we believe there are no other nearly universally present tail tip proteins (discussed below).

4. Discussion

4.1. Synteny and potentially unrecognized tail tip proteins

Were any widely present, potentially critical tail tip assembly cluster genes missed because the few best-studied phages happen to lack them? Our analysis did not identify any widely present proteins that are not recognizable homologs of one of the lambda H, M, L, K, I or J proteins. Figures S3A, B, C and D show that although there are sporadically present genes with unknown function (white in figures), they are not common and are generally unique to particular phages; if they are functional these could fall into the category of “morons” (Abedon, 2022). However, in a few cases such genes are ubiquitously present within a tail tip type or a subgroup within a type. Phages with MP22-like tips all carry homologous short genes (e.g., gene 52 in MP22; Fig. S3C) of unknown function between their THI and CF encoding genes, and a subset of the PY54-like phages (e.g., genes 17 and 9 in Loki and Halfdan respectively; Fig. S3D) have an extra TNH gene between the DT and canonical TH gene. In these two examples the novel gene may well encode an accessory tail tip protein that is used only in these contexts.

In addition, most but not all of the MP22- and KL1-like phages have two “extra” RBP genes between their TM and DT genes (Fig. S3D). Similarly, members of the T5-like group encode protein that forms an extra ring of subunits between the TT shaft and the DT ring and another that forms a bridge between the extra ring and the long tail fiber. Such “accessory” genes all have no homologs outside their tail tip sequence type group, so any function in tip assembly is limited to their particular group. We believe that our studies have identified all the critial players in the assembly of the tips of our panel phage tail tips.

4.2. Is this general type of tail tip limited to phages with Proteobacteria hosts?

We have not done an extensive study, but they appear to be uncommon outside the Proteobacteria, since only one of 42 non-Proteobacteria siphophages has the required characteristics (data not shown). The lone exception, Synechococcus (Cyanobacteria host phylum) phage S-CBS4 (acc. no. HQ698895) encodes tail tip proteins with all the characteristics of the lambda-like tip.

4.3. Dearth of genetic exchange among major tail tip types

We found an excellent correlation of the different protein sequence types in each of the seven tail tip types; for example, no tip proteins with a Lambda-like DT, TH, TNLP, THI and CF core sequence type is found in a phage with a different tip type. This indicates that there has been little to no exchange of whole genes or subsets of tail tip genes among the different panel phage tail tip types since their evolutionary separation. Given the high level of mosaicism in other parts of these phage genomes, this dearth of observed exchange almost certainly indicates that there is selection against such exchange in nature. Such a selection is likely a consequence of the very intimate interactions among the tip proteins that would prohibit interaction between divergent partners. Thus, the tail tip genes consitute a “module” of genes whose members cannot be exchanged with distant relatives, so they stay together during long evolutionary descent. Placement of most RBP tail appendage genes at one edge of the tail tip gene cluster downstream of the CF gene means that they can be exchanged without disturbing the precise intergene relartionships among the other tip protein genes. The very few exceptional gene placements beg explanation – (i) the location of the JBD30 gp47/gp48 RBP genes (Fig. S3C) might be explained by the fact that they bind DT and genes that encode proteins that interact are usually close to one another; (ii) the “odd” placment of the phage Carrot tailspike gene (Fig. S3A) could be the result of very recent “random” arrival by horizontal transfer.

Finally, a few of the tail tip proteins encoded by phages 9A (host in Colwellia genus), Seuss (Caulobacter), Shpa (Paracoccus), 16-3 (Rizobium), Nickie (Pseudomonas), R5C (Dinoroseobacter), R7M (Alteromonas), RcapMu (Rhodobacter) and the JWX/83-24./BcepGomr group (all AlphaProteobacteria) form unique sequence subtypes. A few of these also have domain differences from other members of their group, for example, unlike other members of their tail tip types, phage 9A TH has no Fe-S cluster domain, Nickie TH has the Fe-S cluster domain, and JWX/83-24/BcepGomr have two-core-domain DTs. However, some these domain differences could be due to differential loss (or recent duplication in the latter case) rather than horizontal exchange, and none of these proteins’ common region AA sequences suggests that it should be convincingly placed a different sequence type from the other tail tip proteins encoded by the same phage. Thus, they most likely represent highly diverged members of their groups. This is perhaps not surprising since their hosts are distantly related to those of most phages in the panel.

4.4. Annotation improvement and phage gene novelty

Tailed phage genomes are extremely diverse and currently much less than half of their genes are annotated for specific function. It was estimated in 2013 that virus genomes encode between millions and billions of novel protein types of unknown function and the ensuing decade has not led to increased insight into this issue. In the current study we identified the specific tail tip assembly proteins (functions) in all 429 of the panel phages, thus providing specific functional information on a large number of previously functionally unannotated or incorrectly annotated genes. Our similar analyses of phage head proteins, head-tail junction proteins, contractile tail assembly proteins (unpublished results), recombination proteins (Zheng et al., 2025), and replication proteins (unpublished results) strongly support the idea that such an annotation improvement is possible for many other tailed phage genes. It is therefore likely that a significant fraction of all tailed virion assembly genes whose specific functions are currently unannotated can be identified as having a very specific function through the approach used here - delineating sequence types and analyzing them with AF3 for polypeptide fold similarities.

5. Conclusion

The first striking conclusion from our analysis is that similar general tail tip functions are present in a very large majority (429 of 436) of Proteobacteria siphophage tail tips. In addition, the above analysis naturally organizes these 429 panel phage tail tips into a small number of distinct types.

Another major conclusion is that our analysis robustly groups the panel phage tail tips into a small number of “types”. The PsiBLAST and HMM analyses of the five tail tip proteins/domains (DT, TH, TNLP, THI and CF core) naturally divide the tail tips into distinct types that are typified by phages lambda, D3, MP22, øCbK, PY54 and T5. No phage with a lambda-like DT, for example, has a convincingly non-lambda-like TH, TNLP, THI or CF core protein. The same is true for each the other tip protein types – the sequence types are congruent across the tail tip types.

Domain/gene content and arrangement differences lend support to the tail tip types defined above by AA sequence alone (above), and they further support separation MP22-like and KL1-like tips to give seven tip types. THH and TNI domains are located in different genes in the MP22-like tips and in the KL1-like tips (we do not use the apparently recent DT tandem domain duplication in three D3-like tip proteins to define a separate type, since the other aspects of these three tips are D3-like).

Thus, our analysis defines seven robust and unambiguous, discrete Proteobacteria siphophage tail tip types. This greatly simplifies understanding of the Proteobacteria-infecting siphophage tail tips and will help to avoid redundant studies in the future.

References

Abedon, S. T. (2022). Phage Morons. Bacteriophages as Drivers of Evolution: An Evolutionary Ecological Perspective (pp. 153–164). Springer.

Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., & Bambrick, J. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493–500.

Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389–3402.

Ayala, R., Moiseenko, A. V., Chen, T., Kulikov, E. E., Golomidova, A. K., Orekhov, P. S., Street, M. A., Sokolova, O. S., Letarov, A. V., & Wolf, M. (2023). Nearly complete structure of bacteriophage DT57C reveals architecture of head-to-tail interface and lateral tail fibers. Nature Communications, 14(1), 8205.

Bárdy, P., Füzik, T., Hrebík, D., Pantůček, R., Thomas Beatty, J., & Plevka, P. (2020). Structure and mechanism of DNA delivery of a gene transfer agent. Nature Communications, 11(1), 3034. 10.1038/s41467-020-16669-9

Bubert, A., Kuhn, M., Goebel, W., & Köhler, S. (1992). Structural and functional properties of the p60 proteins from different Listeria species. Journal of Bacteriology, 174(24), 8166–8171. 10.1128/jb.174.24.8166-8171.1992

Casjens, S. R., Davidson, A. R., & Grose, J. H. (2022). The small genome, virulent, non-contractile tailed bacteriophages that infect Enterobacteriales hosts. Virology, 573, 151–166.

Casjens, S. R., & Grose, J. H. (2016). Contributions of P2-and P22-like prophages to understanding the enormous diversity and abundance of tailed bacteriophages. Virology, 496, 255–276.

Casjens, S., & Hendrix, R. (1974). Comments on the arrangement of the morphogenetic genes of bacteriophage lambda. Journal of Molecular Biology, 90(1), 20–23. 10.1016/0022-2836(74)90253-8

Chen, Y., Xiao, H., Zhou, J., Peng, Z., Peng, Y., Song, J., Zheng, J., & Liu, H. (2025). The In Situ Structure of T-Series T1 Reveals a Conserved Lambda-Like Tail Tip. Viruses, 17(3), 351.

Degroux, S., Effantin, G., Linares, R., Schoehn, G., & Breyton, C. (2023). Deciphering bacteriophage T5 host recognition mechanism and infection trigger. Journal of Virology, 97(3), 1584.

Finn, R. D., Clements, J., & Eddy, S. R. (2011). HMMER web server: interactive sequence similarity searching. Nucleic Acids Research, 39(suppl_2), W29–W37.

Flayhan, A., Vellieux, F. M., Lurz, R., Maury, O., Contreras-Martel, C., Girard, E., Boulanger, P., & Breyton, C. (2014). Crystal structure of pb9, the distal tail protein of bacteriophage T5: a conserved structural motif among all siphophages. Journal of Virology, 88(2), 820–828.

Fremin, B. J., Bhatt, A. S., Kyrpides, N. C., Sengupta, A., Sczyrba, A., da Silva, A. M., Buchan, A., Gaudin, A., Brune, A., & Hirsch, A. M. (2022). Thousands of small, novel genes predicted in global phage genomes. Cell Reports, 39(12).

Ge, X., & Wang, J. (2024). Structural mechanism of bacteriophage lambda tail’s interaction with the bacterial receptor. Nature Communications, 15(1), 4185. 10.1038/s41467-024-48686-3.

Griffin, M. E., Klupt, S., Espinosa, J., & Hang, H. C. (2023). Peptidoglycan NlpC/P60 peptidases in bacterial physiology and host interactions. Cell Chemical Biology, 30(5), 436–456.

Huang, Y., Sun, H., Wei, S., Cai, L., Liu, L., Jiang, Y., Xin, J., Chen, Z., Que, Y., Kong, Z., Li, T., Yu, H., Zhang, J., Gu, Y., Zheng, Q., Li, S., Zhang, R., & Xia, N. (2023). Structure and proposed DNA delivery mechanism of a marine roseophage. Nature Communications, 14(1), 3609. 10.1038/s41467-023-39220-y

Iyer, L. M., Burroughs, A. M., & Aravind, L. (2006). The prokaryotic antecedents of the ubiquitin-signaling system and the early evolution of ubiquitin-like β-grasp domains. Genome Biology, 7(7), R60.

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., & Potapenko, A. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.

Katsura, I., & Hendrix, R. W. (1984). Length determination in bacteriophage lambda tails. Cell, 39(3), 691–698.

Linares, R., Arnaud, C., Effantin, G., Darnault, C., Epalle, N. H., Boeri Erba, E., Schoehn, G., & Breyton, C. (2023). Structural basis of bacteriophage T5 infection trigger and E. coli cell wall perforation. Science Advances, 9(12), eade9674.

Maharana, J., Wang, C., Tsai, L., Liao, Y., Yang, C., Shen, M. C., Macale, L. S., Tran, T. N., Narsico, J., & Perez, R. J. (2025). Cryo-EM and cryo-ET reveal the molecular architecture and host interactions of mycobacteriophage Douge. Cell Reports, 44(8)

Paul, J. H., Sullivan, M. B., Segall, A. M., & Rohwer, F. (2002). Marine phage genomics. Comparative Biochemistry and Physiology. Part B, Biochemistry & Molecular Biology, 133(4), 463–476. 10.1016/s1096-4959(02)00168-9.

Singh, S. K., SaiSree, L., Amrutha, R. N., & Reddy, M. (2012). Three redundant murein endopeptidases catalyse an essential cleavage step in peptidoglycan synthesis of E scherichia coli K 12. Molecular Microbiology, 86(5), 1036–1051.

Sonani, R. R., Esteves, N. C., Scharf, B. E., & Egelman, E. H. (2024). Cryo-EM structure of flagellotropic bacteriophage Chi. Structure, 32(7), 856–865.e3. 10.1016/j.str.2024.03.011

Tsui, L., & Hendrix, R. W. (1983). Proteolytic processing of phage λ tail protein gpH: timing of the cleavage. Virology, 125(2), 257–264.

Valentová, L., Füzik, T., Nováček, J., Hlavenková, Z., Pospíšil, J., & Plevka, P. (2024). Structure and replication of Pseudomonas aeruginosa phage JBD30. The EMBO Journal, 43(19), 4384–4405.

van den Berg, B., Silale, A., Baslé, A., Brandner, A. F., Mader, S. L., & Khalid, S. (2022). Structural basis for host recognition and superinfection exclusion by bacteriophage T5. Proceedings of the National Academy of Sciences, 119(42), e2211672119.

Wang, C., Duan, J., Gu, Z., Ge, X., Zeng, J., & Wang, J. (2024). Architecture of the bacteriophage lambda tail. Structure, 32(1), 35–46.e3. 10.1016/j.str.2023.10.006

Xiao, H., Tan, L., Tan, Z., Zhang, Y., Chen, W., Li, X., Song, J., Cheng, L., & Liu, H. (2023). Structure of the siphophage neck–Tail complex suggests that conserved tail tip proteins facilitate receptor binding and tail assembly. PLOS Biology, 21(12), e3002441. 10.1371/journal.pbio.3002441.

Zheng, C., Casjens, S. R., Davidson, A. R., Amundsen, S. K., & Smith, G. R. (2025). Lambdoid phages with abundant Chi recombination hotspots reflect diverse viral strategies for recombination-dependent growth. Genome Research, 35(8), 1767–1780.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages