Hallmark gene panel for phage ecology fingerprinting in human microbiomes.
Metagenomic sequencing is the standard approach for studying human phageomes, but it has fundamental limitations for ecological analysis:
| Problem | Impact on Ecology Studies |
|---|---|
| Contamination | Bacterial DNA, host DNA, reagent contamination |
| Short contigs | Phages hard to assemble, fragmented data |
| Compositional bias | Relative abundance only, not absolute |
| Limit of detection | Rare phages missed |
| Complexity | Hard to translate into ecological metrics |
| Between-study variation | Results not comparable |
The core issue: Most virome studies want to detect ecological disruption (dysbiosis), but metagenomics produces data that's poorly suited for ecological modeling.
Instead of capturing entire phage genomes, target conserved hallmark genes that serve as ecological "barcodes":
Traditional Metagenomics Phageprint Approach
───────────────────────── ─────────────────────
Capture everything → Target hallmark genes
Assemble contigs → Detect/quantify markers
Annotate genes → Direct ecological signal
Build complex models → Standardized diversity metrics
Phageprint defines 24 curated hallmark genes across major phage groups:
| Category | Genes | Ecological Signal |
|---|---|---|
| Capsid proteins | MCP (HK97-fold, crAss, Microviridae) | Community structure |
| Terminases | TerL families | Phylogenetic diversity |
| crAss-specific | RNAP, PolB | crAss-like abundance |
| Microviridae | VP1, gpA | ssDNA phage presence |
| Inoviridae | pI, pVIII | Filamentous phages |
| Integrases | Tyr/Ser recombinases | Lysogeny potential |
| Lysis genes | Holins, endolysins | Lytic activity |
Three panel sizes are available:
- Core panel: 12 essential hallmarks for basic fingerprinting
- Extended panel: 18 hallmarks with broader coverage
- Full panel: All 24 hallmarks for comprehensive analysis
# Basic installation
pip install phageprint
# With probe design support (requires probesmith)
pip install phageprint[probes]
# Development installation
git clone https://github.com/shandley/phageprint.git
cd phageprint
pip install -e ".[dev]"# View available hallmark genes
phageprint hallmarks list
# View panel information
phageprint hallmarks panel core
# Fetch database annotations
phageprint databases fetch phrogs
# Analyze sequence diversity for a hallmark
phageprint diversity analyze MCP_HK97 --output results/
# Generate probes for a panel
phageprint design amino --panel core --output probes.fasta
# Generate nucleotide probes (with GenBank fetching)
phageprint design nucleotide --panel core --email you@example.com --output probes.fastaIntegrates with three phage protein family databases:
- PHROGs: 38,880 phage protein families with functional annotations
- pVOGs: ~9,518 prokaryotic virus orthologous groups
- INPHARED: Curated phage genome database (monthly updates)
phageprint databases fetch phrogs
phageprint databases status
phageprint databases search "major capsid protein"24 curated hallmark genes with PHROGs/pVOGs mappings:
phageprint hallmarks list
phageprint hallmarks show MCP_HK97
phageprint hallmarks panel extended --format yamlSequence clustering and conservation analysis:
# Assess feasibility for probe/primer design
phageprint diversity analyze TerL --max-sequences 1000
# Analyze entire panel
phageprint diversity panel core --output diversity_report/Probe generation for hybrid capture:
# Amino acid probes (from cluster representatives)
phageprint design amino --panel core
# Nucleotide probes (120bp, tiled)
phageprint design nucleotide --panel core --probe-length 120 --tiling 1.5
# Export for synthesis vendors
phageprint design nucleotide --panel core --format twist --output probes.csvAnalysis of the core panel showed high sequence diversity (>30 clusters at 80% identity for most hallmarks), making traditional PCR approaches infeasible. The recommended approach is hybridization probe capture:
| Rating | Clusters (80% ID) | Approach |
|---|---|---|
| Excellent | ≤3 | Single primer pair |
| Good | 4-10 | Degenerate primers |
| Moderate | 11-30 | Multiple primers |
| Difficult | >30 | Probes recommended |
The core panel generates ~13,500 nucleotide probes (120bp) to capture the full diversity.
- Cluster sequences at 80% identity using MMseqs2
- Select representatives from each cluster
- Fetch nucleotide CDS from GenBank (or reverse-translate)
- Tile probes with configurable overlap (default 1.5x = 33% overlap)
- Filter by GC content and melting temperature
- Deduplicate to remove redundant probes
- Export in vendor formats (Twist, IDT)
When probesmith is installed, phageprint uses it for professional-grade probe design with:
- GC content filtering (30-70%)
- Melting temperature filtering (60-85°C)
- Homopolymer detection
- Redundancy optimization
phageprint/
├── src/phageprint/
│ ├── databases/ # PHROGs, pVOGs, INPHARED integration
│ ├── hallmarks/ # Gene definitions and panel management
│ ├── diversity/ # Clustering and conservation analysis
│ └── design/ # Probe generation and export
├── data/
│ └── hallmark_genes.yaml
└── tests/
- probesmith: Probe design toolkit for hybridization capture
- PHROGs: Phage protein families database
- pVOGs: Prokaryotic virus orthologous groups
- INPHARED: Phage genome database
Manuscript in preparation.
MIT