Skip to content

shandley/phageprint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phageprint

Hallmark gene panel for phage ecology fingerprinting in human microbiomes.

The Problem

Metagenomic sequencing is the standard approach for studying human phageomes, but it has fundamental limitations for ecological analysis:

Problem Impact on Ecology Studies
Contamination Bacterial DNA, host DNA, reagent contamination
Short contigs Phages hard to assemble, fragmented data
Compositional bias Relative abundance only, not absolute
Limit of detection Rare phages missed
Complexity Hard to translate into ecological metrics
Between-study variation Results not comparable

The core issue: Most virome studies want to detect ecological disruption (dysbiosis), but metagenomics produces data that's poorly suited for ecological modeling.

The Solution

Instead of capturing entire phage genomes, target conserved hallmark genes that serve as ecological "barcodes":

Traditional Metagenomics          Phageprint Approach
─────────────────────────         ─────────────────────
Capture everything         →      Target hallmark genes
Assemble contigs          →      Detect/quantify markers
Annotate genes            →      Direct ecological signal
Build complex models      →      Standardized diversity metrics

Hallmark Gene Panel

Phageprint defines 24 curated hallmark genes across major phage groups:

Category Genes Ecological Signal
Capsid proteins MCP (HK97-fold, crAss, Microviridae) Community structure
Terminases TerL families Phylogenetic diversity
crAss-specific RNAP, PolB crAss-like abundance
Microviridae VP1, gpA ssDNA phage presence
Inoviridae pI, pVIII Filamentous phages
Integrases Tyr/Ser recombinases Lysogeny potential
Lysis genes Holins, endolysins Lytic activity

Three panel sizes are available:

  • Core panel: 12 essential hallmarks for basic fingerprinting
  • Extended panel: 18 hallmarks with broader coverage
  • Full panel: All 24 hallmarks for comprehensive analysis

Installation

# Basic installation
pip install phageprint

# With probe design support (requires probesmith)
pip install phageprint[probes]

# Development installation
git clone https://github.com/shandley/phageprint.git
cd phageprint
pip install -e ".[dev]"

Quick Start

# View available hallmark genes
phageprint hallmarks list

# View panel information
phageprint hallmarks panel core

# Fetch database annotations
phageprint databases fetch phrogs

# Analyze sequence diversity for a hallmark
phageprint diversity analyze MCP_HK97 --output results/

# Generate probes for a panel
phageprint design amino --panel core --output probes.fasta

# Generate nucleotide probes (with GenBank fetching)
phageprint design nucleotide --panel core --email you@example.com --output probes.fasta

Modules

Databases

Integrates with three phage protein family databases:

  • PHROGs: 38,880 phage protein families with functional annotations
  • pVOGs: ~9,518 prokaryotic virus orthologous groups
  • INPHARED: Curated phage genome database (monthly updates)
phageprint databases fetch phrogs
phageprint databases status
phageprint databases search "major capsid protein"

Hallmarks

24 curated hallmark genes with PHROGs/pVOGs mappings:

phageprint hallmarks list
phageprint hallmarks show MCP_HK97
phageprint hallmarks panel extended --format yaml

Diversity

Sequence clustering and conservation analysis:

# Assess feasibility for probe/primer design
phageprint diversity analyze TerL --max-sequences 1000

# Analyze entire panel
phageprint diversity panel core --output diversity_report/

Design

Probe generation for hybrid capture:

# Amino acid probes (from cluster representatives)
phageprint design amino --panel core

# Nucleotide probes (120bp, tiled)
phageprint design nucleotide --panel core --probe-length 120 --tiling 1.5

# Export for synthesis vendors
phageprint design nucleotide --panel core --format twist --output probes.csv

Technical Details

Viability Assessment

Analysis of the core panel showed high sequence diversity (>30 clusters at 80% identity for most hallmarks), making traditional PCR approaches infeasible. The recommended approach is hybridization probe capture:

Rating Clusters (80% ID) Approach
Excellent ≤3 Single primer pair
Good 4-10 Degenerate primers
Moderate 11-30 Multiple primers
Difficult >30 Probes recommended

The core panel generates ~13,500 nucleotide probes (120bp) to capture the full diversity.

Probe Design Pipeline

  1. Cluster sequences at 80% identity using MMseqs2
  2. Select representatives from each cluster
  3. Fetch nucleotide CDS from GenBank (or reverse-translate)
  4. Tile probes with configurable overlap (default 1.5x = 33% overlap)
  5. Filter by GC content and melting temperature
  6. Deduplicate to remove redundant probes
  7. Export in vendor formats (Twist, IDT)

Integration with Probesmith

When probesmith is installed, phageprint uses it for professional-grade probe design with:

  • GC content filtering (30-70%)
  • Melting temperature filtering (60-85°C)
  • Homopolymer detection
  • Redundancy optimization

Project Structure

phageprint/
├── src/phageprint/
│   ├── databases/     # PHROGs, pVOGs, INPHARED integration
│   ├── hallmarks/     # Gene definitions and panel management
│   ├── diversity/     # Clustering and conservation analysis
│   └── design/        # Probe generation and export
├── data/
│   └── hallmark_genes.yaml
└── tests/

Related Projects

  • probesmith: Probe design toolkit for hybridization capture
  • PHROGs: Phage protein families database
  • pVOGs: Prokaryotic virus orthologous groups
  • INPHARED: Phage genome database

Citation

Manuscript in preparation.

License

MIT

About

Hallmark gene panel for phage ecology fingerprinting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages