# RBPdetect2: Automated Phage Receptor-Binding Protein Detection

This notebook automatically detects and classifies Receptor-Binding Proteins (RBPs) from phage sequences.

**Instructions:**
1. Run all cells (Runtime ‚Üí Run all)
2. Upload your FASTA file (genome or protein sequences)
3. Wait for processing to complete
4. Review results and download output files

---

In [None]:
#@title Setup: Check GPU Availability { display-mode: "form" }
import torch

if not torch.cuda.is_available():
    raise RuntimeError(
        "‚ùå GPU NOT detected.\n"
        "Go to Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí GPU"
    )
else:
    print("‚úÖ GPU detected:", torch.cuda.get_device_name(0))

In [None]:
#@title Setup: Install Dependencies { display-mode: "form" }
import sys
import os

print("üì¶ Installing RBPdetect2 and dependencies...")
%cd /content
!rm -rf RBPdetect2
!git clone -q https://github.com/victornemeth/RBPdetect2.git
%cd RBPdetect2
!pip install -q -r requirements.txt

# Install phanotate for genome processing
!pip install -q phanotate

print("‚úÖ Installation complete!")

---
## Upload Your FASTA File

Upload a FASTA file containing either:
- **Genome sequences** (nucleotide sequences: ATGC...)
- **Protein sequences** (amino acid sequences)

The pipeline will automatically detect the file type and process accordingly.

In [None]:
from google.colab import files
import os

print("üìÅ Please upload your FASTA file (.fasta, .fa, .faa, .txt, etc.)\n")
uploaded = files.upload()

if not uploaded:
    raise ValueError("‚ùå No file uploaded. Please run this cell again and upload a file.")

input_file = next(iter(uploaded.keys()))
input_path = os.path.abspath(input_file)
print(f"\n‚úÖ File uploaded: {input_file}")

In [None]:
#@title Processing: Auto-detect File Type and Run Pipeline { display-mode: "form" }
import re
import subprocess

def detect_fasta_type(fasta_path):
    """Detect if FASTA file contains genome (nucleotide) or protein sequences."""
    nucleotide_pattern = re.compile(r'^[ATGCN]+$', re.IGNORECASE)
    protein_pattern = re.compile(r'^[ACDEFGHIKLMNPQRSTVWY*]+$', re.IGNORECASE)
    
    with open(fasta_path, 'r') as f:
        sequences = []
        current_seq = []
        
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                if current_seq:
                    sequences.append(''.join(current_seq))
                    current_seq = []
                    if len(sequences) >= 5:  # Sample first 5 sequences
                        break
            else:
                current_seq.append(line)
        
        if current_seq:
            sequences.append(''.join(current_seq))
    
    # Check first few sequences
    nucleotide_count = 0
    protein_count = 0
    
    for seq in sequences[:5]:
        seq_clean = seq.replace('*', '').replace('#', '').strip()
        if nucleotide_pattern.match(seq_clean):
            nucleotide_count += 1
        elif protein_pattern.match(seq_clean):
            protein_count += 1
    
    if nucleotide_count > protein_count:
        return 'genome'
    else:
        return 'protein'

# Detect file type
print("üîç Detecting file type...")
file_type = detect_fasta_type(input_path)
print(f"‚úÖ Detected: {file_type.upper()} sequences\n")

# Process based on file type
if file_type == 'genome':
    print("üß¨ Converting genome to protein sequences using Phanotate...")
    %cd /content
    
    # Run phanotate
    protein_fasta = 'phanotate_proteins.faa'
    !phanotate.py "$input_path" -f faa > {protein_fasta} 2>&1 | grep -v "UserWarning" || true
    
    # Clean the protein FASTA
    with open(protein_fasta, 'r') as f:
        content = f.read()
    cleaned_content = content.replace('*', '').replace('#', '')
    with open(protein_fasta, 'w') as f:
        f.write(cleaned_content)
    
    protein_path = os.path.abspath(protein_fasta)
    print(f"‚úÖ Protein sequences generated: {protein_fasta}\n")
else:
    # Already protein sequences
    protein_path = input_path
    print("‚úÖ Using uploaded protein sequences directly\n")

# Run RBPdetect2 inference
print("üöÄ Running RBPdetect2 classification...")
print("   This may take a few minutes depending on file size...\n")

model_dir = "/content/RBPdetect2/final_esm2_classifier"
output_tsv = "/content/output.tsv"
batch_size = 64

cmd = [
    "python", "/content/RBPdetect2/inference_parallel.py",
    model_dir, protein_path,
    "-o", output_tsv,
    "-b", str(batch_size),
]

subprocess.check_call(cmd)

print("\n‚úÖ Classification complete!")

In [None]:
#@title Processing: Filter Results and Generate Output FASTA { display-mode: "form" }
import pandas as pd

print("üî¨ Filtering for Tail Fibers (LABEL_1) and Tail Spikes (LABEL_2)...\n")

# Read the TSV file
df_output = pd.read_csv('/content/output.tsv', sep='\t')

# Filter for LABEL_1 and LABEL_2
df_filtered = df_output[df_output['predicted_label'].isin(['LABEL_1', 'LABEL_2'])].copy()

# Show summary statistics
total_proteins = len(df_output)
label_0_count = len(df_output[df_output['predicted_label'] == 'LABEL_0'])
label_1_count = len(df_output[df_output['predicted_label'] == 'LABEL_1'])
label_2_count = len(df_output[df_output['predicted_label'] == 'LABEL_2'])

print(f"üìä Classification Summary:")
print(f"   Total proteins analyzed: {total_proteins}")
print(f"   LABEL_0 (Other):         {label_0_count}")
print(f"   LABEL_1 (Tail Fibers):   {label_1_count}")
print(f"   LABEL_2 (Tail Spikes):   {label_2_count}")
print(f"\n   RBPs detected: {label_1_count + label_2_count}\n")

# Generate filtered FASTA
fasta_content = []

for index, row in df_filtered.iterrows():
    protein_id = row['protein_id']
    predicted_label = row['predicted_label']
    sequence = row['sequence']
    
    # Apply prefixes based on predicted_label
    if predicted_label == 'LABEL_1':
        new_protein_id = 'TF_' + protein_id
    elif predicted_label == 'LABEL_2':
        new_protein_id = 'TSP_' + protein_id
    else:
        new_protein_id = protein_id
    
    fasta_content.append(f'>{new_protein_id}\n{sequence}')

# Write to filtered FASTA file
filtered_fasta = '/content/filtered_proteins.fasta'
with open(filtered_fasta, 'w') as f:
    f.write('\n'.join(fasta_content))

print(f"‚úÖ Filtered FASTA file generated: filtered_proteins.fasta")

---
## Results

### Preview: Classification Results (output.tsv)

This table shows all proteins with their predicted labels and confidence scores.

In [None]:
import pandas as pd

df = pd.read_csv('/content/output.tsv', sep='\t')

# Display first 20 rows
print("üìÑ First 20 entries from output.tsv:\n")
display(df.head(20))

# Show RBPs only
df_rbps = df[df['predicted_label'].isin(['LABEL_1', 'LABEL_2'])]
if len(df_rbps) > 0:
    print(f"\nüéØ Detected RBPs ({len(df_rbps)} total):\n")
    display(df_rbps)
else:
    print("\n‚ö†Ô∏è No RBPs (Tail Fibers or Tail Spikes) detected in this dataset.")

### Preview: Filtered RBP Sequences (filtered_proteins.fasta)

This file contains only the detected Tail Fibers (TF_) and Tail Spikes (TSP_).

In [None]:
print("üß¨ Preview of filtered_proteins.fasta (first 30 lines):\n")
!head -n 30 /content/filtered_proteins.fasta

---
## Download Results

Click the links below to download your results.

In [None]:
from google.colab import files

print("üì• Downloading output files...\n")

# Download TSV
print("1Ô∏è‚É£ Downloading output.tsv (all classification results)")
files.download('/content/output.tsv')

# Download filtered FASTA
print("2Ô∏è‚É£ Downloading filtered_proteins.fasta (RBPs only)")
files.download('/content/filtered_proteins.fasta')

print("\n‚úÖ Downloads complete!")

---
## Label Legend

- **LABEL_0**: Other proteins (not RBPs)
- **LABEL_1**: Tail Fibers (TF_)
- **LABEL_2**: Tail Spikes (TSP_)

---

**Citation**: If you use RBPdetect2 in your research, please cite the original repository:
https://github.com/victornemeth/RBPdetect2