# Streptococcus S protein

OG: 4HHT0

S proteins are LysM-containing proteins playing a central role in pathogenicity:
- immune evasion in Group A Streptococci (camouflaging with red blood cell debris) [1]
- resistance to beta-lactam antibiotics through recruitment of protein PGP1a [2]
- resistance to lysozyme through recruitment of protein PgdA. [2]
  - PgdA is a peptidoglycan N-acetylglucosamine deacetylase, i.e. it removes the N-acetyl group from the middle sugar of the peptidoglycan.
  - It helps the bacteria resist lysozyme [3].

Mutants of _S. pneumoniae_ without S protein show reduce virulence in mouse models, are more suceptible to beta-lactam antibiotics and lysozyme. [2]

Refs: 
- [1] [Wierzbicki et al., 2019](https://doi.org/10.1016/j.celrep.2019.11.001)
- [2] [Burnier et al., 2024](https://doi.org/10.1101/2024.11.08.622053)
- [3] [Bui et al., 2011](https://doi.org/10.1016/j.bcp.2011.03.028)

We found that S proteins are absent from all complete genomes of _S. equi_.

_S. equi_ is primarily a horse pathogen but _S. equi subsp. zooepidemicus_ has zoonotic potential: it also infects a wide range of mammals, including dogs, cats, ruminants, pigs, and humans.

In [98]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
from Bio import Phylo

cwd = os.getcwd()
if cwd.endswith('notebook'):
    os.chdir('..')
    cwd = os.getcwd()

from src.tree.tree_util import prune_leaves_with_unknown_id
from src.tree.itol_annotation import itol_labels, itol_colored_ranges, itol_binary_annotations, hex_to_rgba

In [67]:
sns.set_theme(palette='colorblind', font_scale=1.3)
palette_colorblind = sns.color_palette('colorblind').as_hex()
palette_pastel = sns.color_palette('pastel').as_hex()

data_folder = Path('./data/')
assert data_folder.is_dir()

db_proka = Path('../db_proka/')
assert db_proka.is_dir()

gtdb_folder = Path('../data/gtdb_r220/')
assert gtdb_folder.is_dir()

strep_folder = gtdb_folder / 'Streptococcus'
assert strep_folder.is_dir()

In [68]:
metadata_df = pd.read_csv(strep_folder / 'genomes_metadata.csv', index_col='assembly_accession')
metadata_df.head()

Unnamed: 0_level_0,accession,ambiguous_bases,checkm2_completeness,checkm2_contamination,checkm2_model,checkm_completeness,checkm_contamination,checkm_marker_count,checkm_marker_lineage,checkm_marker_set_count,...,trna_aa_count,trna_count,trna_selenocysteine_count,domain,gtdb_phylum,gtdb_class,gtdb_order,gtdb_family,gtdb_genus,gtdb_species
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_900636555.1,RS_GCF_900636555.1,0,100.0,0.14,Specific,100.0,0.0,475,o__Lactobacillales (UID544),267,...,19,59,0,Bacteria,Bacillota,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus gordonii
GCF_015908985.1,RS_GCF_015908985.1,0,100.0,0.23,Specific,100.0,0.0,475,o__Lactobacillales (UID544),267,...,19,56,0,Bacteria,Bacillota,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus suis
GCF_001266635.1,RS_GCF_001266635.1,0,100.0,0.06,Specific,99.82,0.0,524,f__Streptococcaceae (UID545),282,...,18,63,0,Bacteria,Bacillota,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus agalactiae
GCF_004154025.1,RS_GCF_004154025.1,0,99.99,0.13,Specific,99.85,0.0,676,g__Streptococcus (UID722),182,...,19,67,0,Bacteria,Bacillota,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus pyogenes
GCF_029011635.1,RS_GCF_029011635.1,0,100.0,0.2,Specific,100.0,0.18,524,f__Streptococcaceae (UID545),282,...,19,80,0,Bacteria,Bacillota,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus agalactiae


In [69]:
pfam_df = pd.read_csv(strep_folder / 'Streptococcus_all_proteins.pfam.csv', index_col='assembly_accession')
pfam_df['gtdb_species'] = [metadata_df.loc[a, 'gtdb_species'] for a in pfam_df.index]

tigr_df = pd.read_csv(strep_folder / 'Streptococcus_all_proteins.tigr.csv', index_col='assembly_accession')
tigr_df['gtdb_species'] = [metadata_df.loc[a, 'gtdb_species'] for a in tigr_df.index]

In [70]:
eggNOG_annotations_path = strep_folder / 'Streptococcus_eggNOG_annotations.csv'
eggNOG_df = pd.read_csv(eggNOG_annotations_path, index_col='assembly_accession')
eggNOG_df.head()

Unnamed: 0_level_0,protein_id,seed_ortholog,evalue,score,eggNOG_OGs,max_annot_lvl,COG_category,Description,Preferred_name,GOs,...,KEGG_ko,KEGG_Pathway,KEGG_Module,KEGG_Reaction,KEGG_rclass,BRITE,KEGG_TC,CAZy,BiGG_Reaction,PFAMs
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_003963555.1,WP_126467658.1@GCF_003963555.1,1000570.HMPREF9966_1759,9.704e-82,271.0,"COG0716@1|root,COG0716@2|Bacteria,1V45R@1239|F...",2|Bacteria,C,Flavodoxin,,,...,,,,,,,,,,Flavodoxin_4
GCF_003963555.1,WP_126467724.1@GCF_003963555.1,862970.SAIN_1577,1.3129999999999999e-186,582.0,"COG1396@1|root,COG1396@2|Bacteria,1VIH9@1239|F...",2|Bacteria,K,Helix-turn-helix XRE-family like proteins,,,...,,,,,,,,,,"HTH_19,HTH_3,TPR_12,TPR_8"
GCF_003963555.1,WP_126467790.1@GCF_003963555.1,862970.SAIN_1636,2.776e-131,418.0,"COG1564@1|root,COG1564@2|Bacteria,1VA0W@1239|F...",2|Bacteria,H,"Thiamin pyrophosphokinase, vitamin B1 binding ...",thiN,,...,ko:K00949,"ko00730,ko01100,map00730,map01100",,R00619,"RC00002,RC00017","ko00000,ko00001,ko01000",,,,"TPK_B1_binding,TPK_catalytic"
GCF_003963555.1,WP_126467857.1@GCF_003963555.1,176090.SSIN_0693,1.181e-201,627.0,"COG3677@1|root,COG3677@2|Bacteria,1V4D1@1239|F...",2|Bacteria,L,ISXO2-like transposase domain,,,...,,,,,,,,,,"DDE_Tnp_IS1595,Zn_Tnp_IS1595"
GCF_003963555.1,WP_126467993.1@GCF_003963555.1,862969.SCI_1925,1.812e-256,791.0,"COG0612@1|root,COG0612@2|Bacteria,1TPN6@1239|F...",2|Bacteria,S,Peptidase M16 inactive,ymfF,,...,,,,,,,,,,"Peptidase_M16,Peptidase_M16_C"


In [71]:
def get_unique_ogs(eggNOG_df, accessions, og_whitelist=None):
    index = eggNOG_df.index

    og_union = set()
    for acc in accessions:
        if acc not in index:
            print(f'Not in index: {acc}')
            continue

        genme_ogs = eggNOG_df.loc[[acc]].set_index('protein_id')

        ogs = set()
        for protein_id in genme_ogs.index:
            protein_ogs = genme_ogs.loc[protein_id, 'eggNOG_OGs']

            candidate_ogs = {
                og_with_tax.split('@')[0]
                for og_with_tax in protein_ogs.split(',')
            }

            if og_whitelist is not None:
                candidate_ogs = candidate_ogs & og_whitelist
                
            ogs |= candidate_ogs

        og_union = og_union.union(ogs)

    return og_union

In [72]:
def get_ogs_present_in_all(eggNOG_df, accessions, og_whitelist=None):
    index = eggNOG_df.index

    og_intersection = None
    for acc in accessions:
        if acc not in index:
            print(f'Not in index: {acc}')
            continue

        genme_ogs = eggNOG_df.loc[[acc]].set_index('protein_id')

        ogs = set()
        for protein_id in genme_ogs.index:
            protein_ogs = genme_ogs.loc[protein_id, 'eggNOG_OGs']

            candidate_ogs = {
                og_with_tax.split('@')[0]
                for og_with_tax in protein_ogs.split(',')
            }

            if og_whitelist is not None:
                candidate_ogs = candidate_ogs & og_whitelist
                
            ogs |= candidate_ogs

        if og_intersection is None:
            og_intersection = ogs
        else:
            og_intersection &= ogs

    return og_intersection

In [73]:
cog_ogs = {
    og
    for og_str in eggNOG_df['eggNOG_OGs']
    for og_raw in og_str.split(',')
    if (og := og_raw.split('@')[0]).startswith('COG')
}

len(cog_ogs)

2190

In [74]:
og_metadata = pd.read_csv(
    strep_folder / 'eggNOG' / 'e5.og_annotations.tsv',
    sep='\t', 
    header=None,
    names=['og', 'og_level', 'description'],
).drop_duplicates('og').set_index('og')
og_metadata.head()

def display_ogs(ogs):
    for og in ogs:
        desc = og_metadata.loc[og, 'description']
        print(f'{og}: {desc}')

In [75]:
species = [
    'Streptococcus pyogenes',
    'Streptococcus dysgalactiae',
    'Streptococcus equi',
    'Streptococcus canis',
]
metadata_subset = metadata_df[metadata_df['gtdb_species'].isin(species)].copy()

accessions = metadata_subset.index

eggNOG_subset = eggNOG_df.loc[accessions]

In [76]:
accessions_S_protein = sorted(set(eggNOG_subset[eggNOG_subset['eggNOG_OGs'].str.contains('4HHT0')].index))
accessions_no_S_protein = sorted(set(accessions) - set(accessions_S_protein))

In [77]:
no_S_protein = get_ogs_present_in_all(eggNOG_subset, accessions_no_S_protein, cog_ogs)
no_S_protein_all = get_unique_ogs(eggNOG_subset, accessions_no_S_protein, cog_ogs)

len(no_S_protein), len(no_S_protein_all)

(856, 1237)

In [78]:
yes_S_protein = get_ogs_present_in_all(eggNOG_subset, accessions_S_protein, cog_ogs)
yes_S_protein_all = get_unique_ogs(eggNOG_subset, accessions_S_protein, cog_ogs)

len(yes_S_protein), len(yes_S_protein_all)

(586, 1435)

## Gain in S. equi

COGs present in all S. equi but in no related Strep.

In [79]:
missing_in_yes = no_S_protein - yes_S_protein_all
print('Gain:', len(missing_in_yes), 'OGs')
display_ogs(sorted(missing_in_yes))

Gain: 13 OGs
COG1345: cell adhesion
COG1376: peptidoglycan L,D-transpeptidase activity
COG1434: Gram-negative-bacterium-type cell wall biogenesis
COG2848: Uncharacterised ACR (DUF711)
COG3458: cephalosporin-C deacetylase activity
COG3830: Belongs to the UPF0237 family
COG3958: Transketolase
COG3959: Transketolase, thiamine diphosphate binding domain
COG4495: Domain of unknown function (DUF4176)
COG4935: serine-type endopeptidase activity
COG5353: nan
COG5444: nuclease activity
COG5585: self proteolysis


## Loss in S. equi

COGs present in all related Strep with S protein but in no S. equi.

In [80]:
missing_in_no = yes_S_protein - no_S_protein_all
print('Loss:', len(missing_in_no), 'OGs')
display_ogs(sorted(missing_in_no))

Loss: 9 OGs
COG0010: Belongs to the arginase family
COG0625: glutathione transferase activity
COG0730: response to heat
COG0813: purine-nucleoside phosphorylase activity
COG1054: Rhodanese Homology Domain
COG1902: FMN binding
COG2110: O-acetyl-ADP-ribose deacetylase activity
COG2141: oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen
COG3643: Formiminotransferase domain


### Export list of gains & losses

In [81]:
all_ogs_in_subset = {og_raw.split('@')[0] for l in eggNOG_subset['eggNOG_OGs'].values for og_raw in l.split(',')}
all_COGs_in_subset = {og for og in all_ogs_in_subset if og.startswith('COG')}
n_ogs = len(all_ogs_in_subset)
n_cogs = len(all_COGs_in_subset)

print(f'# OGs: {n_ogs:,} (COGs: {n_cogs:,})')

# OGs: 13,782 (COGs: 1,489)


In [82]:
no_S_protein = get_ogs_present_in_all(eggNOG_subset, accessions_no_S_protein)
no_S_protein_all = get_unique_ogs(eggNOG_subset, accessions_no_S_protein)

yes_S_protein = get_ogs_present_in_all(eggNOG_subset, accessions_S_protein)
yes_S_protein_all = get_unique_ogs(eggNOG_subset, accessions_S_protein)

missing_in_yes = no_S_protein - yes_S_protein_all
missing_in_no = yes_S_protein - no_S_protein_all

missing_in_yes_cogs = {og for og in missing_in_yes if og.startswith('COG')}
missing_in_no_cogs = {og for og in missing_in_no if og.startswith('COG')}

print(f'# gains : {len(missing_in_yes):,} ({100 * len(missing_in_yes) / n_ogs:.1f} %)') 
print(f'\tCOGs: {len(missing_in_yes_cogs):,} ({100 * len(missing_in_yes_cogs) / n_cogs:.1f} %)')
print(f'# losses: {len(missing_in_no):,} ({100 * len(missing_in_no) / n_ogs:.1f} %)')
print(f'\tCOGs: {len(missing_in_no_cogs):,} ({100 * len(missing_in_no_cogs) / n_cogs:.1f} %)')

# gains : 133 (1.0 %)
	COGs: 13 (0.9 %)
# losses: 44 (0.3 %)
	COGs: 9 (0.6 %)


In [83]:
uniprot_mapping = pd.read_csv(strep_folder / 'Streptococcus_all_proteins_UniProtKB_map.tsv', sep='\t', index_col='query')
uniprot_set = set(uniprot_mapping.index)
uniprot_mapping.head()

Unnamed: 0_level_0,target,qlen,tlen,fident,alnlen,mismatch,qstart,qend,tstart,tend,evalue,bits
query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
WP_000031175.1@GCF_009730515.1,A0A380KSV8,417,417,1.0,417,0,1,417,1,417,1.929e-311,963
WP_000730403.1@GCF_009730515.1,A0A1T0C0H9,305,305,1.0,305,0,1,305,1,305,2.411e-230,719
WP_011999767.1@GCF_009730535.1,A0A1V0H0P2,426,426,1.0,426,0,1,426,1,426,0.0,1008
WP_012130945.1@GCF_009730535.1,A8AZI4,849,849,1.0,849,0,1,849,1,849,0.0,2006
WP_156011810.1@GCF_009731465.1,A0A6H3S4G9,250,250,1.0,250,0,1,250,1,250,2.376e-184,582


In [84]:
data_og = {
    'eggNOG_OG': [],
    'description': [],
    'is_gain': [],
    'is_cog': [],
    'PFAMs': [],
    'example_uniprot_id': [],
}
for is_gain, ogs in [(True, missing_in_yes), [False, missing_in_no]]:
    for og in sorted(ogs):
        description = og_metadata.loc[og, 'description']
        if pd.isnull(description) or description == '' or description == 'nan':
            description = None

        df = eggNOG_subset[eggNOG_subset['eggNOG_OGs'].str.contains(og)]
        protein_ids = df['protein_id'].unique()
        proteins_with_uniprot_match = sorted(uniprot_set & set(protein_ids))
        if len(proteins_with_uniprot_match) > 0:
            example_uniprot_id = uniprot_mapping.loc[proteins_with_uniprot_match[0], 'target']
        else:
            example_uniprot_id = None

        pfams = None
        for pfam_list in df['PFAMs'].unique():
            s = set()
            for p in (pfam_list if isinstance(pfam_list, str) else '').split(','):
                s.add(p)

            if pfams is None:
                pfams = s
            else:
                pfams &= s

        pfams = ','.join(sorted(pfams))
        
        data_og['eggNOG_OG'].append(og)
        data_og['description'].append(description)
        data_og['is_gain'].append(is_gain)
        data_og['is_cog'].append(og.startswith('COG'))
        data_og['PFAMs'].append(pfams)
        data_og['example_uniprot_id'].append(example_uniprot_id)

comparative_genomics_df = pd.DataFrame.from_dict(data_og).set_index('eggNOG_OG', drop=True).sort_values(
    ['is_gain', 'is_cog'],
    ascending=False,    
)
comparative_genomics_df.head()

Unnamed: 0_level_0,description,is_gain,is_cog,PFAMs,example_uniprot_id
eggNOG_OG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COG1345,cell adhesion,True,True,"Flagellin_IN,FliD_C,FliD_N",C0M7S8
COG1376,"peptidoglycan L,D-transpeptidase activity",True,True,YkuD,A0A380JSM9
COG1434,Gram-negative-bacterium-type cell wall biogenesis,True,True,DUF218,B4U0X8
COG2848,Uncharacterised ACR (DUF711),True,True,DUF711,C0M7F4
COG3458,cephalosporin-C deacetylase activity,True,True,AXE1,B4U0U4


In [85]:
comparative_genomics_df.to_csv(strep_folder / 'S_equi_gain_loss.csv')

## Tree

In [97]:
strep_tree_path = strep_folder / 'tree' / 'Streptococcus.tree'
strep_tree = Phylo.read(strep_tree_path, 'phyloxml')

strep_tree_subset = prune_leaves_with_unknown_id(strep_tree, set(metadata_subset.index))

assert len(strep_tree_subset.get_terminals()) == len(metadata_subset)

strep_tree_subset_path = strep_folder / 'tree' / 'Streptococcus_zoom.tree'
with strep_tree_subset_path.open('w') as f_out:
     Phylo.write([strep_tree_subset], f_out, 'phyloxml')

strep_tree_subset = Phylo.read(strep_tree_subset_path, 'phyloxml')

### Labels

In [99]:
labels = []
for accession in metadata_subset.index:
    ncbi_organism_name = metadata_df.loc[accession, 'ncbi_organism_name']
    label = f'{ncbi_organism_name} [{accession}]'
    labels.append([accession, label])

itol_labels(
    labels, 
    strep_folder / 'tree' / 'labels_zoom.txt'
)

### Binary annotations: presence or absence of S-protein

In [107]:
cwb_bins = pd.read_csv(strep_folder / 'Streptococcus_cell_wall_binding.csv', index_col='assembly_accession')
lysM_accessions = set(cwb_bins[cwb_bins['LysM'] > 0].index)

accessions_pgh = sorted(set(eggNOG_subset[
    eggNOG_subset['eggNOG_OGs'].str.contains('COG0860') &
    eggNOG_subset['eggNOG_OGs'].str.contains('COG1388')
].index) & lysM_accessions)

In [110]:
s_protein_binary_data = []
for accession in sorted(metadata_subset.index):
    s_protein = '1' if accession in accessions_S_protein else '-1'
    pgh = '1' if accession in accessions_pgh else '-1'
    s_protein_binary_data.append([accession, s_protein, pgh])

itol_binary_annotations(
    data=s_protein_binary_data,
    output_path=strep_folder / 'tree' / 'zoom_S_protein_binary_presence.txt',
    field_shapes=[1, 1],
    field_labels=['S-protein', 'Amidase_3 + LysM'],
    dataset_label='S-protein',
    field_colors=['#008000', '#FF6347'],
    legend_title='S-protein',
    height_factor=1,
)

### Colored ranges

In [102]:
range_colors = {
    'Streptococcus pyogenes': '#a1c9f4',
    'Streptococcus dysgalactiae': '#debb9b',
    'Streptococcus equi': '#fab0e4',
}

colored_ranges = []
for species, color in range_colors.items():
    if species == 'Streptococcus canis':
        continue
    node_id = f's__{species}'
    colored_ranges.append(
        [node_id, node_id, hex_to_rgba(color, 0.25), '', '', '', '', species, '', '10', 'bold-italic']
    )

itol_colored_ranges(
    colored_ranges,
    output_path=strep_folder / 'tree' / 'zoom_species_colored_range.txt',
    range_type='box',
    range_cover='tree',
    dataset_label='Species',
)