# Focus on genus Streptococcus

We found differential presence of `LysM` in 9 out of 10 representative genomes of Streptococci species, i.e. 1 species, _Streptococcus equi_, doesn't encode LysM in its otherwise complete genome. Does it mean that LysM does not bind the cell wall of _Streptococcus equi_? We don't know, but the omission is certainly intriguing.

Our phylogenetically balanced dataset `db_proka` includes a maximum of 10 species per genus. To continue exploring the patterns of presence or absence of LysM and other CWB domains, this notebook is looking to gather a list of all Streptococci with complete genome in GTDB (release 220).

In [9]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
from Bio import Phylo

cwd = os.getcwd()
if cwd.endswith('notebook'):
    os.chdir('..')
    cwd = os.getcwd()

from src.cell_wall_binding_domains import cwb_domains

In [48]:
sns.set_theme(palette='colorblind', font_scale=1.3)
palette = sns.color_palette().as_hex()

data_folder = Path('./data/')
assert data_folder.is_dir()

db_proka = Path('../db_proka/')
assert db_proka.is_dir()

gtdb_folder = Path('../data/gtdb_r220/')
assert gtdb_folder.is_dir()

## Load GTDB r220 bacterial metadata

In [11]:
bac_metadata = pd.read_csv(
    gtdb_folder / 'bac120_metadata_r220.tsv.gz', 
    sep='\t',
)
bac_metadata['assembly_accession'] = [a[3:] for a in bac_metadata['accession'].values]

bac_metadata['domain'] = bac_metadata['gtdb_taxonomy'].apply(lambda t: t.split(';')[0].replace('d__', ''))
bac_metadata['gtdb_phylum'] = bac_metadata['gtdb_taxonomy'].apply(lambda t: t.split(';')[1].replace('p__', ''))
bac_metadata['gtdb_class'] = bac_metadata['gtdb_taxonomy'].apply(lambda t: t.split(';')[2].replace('c__', ''))
bac_metadata['gtdb_order'] = bac_metadata['gtdb_taxonomy'].apply(lambda t: t.split(';')[3].replace('o__', ''))
bac_metadata['gtdb_family'] = bac_metadata['gtdb_taxonomy'].apply(lambda t: t.split(';')[4].replace('f__', ''))
bac_metadata['gtdb_genus'] = bac_metadata['gtdb_taxonomy'].apply(lambda t: t.split(';')[5].replace('g__', ''))
bac_metadata['gtdb_species'] = bac_metadata['gtdb_taxonomy'].apply(lambda t: t.split(';')[6].replace('s__', ''))

bac_metadata = bac_metadata.set_index('assembly_accession', drop=True)

print(f'Number of genomes: {len(bac_metadata):,}')

bac_metadata.head()

Number of genomes: 584,382


Unnamed: 0_level_0,accession,ambiguous_bases,checkm2_completeness,checkm2_contamination,checkm2_model,checkm_completeness,checkm_contamination,checkm_marker_count,checkm_marker_lineage,checkm_marker_set_count,...,trna_aa_count,trna_count,trna_selenocysteine_count,domain,gtdb_phylum,gtdb_class,gtdb_order,gtdb_family,gtdb_genus,gtdb_species
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_000657795.2,RS_GCF_000657795.2,0,100.0,0.14,Specific,99.53,0.0,426,o__Burkholderiales (UID4000),213,...,19,55,1,Bacteria,Pseudomonadota,Gammaproteobacteria,Burkholderiales,Burkholderiaceae,Bordetella,Bordetella pseudohinzii
GCF_001072555.1,RS_GCF_001072555.1,7,100.0,0.52,Specific,99.81,0.09,773,g__Staphylococcus (UID294),178,...,14,36,0,Bacteria,Bacillota,Bacilli,Staphylococcales,Staphylococcaceae,Staphylococcus,Staphylococcus epidermidis
GCF_003050715.1,RS_GCF_003050715.1,0,100.0,0.04,Specific,99.6,0.22,769,g__Burkholderia (UID4006),248,...,19,52,0,Bacteria,Pseudomonadota,Gammaproteobacteria,Burkholderiales,Burkholderiaceae,Paraburkholderia,Paraburkholderia graminis
GCF_016772635.1,RS_GCF_016772635.1,0,100.0,0.16,Specific,100.0,0.04,1169,f__Enterobacteriaceae (UID5139),340,...,19,86,1,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Salmonella,Salmonella enterica
GCA_000615405.1,GB_GCA_000615405.1,0,100.0,1.37,Specific,99.62,0.57,471,o__Lactobacillales (UID543),264,...,17,45,0,Bacteria,Bacillota,Bacilli,Lactobacillales,Streptococcaceae,Lactococcus,Lactococcus lactis


## Streptococcus

In [12]:
strep_df = bac_metadata[bac_metadata['gtdb_genus'] == 'Streptococcus'].copy()

n_streptococci = len(strep_df)
print(f'Number of Streptococci genomes: {n_streptococci:,}')

n_complete_genomes = len(strep_df[strep_df['ncbi_assembly_level'] == 'Complete Genome'])
n_good_genomes = len(strep_df[
    (strep_df['ncbi_assembly_level'] == 'Complete Genome') |
    (strep_df['checkm2_completeness'] > 0.98)
])

print(f'Number of Streptococci complete genomes: {n_complete_genomes:,}')
print(f'Number of Streptococci > 98% complete  : {n_good_genomes:,}')

Number of Streptococci genomes: 20,045
Number of Streptococci complete genomes: 1,228
Number of Streptococci > 98% complete  : 20,045


In [43]:
[c for c in strep_df.columns if 'repr' in c]

['gtdb_genome_representative',
 'gtdb_representative',
 'ncbi_genome_representation']

We'll focus on complete genomes first.

In [13]:
strep_cg = strep_df[strep_df['ncbi_assembly_level'] == 'Complete Genome'].copy()

n_genomes_per_species = strep_cg[['gtdb_species', 'accession']].groupby('gtdb_species').nunique().sort_values(
    'accession', ascending=False,
)

print(f'Number of species: {len(n_genomes_per_species):,}')

n_genomes_per_species.head(10)

Number of species: 110


Unnamed: 0_level_0,accession
gtdb_species,Unnamed: 1_level_1
Streptococcus pyogenes,271
Streptococcus agalactiae,197
Streptococcus pneumoniae,158
Streptococcus suis,109
Streptococcus thermophilus,88
Streptococcus dysgalactiae,43
Streptococcus equi,38
Streptococcus mutans,26
Streptococcus iniae,14
Streptococcus gordonii,13


There are 38 genomes of _Streptococcus equi_, great.

### Manual accession changed

Since GTDB 220 was release in April 2024, 1 genome of our shortlist of Streptococcus has been updated:

GCA_029203915.1 --> GCA_029203915.2

In [14]:
strep_cg.index = strep_cg.index.where(strep_cg.index != 'GCA_029203915.1', 'GCA_029203915.2')

assert 'GCA_029203915.1' not in strep_cg.index

strep_cg.loc['GCA_029203915.2', 'gtdb_species']

'Streptococcus suis_W'

### Export assembly accession list

In [15]:
strep_cg.to_csv(gtdb_folder / 'Streptococcus' / 'genomes_metadata.csv')

strep_cg.reset_index()[['assembly_accession']].to_csv(
    gtdb_folder / 'Streptococcus' / 'assembly_accessions.txt', 
    index=False, 
    header=False,
)

### Check outputs

Have all genomes been downloaded? Are all protein files present?

In [16]:
accessions = set()
missing_protein_files = []
for p in (gtdb_folder / 'Streptococcus' / 'genomes').iterdir():
    if p.name.startswith('GC'):
        accession = '_'.join(p.name.split('_')[:2])
        accessions.add(accession)

        if not (p / f'{p.name}_protein.faa.gz').is_file():
            missing_protein_files.append(p.name)

assert len(accessions & set(strep_cg.index)) == len(strep_cg)

print(f'Number of genomes without predicted proteins: {len(missing_protein_files):,}')

Number of genomes without predicted proteins: 0


### GTDB tree

How many Streptococci species are present on the tree? All the species representative I suppose, let's see.

In [17]:
bac_tree = Phylo.read(gtdb_folder / 'bac120_r220.tree', 'newick')

In [18]:
gtdb_ids = {leaf.name for leaf in bac_tree.get_terminals()}

strep_ids_in_tree = gtdb_ids & set(strep_cg['accession'].unique())

print(f'Number of Streptococci in GTDB tree: {len(strep_ids_in_tree):,}')

Number of Streptococci in GTDB tree: 79


We need to make a Streptococci tree. 

- GTDBTk to identify genes
- Concatenate core 120 genes
- Align with mafft L-INS-i
- Build tree with IQ-Tree

TBD

## CWB domains in Streptococcus

From the database assembled with software [github.com/srom/assembly](https://github.com/srom/assembly).

In [19]:
strep_folder = gtdb_folder / 'Streptococcus'

In [20]:
pfam_df = pd.read_csv(strep_folder / 'Streptococcus_all_proteins.pfam.csv', index_col='assembly_accession')
pfam_df.head()

Unnamed: 0_level_0,id,protein_id,hmm_accession,hmm_query,evalue,bitscore,accuracy,start,end
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
GCA_000013525.1,ABF35053.1@GCA_000013525.1,ABF35053.1,PF01695.22,IstB_IS21,3.5e-07,26.8,0.71,61,266
GCA_000013525.1,ABF35053.1@GCA_000013525.1,ABF35053.1,PF00308.23,Bac_DnaA,1e-74,247.2,0.98,111,293
GCA_000013525.1,ABF35053.1@GCA_000013525.1,ABF35053.1,PF00004.34,AAA,6.3e-08,30.0,0.74,147,272
GCA_000013525.1,ABF35053.1@GCA_000013525.1,ABF35053.1,PF08299.16,Bac_DnaA_C,3.1e-31,104.1,0.99,360,429
GCA_000013525.1,ABF35054.1@GCA_000013525.1,ABF35054.1,PF00712.24,DNA_pol3_beta,7.2e-27,90.8,0.98,1,127


In [21]:
assert len(set(pfam_df.index)) == len(strep_cg)

In [22]:
data = {
    'assembly_accession': [],
}
for cwb in cwb_domains:
    data[cwb] = []

for accession in sorted(set(pfam_df.index)):
    data['assembly_accession'].append(accession)
    df = pfam_df.loc[[accession]]
    for cwb in cwb_domains:
        n = len(df[df['hmm_query'] == cwb]['protein_id'].unique())
        data[cwb].append(n)

cwb_domain_counts = pd.DataFrame.from_dict(data).set_index('assembly_accession', drop=True)
cwb_domain_counts.head()

Unnamed: 0_level_0,PG_binding_1,PG_binding_2,PG_binding_3,AMIN,SPOR,SH3_1,SH3_2,SH3_3,SH3_4,SH3_5,...,Choline_bind_1,Choline_bind_2,Choline_bind_3,CW_binding_2,CW_7,PSA_CBD,ZoocinA_TRD,GW,OapA,WxL
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCA_000013525.1,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
GCA_000014305.1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
GCA_000014325.1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
GCA_000188715.1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
GCA_000211095.1,0,0,0,0,0,0,0,0,0,1,...,12,7,8,0,0,0,0,0,0,0


In [23]:
cwb_domain_bins = cwb_domain_counts.copy()

for cwb in cwb_domains:
    cwb_domain_bins[cwb] = (cwb_domain_bins[cwb] > 0).astype(int)

cwb_domain_bins.head()

Unnamed: 0_level_0,PG_binding_1,PG_binding_2,PG_binding_3,AMIN,SPOR,SH3_1,SH3_2,SH3_3,SH3_4,SH3_5,...,Choline_bind_1,Choline_bind_2,Choline_bind_3,CW_binding_2,CW_7,PSA_CBD,ZoocinA_TRD,GW,OapA,WxL
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCA_000013525.1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
GCA_000014305.1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
GCA_000014325.1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
GCA_000188715.1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
GCA_000211095.1,0,0,0,0,0,0,0,0,0,1,...,1,1,1,0,0,0,0,0,0,0


In [24]:
n_no_lysm = len(cwb_domain_bins[cwb_domain_bins['LysM'] == 0])
total = len(cwb_domain_bins)
p = 100 * n_no_lysm / total

print(f'Number of Streptococci strains without LysM: {n_no_lysm:,} ({p:.1f}%)')

Number of Streptococci strains without LysM: 52 (4.2%)


In [25]:
cwb_domain_bins_with_species = cwb_domain_bins.copy()
cwb_domain_bins_with_species['gtdb_species'] = [strep_cg.loc[a, 'gtdb_species'] for a in cwb_domain_bins_with_species.index]

In [26]:
cwb_domain_bins_with_species[['gtdb_species'] + cwb_domains].to_csv(strep_folder / 'Streptococcus_cell_wall_binding.csv')

In [27]:
lysM_per_species = cwb_domain_bins_with_species.reset_index()[
    ['assembly_accession', 'gtdb_species', 'LysM']
].groupby('gtdb_species').agg(
    {'LysM': 'sum', 'assembly_accession': 'nunique'}
).rename(columns={
    'LysM': 'n_LysM',
    'assembly_accession': 'total',
})

lysM_per_species['percentage'] = (100 * lysM_per_species['n_LysM'] / lysM_per_species['total']).round(2)
lysM_per_species.sort_values(['percentage', 'total'], ascending=[True, False]).head(15)

Unnamed: 0_level_0,n_LysM,total,percentage
gtdb_species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Streptococcus parauberis,0,11,0.0
Streptococcus porcinus,0,3,0.0
Streptococcus halichoeri,0,1,0.0
Streptococcus porcinus_A,0,1,0.0
Streptococcus equi,4,38,10.53
Streptococcus pseudoporcinus,1,3,33.33
Streptococcus pyogenes,271,271,100.0
Streptococcus agalactiae,197,197,100.0
Streptococcus pneumoniae,158,158,100.0
Streptococcus suis,109,109,100.0


In [28]:
s_equi_with_lysM = cwb_domain_bins_with_species[
    (cwb_domain_bins_with_species['gtdb_species'] == 'Streptococcus equi') &
    (cwb_domain_bins_with_species['LysM'] > 0)
].index.tolist()

strep_cg.loc[s_equi_with_lysM][['ncbi_organism_name']]

Unnamed: 0_level_0,ncbi_organism_name
assembly_accession,Unnamed: 1_level_1
GCF_000219765.1,Streptococcus equi subsp. zooepidemicus ATCC 3...
GCF_009676645.1,Streptococcus equi subsp. zooepidemicus
GCF_009676685.2,Streptococcus equi subsp. zooepidemicus
GCF_900636805.1,Streptococcus equi subsp. zooepidemicus


In [29]:
s_equi_with_zooA = cwb_domain_bins_with_species[
    (cwb_domain_bins_with_species['gtdb_species'] == 'Streptococcus equi') &
    (cwb_domain_bins_with_species['ZoocinA_TRD'] > 0)
].index.tolist()

strep_cg.loc[s_equi_with_zooA][['ncbi_organism_name']]

Unnamed: 0_level_0,ncbi_organism_name
assembly_accession,Unnamed: 1_level_1
GCF_015689395.1,Streptococcus equi subsp. zooepidemicus
GCF_015767555.1,Streptococcus equi subsp. zooepidemicus
GCF_900636805.1,Streptococcus equi subsp. zooepidemicus


### How often LysM-containing proteins associated with peptidoglycan catalytic domains?

In [30]:
catalytic_domains = [
    'Amidase',
    'Amidase_2',
    'Amidase_3',
    'Amidase_5',
    'Amidase_6',
    'Pepdidase_M14_N',
    'Peptidase_C107',
    'Peptidase_C21',
    'Peptidase_C23',
    'Peptidase_C24',
    'Peptidase_C27',
    'Peptidase_C30',
    'Peptidase_C34',
    'Peptidase_C36',
    'Peptidase_C42',
    'Peptidase_C53',
    'Peptidase_C92',
    'Peptidase_M14',
    'Peptidase_M15',
    'Peptidase_M15_2',
    'Peptidase_M15_3',
    'Peptidase_M15_4',
    'Peptidase_M2',
    'Peptidase_M23',
    'Peptidase_M26_C',
    'Peptidase_M26_N',
    'Peptidase_M32',
    'Peptidase_M73',
    'Peptidase_M74',
    'Peptidase_M99',
    'Peptidase_M99_C',
    'Peptidase_M99_m',
    'Peptidase_S10',
    'Peptidase_S11',
    'Peptidase_S13',
    'Peptidase_S28',
    'Peptidase_S32',
    'Peptidase_S37',
    'Peptidase_S66',
    'Peptidase_S66C',
    'Propep_M14',
    'Prophage_tail',
    'Prophage_tailD1',
    'VanY',
    'Glucosaminidase',
    'CHAP',
    'Aspzincin_M35',
    'Glyco_hydro_25',
    'Glyco_hydro_108',
    'Lysozyme_like',
    'Phage_lysozyme',
    'Phage_lysozyme2',
    'SLT',
    'SLT_2',
    'NLPC_P60',
]

In [31]:
ids = cwb_domain_bins_with_species[cwb_domain_bins_with_species['LysM'] > 0].index
df = pfam_df.loc[ids]
protein_ids = df[df['hmm_query'] == 'LysM']['id'].unique()
accessions = set(df[
    df['id'].isin(protein_ids) &
    df['hmm_query'].isin(catalytic_domains)
].index)

cwb_domain_bins_with_species['has_LysM_pgh'] = [a in accessions for a in cwb_domain_bins_with_species.index]


In [32]:
lysM_pgh_per_species = cwb_domain_bins_with_species.reset_index()[
    ['assembly_accession', 'gtdb_species', 'has_LysM_pgh']
].groupby('gtdb_species').agg(
    {'has_LysM_pgh': 'sum', 'assembly_accession': 'nunique'}
).rename(columns={
    'assembly_accession': 'total',
})

lysM_pgh_per_species['percentage'] = (100 * lysM_pgh_per_species['has_LysM_pgh'] / lysM_pgh_per_species['total']).round(2)

In [33]:
lysM_per_species.sort_values(['percentage', 'total'], ascending=[True, False]).head(15)

Unnamed: 0_level_0,n_LysM,total,percentage
gtdb_species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Streptococcus parauberis,0,11,0.0
Streptococcus porcinus,0,3,0.0
Streptococcus halichoeri,0,1,0.0
Streptococcus porcinus_A,0,1,0.0
Streptococcus equi,4,38,10.53
Streptococcus pseudoporcinus,1,3,33.33
Streptococcus pyogenes,271,271,100.0
Streptococcus agalactiae,197,197,100.0
Streptococcus pneumoniae,158,158,100.0
Streptococcus suis,109,109,100.0


In [34]:
pyogenes_ids = strep_cg[strep_cg['gtdb_species'] == 'Streptococcus pyogenes'].index
df = pfam_df.loc[pyogenes_ids]
protein_ids = df[df['hmm_query'] == 'LysM']['id'].unique()
pfam_df[pfam_df['id'].isin(protein_ids) & (pfam_df['hmm_query'] != 'LysM')]['hmm_query'].unique()

array([], dtype=object)

All pyogenes proteins containing LysM only contain LysM (no other Pfam domain predicted).

### Has LysM been replaced with another CWB domain?

In [35]:
no_LysM_species = lysM_per_species.sort_values(['percentage', 'total'], ascending=[True, False]).iloc[:6].index.tolist()
yes_LysM_species = lysM_per_species.sort_values(['percentage', 'total'], ascending=[True, False]).iloc[6:].index.tolist()

In [36]:
no_series = cwb_domain_bins_with_species[
    cwb_domain_bins_with_species['gtdb_species'].isin(no_LysM_species)
][cwb_domains].sum() / len(no_LysM_species)

yes_series = cwb_domain_bins_with_species[
    cwb_domain_bins_with_species['gtdb_species'].isin(yes_LysM_species)
][cwb_domains].sum() / len(yes_LysM_species)

cwb_stats = pd.DataFrame({
    'yes_LysM': yes_series,
    'no_LysM': no_series
})
cwb_stats['diff'] = (cwb_stats['yes_LysM'] - cwb_stats['no_LysM']).abs()

cwb_stats = cwb_stats.sort_values('diff', ascending=False)
cwb_stats[cwb_stats['diff'] > 0]

Unnamed: 0,yes_LysM,no_LysM,diff
LysM,11.259615,0.833333,10.426282
GW,0.144231,4.333333,4.189103
Choline_bind_3,3.307692,0.0,3.307692
Choline_bind_1,3.278846,0.0,3.278846
Choline_bind_2,2.182692,0.0,2.182692
ZoocinA_TRD,2.096154,0.5,1.596154
SH3_5,10.432692,9.5,0.932692
CW_7,1.432692,1.0,0.432692
SH3_3,0.201923,0.333333,0.13141


In [37]:
def get_companion_domains(strep_cg, hmm_query):
    strep_ids = strep_cg.index
    df = pfam_df.loc[strep_ids]
    protein_ids = df[df['hmm_query'] == hmm_query]['id'].unique()

    companion_domains = pfam_df[
        pfam_df['id'].isin(protein_ids) & 
        (pfam_df['hmm_query'] != hmm_query)
    ]

    n_standalone = len(set(protein_ids) - set(companion_domains['id'].unique()))
    percent = 100 * n_standalone / len(protein_ids)

    print(f'Number of standalone {hmm_query}: {n_standalone:,} ({percent:.0f}%)')
    print('Top 5 companion domains:')
    print(companion_domains['hmm_query'].value_counts()[:5])

In [38]:
for hmm_query in cwb_stats[cwb_stats['diff'] > 0].index:
    get_companion_domains(strep_cg, hmm_query)
    print()

Number of standalone LysM: 2,172 (82%)
Top 5 companion domains:
hmm_query
CHAP              365
Amidase_3         116
SLT                82
GA                 21
Glyco_hydro_25     15
Name: count, dtype: int64

Number of standalone GW: 0 (0%)
Top 5 companion domains:
hmm_query
Beta-lactamase2    41
DUF5776            13
Peptidase_S11       9
Name: count, dtype: int64

Number of standalone Choline_bind_3: 38 (1%)
Top 5 companion domains:
hmm_query
Choline_bind_1    14864
Choline_bind_4      941
Choline_bind_2      787
KxYKxGKxW_sig       499
SH3_5               333
Name: count, dtype: int64

Number of standalone Choline_bind_1: 120 (4%)
Top 5 companion domains:
hmm_query
Choline_bind_3    8911
Choline_bind_2    1651
Choline_bind_4     941
KxYKxGKxW_sig      439
SH3_5              333
Name: count, dtype: int64

Number of standalone Choline_bind_2: 145 (11%)
Top 5 companion domains:
hmm_query
Choline_bind_1     8205
Choline_bind_3     2932
Choline_bind_4      933
Glucosaminidase     198
L