# SiteAlign features

This notebook reads the SiteAlign features from the respective [paper](https://onlinelibrary.wiley.com/doi/full/10.1002/prot.21858) and [SI table](https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fprot.21858&file=prot21858-SupplementaryTable.pdf) to verify `kissim`'s implementation of the SiteAlign definitions:

In [1]:
from kissim.definitions import SITEALIGN_FEATURES

In [2]:
SITEALIGN_FEATURES

Unnamed: 0,size,hbd,hba,charge,aromatic,aliphatic
ALA,1.0,0.0,0.0,0.0,0.0,1.0
ARG,3.0,3.0,0.0,1.0,0.0,0.0
ASN,2.0,1.0,1.0,0.0,0.0,0.0
ASP,2.0,0.0,2.0,-1.0,0.0,0.0
CYS,1.0,1.0,0.0,0.0,0.0,1.0
GLN,2.0,1.0,1.0,0.0,0.0,0.0
GLU,2.0,0.0,2.0,-1.0,0.0,0.0
GLY,1.0,0.0,0.0,0.0,0.0,0.0
HIS,2.0,1.0,1.0,0.0,1.0,0.0
ILE,2.0,0.0,0.0,0.0,0.0,1.0


## Size

SiteAlign's size definitions:

> Natural amino acids have been classified into three groups according to the number of heavy atoms (<4 heavy atoms: Ala, Cys, Gly, Pro, Ser, Thr, Val; 4–6 heavy atoms: Asn, Asp, Gln, Glu, His, Ile, Leu, Lys, Met; >6 heavy atoms: Arg, Phe, Trp, Tyr) and three values (“1,” “2,” “3”) are outputted according to the group to which the current residues belong to (Table I)

https://onlinelibrary.wiley.com/doi/full/10.1002/prot.21858

### Parse text from SiteAlign paper

In [3]:
size = {
    1.0: "Ala, Cys, Gly, Pro, Ser, Thr, Val".split(", "),
    2.0: "Asn, Asp, Gln, Glu, His, Ile, Leu, Lys, Met".split(", "),
    3.0: "Arg, Phe, Trp, Tyr".split(", ")
}

### `kissim` definitions correct?

In [4]:
import pandas as pd
from IPython.display import display, HTML

# Format SiteAlign data
size_list = []
for value, amino_acids in size.items():
    values = [(amino_acid.upper(), value) for amino_acid in amino_acids]
    size_list = size_list + values
size_series = pd.DataFrame(size_list, columns=["amino_acid", "size"]).sort_values("amino_acid").set_index("amino_acid").squeeze()

# KiSSim implementation of SiteAlign features correct?
diff = (size_series == SITEALIGN_FEATURES["size"])
if not diff.all():
    raise ValueError(f"KiSSim implementation of SiteAlign features is incorrect!!!\n{display(HTML(diff.to_html()))}")
else:
    print("KiSSim implementation of SiteAlign features is correct :)")

KiSSim implementation of SiteAlign features is correct :)


## HBA, HBD, charge, aromatic, aliphatic

### Parse table from SiteAlign SI

In [5]:
sitealign_table = """
Ala 0 0 0 1 0
Arg 3 0 +1 0 0
Asn 1 1 0 0 0
Asp 0 2 -1 0 0
Cys 1 0 0 1 0
Gly 0 0 0 0 0
Gln 1 1 0 0 0
Glu 0 2 -1 0 0
His/Hid/Hie 1 1 0 0 1
Hip 2 0 1 0 0
Ile 0 0 0 1 0
Leu 0 0 0 1 0
Lys 1 0 +1 0 0
Met 0 0 0 1 0
Phe 0 0 0 0 1
Pro 0 0 0 1 0
Ser 1 1 0 0 0
Thr 1 1 0 1 0
Trp 1 0 0 0 1
Tyr 1 1 0 0 1
Val 0 0 0 1 0 
"""
sitealign_table = [i.split() for i in sitealign_table.split("\n")[1:-1]]
sitealign_dict = {i[0]: i[1:] for i in sitealign_table}
sitealign_df = pd.DataFrame.from_dict(sitealign_dict).transpose()
sitealign_df.columns = ["hbd", "hba", "charge", "aliphatic", "aromatic"]
sitealign_df = sitealign_df[["hbd", "hba", "charge", "aromatic", "aliphatic"]]
sitealign_df = sitealign_df.rename(index={"His/Hid/Hie": "His"})
sitealign_df = sitealign_df.drop("Hip", axis=0)
sitealign_df = sitealign_df.astype("float")
sitealign_df.index = [i.upper() for i in sitealign_df.index]
sitealign_df = sitealign_df.sort_index()
sitealign_df

Unnamed: 0,hbd,hba,charge,aromatic,aliphatic
ALA,0.0,0.0,0.0,0.0,1.0
ARG,3.0,0.0,1.0,0.0,0.0
ASN,1.0,1.0,0.0,0.0,0.0
ASP,0.0,2.0,-1.0,0.0,0.0
CYS,1.0,0.0,0.0,0.0,1.0
GLN,1.0,1.0,0.0,0.0,0.0
GLU,0.0,2.0,-1.0,0.0,0.0
GLY,0.0,0.0,0.0,0.0,0.0
HIS,1.0,1.0,0.0,1.0,0.0
ILE,0.0,0.0,0.0,0.0,1.0


### `kissim` definitions correct?

In [6]:
from IPython.display import display, HTML

diff = (sitealign_df == SITEALIGN_FEATURES.drop("size", axis=1).sort_index())
if not diff.all().all():
    raise ValueError(f"KiSSim implementation of SiteAlign features is incorrect!!!\n{display(HTML(diff.to_html()))}")
else:
    print("KiSSim implementation of SiteAlign features is correct :)")

KiSSim implementation of SiteAlign features is correct :)


## Table style

In [7]:
from Bio.Data.IUPACData import protein_letters_3to1

for feature_name in SITEALIGN_FEATURES.columns:
    print(feature_name)
    for name, group in SITEALIGN_FEATURES.groupby(feature_name):
        amino_acids = {protein_letters_3to1[i.capitalize()] for i in group.index}
        amino_acids = sorted(amino_acids)
        print(f"{name:<7}{' '.join(amino_acids)}")
    print()

size
1.0    A C G P S T V
2.0    D E H I K L M N Q
3.0    F R W Y

hbd
0.0    A D E F G I L M P V
1.0    C H K N Q S T W Y
3.0    R

hba
0.0    A C F G I K L M P R V W
1.0    H N Q S T Y
2.0    D E

charge
-1.0   D E
0.0    A C F G H I L M N P Q S T V W Y
1.0    K R

aromatic
0.0    A C D E G I K L M N P Q R S T V
1.0    F H W Y

aliphatic
0.0    D E F G H K N Q R S W Y
1.0    A C I L M P T V

