# Advanced Structure Similarity Analysis

This advanced example will illustrate how to use Melodia to compare structural similarities between protein family members. Curvature and torsion are very sensitive to small changes in the backbone geometry and are also rotational invariant. Thus, they are robust descriptors for finding the protein family building blocks.

In [1]:
import dill
import warnings

import pandas as pd
import melodia as mel
import seaborn as sns
import matplotlib.pyplot as plt

from os import path
from math import sqrt
from sklearn.preprocessing import StandardScaler
from Bio.PDB.PDBExceptions import PDBConstructionWarning

warnings.filterwarnings("ignore", category=PDBConstructionWarning)

The first step is to read and process an alignment file. See **Notebook 2** for more information about this process.

In [2]:
# Dill can be used for storage
if path.exists('model.dill'):
    with open('model.dill', 'rb') as file:
        align = dill.load(file)
else:
    align = mel.parser_pir_file('model.ali')
    with open('model.dill', 'wb') as file:
        dill.dump(align, file)

In [3]:
# Create a Pandas DataFrame for only the Curvature and Torsion descriptors
df = mel.dataframe_from_alignment(align=align, keys=['curvature', 'torsion'])
df.head()

In [4]:
# Get the ids for all the sequences with a structure file
ids = [record.description.split(':')[1] for record in align if 'structure' in record.description]
ids

In [5]:
# Creat a copy
dfa = df.copy()

# Define the scikit-learn autoscaler
autoscaler = StandardScaler()

# Create a list for the feature to scale
features = []
for id in ids:
    features.append(f'curvature_{id}')
    features.append(f'torsion_{id}')

# Apply the autoscaler
dfa[features] = autoscaler.fit_transform(dfa[features])

In [6]:
# Plot the scaled data for the 1st protein
cmap = sns.color_palette("Blues", as_cmap=True)
sns.jointplot(x=features[0], y=features[1], data=dfa, kind='kde', cmap=cmap, height=10, fill=True);

In [7]:
# Copy the ids and POP one of the proteins as the base
tags = ids.copy()
base = tags.pop()
base

In this case, we will compare the *base protein* with **all** the other *aligned proteins*. Our idea is to find regions of high similarity with the base protein in all the other proteins.

In [8]:
# Define the Euclidian distance between the base protein and the cmp protein
def dist(row):
    d = (row[f'curvature_{base}'] - row[f'curvature_{cmp}'])**2 + (row[f'torsion_{base}'] - row[f'torsion_{cmp}'])**2
    return sqrt(d)

# Create new columns with the Euclidian distance
columns = []
for cmp in tags:    
    dfa[f'{base}_{cmp}'] = dfa.apply(dist, axis=1)
    columns.append(f'{base}_{cmp}')

In [9]:
dfa[[i for i in dfa.columns if 'seq' in i] + columns]

In [10]:
# Select the 1st protein for comparison
column = columns[0]
df_clust = dfa[column].copy().reset_index()
df_clust.head()

***
This simple algorithm selects regions with similarity under the specified threshold. The algorithm ensures that those regions are at least three residues long.
***

In [11]:
import more_itertools as mit

threshold = 0.5

clabels = [0 if x > threshold else 1 for x in list(df_clust[column])]

similar = [i for i, label in enumerate(clabels) if label > 0]

regions = [list(group) for group in mit.consecutive_groups(similar)]

for i, block in enumerate(regions):
    if len(block) >= 3:
        value = i
    else:
        value = 0
    for j in block:
        clabels[j] = value

In [13]:
# Display the similar regions between the two proteins
n_clusters = max(clabels)+1

color_palette = sns.color_palette('Dark2', n_clusters)
cluster_colors = [color_palette[x] if x > 0 else (1.0, 1.0, 1.0) for x in clabels]

ax = df_clust.plot.scatter(x='index', y=column, s=50, linewidth=0, c=cluster_colors, alpha=1.0, figsize=(6.4*3.5, 4.8*1.5));
df_clust.plot(x='index', y=column, figsize=(6.4*3.5, 4.8*1.5), ax=ax)
plt.title(f'{n_clusters}')
plt.grid()