# Getting Started with Alignment Files in Melodia

In [15]:
import dill
import warnings

import pandas as pd
import melodia as mel
import seaborn as sns

from os import path
from Bio.PDB.PDBExceptions import PDBConstructionWarning

warnings.filterwarnings("ignore", category=PDBConstructionWarning)

## Parsing an alignment in the PIR file format

Melodia can read PIR alignment files and parse the geometric descriptors from the PDB files in the same directory.

***
![PIR record](model_pir.png)
***

In this example, the **structureX**  and the  **>P1;1cdoa** records inform Melodia that this sequence is related to a protein structure in a file called **1cdoa.pdb**.

In [4]:
# Dill can be used for storage

# Load the model if it already exists
if path.exists('model.dill'):
    with open('model.dill', 'rb') as file:
        align = dill.load(file)
else:
    # Parser and save a new alignment
    align = mel.parser_pir_file('model.ali')
    with open('model.dill', 'wb') as file:
        dill.dump(align, file)

The result is a BioPython alignment:

https://biopython.org/docs/1.74/api/Bio.Align.html

In [5]:
align

In [7]:
# It easy to iterate over the alignment records 
for record in align:
    print(record)
    break

## Accessing Geometric Attributes

All geometric attributes can be accessed through the letter_annotations funcionality:

https://biopython.org/docs/1.75/api/Bio.SeqRecord.html

In [11]:
# Select the third sequence in the alignment
record = align[2]

# Print some of the record's data
print(record.id)
print(record.seq)
print(record.letter_annotations.keys())
print()

It is easy to access and work with the alignment data with the geometric annotation.

In [13]:
# Print the curvature and torsion for a few residues
for i, residue in enumerate(record.seq):
    print(f"{i} - {residue} - {record.letter_annotations['curvature'][i]:7.4f} - {record.letter_annotations['torsion'][i]:7.4f}")
    if i > 4:
        break

## Alignments and Pandas Dataframe

### Converting a BioPython alignment to a Pandas Dataframe

In [7]:
mel.dataframe_from_alignment(align=align)

***
It is possible to choose the geometric annotations for the Dataframe.
***

In [8]:
df = mel.dataframe_from_alignment(align=align, keys=['curvature', 'torsion'])

In [9]:
df.head()

### DataFrame Storage

***
Pandas Dataframe can be stored using the Parquet file format.
***

In [10]:
df.to_parquet('df.parquet.gzip', compression='gzip')  

In [11]:
pd.read_parquet('df.parquet.gzip') 

## Structural Similarity Analysis

### Alignment Clustering

Melodia can cluster segments of the proteins in the alignment to determine the highly conserved regions. It infers the conservation patterns by comparing the differential geometry of aligned positions. Melodia uses a threshold to determine and group the regions where curvature and torsion are deemed similar. 

For more information about structural clustering see:

Rinaldo W. Montalvão, Richard E. Smith, Simon C. Lovell, Tom L. Blundell, CHORAL: a differential geometry approach to the prediction of the cores of protein structures, *Bioinformatics*, Volume 21, Issue 19, January 2005, Pages 3719–3725, https://doi.org/10.1093/bioinformatics/bti595


In [21]:
mel.cluster_alignment(align=align, threshold=1.1, long=True)

### Alignment Cluster Annotation

Melodia can save a colour-annotated version of the alignment as a PostScript file. Each colour, in aligned positions, indicates blocks with similar geometry. Those clusters accommodate the multimodal nature of protein ensembles of a homologous family, which is particularly characteristic of the Cα spatial distributions in low similarity superfamilies and for loop regions. The differential geometric classification provides a better classification than that obtained using Cα distances alone.

In [25]:
# First select a colour pallete
palette='Dark2'
colors=7
sns.color_palette(palette, colors)

In [23]:
# Save a PS file with the colour-coded alignment
mel.save_align_to_ps(align=align, ps_file='model', palette=palette, colors=colors)

![alignment cluster](ali_cluster.png)

### Structure Superpostion and Annotation

Melodia can also create a PyMol script to load and superpose the protein structures and colour the clustered regions like in the PS file. The following command will produce a **clusters_model.pml** script file for this operation.

In [26]:

mel.save_pymol_script(align=align, pml_file='cluster_models', palette=palette, colors=colors)

Just run **pymol cluster_models.pml** in the command line to create this vizualization.

![pdb clusters](pdbs_cluster.png)

***
Using a colour-coded alignment and colour-coded structures can give valuable insights into the structural conservation patterns of a protein family. Colour-coded structures can indicate different conformational patterns in an otherwise reasonably similar region that would be classified as a common framework in many programs.
***