# MMLigner User Tutorial


#### Developed at Volkamerlab, Charité/FU Berlin 

by Dennis Köser

# Reference

MMLigner (Collier et al., 2017) works by minimizing the ivalue of the alignment. The ivalue is based on
the Minimum Message Length framework (Wallace and Boulton, 1968; Wallace, 2005), a Bayesian framework for
statistical inductive inference. The ivalue represents the hypothetical minimum message length needed to transmit
the computed alignment losslessly (Shannon, 1948).
Using the ivalue measure, the algorithm creates crude-but-effective strucural alignments rapidly to act as seeds.
These seeds are iteratively refined over an Expectation-Maximization approach using the I-value criterion.
By substracting the ivalue from the null model, the statistical significance of the alignment can be computed. If the
difference is positive, the alignment is significant.

Collier, J.H., Allison, L., Lesk A.M., Stuckey, P.J., Garcia de la Banda , M., Konagurthu, A.S. (2017)
Statistical inference of protein structural alignments using information and compression.
Bioinformatics, 33(7), 1005-1013

Wallace,C.S. and Boulton,D.M. (1968) An information measure for classification.
Comput. J., 11, 185–194.

Wallace,C.S. (2005) Statistical and Inductive Inference Using MinimumMessage Length.
Information Science and Statistics. SpringerVerlag, New York, NY.

Shannon,C.E. (1948) A mathematical theory of communication.
Bell Syst.Tech. J., 27, 379–423.

# Introduction


## What are the  chosen structures

Since this project was developed during the SARS-CoV-2 pandemic of 2020, we chose the main protease of SARS-Cov from 2006 (2GZ9) and the main protease of SARS-Cov 2 from 2020 (5R8T) as example structures for this tutorial. SARS-Cov and SARS-Cov 2 are strains of viruses that cause severe acute respiratory syndrome (SARS). The chosen proteases are required for the maturation of SARS-Cov and SARS-Cov 2 respectively, so they make a good target for structure-based drug design of anti-SARS drugs.

## Why they have been chosen

In addition to the relevance for the pandemic of 2020, these sructures work well as an example for this tutorial because of their relation to each other resulting in quite similar but not completly similar structures. They both have a length of 306 and, if you align them with Needelman-Wunsch, show a similarity 98.69 % and identity of 96.08 %. The alignment with MMLigner results in a RMSD of 1.244 Å, a coverage of 303 and an ivalue of 29.559 bits.

# Theory

## About the RCSB

The [RCSB PDB](http://www.rcsb.org/) (Research Collaboratory for Structural Bioinformatics Protein Data Bank) provide a global PDB archive, and makes PDB data available at no charge to all data consumers without limiations on usage.

## RMSD

The RMSD is the average distance between the atoms of tuperposed structures in Angstrom.

## coverage

The coverage of the aligned structures

## ivalue

The minimum message length of the compressed alignment in bites.

# Preperation

## How to get the structure from the CLI

To get the structures directly from the RCSB, the syntax looks like this:

In [1]:
!structuralalignment --method=mmligner 2GZ9 5R8T

Der Befehl "structuralalignment" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


When you want to use structures which are locally saved, do this:

In [2]:
!structuralalignment --method=mmligner PATH_OF_2GZ9 PATH_OF_5R8T

Der Befehl "structuralalignment" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


## Getting the structure in python

The method will use atomium.models as input.

If you want to get the structures from the RCSB, you can do the following:

In [3]:
%load_ext autoreload

In [4]:
%autoreload 2

In [5]:
import atomium

structure1 = atomium.fetch("2GZ9").model
structure2 = atomium.fetch("5R8T").model

# Alignment 

Using MMligner in Python looks like this:

In [None]:
from structuralalignment.superposition.mmligner import MMLignerAligner

mmligner = MMLignerAligner()
results = mmligner.calculate([structure1, structure2])

In [7]:
results["scores"]["rmsd"]

NameError: name 'results' is not defined

In addition you can compute the ivalue of a already computed alignment. This looks like: 

In [8]:
mmligner = MMLignerAligner()

results = mmligner.ivalue([structure1, structure2], alignment)

NameError: name 'MMLignerAligner' is not defined

The results are returned as a dictionary containing the superposed structure, the RMSD, ivalue and coverage, as well as the alignment. The dictionary looks like:

# Analysis

### NGLview

If you have trouble with NGLview, follow this [troubleshooting guide](https://github.com/SBRG/ssbio/wiki/Troubleshooting#tips-for-nglview).

In [9]:
def atomium_to_nglview(*models):
    """
    Represent Atomium models in NGLView
    Parameters
    ----------
    models : atomium.Model
    Returns
    -------
    nglview.NGLWidget
    """
    import nglview as nv
    from structuralalignment.utils import enter_temp_directory

    v = nv.NGLWidget()
    with enter_temp_directory():
        for i, model in enumerate(models):
            fn = f"tmp{i}.pdb"
            model.save(fn)
            fs = nv.FileStructure(fn)
            v.add_component(fs)
    return v

In [10]:
import nglview as nv
print("nglview version = {}".format(nv.__version__))
# your nglview version should be 1.1.7 or later

v = atomium_to_nglview(*results["superposed"])
v

nglview version = 1.1.7


NameError: name 'results' is not defined

In [11]:
vv = nv.show_file("/tmp/tmp8zrlttud/structure1.pdb")
vv.add_component("/tmp/tmp8zrlttud/p_superposed__1.pdb")
vv

ValueError: you must provide file extension if using file-like object or text content

## Report

* RMSD before and after
* coverage
* what residues are mapped