# Beyond sequences primary structure


We may have access to sequences 3D-structure (which could be predicted by a tool). How can we use seqme to evaluate sequences based on their 3D-structure? We will show how to do this in this notebook.


In [None]:
# !pip install tmtools

In [None]:
from typing import Literal

import numpy as np
from tmtools import tm_align

import seqme as sm

Let's define a metric which uses atomic positions. Here we use RMSD.


In [None]:
class RMSD(sm.Metric):
    """Root mean square deviation of atomic positions."""

    def __init__(self, reference: str, sequence_to_coordinates: dict[str, np.ndarray]):
        self.reference = reference
        self.sequence_to_coordinates = sequence_to_coordinates

    def __call__(self, sequences: list[str]) -> sm.MetricResult:
        ref_coords = self.sequence_to_coordinates[self.reference]
        scores = np.array(
            [tm_align(self.sequence_to_coordinates[seq], ref_coords, seq, self.reference).rmsd for seq in sequences]
        )
        return sm.MetricResult(scores.mean().item())

    @property
    def name(self) -> str:
        return "RMSD"

    @property
    def objective(self) -> Literal["minimize", "maximize"]:
        return "minimize"

Let's define our protein folding model.

In [None]:
cache = sm.ModelCache(models={"esm-fold": sm.models.EsmFold()})

Some weights of EsmForProteinFolding were not initialized from the model checkpoint at facebook/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
sequences = ["MRKIVV", "MVHAT"]
"""
sequences = [
    "MRKIVVAAIAVSLTTVSITASASADPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ",
    "MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKESPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE",
]
"""
coordinates = cache.model("esm-fold", variable_length=True)(sequences)
sequence_to_coordinates = dict(zip(sequences, coordinates, strict=True))

Let's create the metric and sequences.


In [None]:
metrics = [RMSD(reference=sequences[0], sequence_to_coordinates=sequence_to_coordinates)]

In [None]:
sequences = {f"Protein {i + 1}": [seq] for i, seq in enumerate(list(sequence_to_coordinates.keys()))}

Let's compute the metric.


In [None]:
df = sm.compute_metrics(sequences, metrics)

100%|██████████| 2/2 [00:00<00:00, 35.37it/s, data=Protein 2, metric=RMSD]


In [None]:
sm.show_table(df)

Unnamed: 0,RMSD↓
Protein 1,0.0
Protein 2,0.59


Recall seqme defines three groups of metrics: sequence-based, embedding-based and property-based metrics. One may ask, what group this metric fits in? Notice, metrics operating on 3D-structure are very similar to property-based metrics: sequence → 3D-structure (property) → metric. And there you go.