# Embeddings and Vectors with Scikit-Bio

**Welcome to scikit-bio tutorial-02!** In this tutorial, we will showcase how the scikit-bio library can be utilized for embedding and vectorizing sets of protein sequences. Our goal is to demonstrate sequence classification and structural alignment using *TM-Vec* and *DeepBLAST* respectively.

### Introduction - a role of deep learning in `scikit-bio`
he application of deep learning in biology is gaining widespread popularity, with protein language models (pLMs) being a notable success story. By generating protein embeddings from sequence data, pLMs have paved the way for a range of innovative applications, including, for instance, fluorescent protein design by [ESM3](https://techcrunch.com/2024/06/25/evolutionaryscale-backed-by-amazon-and-nvidia-raises-142m-for-protein-generating-ai/) and [CRISPR-Cas design](https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1).

Compared to traditional letter-encoded amino acids, protein embeddings offer a more expressive representation of a protein. In addition to sequence information, they encode structural, evolutionary (including organism of origin), and functional (such as thermostability and fluorescence) properties. These embeddings can be viewed as a compressed representation of multiple sequence alignments (MSAs), and thus, any task that benefits from MSA will also benefit from operations on embeddings. As computational optimizations of protein language models (pLMs) continue to emerge, embeddings-based methods are likely to gain popularity. Our goal is to provide an infrastructure that enables users to conduct sequence analysis leveraging embeddings.

However, not all tasks benefit from embeddings. The limitations of embeddings are discussed in detail in [Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.02.05.578959v2.abstract).

### What is TM-Vec?
[TM-Vec](https://www.nature.com/articles/s41587-023-01917-2) is a tool that utilises protein representations from pLM in order to predict structural similarity of two proteins.
### How does TM-Vec work?
General protein representations (also called "embeddings"), that are predictive of protein structure, are obtained from pLM (ProtT5). Next, they are modified into vectors with the help of TM-Vec, which is another neural network. TM-Vec encodes proteins in a way that allows the cosine distance between two vectors to approximate structural similarity (measure via TM-score), thereby eliminating the need for time-consuming structure prediction.

In [1]:
from importlib.util import find_spec
if find_spec('skbio') is None:
    !pip install -q scikit-bio

if find_spec('tmvec') is None:
    !pip install -q git+https://github.com/valentynbez/tmvec.git

if find_spec('deepblast') is None:
    !pip install -q git+https://github.com/valentynbez/deepblast.git

if find_spec('onnx') is None:
    !pip install -q onnx

!pip install -q biopython

In [2]:
import skbio
skbio.__version__

'0.6.2'

In [None]:
# get data
!mkdir data
!wget -q -O ./data/pdb_hits.fa "https://raw.githubusercontent.com/scikit-bio/scikit-bio-tutorials/main/02-language-model/data/pdb_hits.fa"
!wget -q -O ./data/bacteriocin.csv "https://raw.githubusercontent.com/scikit-bio/scikit-bio-tutorials/main/02-language-model/data/bagel_bacteriocins_all_classes.csv"

In [4]:
# All Necessary imports
from deepblast.dataset.utils import get_sequence, pack_sequences, revstate_f, states2alignment
from deepblast.utils import load_model
from skbio.embedding import ProteinEmbedding, ProteinVector
from tmvec.embedding import ProtT5Encoder
from skbio.alignment import TabularMSA
from skbio.sequence import Protein
import matplotlib.pyplot as plt
import skbio.embedding as emb
from skbio.io import read
from tqdm import tqdm
import numpy as np
import pandas as pd
import torch

### Helper Functions

In [5]:
# Helper Functions for Embedding Sequences
def load_protein_t5_embedding(sequence, model_path, tokenizer_path):

    embedder = ProtT5Encoder(model_path=model_path, tokenizer_path=tokenizer_path, backend="onnx")

    # generate embeddings
    emb = embedder.get_sequence_embeddings([sequence])[0]

    return ProteinEmbedding(emb, sequence)


def to_embeddings(sequences : list, model_name, tokenizer_name):
    # Embed the random/inputted protein sequence(s)
    for sequence in tqdm(sequences):
        test_embed = load_protein_t5_embedding(str(sequence), model_name, tokenizer_name)
        #reshape embeddings to fit the skbio format
        yield test_embed

def align(x, y, model):
    pred_alignment = model.align(str(x), str(y))
    return pred_alignment


def predict_aln_matrix(query_seq, target_seq, model):
    x_code = get_sequence(str(query_seq), model.tokenizer)[0].to(model.device)
    y_code = get_sequence(str(target_seq), model.tokenizer)[0].to(model.device)
    seq, order = pack_sequences([x_code], [y_code])
    with torch.no_grad():
        gen = model.aligner.traceback(seq, order)
    _, aln_mat = next(gen)

    return aln_mat.squeeze().detach().numpy()

In [6]:
# Load_vector function for vectors
def load_vectors(file_path, sequence_list : list):
      data = np.load(file_path, allow_pickle=True)
      vectors = data['embeddings']

      protein_vectors = [ProteinVector(vector, sequence) for vector, sequence in zip(vectors, sequence_list)]
      return protein_vectors

### Part 1: Embedding sequences to file

We're going to start by reading in the bacteriocin sequences with `skbio.read`, then storing the embedded sequences in `ProteinEmbedding` objects.

In [7]:
# a model we prepared for a faster execution on CPU
model_path = "scikit-bio/prot-t5-xl-uniref50-onnx"
# tokenizer from standard repo
tokenizer_path = "Rostlab/prot_t5_xl_uniref50"

# Parse bagel.fa
sequence_list = read("data/pdb_hits.fa", format='fasta')
embed_list = to_embeddings(sequence_list, model_path, tokenizer_path)

In [None]:
next(embed_list)

### Part 2: Building vector-DB and plot Ordination

We can directly feed our FASTA file into the tmvec build_db __CLI__ function, which will output our
vectors as a .npz file in the specified directory.

This function takes in as an input:
1. --input-fasta: A FASTA file containing your sequences.
2. --output: the file location to output to.

In [None]:
!tmvec build-db --input-fasta data/pdb_hits.fa --output outputs/pdb_hits_output

With the proteins encoded as vectors within a database, we can extract the vectors and
cast them to ProteinVectors objects, which can then be used to visualize the structural similarity between the proteins

In [None]:
# load sequences
sequence_list = read("data/pdb_hits.fa", format='fasta')

#read in vectors to generator object
vec_generator = load_vectors("outputs/pdb_hits_output.npz", sequence_list)

# convert the vectors into an OrdinationResults object for plotting
ord_results = emb.embed_vec_to_ordination(vec_generator)

# read in the bacterion sequence / function metadata
df = pd.read_csv("data/bacteriocin.csv")
df = df.dropna(subset=['Sequence']).set_index('Sequence')
df = df.groupby('Sequence').first()
# match the sequence ids to the rows in the metadata
common_ids = list(set(ord_results.samples.index) & set(df.index))

df = df.loc[common_ids]

ord_results.samples = ord_results.samples.loc[common_ids]

# plot the results
ord_results.plot(df, column='class', title='Bacteriocin Sequence TM-Scores');

In [None]:
del(to_embeddings)

### Part 3: Structural Alignment with DeepBLAST

DeepBLAST is a deep learning tool that leverages protein embeddings to incorporate structural alignment information into sequence alignment. In essence, DeepBLAST has distilled information from thousands of observed protein structure alignments and integrated it into its alignment matrix. Consequently, DeepBLAST alignments can accurately match dissimilar amino acids based on structural information, making it the most accurate sequence-based method in the [Malidup and Malisam benchmarks](https://www.nature.com/articles/s41587-023-01917-2/tables/2). hile it falls short of structure alignment methods in terms of accuracy, DeepBLAST significantly reduces computational requirements, rendering the analysis of large datasets more feasible.

In [None]:
# download the model from scikit-bio HuggingFace repository!
!wget https://huggingface.co/scikit-bio/deepblast/resolve/main/deepblast-v3.ckpt -O ./data/deepblast-v3.ckpt

In [12]:
bagel_list = read("data/pdb_hits.fa", format='fasta', constructor=Protein)

In [14]:
x = next(bagel_list)
y = next(bagel_list)

In [15]:
model = load_model("./data/deepblast-v3.ckpt", device="cpu",
                   alignment_mode="needleman-wunsch"
                   )

path = align(x, y, model)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [17]:
x_aligned, y_aligned = states2alignment(path, str(x), str(y))

In [None]:
print(x_aligned)
print(y_aligned)

### Part 4: Visualization of Predicted Alignment Matrix

Probabilities and regions of alignment between two sequences can be visualized, thus enabling detailed analysis of conserved regions of proteins.

In [None]:
matrix = predict_aln_matrix(x, y, model)

# visualise matrix with cbar
plt.imshow(matrix, cmap='viridis')
plt.colorbar()
plt.show();