# Test Proteins Embedding

In this notebook, we will preprocess the data that will be used for testing a protein function prediction model, i.e. we will embed the test protein sequences into a vector format.

## Libraries

In [None]:
!pip install Bio

In [None]:
import re
import torch
import pandas as pd
import numpy as np

from Bio import SeqIO
from tqdm import tqdm
from transformers import T5Tokenizer, T5EncoderModel

## Data

### Protein Sequence

Our data is composed of protein sequences (a string of letters), where each one-letter or three-letter code represents an amino acid and corresponding taxonomyID. The sequences can be found in the file `testsuperset.fasta`.

In [None]:
proteins = SeqIO.parse('./data/cafa-5-protein-function-prediction/Test (Targets)/testsuperset.fasta', "fasta")
test_proteins = {}

for protein in proteins:
    taxonomyID = protein.description.split()[1]
    test_proteins[protein.id] = {'sequence': str(protein.seq), 'taxonomyID': taxonomyID}

In [None]:
list(test_proteins.items())[:3]

### Taxonomy

The file `testsuperset-taxon-list.tsv` contains list of taxonomies description. The first columns is the taxon ID and the second is the corresponding description.

In [None]:
test_taxonomy = pd.read_csv('./data/cafa-5-protein-function-prediction/Test (Targets)/testsuperset-taxon-list.tsv', sep='\t', encoding='latin1')

In [None]:
test_taxonomy.head(3)

## Embedding

We will start by initializing the tokenizer and the model, which we will use to generate the protein embeddings. We will be using an encoder-only, half-precision version of the ProtT5-XL-UniRef50 model, which is pretrained on a large corpus of protein sequences in a self-supervised fashion. This version will help us generate protein embeddings even with low GPU-memort, because it is fully usable on 8 GB of video RAM.

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc")
model = model.to(device)
model = model.eval()
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

In [None]:
all_ids = list(test_proteins.keys())
sequences = sorted(
    [re.sub(r"[UZOB]", "X", protein['sequence']) for protein in test_proteins.values()],
    key=len,
    reverse=True
)

for seq in sequences[:3]:
    print(len(seq))

In [None]:
batch = list()
max_batch=100
max_residues=4000
max_seq_len=1000

for idx, seq in enumerate(sequences):
    seq_len = len(seq)
    seq = ' '.join(list(seq))
    batch.append((all_ids[idx], seq, seq_len))

    n_res_batch = sum([s_len for _, _, s_len in batch]) + seq_len
    if len(batch) >= max_batch or n_res_batch >= max_residues or idx == len(sequences) or seq_len > max_seq_len:
        protein_ids, seqs, seq_lens = zip(*batch)
        batch = list()

        token_encoding = tokenizer.batch_encode_plus(seqs, add_special_tokens=True, padding="longest")
        input_ids = torch.tensor(token_encoding['input_ids']).to(device)
        attention_mask = torch.tensor(token_encoding['attention_mask']).to(device)

        try:
            with torch.no_grad():
                embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)
        except RuntimeError:
            print("RuntimeError during embedding for {} (L={})".format(protein_ids[idx], seq_len))
            continue

        for batch_idx, identifier in enumerate(protein_ids):
            s_len = seq_lens[batch_idx]
            protein_emb = embedding_repr.last_hidden_state[batch_idx,:s_len].mean(dim=0)

            test_proteins[identifier]['embedding'] = protein_emb.detach().cpu().numpy().squeeze()

In [None]:
list(test_proteins.items())[:3]

## Integrating Data

Now when we have our protein sequences prepared, we can simply integrate the data, which will give us only one table, that contains all protein sequences and their corresponding taxonomy and descriptions.

In [None]:
lookup_taxonomy = dict(zip(test_taxonomy['ID'], test_taxonomy['Species']))

for protein in test_proteins.values():
    if int(protein['taxonomyID']) in lookup_taxonomy:        
        protein['taxonomy'] = lookup_taxonomy[int(protein['taxonomyID'])]
    else:
        protein['taxonomy'] = ''

In [None]:
list(test_proteins.items())[:3]

In [None]:
data_list = []
features = []

for protein_id, data in test_proteins.items():
    data_list.append([
        protein_id,
        data['taxonomyID'],
        data['taxonomy'],
        data['sequence'],
        data['embedding']
    ])

    protein_features = [
        protein_id,
        data['taxonomyID'],
    ]

    protein_features.extend(data['embedding'])
    features.append(protein_features)

In [None]:
test_df = pd.DataFrame(data_list, columns=['ProteinID', 'TaxonomyID', 'Taxonomy', 'Sequence', 'Embedding'])
test_df.set_index('ProteinID', inplace=True)
test_df.index.name = None

In [None]:
test_df.head(3)

In [None]:
column_names = ['ProteinID', 'TaxonomyID'] + ['Embed_' + str(i+1) for i in range(1024)]
X_test = pd.DataFrame(features, columns=column_names)
X_test.set_index('ProteinID', inplace=True)
X_test.index.name = None

In [None]:
X_test.head(3)

## Store

Let's save the embedded data as CSV files for testing so we could easily access them.

In [None]:
test_df.to_csv('data_test.csv', index=False)

In [None]:
X_test.to_csv('features_test.csv', index=False)