# Training Protein Embeddings

In this notebook, we will preprocess the data that will be used for training a protein function prediction model, i.e. we will embed the training protein sequences into a vector format.

## Libraries

In [1]:
!pip install Bio

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import re
import torch
import pandas as pd
import numpy as np

from Bio import SeqIO
from tqdm import tqdm
from transformers import T5Tokenizer, T5EncoderModel

## Data

The objective of the model will be to predict the terms (functions) of a protein sequence. It's important to keep in mind that one protein sequence can have many functions (GO Term IDs) and all of them must be predicted by our model for each protein sequence.

### Protein Sequence

Our data is composed of protein sequences (a string of letters), where each one-letter or three-letter code represents an amino acid. The sequences can be found in the file `train_sequences.fasta`.

In [3]:
proteins = SeqIO.parse('./data/cafa-5-protein-function-prediction/Train/train_sequences.fasta', "fasta")
train_proteins = {}

for protein in proteins:
    train_proteins[protein.id] = {'sequence': str(protein.seq), 'GO': {'BPO': [], 'CCO': [], 'MFO': []}}

In [4]:
list(train_proteins.items())[:3]

[('P20536',
  {'sequence': 'MNSVTVSHAPYTITYHDDWEPVMSQLVEFYNEVASWLLRDETSPIPDKFFIQLKQPLRNKRVCVCGIDPYPKDGTGVPFESPNFTKKSIKEIASSISRLTGVIDYKGYNLNIIDGVIPWNYYLSCKLGETKSHAIYWDKISKLLLQHITKHVSVLYCLGKTDFSNIRAKLESPVTTIVGYHPAARDRQFEKDRSFEIINVLLELDNKVPINWAQGFIY',
   'GO': {'BPO': [], 'CCO': [], 'MFO': []}}),
 ('O73864',
  {'sequence': 'MTEYRNFLLLFITSLSVIYPCTGISWLGLTINGSSVGWNQTHHCKLLDGLVPDQQQLCKRNLELMHSIVRAARLTKSACTSSFSDMRWNWSSIESAPHFTPDLAKGTREAAFVVSLAAAVVSHAIARACASGDLPSCSCAAMPSEQAAPDFRWGGCGDNLRYYGLQMGSAFSDAPMRNRRSGPQDFRLMQLHNNAVGRQVLMDSLEMKCKCHGVSGSCSVKTCWKGLQDISTISADLKSKYLSATKVIPRQIGTRRQLVPREMEVRPVGENELVYLVSSPDYCTQNAKQGSLGTTDRQCNKTASGSESCGLMCCGRGYNAYTEVLVERCQCKYHWCCYVSCKTCKRTVERYVSK',
   'GO': {'BPO': [], 'CCO': [], 'MFO': []}}),
 ('O95231',
  {'sequence': 'MRLSSSPPRGPQQLSSFGSVDWLSQSSCSGPTHTPRPADFSLGSLPGPGQTSGAREPPQAVSIKEAAGSSNLPAPERTMAGLSKEPNTLRAPRVRTAFTMEQVRTLEGVFQHHQYLSPLERKRLAREMQLSEVQIKTWFQNRRMKHKRQMQDPQLHSPFSGSLHAPPAFYSTSSGLANGLQLLCPWAPLSGPQALMLPPGSFWGLCQVAQEALASAGASCCGQPLASHPPTPGRPSLGPALSTGPR

### Taxonomy

The file `train_taxonomy.tsv` contains list of proteins and the species to whuch they belong (taxonomy ID). The first columns is the protein UniProt accession ID and the second is the taxon ID.

In [5]:
train_taxonomy = pd.read_csv('./data/cafa-5-protein-function-prediction/Train/train_taxonomy.tsv', sep='\t')

In [6]:
train_taxonomy.head(3)

Unnamed: 0,EntryID,taxonomyID
0,Q8IXT2,9606
1,Q04418,559292
2,A8DYA3,7227


### Gene Ontology

The functional properties of proteins are defined using Gene Ontology (GO) with respect to Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). List of annotated terms (ground thruths) for the protein sequences are available in the file `train_terms.fasta`, where the first column (attribute) is the protein's UniProt accession ID (unique protein id), the second is the GO Term ID, and the third is the ontology, in which the term appears.

In [7]:
train_terms = pd.read_csv('./data/cafa-5-protein-function-prediction/Train/train_terms.tsv', sep='\t')

In [8]:
train_terms.head(3)

Unnamed: 0,EntryID,term,aspect
0,A0A009IHW8,GO:0008152,BPO
1,A0A009IHW8,GO:0034655,BPO
2,A0A009IHW8,GO:0072523,BPO


## Embedding

We will start by initializing the tokenizer and the model, which we will use to generate the protein embeddings. We will be using an encoder-only, half-precision version of the [ProtT5-XL-UniRef50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) model, which is pretrained on a large corpus of protein sequences in a self-supervised fashion. [This version](https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc) will help us generate protein embeddings even with low GPU-memort, because it is fully usable on 8 GB of video RAM.

In [26]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc", torch_dtype=torch.float16).to(device)

In [29]:
def generate_embedding(sequence):
    # Map the rare amino acids "U,Z,O,B" to "X"
    seq = [" ".join(list(re.sub(r"[UZOB]", "X", sequence)))]

    # Encode the sequence
    ids = tokenizer.batch_encode_plus(seq, add_special_tokens=True, padding="longest")
    input_ids = torch.tensor(ids['input_ids']).to(device)
    attention_mask = torch.tensor(ids['attention_mask']).to(device)

    # Generate the embedding
    with torch.no_grad():
        output = model(input_ids=input_ids, attention_mask=attention_mask)

    return output.last_hidden_state[0].mean(dim=0)

Let's now convert the alphabetical protein sequences into a vector format (embeddings) that will be used to train our model. Since we talk about a sequence of letters, we can use a similar approach, used to generate word embeddings for an NLP model's training. Thus we can use any publicly available pre-trained protein embedding models to generate the embeddings.

In [31]:
for data in tqdm(train_proteins.values(), desc="Generating embeddings"):
    print(data['sequence'])
    data['embedding'] = generate_embedding(data['sequence']).detach().cpu().numpy()

Generating embeddings:   0%|                                                                | 0/142246 [00:00<?, ?it/s]

MNSVTVSHAPYTITYHDDWEPVMSQLVEFYNEVASWLLRDETSPIPDKFFIQLKQPLRNKRVCVCGIDPYPKDGTGVPFESPNFTKKSIKEIASSISRLTGVIDYKGYNLNIIDGVIPWNYYLSCKLGETKSHAIYWDKISKLLLQHITKHVSVLYCLGKTDFSNIRAKLESPVTTIVGYHPAARDRQFEKDRSFEIINVLLELDNKVPINWAQGFIY


Generating embeddings:   0%|                                                      | 2/142246 [00:00<7:03:59,  5.59it/s]

MTEYRNFLLLFITSLSVIYPCTGISWLGLTINGSSVGWNQTHHCKLLDGLVPDQQQLCKRNLELMHSIVRAARLTKSACTSSFSDMRWNWSSIESAPHFTPDLAKGTREAAFVVSLAAAVVSHAIARACASGDLPSCSCAAMPSEQAAPDFRWGGCGDNLRYYGLQMGSAFSDAPMRNRRSGPQDFRLMQLHNNAVGRQVLMDSLEMKCKCHGVSGSCSVKTCWKGLQDISTISADLKSKYLSATKVIPRQIGTRRQLVPREMEVRPVGENELVYLVSSPDYCTQNAKQGSLGTTDRQCNKTASGSESCGLMCCGRGYNAYTEVLVERCQCKYHWCCYVSCKTCKRTVERYVSK
MRLSSSPPRGPQQLSSFGSVDWLSQSSCSGPTHTPRPADFSLGSLPGPGQTSGAREPPQAVSIKEAAGSSNLPAPERTMAGLSKEPNTLRAPRVRTAFTMEQVRTLEGVFQHHQYLSPLERKRLAREMQLSEVQIKTWFQNRRMKHKRQMQDPQLHSPFSGSLHAPPAFYSTSSGLANGLQLLCPWAPLSGPQALMLPPGSFWGLCQVAQEALASAGASCCGQPLASHPPTPGRPSLGPALSTGPRGLCAMPQTGDAF
MGGEAGADGPRGRVKSLGLVFEDESKGCYSSGETVAGHVLLEAAEPVALRGLRLEAQGRATSAWGPSAGARVCIGGGSPAASSEVEYLNLRLSLLEAPAGEGVTLLQPGKHEFPFRFQLPSEPLATSFTGKYGSIQYCVRAVLERPQVPDQSVRRELQVVSHVDVNTPPLLTPMLKTQEKMVGCWLFTSGPVSLSVKIERKGYCNGEAIPIYAEIENCSSRLVVPKAAIFQTQTYLASGKTKTVRHMVANVRGNHIGSGSTDTWNGKMLKIPPVTPSILDCCIIRVDYSLAVYIHIPGAKRLMLELPLVIGTIPYSGFGRRNSSVASQFSMDMCWLALALPEQPEAPPNYADVVSEEEFSRHVPPYPQPSDCDGEACYSMFACIQE

Generating embeddings:   0%|                                                      | 5/142246 [00:00<4:53:37,  8.07it/s]

MVETNSPPAGYTLKRSPSDLGEQQQPPRQISRSPGNTAAYHLTTAMLLNSQQCGYLGQRLQSVLQQQHAQHQQSQSQTPSSDDGSQSGVTILEEERRGGAAAASLFTIDSILGSRQQGGGTAPSQGSHISSNGNQNGLTSNGISLGLKRSGAESPASPNSNSSSSAAASPIRPQRVPAMLQHPGLHLGHLAAAAASGFAASPSDFLVAYPNFYPNYMHAAAVAHVAAAQMQAHVSGAAAGLSGHGHHPHHPHGHPHHPHLGAHHHGQHHLSHLGHGPPPKRKRRHRTIFTEEQLEQLEATFDKTHYPDVVLREQLALKVDLKEERVEVWFKNRRAKWRKQKREEQERLRKLQEEQCGSTTNGTTNSSSGTTSSTGNGSLTVKCPGSDHYSAQLVHIKSDANGYSDADESSDLEVA
MGHTRRQGTSPSKCPYLNFFQLLVLAGLSHFCSGVIHVTKEVKEVATLSCGHNVSVEELAQTRIYWQKEKKMVLTMMSGDMNIWPEYKNRTIFDITNNLSIVILALRPSDEGTYECVVLKYEKDAFKREHLAEVTLSVKADFPTPSISDFEIPTSNIRRIICSTSGGFPEPHLSWLENGEELNAINTTVSQDPETELYAVSSKLDFNMTTNHSFMCLIKYGHLRVNQTFNWNTTKQEHFPDNLLPSWAITLISVNGIFVICCLTYCFAPRCRERRRNERLRRESVRPV
MTIEKIFTPQDDAFYAVITHAAGPQGALPLTPQMLMESPSGNLFGMTQNAGMGWDANKLTGKEVLIIGTQGGIRAGDGRPIALGYHTGHWEIGMQMQAAAKEITRNGGIPFAAFVSDPCDGRSQGTHGMFDSLPYRNDAAIVFRRLIRSLPTRRAVIGVATCDKGLPATMIALAAMHDLPTILVPGGATLPPTVGEDAGKVQTIGARFANHELSLQEAAELGCRACASPGGGCQFLGTAGTSQVVAEALGLALPHSALAPSGQAVWLEIARQSARAVSELDSRGITTRDILSDKA

Generating embeddings:   0%|                                                     | 7/142246 [00:01<11:27:02,  3.45it/s]

MAAAARPRGRALGPVLPPTPLLLLVLRVLPACGATARDPGAAAGLSLHPTYFNLAEAARIWATATCGERGPGEGRPQPELYCKLVGGPTAPGSGHTIQGQFCDYCNSEDPRKAHPVTNAIDGSERWWQSPPLSSGTQYNRVNLTLDLGQLFHVAYILIKFANSPRPDLWVLERSVDFGSTYSPWQYFAHSKVDCLKEFGREANMAVTRDDDVLCVTEYSRIVPLENGEVVVSLINGRPGAKNFTFSHTLREFTKATNIRLRFLRTNTLLGHLISKAQRDPTVTRRYYYSIKDISIGGQCVCNGHAEVCNINNPEKLFRCECQHHTCGETCDRCCTGYNQRRWRPAAWEQSHECEACNCHGHASNCYYDPDVERQQASLNTQGIYAGGGVCINCQHNTAGVNCEQCAKGYYRPYGVPVDAPDGCIPCSCDPEHADGCEQGSGRCHCKPNFHGDNCEKCAIGYYNFPFCLRIPIFPVSTPSSEDPVAGDIKGCDCNLEGVLPEICDAHGRCLCRPGVEGPRCDTCRSGFYSFPICQACWCSALGSYQMPCSSVTGQCECRPGVTGQRCDRCLSGAYDFPHCQGSSSACDPAGTINSNLGYCQCKLHVEGPTCSRCKLLYWNLDKENPSGCSECKCHKAGTVSGTGECRQGDGDCHCKSHVGGDSCDTCEDGYFALEKSNYFGCQGCQCDIGGALSSMCSGPSGVCQCREHVVGKVCQRPENNYYFPDLHHMKYEIEDGSTPNGRDLRFGFDPLAFPEFSWRGYAQMTSVQNDVRITLNVGKSSGSLFRVILRYVNPGTEAVSGHITIYPSWGAAQSKEIIFLPSKEPAFVTVPGNGFADPFSITPGIWVACIKAEGVLLDYLVLLPRDYYEASVLQLPVTEPCAYAGPPQENCLLYQHLPVTRFPCTLACEARHFLLDGEPRPVAVRQPTPAHPVMVDLSGREVELHLRLRIPQVGHYVVVVEYSTEAAQLFVVDVNVKSSGSVLAGQVNIYSCNYSVLCRS

Generating embeddings:   0%|                                                    | 7/142246 [01:25<480:18:57, 12.16s/it]


KeyboardInterrupt: 

In [None]:
list(train_proteins.items())[:3]

## Integrating Data

Now when we have our protein sequences prepared, we can simply integrate the data, which will give us only one table, that contains all protein sequences and their corresponding taxonomy and GO Term IDs with respect to all aspects (MF, BP, and CC).

In [89]:
for _, row in train_taxonomy.iterrows():
    if row['EntryID'] in train_proteins:
        embedded_train_proteins[row['EntryID']]['taxonomyID'] = row['taxonomyID']

In [90]:
for _, row in train_terms.iterrows():
    if row['EntryID'] in train_proteins:
        embedded_train_proteins[row['EntryID']]['GO'][row['aspect']].append(row['term'])

In [105]:
data_list = []
features = []
labels = []

for protein_id, data in embedded_train_proteins.items():
    data_list.append([
        protein_id,
        data['taxonomyID'],
        data['sequence'],
        data['embedding'],
        data['GO']['BPO'],
        data['GO']['CCO'],
        data['GO']['MFO']
    ])

    protein_features = [
        protein_id,            
        data['taxonomyID'],     
    ]

    protein_features.extend(data['embedding'])
    features.append(protein_features)

    labels.append([
        protein_id,
        data['GO']['BPO'],
        data['GO']['CCO'],
        data['GO']['MFO']
    ])

In [106]:
train_df = pd.DataFrame(data_list, columns=['ProteinID', 'TaxonomyID', 'Sequence', 'Embedding', 'BPO', 'CCO', 'MFO'])
train_df.set_index('ProteinID', inplace=True)
train_df.index.name = None

In [108]:
train_df.head(3)

Unnamed: 0,TaxonomyID,Sequence,Embedding,BPO,CCO,MFO
P20536,10249,MNSVTVSHAPYTITYHDDWEPVMSQLVEFYNEVASWLLRDETSPIP...,"[-0.090443626, -0.1601761, -0.020129055, 0.067...","[GO:0008152, GO:0071897, GO:0044249, GO:000625...",[],"[GO:0005515, GO:0005488, GO:0003674]"
O73864,7955,MTEYRNFLLLFITSLSVIYPCTGISWLGLTINGSSVGWNQTHHCKL...,"[-0.090443626, -0.1601761, -0.020129055, 0.067...","[GO:0061371, GO:0048589, GO:0051641, GO:004885...","[GO:0071944, GO:0005575, GO:0110165, GO:001602...","[GO:0046982, GO:0003674, GO:0005488, GO:000551..."
O95231,9606,MRLSSSPPRGPQQLSSFGSVDWLSQSSCSGPTHTPRPADFSLGSLP...,"[-0.090443626, -0.1601761, -0.020129055, 0.067...","[GO:0006357, GO:0010557, GO:0045935, GO:006500...","[GO:0005622, GO:0031981, GO:0043229, GO:004322...","[GO:0003676, GO:1990837, GO:0001216, GO:000548..."


In [111]:
column_names = ['ProteinID', 'TaxonomyID'] + ['Embed_' + str(i+1) for i in range(1024)]
X_train = pd.DataFrame(features, columns=column_names)
X_train.set_index('ProteinID', inplace=True)
X_train.index.name = None

In [114]:
X_train.head(30)

Unnamed: 0,TaxonomyID,Embed_1,Embed_2,Embed_3,Embed_4,Embed_5,Embed_6,Embed_7,Embed_8,Embed_9,...,Embed_1015,Embed_1016,Embed_1017,Embed_1018,Embed_1019,Embed_1020,Embed_1021,Embed_1022,Embed_1023,Embed_1024
P20536,10249,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
O73864,7955,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
O95231,9606,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
A0A0B4J1F4,10090,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
P54366,7227,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
P33681,9606,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
P77596,83333,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
Q16787,9606,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
Q59VP0,237561,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414
P13508,6239,-0.090444,-0.160176,-0.020129,0.067913,-0.015551,0.033842,0.043394,0.041893,-0.16812,...,-0.054649,-0.084524,0.044401,0.080548,-0.015732,-0.013776,-0.005043,-0.011905,-0.090861,-0.008414


In [None]:
y_train = pd.DataFrame(features, columns=['ProteinID', 'BPO', 'CCO', 'MFO'])
y_train.set_index('ProteinID', inplace=True)
y_train.index.name = None

## Store 

Let's save the embedded data as CSV files for both training and testing so we could easily access it.

In [None]:
train_df.to_csv('data_train.csv', index=False)

In [None]:
X_train.to_csv('features_train.csv', index=False)

In [None]:
y_train.to_csv('labels_train.csv', index=False)