# Protein Embeddings

In this notebook, we will preprocess the data that will be used for training a protein function prediction model, i.e. we will embed the protein sequences into a vector format.

## Libraries

In [1]:
!pip install Bio

Defaulting to user installation because normal site-packages is not writeable


In [32]:
import pandas as pd
import torch
from Bio import SeqIO

## Data

The objective of the model will be to predict the terms (functions) of a protein sequence. It's important to keep in mind that one protein sequence can have many functions (GO Term IDs) and all of them must be predicted by our model for each protein sequence.

### Protein Sequence

Our data is composed of protein sequences (a string of letters), where each one-letter or three-letter code represents an amino acid. The sequences can be found in the file `train_sequences.fasta`.

In [3]:
proteins = SeqIO.parse('./data/cafa-5-protein-function-prediction/Train/train_sequences.fasta', "fasta")
train_proteins = {}

for protein in proteins:
    train_proteins[protein.id] = {'sequence': str(protein.seq), 'GO': {'BPO': [], 'CCO': [], 'MFO': []}}

In [4]:
list(train_proteins.items())[:3]

[('P20536',
  {'sequence': 'MNSVTVSHAPYTITYHDDWEPVMSQLVEFYNEVASWLLRDETSPIPDKFFIQLKQPLRNKRVCVCGIDPYPKDGTGVPFESPNFTKKSIKEIASSISRLTGVIDYKGYNLNIIDGVIPWNYYLSCKLGETKSHAIYWDKISKLLLQHITKHVSVLYCLGKTDFSNIRAKLESPVTTIVGYHPAARDRQFEKDRSFEIINVLLELDNKVPINWAQGFIY',
   'GO': {'BPO': [], 'CCO': [], 'MFO': []}}),
 ('O73864',
  {'sequence': 'MTEYRNFLLLFITSLSVIYPCTGISWLGLTINGSSVGWNQTHHCKLLDGLVPDQQQLCKRNLELMHSIVRAARLTKSACTSSFSDMRWNWSSIESAPHFTPDLAKGTREAAFVVSLAAAVVSHAIARACASGDLPSCSCAAMPSEQAAPDFRWGGCGDNLRYYGLQMGSAFSDAPMRNRRSGPQDFRLMQLHNNAVGRQVLMDSLEMKCKCHGVSGSCSVKTCWKGLQDISTISADLKSKYLSATKVIPRQIGTRRQLVPREMEVRPVGENELVYLVSSPDYCTQNAKQGSLGTTDRQCNKTASGSESCGLMCCGRGYNAYTEVLVERCQCKYHWCCYVSCKTCKRTVERYVSK',
   'GO': {'BPO': [], 'CCO': [], 'MFO': []}}),
 ('O95231',
  {'sequence': 'MRLSSSPPRGPQQLSSFGSVDWLSQSSCSGPTHTPRPADFSLGSLPGPGQTSGAREPPQAVSIKEAAGSSNLPAPERTMAGLSKEPNTLRAPRVRTAFTMEQVRTLEGVFQHHQYLSPLERKRLAREMQLSEVQIKTWFQNRRMKHKRQMQDPQLHSPFSGSLHAPPAFYSTSSGLANGLQLLCPWAPLSGPQALMLPPGSFWGLCQVAQEALASAGASCCGQPLASHPPTPGRPSLGPALSTGPR

### Taxonomy

The file `train_taxonomy.tsv` contains list of proteins and the species to whuch they belong (taxonomy ID). The first columns is the protein UniProt accession ID and the second is the taxon ID.

In [5]:
train_taxonomy = pd.read_csv('./data/cafa-5-protein-function-prediction/Train/train_taxonomy.tsv', sep='\t')

train_taxonomy.head(3)

### Gene Ontology

The functional properties of proteins are defined using Gene Ontology (GO) with respect to Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). List of annotated terms (ground thruths) for the protein sequences are available in the file `train_terms.fasta`, where the first column (attribute) is the protein's UniProt accession ID (unique protein id), the second is the GO Term ID, and the third is the ontology, in which the term appears.

In [6]:
train_terms = pd.read_csv('./data/cafa-5-protein-function-prediction/Train/train_terms.tsv', sep='\t')

In [7]:
train_terms.head(3)

Unnamed: 0,EntryID,term,aspect
0,A0A009IHW8,GO:0008152,BPO
1,A0A009IHW8,GO:0034655,BPO
2,A0A009IHW8,GO:0072523,BPO


## Embedding

### Training Proteins

Let's now convert the alphabetical protein sequences into a vector format (embeddings) that will be used to train our model. Since we talk about a sequence of letters, we can use a similar approach, used to generate word embeddings for an NLP model's training. Thus we can use any publicly available pre-trained protein embedding models to generate the embeddings. **TODO**

### Test Proteins

Let's do the same embedding for the test protein sequences. **TODO**

## Integrating Data

Now when we have our protein sequences prepared, we can simply integrate the data, which will give us only one table, that contains all protein sequences and their corresponding taxonomy and GO Term IDs with respect to all aspects (MF, BP, and CC).

In [9]:
for _, row in train_taxonomy.iterrows():
    if row['EntryID'] in train_proteins:
        train_proteins[row['EntryID']]['taxonomyID'] = row['taxonomyID']

In [33]:
for _, row in train_terms.iterrows():
    if row['EntryID'] in train_proteins:
        train_proteins[row['EntryID']]['GO'][row['aspect']].append(row['term'])

In [37]:
data_list = []

for protein_id, data in train_proteins.items():
    data_list.append([
        protein_id,
        data['taxonomyID'],
        data['sequence'],
        data['GO']['BPO'],
        data['GO']['CCO'],
        data['GO']['MFO'],
    ])

train_df = pd.DataFrame(data_list, columns=['ProteinID', 'TaxonomyID', 'Sequence', 'BPO', 'CCO', 'MFO'])
train_df.set_index('ProteinID', inplace=True)
train_df.index.name = None

In [38]:
train_df.head()

Unnamed: 0,TaxonomyID,Sequence,BPO,CCO,MFO
P20536,10249,MNSVTVSHAPYTITYHDDWEPVMSQLVEFYNEVASWLLRDETSPIP...,"[GO:0008152, GO:0071897, GO:0044249, GO:000625...",[],"[GO:0005515, GO:0005488, GO:0003674]"
O73864,7955,MTEYRNFLLLFITSLSVIYPCTGISWLGLTINGSSVGWNQTHHCKL...,"[GO:0061371, GO:0048589, GO:0051641, GO:004885...","[GO:0071944, GO:0005575, GO:0110165, GO:001602...","[GO:0046982, GO:0003674, GO:0005488, GO:000551..."
O95231,9606,MRLSSSPPRGPQQLSSFGSVDWLSQSSCSGPTHTPRPADFSLGSLP...,"[GO:0006357, GO:0010557, GO:0045935, GO:006500...","[GO:0005622, GO:0031981, GO:0043229, GO:004322...","[GO:0003676, GO:1990837, GO:0001216, GO:000548..."
A0A0B4J1F4,10090,MGGEAGADGPRGRVKSLGLVFEDESKGCYSSGETVAGHVLLEAAEP...,"[GO:0008152, GO:0051234, GO:0036211, GO:007072...","[GO:0065010, GO:0043226, GO:1903561, GO:011016...","[GO:0030674, GO:0003674, GO:1990756, GO:014076..."
P54366,7227,MVETNSPPAGYTLKRSPSDLGEQQQPPRQISRSPGNTAAYHLTTAM...,[],"[GO:0005622, GO:0043229, GO:0043226, GO:011016...","[GO:0001227, GO:0042803, GO:0042802, GO:000548..."


## Store 

Let's save the embedded data as CSV files for both training and testing so we could easily access it. **TODO**