# Feature Extraction with ProtT5

Please note that this is quite computationally intensive. With GPU, the whole process could take up to a few hours for PDB14189.

Since GPU RAM is limited, you would most likely need to convert proteins in PDB14189_shuffled.csv into feature vectors in batches. I manually divided the dataset into 11 batches. On Google Colab, you will want to clear and restart the session after you finish each batch. See the section below on PDB14189 for more detail.

In [None]:
import pandas as pd
import numpy as np
import pickle
import re
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab Notebooks/stale pickles
%ls

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/stale pickles
 1075.pkl                       test_data_shuffled.csv
 1075_shuffled.csv              train_data_10.pkl
 186.pkl                        train_data_11.pkl
 186_shuffled.csv               train_data_1.pkl
'Copy of classifier.ipynb'      train_data_2.pkl
'Copy of Preprocessing.ipynb'   train_data_3.pkl
 mixPosNeg.csv                  train_data_4.pkl
 PDB14189_N.txt                 train_data_5.pkl
 PDB14189_P.txt                 train_data_6.pkl
 PDB2272_N.txt                  train_data_7.pkl
 PDB2272_P.txt                  train_data_8.pkl
 pr.jpg                         train_data_9.pkl
 roc.jpg                        train_data_shuffled.csv
 testData186.csv                train_to_pickle.ipynb
 test_data.pkl                  val_data_shuffled.csv


In [None]:
!pip install -q SentencePiece transformers
import pickle
import torch
from transformers import T5EncoderModel, T5Tokenizer, pipeline
import numpy as np
import gc
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50", do_lower_case=False )
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_uniref50")

[K     |████████████████████████████████| 1.3 MB 14.8 MB/s 
[K     |████████████████████████████████| 4.7 MB 67.3 MB/s 
[K     |████████████████████████████████| 101 kB 12.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 77.4 MB/s 
[?25h

Downloading spiece.model:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/10.5G [00:00<?, ?B/s]

Some weights of the model checkpoint at Rostlab/prot_t5_xl_uniref50 were not used when initializing T5EncoderModel: ['decoder.block.3.layer.1.EncDecAttention.k.weight', 'decoder.block.1.layer.1.EncDecAttention.v.weight', 'decoder.block.17.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.0.SelfAttention.v.weight', 'decoder.block.8.layer.2.DenseReluDense.wi.weight', 'decoder.block.13.layer.1.EncDecAttention.o.weight', 'decoder.block.10.layer.1.layer_norm.weight', 'decoder.block.9.layer.0.SelfAttention.v.weight', 'decoder.block.2.layer.1.EncDecAttention.o.weight', 'decoder.block.17.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.2.layer_norm.weight', 'decoder.block.8.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.1.layer_norm.weight', 'decoder.block.19.layer.0.SelfAttention.o.weight', 'decoder.block.10.layer.2.layer_norm.weight', 'decoder.block.13.layer.2.layer_norm.weight', 'decoder.block.18.layer.2.DenseReluDense.wi.weight', 'decoder.block.23.layer.2.DenseRel

> The code below converts 1075_shuffled into feature vectors. To convert 186_shuffled, simply change the file name inside the pd.read_csv() statement into "186_shuffled.csv" and the pickle file name into "186.pkl".

In [None]:
fe = pipeline('feature-extraction', model=model, tokenizer=tokenizer,device=0)

In [None]:
train = pd.read_csv('1075_shuffled.csv') # or 186_shuffled.csv / 14189_shuffled.csv / 2272_shuffled.csv

In [None]:
X_train=train["ProteinSequence"]
y_train=train["Class"]

In [None]:
seq = []


for s in X_train:
  #add spaces in between amino acids
  seq.append(s.replace("", " ")[1: -1])

#replace unknown amino acids 
seq = [re.sub(r"[UZOBX]", "<unk>", sequence) for sequence in seq]

In [None]:
 
embedding = fe(seq) 

In [None]:
cls_embedding = []
for e in embedding:
  cls_embedding.append(e[0][0])
cls_embedding = np.asarray(cls_embedding)

In [None]:

pickle.dump(cls_embedding, open('1075.pkl', 'wb'), protocol=4) # or 186.pkl / 14189_1.pkl / 2272.pkl

In [None]:
with torch.no_grad():
  torch.cuda.empty_cache()
del cls_embedding
del embedding

## How to convert PDB14189 and PDB2272 into feature vectors

The setup is the same. You will only need to do make some very small changes to the code above.

First, you would need to write train = pd.read_csv('14189_shuffled.csv') to nead in PDB14189_shuffled.csv. Then, you need to change the name of the pickle file you want to write to (e.g., "14189_1.pkl"). These two steps are straightforward.

If you only make these changes, you will likely encounter a RAM overflow error on your GPU. The solution is to manually divide PDB14189 into batches. You do this by slicing train_seq. See example below:

embedding = fe(seq[:1500]) <- This converts the first 1500 proteins in PDB14189. This precludes the GPU RAM from overflowing.
If your GPU still complains, try further reduce the batch size until it doesn't. This process can take a bit of patience. When you are saving your pickle files, remember to number them (that's why I have 11 pickle files for PDB14189 -- I divided PDB14189 into 11 batches.)

The same goes for PDB2272. On Google Colab, I was able to convert PDB2272 in one shot, but then again, circumstances vary based on what GPU you have.