#**Description of Notebook:**
The following notebook uses the pretrained models described here - https://www.biorxiv.org/content/10.1101/2020.07.12.199554v1.full.pdf, and found here - https://github.com/agemagician/ProtTrans, to create embedding vectors for each sequence in the inputted dataset. Below I have created word embeddings for both the dataset from Veltri et al's paper, as well as for my own independent dataset.

#**Imports:**

In [1]:
!pip install -q transformers
!pip install -q transformers sentencePiece

[K     |████████████████████████████████| 2.6 MB 5.2 MB/s 
[K     |████████████████████████████████| 636 kB 56.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 41.3 MB/s 
[K     |████████████████████████████████| 895 kB 60.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 5.3 MB/s 
[?25h

In [2]:
import torch
from transformers import BertModel, BertTokenizer, XLNetModel, XLNetTokenizer, T5EncoderModel, T5Tokenizer
import re
import os
import requests
import gc
from tqdm.auto import tqdm
import pandas as pd
import numpy as np

In [3]:
print('Mounting google drive...')
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/MyDrive/Dissertation_Embeddings"

Mounting google drive...
Mounted at /content/drive
/content/drive/MyDrive/Dissertation_Embeddings


#**Function To Extract Word Embeddings:**

In [4]:
class LM_EMBED:
    def __init__(self, language_model, max_len, rare_aa):
    self.lang_model = language_model
    self.max_len = max_len
    self.rare_aa = rare_aa

    # Import tokenizer and model from ProtTrans Pre-Trained Rostlab:
    if self.lang_model == 'BERT-BFD':
      self.tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert_bfd', do_lower_case=False)
      self.model = BertModel.from_pretrained("Rostlab/prot_bert_bfd")
    elif self.lang_model == 'BERT':
      self.tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
      self.model = BertModel.from_pretrained("Rostlab/prot_bert")
    elif self.lang_model == 'T5-XL-BFD':
      self.tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_bfd", do_lower_case=False )
      self.model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_bfd")
      gc.collect()
    elif self.lang_model == 'T5-XL-UNI':
      self.tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50", do_lower_case=False )
      self.model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_uniref50")
      gc.collect()
    elif self.lang_model == 'XLNET':
      self.tokenizer = XLNetTokenizer.from_pretrained("Rostlab/prot_xlnet", do_lower_case=False)
      self.model = XLNetModel.from_pretrained("Rostlab/prot_xlnet", mem_len=512)


  # Function to use the specified model and tokenizer to create word embedding array:
  def extract_word_embs(self, seq_df, filename):

    # Setting device to GPU if available:
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

    # Assigning model to GPU if available, and setting to eval mode:
    self.model = self.model.to(device)
    self.model = self.model.eval()

    # Making a list of sequences from the df:
    seqs_list = seq_df.Sequence.to_list()

    # Adding spaces in between sequence letters (amino acids):
    seqs_spaced = self.add_spaces(seqs_list)

    # Map Rarely Occuring Amino Acids (U, Z, O, B) to (X) if they are present in the dataset:
    if self.rare_aa:
      seqs_spaced = [re.sub(r"[UZOB]", "X", sequence) for sequence in seqs_spaced]

    # ID list tokenized:
    ids = self.tokenizer.batch_encode_plus(seqs_spaced, add_special_tokens=True, padding = 'max_length', max_length = self.max_len)

    # Retrieving the input IDs and mask for attention as tensors:
    input_ids = torch.tensor(ids['input_ids']).to(device)
    attention_mask = torch.tensor(ids['attention_mask']).to(device)

    # Emptying cache to ensure enough memory:
    torch.cuda.empty_cache()

# Loop to process the sequences into embeddings in batches of 10:
for i in range(10, len(input_ids)+10, 10):
    if i%100 == 0:
    print("Initial Embedding Batch Ending with...", i)
    with torch.no_grad():
    embeddings = self.model(input_ids=input_ids[i-10:i], 
                            attention_mask=attention_mask[i-10:i])[0]
    emb_array = embeddings.cpu().numpy()

    # Creating initial array or concatenating to existing array:
    if i==10:embedding_res = emb_array
    else:
      embedding_res = np.concatenate((embedding_res, emb_array))

    # Extracting features using the function below:
    features = self.extract_features(embedding_res, attention_mask) 

    # Padding these features to a specified max length with zeros:
    padded_arr = self.pad(features)

    #Ensuring in correct location to save embeddings:
    %cd "/content/drive/MyDrive/Dissertation_Embeddings/Embeddings"

    # Saving array:
    print("Saving Embeddings...")
    np.save(filename, padded_arr)


  # Function to add spaces between the amino acids in each sequence:
  def add_spaces(self, df_col):
    return [" ".join(x) for x in df_col]

  # Function to remove any CLS or SEP tokens, just leaving features:
  def extract_features(self, emb_res, att_msk):
    features = [] 

    for seq_num in range(len(emb_res)):
      seq_len = (att_msk[seq_num] == 1).sum()

      if self.lang_model in ['BERT-BFD', 'BERT']:
        seq_emd = emb_res[seq_num][1:seq_len-1]

      elif self.lang_model in ['T5-XL-BFD', 'T5-XL-UNI']:
        seq_emd = emb_res[seq_num][:seq_len-1]

      elif self.lang_model == 'XLNET':
        padded_seq_len = len(att_msk[seq_num])
        seq_emd = emb_res[seq_num][padded_seq_len-seq_len:padded_seq_len-2]

      features.append(seq_emd)
    
    features_arr = np.array(features, dtype=object)

    return features_arr

  # Function to add zeros to pad all features to max length:
  def pad(self, features):
    dim1 = self.max_len-2   # reducing by 2 for CLS and SEP tokens which have already been removed
    dim2 = features[0].shape[1]

    for i in range(len(features)):
      if i%100 == 0:
        print("Padding Batch: ", i)

      all_zeros = np.zeros((dim1, dim2))
      all_zeros[:features[i].shape[0], :features[i].shape[1]] = features[i]

      if i==0:
        padded_arr = all_zeros
      elif i==1:
        padded_arr = np.stack((padded_arr, all_zeros), axis=0)
      else:
        reshaped_arr = all_zeros.reshape(1, all_zeros.shape[0], all_zeros.shape[1])
        padded_arr = np.vstack((padded_arr, reshaped_arr))
    
    return padded_arr
    

#**Create Embeddings Using Function Above:**

##**Loading Datasets:**

###**Veltri Dataset:**

In [5]:
# Load Dataset:
X_train_Velt = pd.read_csv('Datasets/VELTRI/X_train_Velt.csv')
X_val_Velt = pd.read_csv('Datasets/VELTRI/X_val_Velt.csv')
X_test_Velt = pd.read_csv('Datasets/VELTRI/X_test_Velt.csv')

###**Independent Dataset:**

In [None]:
X_train_INDEP = pd.read_csv('Datasets/INDEP/X_train.csv')
X_val_INDEP = pd.read_csv('Datasets/INDEP/X_val.csv')
X_test_INDEP = pd.read_csv('Datasets/INDEP/X_test.csv')

##**BERT BFD Language Model:**

In [None]:
# Specifying the max sequence length in the given dataset (183 for Veltri, 255 for the Independent Dataset, then adding 2 for special tokens added by the language models):
max_seq_len = 257
BERT_BFD_EMBED = LM_EMBED('BERT-BFD', max_seq_len, True)

###**Embedding Veltri Dataset:**

In [None]:
BERT_BFD_EMBED.extract_word_embs(X_train_Velt, "BERT_BFD/BERT_BFD_VELTRI_X_TRAIN.npy")
BERT_BFD_EMBED.extract_word_embs(X_val_Velt, "BERT_BFD/BERT_BFD_VELTRI_X_VAL.npy")
BERT_BFD_EMBED.extract_word_embs(X_test_Velt, "BERT_BFD/BERT_BFD_VELTRI_X_TEST.npy")

###**Embedding Independent Dataset:**

In [None]:
BERT_BFD_EMBED.extract_word_embs(X_train_INDEP, "BERT_BFD/BERT_BFD_INDEP_X_TRAIN.npy")
BERT_BFD_EMBED.extract_word_embs(X_val_INDEP, "BERT_BFD/BERT_BFD_INDEP_X_VAL.npy")
BERT_BFD_EMBED.extract_word_embs(X_test_INDEP, "BERT_BFD/BERT_BFD_INDEP_X_TEST.npy")

##**BERT Language Model:**

In [None]:
max_seq_len = 257
BERT_EMBED = LM_EMBED('BERT', max_seq_len, True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=86.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=361.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1684058277.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


###**Embedding Veltri Dataset:**

In [None]:
BERT_EMBED.extract_word_embs(X_train_Velt, "BERT/BERT_VELTRI_X_TRAIN.npy")
BERT_EMBED.extract_word_embs(X_val_Velt, "BERT/BERT_VELTRI_X_VAL.npy")
BERT_EMBED.extract_word_embs(X_test_Velt, "BERT/BERT_VELTRI_X_TEST.npy")

###**Embedding Independent Dataset:**

In [None]:
BERT_EMBED.extract_word_embs(X_train_INDEP, "BERT/BERT_INDEP_X_TRAIN.npy")
BERT_EMBED.extract_word_embs(X_val_INDEP, "BERT/BERT_INDEP_X_VAL.npy")
BERT_EMBED.extract_word_embs(X_test_INDEP, "BERT/BERT_INDEP_X_TEST.npy")

##**T5XL BFD Language Model:**

In [None]:
max_seq_len = 257
T5XL_BFD_EMBED = LM_EMBED('T5-XL-BFD', max_seq_len, True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=237990.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1786.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=24.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=457.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=11275562268.0, style=ProgressStyle(desc…




###**Embedding Veltri Dataset:**

In [None]:
T5XL_BFD_EMBED.extract_word_embs(X_train_Velt, "T5XL_BFD/T5XL_BFD_VELTRI_X_TRAIN.npy")
T5XL_BFD_EMBED.extract_word_embs(X_val_Velt, "T5XL_BFD/T5XL_BFD_VELTRI_X_VAL.npy")
T5XL_BFD_EMBED.extract_word_embs(X_test_Velt, "T5XL_BFD/T5XL_BFD_VELTRI_X_TEST.npy")

###**Embedding Independent Dataset:**

In [None]:
T5XL_BFD_EMBED.extract_word_embs(X_train_INDEP, "T5XL_BFD/T5XL_BFD_INDEP_X_TRAIN.npy")
T5XL_BFD_EMBED.extract_word_embs(X_val_INDEP, "T5XL_BFD/T5XL_BFD_INDEP_X_VAL.npy")
T5XL_BFD_EMBED.extract_word_embs(X_test_INDEP, "T5XL_BFD/T5XL_BFD_INDEP_X_TEST.npy")

##**T5XL Uniprot Language Model:**

In [None]:
max_seq_len = 257
T5XL_UNI_EMBED = LM_EMBED('T5-XL-UNI', max_seq_len, True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=237990.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1786.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=24.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=546.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=11275562724.0, style=ProgressStyle(desc…




Some weights of the model checkpoint at Rostlab/prot_t5_xl_uniref50 were not used when initializing T5EncoderModel: ['decoder.block.19.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.1.layer_norm.weight', 'decoder.block.9.layer.0.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.v.weight', 'decoder.block.7.layer.1.EncDecAttention.k.weight', 'decoder.block.21.layer.1.EncDecAttention.v.weight', 'decoder.block.19.layer.0.SelfAttention.k.weight', 'decoder.block.6.layer.0.SelfAttention.q.weight', 'decoder.block.3.layer.2.DenseReluDense.wi.weight', 'decoder.block.18.layer.1.layer_norm.weight', 'decoder.block.1.layer.1.EncDecAttention.o.weight', 'decoder.block.12.layer.0.SelfAttention.q.weight', 'decoder.block.5.layer.2.DenseReluDense.wi.weight', 'decoder.block.22.layer.1.layer_norm.weight', 'decoder.block.10.layer.2.layer_norm.weight', 'decoder.block.13.layer.1.EncDecAttention.k.weight', 'decoder.block.19.layer.2.DenseReluDense.wi.weight', 'decoder.block.10.layer.2.DenseRe

In [6]:
max_seq_len = 185
T5XL_UNI_EMBED = LM_EMBED('T5-XL-UNI', max_seq_len, True)

Downloading:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/11.3G [00:00<?, ?B/s]

Some weights of the model checkpoint at Rostlab/prot_t5_xl_uniref50 were not used when initializing T5EncoderModel: ['decoder.block.1.layer.1.EncDecAttention.q.weight', 'decoder.block.23.layer.0.SelfAttention.v.weight', 'decoder.block.21.layer.1.layer_norm.weight', 'decoder.block.10.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.0.SelfAttention.q.weight', 'lm_head.weight', 'decoder.block.19.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.23.layer.2.layer_norm.weight', 'decoder.block.17.layer.1.EncDecAttention.o.weight', 'decoder.block.12.layer.0.SelfAttention.o.weight', 'decoder.block.16.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.20.layer.1.EncDecAttention.k.weight', 'decoder.block.14.layer.2.DenseReluDense.wo.weight', 'decoder.block.17.layer.1.layer_norm.weight', 'decoder.block.9.layer.0.SelfAttention.o.weight', 'decoder.block.20.layer.2.DenseReluDense.wi.weight', '

###**Embedding Veltri Dataset:**

In [7]:
#T5XL_UNI_EMBED.extract_word_embs(X_train_Velt, "T5XL_UNI/T5XL_UNI_VELTRI_X_TRAIN.npy")
#T5XL_UNI_EMBED.extract_word_embs(X_val_Velt, "T5XL_UNI/T5XL_UNI_VELTRI_X_VAL.npy")
T5XL_UNI_EMBED.extract_word_embs(X_test_Velt, "T5XL_UNI/T5XL_UNI_VELTRI_X_TEST.npy")

Initial Embedding Batch Ending with... 100
Initial Embedding Batch Ending with... 200
Initial Embedding Batch Ending with... 300
Initial Embedding Batch Ending with... 400
Initial Embedding Batch Ending with... 500
Initial Embedding Batch Ending with... 600
Initial Embedding Batch Ending with... 700
Initial Embedding Batch Ending with... 800
Initial Embedding Batch Ending with... 900
Initial Embedding Batch Ending with... 1000
Initial Embedding Batch Ending with... 1100
Initial Embedding Batch Ending with... 1200
Initial Embedding Batch Ending with... 1300
Initial Embedding Batch Ending with... 1400
Padding Batch:  0
Padding Batch:  100
Padding Batch:  200
Padding Batch:  300
Padding Batch:  400
Padding Batch:  500
Padding Batch:  600
Padding Batch:  700
Padding Batch:  800
Padding Batch:  900
Padding Batch:  1000
Padding Batch:  1100
Padding Batch:  1200
Padding Batch:  1300
Padding Batch:  1400
/content/drive/MyDrive/Dissertation_Embeddings/Embeddings
Saving Embeddings...


In [None]:
# Creating an embedding of the Veltri Dataset for a T5XL-UNI Model trained on the Independent Training and Validation Data to predict:
T5XL_UNI_EMBED.extract_word_embs(X_test_Velt, "T5XL_UNI/T5XL_UNI_VELTRI_X_TEST_INDEP.npy")

###**Embedding Independent Dataset:**



In [None]:
T5XL_UNI_EMBED.extract_word_embs(X_train_INDEP, "T5XL_UNI/T5XL_UNI_INDEP_X_TRAIN.npy")
T5XL_UNI_EMBED.extract_word_embs(X_val_INDEP, "T5XL_UNI/T5XL_UNI_INDEP_X_VAL.npy")
T5XL_UNI_EMBED.extract_word_embs(X_test_INDEP, "T5XL_UNI/T5XL_UNI_INDEP_X_TEST.npy")

##**XLNET Language Model:**

In [None]:
max_seq_len = 257
XLNET_EMBED = LM_EMBED('XLNET', max_seq_len, True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=238192.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1351.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1637757076.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at Rostlab/prot_xlnet were not used when initializing XLNetModel: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


###**Embedding Veltri Dataset:**

In [None]:
XLNET_EMBED.extract_word_embs(X_train_Velt, "XLNET_VELTRI_X_TRAIN.npy")
XLNET_EMBED.extract_word_embs(X_val_Velt, "XLNET_VELTRI_X_VAL.npy")
XLNET_EMBED.extract_word_embs(X_test_Velt, "XLNET_VELTRI_X_TEST.npy")

###**Embedding Independent Dataset:**


In [None]:
XLNET_EMBED.extract_word_embs(X_train_INDEP, "XLNET/XLNET_INDEP_X_TRAIN.npy")
XLNET_EMBED.extract_word_embs(X_val_INDEP, "XLNET/XLNET_INDEP_X_VAL.npy")
XLNET_EMBED.extract_word_embs(X_test_INDEP, "XLNET/XLNET_INDEP_X_TEST.npy")

Initial Embedding Batch Ending with... 100
Initial Embedding Batch Ending with... 200
Initial Embedding Batch Ending with... 300
Initial Embedding Batch Ending with... 400
Initial Embedding Batch Ending with... 500
Initial Embedding Batch Ending with... 600
Initial Embedding Batch Ending with... 700
Initial Embedding Batch Ending with... 800
Initial Embedding Batch Ending with... 900
Initial Embedding Batch Ending with... 1000
Initial Embedding Batch Ending with... 1100
Initial Embedding Batch Ending with... 1200
Initial Embedding Batch Ending with... 1300
Initial Embedding Batch Ending with... 1400
Initial Embedding Batch Ending with... 1500
Initial Embedding Batch Ending with... 1600
Initial Embedding Batch Ending with... 1700
Initial Embedding Batch Ending with... 1800
Initial Embedding Batch Ending with... 1900
Initial Embedding Batch Ending with... 2000
Initial Embedding Batch Ending with... 2100
Initial Embedding Batch Ending with... 2200
Initial Embedding Batch Ending with... 23