#**Description of Notebook:**
The following notebook uses the pretrained models described here - https://www.biorxiv.org/content/10.1101/2020.07.12.199554v1.full.pdf, and found here - https://github.com/agemagician/ProtTrans, to create embedding vectors for each sequence in the inputted dataset. Below is an example of how to create word embeddings using the T5 Uniprot50 pre-trained model using the function created.

#**Imports:**

In [None]:
!pip install -q transformers
!pip install -q transformers sentencePiece

[K     |████████████████████████████████| 2.6 MB 5.2 MB/s 
[K     |████████████████████████████████| 636 kB 56.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 41.3 MB/s 
[K     |████████████████████████████████| 895 kB 60.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 5.3 MB/s 
[?25h

In [2]:
import torch
from transformers import BertModel, BertTokenizer, XLNetModel, XLNetTokenizer, T5EncoderModel, T5Tokenizer
import re
import os
import requests
import gc
from tqdm.auto import tqdm
import pandas as pd
import numpy as np

In [None]:
print('Mounting google drive...')
from google.colab import drive
drive.mount('/content/drive')
%cd "INSERT_GOOGLE_DRIVE_LOC"

#**Function To Extract Word Embeddings:**

In [5]:
class LM_EMBED:

  def __init__(self, language_model, max_len, rare_aa):
    self.lang_model = language_model
    self.max_len = max_len
    self.rare_aa = rare_aa

    # Import tokenizer and model from ProtTrans Pre-Trained Rostlab:
    if self.lang_model == 'BERT-BFD':
      self.tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert_bfd', do_lower_case=False)
      self.model = BertModel.from_pretrained("Rostlab/prot_bert_bfd")
    elif self.lang_model == 'BERT':
      self.tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
      self.model = BertModel.from_pretrained("Rostlab/prot_bert")
    elif self.lang_model == 'T5-XL-BFD':
      self.tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_bfd", do_lower_case=False )
      self.model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_bfd")
      gc.collect()
    elif self.lang_model == 'T5-XL-UNI':
      self.tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50", do_lower_case=False )
      self.model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_uniref50")
      gc.collect()
    elif self.lang_model == 'XLNET':
      self.tokenizer = XLNetTokenizer.from_pretrained("Rostlab/prot_xlnet", do_lower_case=False)
      self.model = XLNetModel.from_pretrained("Rostlab/prot_xlnet", mem_len=512)


  # Function to use the specified model and tokenizer to create word embedding array:
  def extract_word_embs(self, seq_df, filename):

    # Setting device to GPU if available:
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

    # Assigning model to GPU if available, and setting to eval mode:
    self.model = self.model.to(device)
    self.model = self.model.eval()

    # Making a list of sequences from the df:
    seqs_list = seq_df.Sequence.to_list()

    # Adding spaces in between sequence letters (amino acids):
    seqs_spaced = self.add_spaces(seqs_list)

    # Map Rarely Occuring Amino Acids (U, Z, O, B) to (X) if they are present in the dataset:
    if self.rare_aa:
      seqs_spaced = [re.sub(r"[UZOB]", "X", sequence) for sequence in seqs_spaced]

    # ID list tokenized:
    ids = self.tokenizer.batch_encode_plus(seqs_spaced, add_special_tokens=True, padding = 'max_length', max_length = self.max_len)

    # Retrieving the input IDs and mask for attention as tensors:
    input_ids = torch.tensor(ids['input_ids']).to(device)
    attention_mask = torch.tensor(ids['attention_mask']).to(device)

    # Emptying cache to ensure enough memory:
    torch.cuda.empty_cache()

    # Loop to process the sequences into embeddings in batches of 10:
    for i in range(10, len(input_ids)+10, 10):
      if i%100 == 0:
        print("Initial Embedding Batch Ending with...", i)
      with torch.no_grad():
        embeddings = self.model(input_ids=input_ids[i-10:i], 
                                attention_mask=attention_mask[i-10:i])[0]
        emb_array = embeddings.cpu().numpy()

        # Creating initial array or concatenating to existing array:
        if i==10:
          embedding_res = emb_array
        else:
          embedding_res = np.concatenate((embedding_res, emb_array))

    # Extracting features using the function below:
    features = self.extract_features(embedding_res, attention_mask) 

    # Padding these features to a specified max length with zeros:
    padded_arr = self.pad(features)

    #Ensuring in correct location to save embeddings:
    %cd "INSERT_EMBEDDINGS_FOLDER_LOC"

    # Saving array:
    print("Saving Embeddings...")
    np.save(filename, padded_arr)


  # Function to add spaces between the amino acids in each sequence:
  def add_spaces(self, df_col):
    return [" ".join(x) for x in df_col]

  # Function to remove any CLS or SEP tokens, just leaving features:
  def extract_features(self, emb_res, att_msk):
    features = [] 

    for seq_num in range(len(emb_res)):
      seq_len = (att_msk[seq_num] == 1).sum()

      if self.lang_model in ['BERT-BFD', 'BERT']:
        seq_emd = emb_res[seq_num][1:seq_len-1]

      elif self.lang_model in ['T5-XL-BFD', 'T5-XL-UNI']:
        seq_emd = emb_res[seq_num][:seq_len-1]

      elif self.lang_model == 'XLNET':
        padded_seq_len = len(att_msk[seq_num])
        seq_emd = emb_res[seq_num][padded_seq_len-seq_len:padded_seq_len-2]

      features.append(seq_emd)
    
    features_arr = np.array(features, dtype=object)

    return features_arr

  # Function to add zeros to pad all features to max length:
  def pad(self, features):
    dim1 = self.max_len-2   # reducing by 2 for CLS and SEP tokens which have already been removed
    dim2 = features[0].shape[1]

    for i in range(len(features)):
      if i%100 == 0:
        print("Padding Batch: ", i)

      all_zeros = np.zeros((dim1, dim2))
      all_zeros[:features[i].shape[0], :features[i].shape[1]] = features[i]

      if i==0:
        padded_arr = all_zeros
      elif i==1:
        padded_arr = np.stack((padded_arr, all_zeros), axis=0)
      else:
        reshaped_arr = all_zeros.reshape(1, all_zeros.shape[0], all_zeros.shape[1])
        padded_arr = np.vstack((padded_arr, reshaped_arr))
    
    return padded_arr
    

#**Create Embeddings Using Function Above:**

##**Loading Datasets:**

###**Veltri Dataset:**

In [6]:
# Load Dataset:
X_train_Velt = pd.read_csv('Veltri_Dataset/X_train_Velt.csv')
X_val_Velt = pd.read_csv('Veltri_Dataset/X_val_Velt.csv')
X_test_Velt = pd.read_csv('Veltri_Dataset/X_test_Velt.csv')

###**LMPred Dataset:**

In [7]:
X_train = pd.read_csv('LM_Pred_Dataset/X_train.csv')
X_val = pd.read_csv('LM_Pred_Dataset/X_val.csv')
X_test = pd.read_csv('LM_Pred_Dataset/X_test.csv')

# **Example Embeddings:**
- Example below of how language model word embeddings were created using the BERT Model that had been pre-trained on the Big Fat Database dataset of proteins.

##**T5XL Uniprot50 Language Model:**

In [None]:
# Specifying the max sequence length in the given dataset (255 for the LMPred Dataset, then adding 2 to account for special [CLS, SEP] tokens added by the language models):
max_seq_len = 257
T5XL_UNI_EMBED = LM_EMBED('T5-XL-UNI', max_seq_len, True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.5G [00:00<?, ?B/s]

###**Embedding Veltri Dataset:**

In [None]:
T5XL_UNI_EMBED.extract_word_embs(X_train_Velt, "Veltri_Embeddings/T5XL_UNI_VELTRI_X_TRAIN.npy")
T5XL_UNI_EMBED.extract_word_embs(X_val_Velt, "Veltri_Embeddings/T5XL_UNI_VELTRI_X_VAL.npy")
T5XL_UNI_EMBED.extract_word_embs(X_test_Velt, "Veltri_Embeddings/T5XL_UNI_VELTRI_X_TEST.npy")

###**Embedding LMPred Dataset:**

In [None]:
T5XL_UNI_EMBED.extract_word_embs(X_train, "LMPred_Embeddings/T5XL_UNI_INDEP_X_TRAIN.npy")
T5XL_UNI_EMBED.extract_word_embs(X_val, "LMPred_Embeddings/T5XL_UNI_INDEP_X_VAL.npy")
T5XL_UNI_EMBED.extract_word_embs(X_test, "LMPred_Embeddings/T5XL_UNI_INDEP_X_TEST.npy")