<a href="https://colab.research.google.com/github/schatz06/Thesis/blob/Bio_embeddings/embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install and import embedders

In [2]:
pip install bio-embeddings[all]



In [3]:
from bio_embeddings.embed import seqvec_embedder , prottrans_bert_bfd_embedder

In [4]:
embedder_seqvec = seqvec_embedder.SeqVecEmbedder()
embedder_protbert= prottrans_bert_bfd_embedder.ProtTransBertBFDEmbedder()
max_seq = 1538 # larget sequences in the datasets

weights.hdf5: 374MB [00:16, 22.8MB/s]                           
options.json: 8.19kB [00:00, 23.5kB/s]
prottrans_bert_bfd.zip: 1.56GB [01:02, 24.8MB/s]                            


Import packages

In [5]:
import numpy as np
from numpy import asarray
from numpy import save
import os

Generic method to extract ProtBert Embeddings

In [6]:
def extract_protBert(max_seq,lines,path,embedder_protbert):
  max_sequence = max_seq
  count = 1 # current line
  target_name = 1 # target line that has the name of the protein 
  target_primary = 2 # target line that has the primary structure 
  for line in lines:
    if count == target_name:
      line = line.replace("\n","") # remove the newline character in the end of the string 
      path_to_save = path + '/' + line+ ".npy"
      target_name +=3
    if count == target_primary:
      line = line.replace("\n","") # remove the newline character in the end of the string 
      embedding_protbert = embedder_protbert.embed(line) # extract the embeddings 
      padding = max_sequence - embedding_protbert.shape[0]
      embedding_protbert = np.pad(embedding_protbert,((0,padding),(0,0)),mode='constant',constant_values = 0) # apply zero padding at the end 
      save(path_to_save,embedding_protbert) # save the embedding in a new txt in a directory 
      target_primary +=3
    count+=1

Generic method to extract SeqVec Embeddings 

In [7]:
def extract_seqVec(max_seq,lines,path,embedder_seqvec):
  max_sequence = max_seq 
  count = 1 # current line
  target_name = 1 # target line that has the name of the protein 
  target_primary = 2 # target line that has the primary structure 
  for line in lines:
    if count == target_name:
      line = line.replace("\n","") # remove the newline character in the end of the string 
      path_to_save = path + '/' + line+ ".npy"
      target_name +=3
    if count == target_primary:
      line = line.replace("\n","") # remove the newline character in the end of the string 
      embedding_seqvec = embedder_seqvec.embed(line) # extract the embeddings 
      padding = max_sequence - embedding_seqvec.shape[1]
      embedding_seqvec = np.pad(embedding_seqvec,((0,0),(0,padding),(0,0)),mode='constant',constant_values = 0) # apply zero padding at the end 
      embedding_seqvec = embedding_seqvec.reshape(embedding_seqvec.shape[0], -1) # convert 3d matrix to 2d matrix in order to be able to save it 
      save(path_to_save,embedding_seqvec) # save the embedding in a new txt in a directory 
      target_primary +=3
    count+=1

Create new directories to save the embeddings 

In [8]:
path = os.getcwd() # get the current directory
path_CASP_protBert = path + '/CASP13_ProtBert' 
path_CASP_seqVec = path + '/CASP13_SeqVec' 
path_CB513_protBert = path + '/CB513_ProtBert' 
path_CB513_seqVec = path + '/CB513_SeqVec' 
path_PISCES_protBert = path + '/PISCES_ProtBert' 
path_PISCES_seqVec = path + '/PISCES_SeqVec'  
os.mkdir(path_CASP_protBert)
os.mkdir(path_CASP_seqVec)
os.mkdir(path_CB513_protBert)
os.mkdir(path_CB513_seqVec)
os.mkdir(path_PISCES_protBert)
os.mkdir(path_PISCES_seqVec)

Open files/ Create pointers

In [9]:
casp = open("CASP13_sorted.txt","r")
cb513 = open("CB513_sorted.txt","r")
pisces = open("PISCES_sorted.txt","r")

Read lines

In [10]:
casp_lines = casp.readlines()
cb513_lines = cb513.readlines()
pisces_lines = pisces.readlines()

Extract SeqVec and ProtBert embeddings for CASP13

In [11]:
extract_protBert(max_seq,casp_lines,path_CASP_protBert,embedder_protbert)


The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).



In [12]:
extract_seqVec(max_seq,casp_lines,path_CASP_seqVec,embedder_seqvec)

Extract SeqVec and ProtBert embeddings for CB513




In [13]:
extract_protBert(max_seq,cb513_lines,path_CB513_protBert,embedder_protbert)


The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).



In [14]:
extract_seqVec(max_seq,cb513_lines,path_CB513_seqVec,embedder_seqvec)

Extract SeqVec and ProtBert embeddings for PISCES

In [None]:
extract_protBert(max_seq,pisces_lines,path_PISCES_protBert,embedder_protbert)


The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).


The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).



In [None]:
extract_seqVec(max_seq,pisces_lines,path_PISCES_seqVec,embedder_seqvec)

How to load a numpy array

In [None]:
from numpy import loadtxt
# load array
data = loadtxt('/content/> 6btc_1_protBert.txt', delimiter=',')
print(data.shape)

(85, 1024)


Load a 2d array and convert it to a 3d array

In [None]:

import numpy as gfg 
  
  
arr = gfg.random.rand(5, 4, 3) 
print(arr)  
# reshaping the array from 3D 
# matrice to 2D matrice. 
arr_reshaped = arr.reshape(arr.shape[0], -1) 
  
# saving reshaped array to file. 
gfg.savetxt("geekfile.txt", arr_reshaped) 
  
# retrieving data from file. 
loaded_arr = gfg.loadtxt("geekfile.txt") 
  
# This loadedArr is a 2D array, therefore 
# we need to convert it to the original 
# array shape.reshaping to get original 
# matrice with original shape. 
load_original_arr = loaded_arr.reshape( 
    loaded_arr.shape[0], loaded_arr.shape[1] // arr.shape[2], arr.shape[2]) 
  
# check the shapes: 
print("shape of arr: ", arr.shape) 
print("shape of load_original_arr: ", load_original_arr.shape) 
  
# check if both arrays are same or not: 
if (load_original_arr == arr).all(): 
    print("Yes, both the arrays are same") 
else: 
    print("No, both the arrays are not same") 
print(arr)

Zip and download folders

In [None]:
!zip -r /content/CASP13_ProtBert.zip /content/CASP13_ProtBert
!zip -r /content/CASP13_SeqVec.zip /content/CASP13_SeqVec
!zip -r /content/CB513_ProtBert.zip /content/CB513_ProtBert
!zip -r /content/CB513_SeqVec.zip /content/CB513_SeqVec
!zip -r /content/PISCES_ProtBert.zip /content/PISCES_ProtBert
#!zip -r /content/PISCES_SeqVec.zip /content/PISCES_SeqVec

from google.colab import files

files.download("/content/CASP13_ProtBert.zip")
files.download("/content/CASP13_SeqVec.zip")
files.download("/content/CB513_ProtBert.zip")
files.download("/content/CB513_SeqVec.zip")
files.download("/content/PISCES_ProtBert.zip")
#files.download("/content/PISCES_SeqVec.zip")