# Utilizing ProteinEmbedding and Dataset formation

The _ProteinEmbedding_ object can store protein sequences in combination with embedding and vector representations,
which can then be manipulated by implementing standard operations from the __skbio__ library. 

To demonstrate this process, we will read in a list of protein sequences, then attempt to create a dataset from our 
embeddings to be streamed in with the standard skbio.read.

In [6]:
# Necessary imports
from skbio.embedding import ProteinEmbedding
from skbio.sequence import Protein
from tqdm import tqdm
import numpy as np
import argparse
import skbio

## Protein Sequences

We will now create a function to generate a list of random proteins (of size args.n_prots) to be embedded.

In [7]:
def generate_proteins(n_prots):
    import numpy as np
    PROTEIN_ALPHABET = "ACDEFGHIKLMNPQRSTVWY"
    np.random.seed(42)
    proteins = []
    for _ in range(n_prots):
        prot = "".join(
            np.random.choice(list(PROTEIN_ALPHABET),
                             size=np.random.randint(20, 100)))
        proteins.append(prot)
    return proteins

## Loading FASTA File

_Bagel.fa_ is a file containing bacteriocin sequences stored in FASTA format. We will parse this
file and use the output to create our __ProteinEmbedding__ object.

In [8]:
# Function to read a fasta file and return a list of sequences
def ReadFastaFile(filename, n_sequences):
  fileObj = open(filename, 'r')
  sequences = []
  seqFragments = []
  for line in fileObj:
    if line.startswith('>'):
      if(len(sequences) == n_sequences-1):
        break
      if seqFragments:
        sequence = ''.join(seqFragments)
        sequences.append(sequence)
      seqFragments = []
    else:
      seq = line.rstrip()
      seqFragments.append(seq)
  if seqFragments:
    sequence = ''.join(seqFragments)
    sequences.append(sequence)
  fileObj.close()
  return sequences

## Loading Embeddings

This function will take the inputted protein sequences and feed it through an embedding model (prot-t5), 
outputting the generated embeddings.

In [9]:
def load_protein_t5_embedding(sequence, model_name, tokenizer_name):
    import torch
    from transformers import T5Tokenizer, T5EncoderModel
    # (In case we want to use ONNX model)
    # from optimum.onnxruntime import ORTModel
    tokenizer = T5Tokenizer.from_pretrained(tokenizer_name)
    model = T5EncoderModel.from_pretrained(model_name)

    # tokenize sequences and pad up to the longest sequence in the batch
    ids = tokenizer.batch_encode_plus(sequence, add_special_tokens=True, padding="longest")
    input_ids = torch.tensor(ids['input_ids'])
    attention_mask = torch.tensor(ids['attention_mask'])

    # generate embeddings
    with torch.no_grad():
        embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
        
    return embedding_repr.last_hidden_state

## Passing to file

Finally, we can output the embeddings into a "test.h5" file, which can be utilized further
as will be demonstrated in other scikit-bio tutorials.

In [10]:
# parse arguments
parser = argparse.ArgumentParser()

# Modify the default values of the arguments to match the desired values
parser.add_argument("--n_sequences", type=int, default=2)
parser.add_argument("--model_name", type=str, default="Rostlab/prot_t5_xl_uniref50")
parser.add_argument("--tokenizer_name", type=str, default="Rostlab/prot_t5_xl_uniref50")
args = parser.parse_args("")

# Parse bagel.fa
sequence_list = ReadFastaFile("bagel.fa", args.n_sequences)
embed_list = []
print("Accepted protein sequences: ", sequence_list, "\n")

# Embed the random/inputted protein sequence(s)
for sequence in tqdm(sequence_list):
    print("Accepted protein sequence: ", sequence, "\n")
    test_embed = load_protein_t5_embedding(sequence, args.model_name, args.tokenizer_name).numpy()
    #reshape embeddings to fit the skbio format
    embed_list.append(test_embed.reshape(test_embed.shape[0], -1))

# cast embeddings to ProteinEmbedding object and reshape to fit hdf5 format
embedding_sequence_list = [(embedding, sequence) for embedding, sequence in zip(embed_list, sequence_list)]
embedding_repr = lambda x: ProteinEmbedding(*x)
embed_objs = (x for x in map(embedding_repr, embedding_sequence_list))
skbio.write(embed_objs, format='embed', into="bagel.h5")

#test if the file was written correctly and output
read_embed = iter(skbio.read("bagel.h5", format='embed' ,constructor=ProteinEmbedding))
for item in read_embed:
    print(item.embedding)



Accepted protein sequences:  ['CLGVGSCNNFAGCGYAIVCFW', 'MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN'] 



  0%|          | 0/2 [00:00<?, ?it/s]

Accepted protein sequence:  CLGVGSCNNFAGCGYAIVCFW 



 50%|█████     | 1/2 [01:22<01:22, 82.54s/it]

Accepted protein sequence:  MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN 



100%|██████████| 2/2 [03:27<00:00, 103.67s/it]


AttributeError: 'ProteinEmbedding' object has no attribute 'metadata'