# Utilizing ProteinEmbedding and Dataset formation

The _ProteinEmbedding_ object can store protein sequences in combination with embedding and vector representations,
which can then be manipulated by implementing standard operations from the __skbio__ library. 

To demonstrate this process, we will read in a list of protein sequences, then attempt to create a dataset from our 
embeddings to be streamed in with the standard skbio.read.

In [16]:
# Necessary imports
from skbio.embedding import ProteinEmbedding
from utils import load_protein_t5_embedding
from tqdm import tqdm
import numpy as np
import argparse
import skbio

## Loading FASTA File

_Bagel.fa_ is a file containing bacteriocin sequences stored in FASTA format. We will parse this
file and use the output to create our __ProteinEmbedding__ object.

In [18]:
# Function to read a fasta file and return a list of sequences
def ReadFastaFile(filename, n_sequences):
  fileObj = open(filename, 'r')
  sequences = []
  seqFragments = []
  for line in fileObj:
    if line.startswith('>'):
      if(len(sequences) == n_sequences-1):
        break
      if seqFragments:
        sequence = ''.join(seqFragments)
        sequences.append(sequence)
      seqFragments = []
    else:
      seq = line.rstrip()
      seqFragments.append(seq)
  if seqFragments:
    sequence = ''.join(seqFragments)
    sequences.append(sequence)
  fileObj.close()
  return sequences

## Loading Embeddings

This function will take the inputted protein sequences and feed it through an embedding model (prot-t5), 
outputting the generated embeddings.

In [19]:
def load_protein_t5_embedding(sequence, model_name, tokenizer_name):
    import torch
    from transformers import T5Tokenizer, T5EncoderModel
    # (In case we want to use ONNX model)
    # from optimum.onnxruntime import ORTModel
    tokenizer = T5Tokenizer.from_pretrained(tokenizer_name)
    model = T5EncoderModel.from_pretrained(model_name)

    # tokenize sequences and pad up to the longest sequence in the batch
    ids = tokenizer.batch_encode_plus(sequence, add_special_tokens=True, padding="longest")
    input_ids = torch.tensor(ids['input_ids'])
    attention_mask = torch.tensor(ids['attention_mask'])

    # generate embeddings
    with torch.no_grad():
        embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
        
    return embedding_repr.last_hidden_state

## Passing to file

Finally, we can output the embeddings into a "test.h5" file, which can be utilized further
as will be demonstrated in other scikit-bio tutorials.

In [20]:
# parse arguments
parser = argparse.ArgumentParser()

# Modify the default values of the arguments to match the desired values
parser.add_argument("--n_sequences", type=int, default=20)
parser.add_argument("--model_name", type=str, default="Rostlab/prot_t5_xl_uniref50")
parser.add_argument("--tokenizer_name", type=str, default="Rostlab/prot_t5_xl_uniref50")
args = parser.parse_args("")

# Parse bagel.fa
sequence_list = ReadFastaFile("bagel.fa", args.n_sequences)
embed_list = []
print("Accepted protein sequences: ", sequence_list, "\n")

# Embed the random/inputted protein sequence(s)
for sequence in tqdm(sequence_list):
    print("Accepted protein sequence: ", sequence, "\n")
    test_embed = load_protein_t5_embedding(sequence, args.model_name, args.tokenizer_name).numpy()
    #reshape embeddings to fit the skbio format
    embed_list.append(test_embed.reshape(test_embed.shape[0], -1))

# cast embeddings to ProteinEmbedding object and reshape to fit hdf5 format
embedding_sequence_list = [(embedding, sequence) for embedding, sequence in zip(embed_list, sequence_list)]
embedding_repr = lambda x: ProteinEmbedding(*x)
embed_objs = (x for x in map(embedding_repr, embedding_sequence_list))
skbio.write(embed_objs, format='embed', into="bagel.h5")

#test if the file was written correctly and output
read_embed = iter(skbio.read("bagel.h5", format='embed' ,constructor=ProteinEmbedding))
for item in read_embed:
    print(item.embedding)



Accepted protein sequences:  ['CLGVGSCNNFAGCGYAIVCFW', 'MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN', 'MEPQMTELQPEAYEAPSLIEVGEFSEDTLGFGSKPLDSFGLNFF', 'MGNSILNKMTVEEMEAVKGGNLVCPPMPDYIKRLSTGKGVSSVYMAWQIANCKSSGSCMKGQTNRTC', 'MTQVSPSPLRLIRVGRALDLTRSIGDSGLRESMSSQTYWP', 'MKNFNTLSFETLANIVGGRNNLAANIGGVGGATVAGWALGNAVCGPACGFVGAHYVPIAWAGVTAATGGFGKIRK', 'MKSKKVNTEIDTLEFEIDNQELNGTSGSGWWYTAFKMTLAGRCGLCFTCSYECTTNNVHC', 'MAKNTSRPEIDSLSFEVENQELSGKSGAGWFTAVQLTLAGRCGRWFTGSFECTTNNVKCG', 'MKQDNFEIDSLDYEINSQELNGKSAAGWYTAVRLTVQGRCGWWFTHSYECTSPNVRCG', 'MKNPTLLPKLTAPVERPAVTSSDLKQASSVDAAWLNGDNNWSTPFAGVNAAWLNGDNNWSTPFAGVNAAWLNGDNNWSTPFAADGAE', 'MKEQKLEKITGLIPESELEEHLSGESSGAGTPAITTAISAIIGATAQSPCPTSACSKSCNK', 'MLDVIKNRKKIEEKLELPEILLEEVEEHSAMGGINTWNTTATSTSIIISETFGNKGKVCTYTVECVNNCRG', 'MLKEEKLEKITGLIPESELEEHLSGESSGAGTPAITTAISAIIAATAQSPCPTSACSKSCNK', 'MEKILDLDVQVKAQKESNDSSGDERITSFSLCTPGCAKTGSFNSYCC', 'MSMTMTLQQAVVDDEFRSVLLADPAAFGLSVESLPGAVERQDHEAIEAFTEAVVASEIYACASTCSFGPFTIACDGTTK', 'MNKKNILPQQGQPVIRLTAGQLSSQLAELS

  0%|          | 0/20 [00:00<?, ?it/s]

Accepted protein sequence:  CLGVGSCNNFAGCGYAIVCFW 



  5%|▌         | 1/20 [00:04<01:22,  4.32s/it]

Accepted protein sequence:  MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN 



 10%|█         | 2/20 [00:07<01:09,  3.85s/it]

Accepted protein sequence:  MEPQMTELQPEAYEAPSLIEVGEFSEDTLGFGSKPLDSFGLNFF 



 15%|█▌        | 3/20 [00:11<01:02,  3.69s/it]

Accepted protein sequence:  MGNSILNKMTVEEMEAVKGGNLVCPPMPDYIKRLSTGKGVSSVYMAWQIANCKSSGSCMKGQTNRTC 



 20%|██        | 4/20 [00:14<00:58,  3.64s/it]

Accepted protein sequence:  MTQVSPSPLRLIRVGRALDLTRSIGDSGLRESMSSQTYWP 



 25%|██▌       | 5/20 [00:18<00:53,  3.54s/it]

Accepted protein sequence:  MKNFNTLSFETLANIVGGRNNLAANIGGVGGATVAGWALGNAVCGPACGFVGAHYVPIAWAGVTAATGGFGKIRK 



 30%|███       | 6/20 [00:21<00:49,  3.55s/it]

Accepted protein sequence:  MKSKKVNTEIDTLEFEIDNQELNGTSGSGWWYTAFKMTLAGRCGLCFTCSYECTTNNVHC 



 35%|███▌      | 7/20 [00:25<00:45,  3.52s/it]

Accepted protein sequence:  MAKNTSRPEIDSLSFEVENQELSGKSGAGWFTAVQLTLAGRCGRWFTGSFECTTNNVKCG 



 40%|████      | 8/20 [00:28<00:41,  3.49s/it]

Accepted protein sequence:  MKQDNFEIDSLDYEINSQELNGKSAAGWYTAVRLTVQGRCGWWFTHSYECTSPNVRCG 



 45%|████▌     | 9/20 [00:32<00:38,  3.46s/it]

Accepted protein sequence:  MKNPTLLPKLTAPVERPAVTSSDLKQASSVDAAWLNGDNNWSTPFAGVNAAWLNGDNNWSTPFAGVNAAWLNGDNNWSTPFAADGAE 



 50%|█████     | 10/20 [00:35<00:35,  3.51s/it]

Accepted protein sequence:  MKEQKLEKITGLIPESELEEHLSGESSGAGTPAITTAISAIIGATAQSPCPTSACSKSCNK 



 55%|█████▌    | 11/20 [00:39<00:31,  3.50s/it]

Accepted protein sequence:  MLDVIKNRKKIEEKLELPEILLEEVEEHSAMGGINTWNTTATSTSIIISETFGNKGKVCTYTVECVNNCRG 



 60%|██████    | 12/20 [00:42<00:28,  3.52s/it]

Accepted protein sequence:  MLKEEKLEKITGLIPESELEEHLSGESSGAGTPAITTAISAIIAATAQSPCPTSACSKSCNK 



 65%|██████▌   | 13/20 [00:46<00:24,  3.51s/it]

Accepted protein sequence:  MEKILDLDVQVKAQKESNDSSGDERITSFSLCTPGCAKTGSFNSYCC 



 70%|███████   | 14/20 [00:49<00:20,  3.47s/it]

Accepted protein sequence:  MSMTMTLQQAVVDDEFRSVLLADPAAFGLSVESLPGAVERQDHEAIEAFTEAVVASEIYACASTCSFGPFTIACDGTTK 



 75%|███████▌  | 15/20 [00:53<00:17,  3.56s/it]

Accepted protein sequence:  MNKKNILPQQGQPVIRLTAGQLSSQLAELSEEALGDAGLEASVAACITFCAYDGVEPSCTLCCTLCAYDGE 



 80%|████████  | 16/20 [00:57<00:14,  3.59s/it]

Accepted protein sequence:  MNKKNILPQQGQPVIRLTAGQLSSQLAELSEEALGDAGLEASVTACITFCAYDGVEPSCTLCCALCAYDGE 



 85%|████████▌ | 17/20 [01:00<00:10,  3.60s/it]

Accepted protein sequence:  MNKKNILPQQGQPVIRLTAGQLSSQLAELSEEALGDAGLEASVTACITFCAYDGVEPSCTLCCTLCAYDGE 



 90%|█████████ | 18/20 [01:04<00:07,  3.61s/it]

Accepted protein sequence:  MNKKNILPQQGQPVIRLTAGQLSSQLAELSEEALGDAGLEASLTACITFCAYDGVEPSCTLCCTLCAYDGE 



 95%|█████████▌| 19/20 [01:07<00:03,  3.62s/it]

Accepted protein sequence:  MEMVLELQELDAPNELAYGDPSHGGGSNLSLLASCANSTVSLLTCH 



100%|██████████| 20/20 [01:11<00:00,  3.57s/it]


[[ 0.11630674 -0.1246348  -0.40021622 ... -0.01968804  0.0678182
   0.09913086]
 [ 0.16261135 -0.16866495 -0.32378194 ...  0.01031972  0.07828719
   0.084803  ]
 [ 0.1718336  -0.25992712 -0.3315824  ...  0.0142533   0.04639739
   0.06483438]
 ...
 [ 0.11630684 -0.12463491 -0.40021622 ... -0.01968808  0.06781818
   0.09913088]
 [ 0.14789036 -0.12967137 -0.32749158 ...  0.02746485  0.102824
   0.08326133]
 [ 0.16279021 -0.09277944 -0.21457732 ...  0.01554004  0.04875093
   0.04010428]]
[[ 0.14986856 -0.30113626 -0.39250177 ...  0.02012079  0.06142804
   0.09335212]
 [ 0.15781517 -0.22841084 -0.37898162 ...  0.00430088  0.06391165
   0.08327632]
 [ 0.13699366 -0.18653494 -0.16458543 ...  0.01036704  0.03331704
   0.05194495]
 ...
 [ 0.1718335  -0.25992715 -0.33158228 ...  0.0142533   0.04639742
   0.06483437]
 [ 0.14789064 -0.12967144 -0.32749185 ...  0.02746486  0.10282397
   0.08326131]
 [ 0.06867981 -0.23076493 -0.2405295  ...  0.0383425   0.04962576
   0.10104001]]
[[ 0.14986856 -0.30