### Setup

We will be using the BERT model and tokenizer from Hugging Face's transformers library. Additionally, the regular expressions module is used to preprocess our protein sequence to make it compatible with the model's expected input format.

We then load a tokenizer for protein sequences from the model hub, and then load the pre-trained protein sequence BERT model which acts as our encoder.

In [8]:
from transformers import BertModel, BertTokenizer, BertConfig
import re

tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
model = BertModel.from_pretrained("Rostlab/prot_bert")

In [10]:
config = BertConfig.from_pretrained("Rostlab/prot_bert")
config.max_position_embeddings


40000

### Preprocessing

We'll define a sample protein sequence and preprocess it. Specifically, the U, Z, O, and B amino acids are much less common. For the purposes of standardizing input for the model, they are replaced by "X".

In [3]:
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)

### Tokenizing

Now, we tokenize the sequence to convert it into a format that the BERT model understands. We use PyTorch tensors (`pt`) as that's the format the model expects.

In [6]:
tokens = tokenizer([sequence_Example,"A E T C Z A X","A E T C Z A X X X X X X X","A E T C Z A X X X X X","A E T C Z A X X X X X X X"],truncation=True, max_length=512,return_tensors='pt')

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

In [5]:
tokens

{'input_ids': tensor([[ 2,  6,  9, 15, 23, 25,  6, 25,  3],
        [ 2,  6,  9, 15, 23, 28,  6, 25,  3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}

### Fetching Embeddings from BERT

With the processed sequence, fetch the embeddings or representations from the BERT model. These embeddings can be utilized for further analysis, such as classification.

In [56]:

encodings=model(**tokens).last_hidden_state.mean(dim=1)

In [57]:

len(encodings[1])

1024

In [55]:
encodings[1]

tensor([ 0.0639,  0.0582, -0.0569,  ..., -0.0532, -0.0593,  0.0873],
       grad_fn=<SelectBackward0>)