### Setup

We will be using the BERT model and tokenizer from Hugging Face's transformers library. Additionally, the regular expressions module is used to preprocess our protein sequence to make it compatible with the model's expected input format.

We then load a tokenizer for protein sequences from the model hub, and then load the pre-trained protein sequence BERT model which acts as our encoder.

In [12]:
from transformers import BertModel, BertTokenizer
import re

tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
model = BertModel.from_pretrained("Rostlab/prot_bert")

### Preprocessing

We'll define a sample protein sequence and preprocess it. Specifically, the U, Z, O, and B amino acids are much less common. For the purposes of standardizing input for the model, they are replaced by "X".

In [13]:
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)

### Tokenizing

Now, we tokenize the sequence to convert it into a format that the BERT model understands. We use PyTorch tensors (`pt`) as that's the format the model expects.

In [14]:
encoded_input = tokenizer(sequence_Example, return_tensors='pt')

{'input_ids': tensor([[ 2,  6,  9, 15, 23, 25,  6, 25,  3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

### Fetching Embeddings from BERT

With the processed sequence, fetch the embeddings or representations from the BERT model. These embeddings can be utilized for further analysis, such as classification.

In [10]:
model(**encoded_input).pooler_output

tensor([[-0.2487,  0.2626, -0.2367,  ...,  0.2503,  0.2339, -0.2556]],
       grad_fn=<TanhBackward0>)