How to use embedding of BERT pretrain model for my classification task? #133

niloole · 2022-12-16T12:02:53Z

Hello
I want to use BERT pre-train model to get embedding and after that use embeddings with SVM to do binary classification.

How can I get embeddings? Is my code below true for getting embeddings? which one is embedding, sequence_output or pooled_output or embedding?

import torch
from tape import ProteinBertModel, TAPETokenizer
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')  # iupac is the vocab for TAPE models, use unirep for the UniRep model
# Pfam Family: Hexapep, Clan: CL0536
sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
output = model(token_ids)

sequence_output = output[0]
pooled_output = output[1]
print(sequence_output.size())
print(pooled_output.size())

# NOTE: pooled_output is *not* trained for the transformer, do not use
# w/o fine-tuning. A better option for now is to simply take a mean of
# the sequence output

embedding= sum(sequence_output[0])/(len(sequence)+2)

print(sequence_output.size()) #Result of Run: torch.Size([1, 38, 768])
print(pooled_output.size()) #Result of Run: torch.Size([1, 768])
print(embedding.size()) #Result of Run: torch.Size([768])

How can I use this code for 100 protein sequences? Should I use For Loop?

Thank you in advance!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use embedding of BERT pretrain model for my classification task? #133

How to use embedding of BERT pretrain model for my classification task? #133

niloole commented Dec 16, 2022

How to use embedding of BERT pretrain model for my classification task? #133

How to use embedding of BERT pretrain model for my classification task? #133

Comments

niloole commented Dec 16, 2022