Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use embedding of BERT pretrain model for my classification task? #133

Open
niloole opened this issue Dec 16, 2022 · 0 comments
Open

Comments

@niloole
Copy link

niloole commented Dec 16, 2022

Hello
I want to use BERT pre-train model to get embedding and after that use embeddings with SVM to do binary classification.

  1. How can I get embeddings? Is my code below true for getting embeddings? which one is embedding, sequence_output or pooled_output or embedding?
import torch
from tape import ProteinBertModel, TAPETokenizer
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')  # iupac is the vocab for TAPE models, use unirep for the UniRep model
# Pfam Family: Hexapep, Clan: CL0536
sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
output = model(token_ids)

sequence_output = output[0]
pooled_output = output[1]
print(sequence_output.size())
print(pooled_output.size())

# NOTE: pooled_output is *not* trained for the transformer, do not use
# w/o fine-tuning. A better option for now is to simply take a mean of
# the sequence output

embedding= sum(sequence_output[0])/(len(sequence)+2)

print(sequence_output.size()) #Result of Run: torch.Size([1, 38, 768])
print(pooled_output.size()) #Result of Run: torch.Size([1, 768])
print(embedding.size()) #Result of Run: torch.Size([768])
  1. How can I use this code for 100 protein sequences? Should I use For Loop?

Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant