Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert Tokens to ids correctly #140

Open
nuistZPZ opened this issue May 27, 2024 · 0 comments
Open

How to convert Tokens to ids correctly #140

nuistZPZ opened this issue May 27, 2024 · 0 comments

Comments

@nuistZPZ
Copy link

我使用self.tokenizer.convert_tokens_to_ids尝试将模型输出的text_feat转化为input_id进而转化为文本,代码如下所示:

text_output = self.text_encoder.bert(text.input_ids, attention_mask=text.attention_mask,
                                             return_dict=True, mode='text')

text_embeds = text_output.last_hidden_state
text_feat = F.normalize(self.text_proj(text_embeds[:, 0, :]), dim=-1)

input_ids = self.tokenizer.convert_tokens_to_ids(text_feat[0])
# 转换 `input_id` 为文本
decoded_text = self.tokenizer.decode(input_ids, skip_special_tokens=True)
print('decoded_text', decoded_text)

但是输出结果一直有误,要么全部得到[PAD],或者得到[100, 100]。我检查了Token的值,发现他们并不一样,我觉得是我代码出了问题,我想知道正确的做法应该是什么。

image

image

image

----translation-----
I use self.tokenizer.convert_tokens_to_ids to try to convert the text_feat output of the model to input_id and then to text, as follows:

But the output keeps getting wrong, either all [PAD] or all [100, 100]. I checked the value of the Token and found that they were not the same, I felt that there was something wrong with my code and I wanted to know what the right thing to do should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant