Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问finetune时应如何设置token type id? #48

Closed
AOZMH opened this issue Aug 9, 2020 · 4 comments
Closed

请问finetune时应如何设置token type id? #48

AOZMH opened this issue Aug 9, 2020 · 4 comments

Comments

@AOZMH
Copy link

AOZMH commented Aug 9, 2020

在Bert中若处理输入为两个句子的相关任务(例如语义相似性打分等),常使用token_type_embedding对两个句子分别加上不同的embedding;这一做法只需要在transformers的API中设置token_type_id(一个句子全为0,另一个全为1)即可实现。
然而,electra的预训练好像取消了NSP任务,相应的也就没有训练这个句子embedding(抱歉我不是很确定,只是看了一下论文好像没写这一点😂),所以我想请教一下使用token_type_id这一做法是否在electra中也可以通用呢?如果不行,对于两个句子的输入,推荐的处理方法是什么呢?谢谢!

@ymcui
Copy link
Owner

ymcui commented Aug 9, 2020

TF版:https://github.com/ymcui/Chinese-ELECTRA/blob/master/finetune/classification/classification_tasks.py#L65
PT版:https://huggingface.co/transformers/model_doc/electra.html#electraforsequenceclassification

见TF版,实际上与BERT没有什么差别,还是要设置token_type_id,PT版详见transformers库中的API。

@AOZMH
Copy link
Author

AOZMH commented Aug 9, 2020

明白 谢谢回复!不过有一点好奇 这个token_type_id对应的embedding是在哪个阶段预训练的呢?

@ymcui
Copy link
Owner

ymcui commented Aug 9, 2020

阅读这里的预训练代码:https://github.com/ymcui/Chinese-ELECTRA/blob/master/build_pretraining_dataset.py#L57
事实上是有两个segment的。

@AOZMH
Copy link
Author

AOZMH commented Aug 10, 2020

明白,谢谢帮助!

@AOZMH AOZMH closed this as completed Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants