RoBERTa and 514 #1187

LysandreJik · 2019-09-26T17:02:02Z

Hi! We're scratching our heads with RoBERTa and the way it handles its inputs.

The following matrix is of size 514x768:

from fairseq.models.roberta import RobertaModel

model = RobertaModel.from_pretrained("../roberta.base")
print(model.model.decoder.sentence_encoder.embed_positions.weight.size())

# torch.Size([514, 768])

Why is it different from the maximum embedding size which is 512? Furthermore, we observe that the second column of this matrix is full of zeros:

print(model.model.decoder.sentence_encoder.embed_positions.weight[1, :])
# tensor([0., 0., 0., 0., 0., 0., 0., 0. ...

Why is that? Thank you.

The text was updated successfully, but these errors were encountered:

lematt1991 · 2019-09-26T18:42:25Z

Yes padding_idx is usually equal to 1, so it should always be a vector of all zeros. The positional embeddings then start at padding_idx+1 (i.e. 2 - 514). Hope this clears it up!

LysandreJik · 2019-09-26T19:49:09Z

Ah okay, that makes sense. In this case, what is the first column, is it full of randomly initialized values? In that case, why not use padding_idx = 0? Thank you for your answer.

lematt1991 · 2019-09-26T20:05:08Z

Yeah the first vector is randomly initialized values that never get used. There's really no particular reason why padding_idx is 1 instead of 0, other than it's the first token added to the dictionary. We need to use the same padding_idx value for both embed_tokens and embed_positions

LysandreJik · 2019-09-26T20:22:18Z

Okay, I understand. Thank you very much for your help!

Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [x] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Adds some fantastic work done by Zeming Lin ebetica. FASTA is the predominant format used by biologists for DNA, RNA and proteins. It looks like something like this: ``` >name of your protein1 MSHFAHSDFAHSDFHWEHJW FHDSJFASJDAHASFASDFIAA >name of your protein2 MAHASDFMASFJADSFMSMSM MASDFJASDJ ``` There's no need for BPE or other fancy preprocessing, so we can read the FASTA file directly in fairseq with no speed hit compared to binarized data. Building the index is important, but we can just cache that, similar to the other cached indexed datasets. We hope this reduces the barrier for biologists to use fairseq, making this great framework even more accessible to the computational biology community! This dataset is used internally in proteinseq, [see here for an example](https://github.com/fairinternal/proteinseq/pull/178/files). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: fairinternal/fairseq-py#1187 Reviewed By: myleott Differential Revision: D22020223 Pulled By: ebetica fbshipit-source-id: 372ebc199c0c9200645c79fa7722aded931e9038

Resolve conflict

julien-c mentioned this issue Sep 26, 2019

max_len_single_sentence should be max_len - 2 for RoBERTa huggingface/transformers#1304

Closed

4 tasks

LysandreJik closed this as completed Sep 26, 2019

morganmcg1 mentioned this issue Apr 17, 2020

Why the RoBERTa's max_position_embeddings size is 512+2=514? huggingface/transformers#1363

Closed

masayakondo mentioned this issue Sep 2, 2021

rinna RoBERTa's max_length is 510 not 512? rinnakk/japanese-pretrained-models#3

Closed

datquocnguyen mentioned this issue Jul 22, 2022

What's the meaning of position_ids = {0, 1}? VinAIResearch/PhoBERT#35

Closed

yfyeung pushed a commit to yfyeung/fairseq that referenced this issue Dec 6, 2023

Fix conflict (facebookresearch#1187)

3fb0a43

Resolve conflict

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoBERTa and 514 #1187

RoBERTa and 514 #1187

LysandreJik commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

LysandreJik commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

LysandreJik commented Sep 26, 2019

RoBERTa and 514 #1187

RoBERTa and 514 #1187

Comments

LysandreJik commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

LysandreJik commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

LysandreJik commented Sep 26, 2019