Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoBERTa and 514 #1187

Closed
LysandreJik opened this issue Sep 26, 2019 · 4 comments
Closed

RoBERTa and 514 #1187

LysandreJik opened this issue Sep 26, 2019 · 4 comments

Comments

@LysandreJik
Copy link

Hi! We're scratching our heads with RoBERTa and the way it handles its inputs.

The following matrix is of size 514x768:

from fairseq.models.roberta import RobertaModel

model = RobertaModel.from_pretrained("../roberta.base")
print(model.model.decoder.sentence_encoder.embed_positions.weight.size())

# torch.Size([514, 768])

Why is it different from the maximum embedding size which is 512? Furthermore, we observe that the second column of this matrix is full of zeros:

print(model.model.decoder.sentence_encoder.embed_positions.weight[1, :])
# tensor([0., 0., 0., 0., 0., 0., 0., 0. ... 

Why is that? Thank you.

@lematt1991
Copy link
Contributor

Yes padding_idx is usually equal to 1, so it should always be a vector of all zeros. The positional embeddings then start at padding_idx+1 (i.e. 2 - 514). Hope this clears it up!

@LysandreJik
Copy link
Author

Ah okay, that makes sense. In this case, what is the first column, is it full of randomly initialized values? In that case, why not use padding_idx = 0? Thank you for your answer.

@lematt1991
Copy link
Contributor

Yeah the first vector is randomly initialized values that never get used. There's really no particular reason why padding_idx is 1 instead of 0, other than it's the first token added to the dictionary. We need to use the same padding_idx value for both embed_tokens and embed_positions

@LysandreJik
Copy link
Author

Okay, I understand. Thank you very much for your help!

facebook-github-bot pushed a commit that referenced this issue Aug 28, 2020
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Adds some fantastic work done by Zeming Lin ebetica. FASTA is the predominant format used by biologists for DNA, RNA and proteins. It looks like something like this:
```
>name of your protein1
MSHFAHSDFAHSDFHWEHJW
FHDSJFASJDAHASFASDFIAA
>name of your protein2
MAHASDFMASFJADSFMSMSM
MASDFJASDJ
```

There's no need for BPE or other fancy preprocessing, so we can read the FASTA file directly in fairseq with no speed hit compared to binarized data. Building the index is important, but we can just cache that, similar to the other cached indexed datasets.

We hope this reduces the barrier for biologists to use fairseq, making this great framework even more accessible to the computational biology community!

This dataset is used internally in proteinseq, [see here for an example](https://github.com/fairinternal/proteinseq/pull/178/files).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: fairinternal/fairseq-py#1187

Reviewed By: myleott

Differential Revision: D22020223

Pulled By: ebetica

fbshipit-source-id: 372ebc199c0c9200645c79fa7722aded931e9038
sshleifer pushed a commit that referenced this issue Apr 7, 2021
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Adds some fantastic work done by Zeming Lin ebetica. FASTA is the predominant format used by biologists for DNA, RNA and proteins. It looks like something like this:
```
>name of your protein1
MSHFAHSDFAHSDFHWEHJW
FHDSJFASJDAHASFASDFIAA
>name of your protein2
MAHASDFMASFJADSFMSMSM
MASDFJASDJ
```

There's no need for BPE or other fancy preprocessing, so we can read the FASTA file directly in fairseq with no speed hit compared to binarized data. Building the index is important, but we can just cache that, similar to the other cached indexed datasets.

We hope this reduces the barrier for biologists to use fairseq, making this great framework even more accessible to the computational biology community!

This dataset is used internally in proteinseq, [see here for an example](https://github.com/fairinternal/proteinseq/pull/178/files).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: fairinternal/fairseq-py#1187

Reviewed By: myleott

Differential Revision: D22020223

Pulled By: ebetica

fbshipit-source-id: 372ebc199c0c9200645c79fa7722aded931e9038
yfyeung pushed a commit to yfyeung/fairseq that referenced this issue Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@lematt1991 @LysandreJik and others