New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoBERTa and 514 #1187
Comments
Yes |
Ah okay, that makes sense. In this case, what is the first column, is it full of randomly initialized values? In that case, why not use |
Yeah the first vector is randomly initialized values that never get used. There's really no particular reason why |
Okay, I understand. Thank you very much for your help! |
Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [x] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Adds some fantastic work done by Zeming Lin ebetica. FASTA is the predominant format used by biologists for DNA, RNA and proteins. It looks like something like this: ``` >name of your protein1 MSHFAHSDFAHSDFHWEHJW FHDSJFASJDAHASFASDFIAA >name of your protein2 MAHASDFMASFJADSFMSMSM MASDFJASDJ ``` There's no need for BPE or other fancy preprocessing, so we can read the FASTA file directly in fairseq with no speed hit compared to binarized data. Building the index is important, but we can just cache that, similar to the other cached indexed datasets. We hope this reduces the barrier for biologists to use fairseq, making this great framework even more accessible to the computational biology community! This dataset is used internally in proteinseq, [see here for an example](https://github.com/fairinternal/proteinseq/pull/178/files). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: fairinternal/fairseq-py#1187 Reviewed By: myleott Differential Revision: D22020223 Pulled By: ebetica fbshipit-source-id: 372ebc199c0c9200645c79fa7722aded931e9038
Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [x] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Adds some fantastic work done by Zeming Lin ebetica. FASTA is the predominant format used by biologists for DNA, RNA and proteins. It looks like something like this: ``` >name of your protein1 MSHFAHSDFAHSDFHWEHJW FHDSJFASJDAHASFASDFIAA >name of your protein2 MAHASDFMASFJADSFMSMSM MASDFJASDJ ``` There's no need for BPE or other fancy preprocessing, so we can read the FASTA file directly in fairseq with no speed hit compared to binarized data. Building the index is important, but we can just cache that, similar to the other cached indexed datasets. We hope this reduces the barrier for biologists to use fairseq, making this great framework even more accessible to the computational biology community! This dataset is used internally in proteinseq, [see here for an example](https://github.com/fairinternal/proteinseq/pull/178/files). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: fairinternal/fairseq-py#1187 Reviewed By: myleott Differential Revision: D22020223 Pulled By: ebetica fbshipit-source-id: 372ebc199c0c9200645c79fa7722aded931e9038
Hi! We're scratching our heads with RoBERTa and the way it handles its inputs.
The following matrix is of size 514x768:
Why is it different from the maximum embedding size which is 512? Furthermore, we observe that the second column of this matrix is full of zeros:
Why is that? Thank you.
The text was updated successfully, but these errors were encountered: