-
Notifications
You must be signed in to change notification settings - Fork 6.7k
FIx dataset loading when there are multiple valid subsets #835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When we have multiple valid subsets, say `valid', `valid1` and `valid2`, if `combine=True` holds, when loading `valid` subset, it will try to locate and load `valid`, `valid1`, `valid2`... and then combine them into one dataset. Set combine to False solves this issue.
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@myleott is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Good catch, thanks! |
Summary: When we have multiple valid subsets, say `valid`, `valid1` and `valid2`, if `combine=True` holds, when loading `valid` subset, it will try to locate and load `valid`, `valid1`, `valid2`... and then combine them into one dataset. Set `combine` to `False` solves this issue. In my experiment, I have 3 valid subsets with 3000, 5000 and 8701 examples, with argument `--valid-subset valid,valid1,valid2`, the log is as follows: ``` ...... | ./mix_data/bin valid src-trg 3000 examples | ./mix_data/bin valid1 src-trg 5000 examples | ./mix_data/bin valid2 src-trg 7801 examples | ./mix_data/bin valid1 src-trg 5000 examples | ./mix_data/bin valid2 src-trg 7801 examples ...... ``` As shown above, `valid1` and `valid2` subsets are incorrectly loaded twice. Pull Request resolved: facebookresearch/fairseq#835 Differential Revision: D16006343 Pulled By: myleott fbshipit-source-id: ece7fee3a00f97a6b3409defbf7f7ffaf0a54fdc
Summary: When we have multiple valid subsets, say `valid`, `valid1` and `valid2`, if `combine=True` holds, when loading `valid` subset, it will try to locate and load `valid`, `valid1`, `valid2`... and then combine them into one dataset. Set `combine` to `False` solves this issue. In my experiment, I have 3 valid subsets with 3000, 5000 and 8701 examples, with argument `--valid-subset valid,valid1,valid2`, the log is as follows: ``` ...... | ./mix_data/bin valid src-trg 3000 examples | ./mix_data/bin valid1 src-trg 5000 examples | ./mix_data/bin valid2 src-trg 7801 examples | ./mix_data/bin valid1 src-trg 5000 examples | ./mix_data/bin valid2 src-trg 7801 examples ...... ``` As shown above, `valid1` and `valid2` subsets are incorrectly loaded twice. Pull Request resolved: facebookresearch/fairseq#835 Differential Revision: D16006343 Pulled By: myleott fbshipit-source-id: ece7fee3a00f97a6b3409defbf7f7ffaf0a54fdc
Summary: When we have multiple valid subsets, say `valid`, `valid1` and `valid2`, if `combine=True` holds, when loading `valid` subset, it will try to locate and load `valid`, `valid1`, `valid2`... and then combine them into one dataset. Set `combine` to `False` solves this issue. In my experiment, I have 3 valid subsets with 3000, 5000 and 8701 examples, with argument `--valid-subset valid,valid1,valid2`, the log is as follows: ``` ...... | ./mix_data/bin valid src-trg 3000 examples | ./mix_data/bin valid1 src-trg 5000 examples | ./mix_data/bin valid2 src-trg 7801 examples | ./mix_data/bin valid1 src-trg 5000 examples | ./mix_data/bin valid2 src-trg 7801 examples ...... ``` As shown above, `valid1` and `valid2` subsets are incorrectly loaded twice. Pull Request resolved: facebookresearch/fairseq#835 Differential Revision: D16006343 Pulled By: myleott fbshipit-source-id: ece7fee3a00f97a6b3409defbf7f7ffaf0a54fdc
When we have multiple valid subsets, say
valid,valid1andvalid2, ifcombine=Trueholds, when loadingvalidsubset, it will try to locate and loadvalid,valid1,valid2... and then combine them into one dataset. SetcombinetoFalsesolves this issue.In my experiment, I have 3 valid subsets with 3000, 5000 and 8701 examples, with argument
--valid-subset valid,valid1,valid2, the log is as follows:As shown above,
valid1andvalid2subsets are incorrectly loaded twice.