-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi30k and Dataset download refactor #116
Conversation
Maybe factor out the targz handling into a |
It's a little weird since the class has multiple urls/tar.gz files. |
Okay, I've got a way to unify the tarfile/gzip datasets, so will push in a few minutes after tests run locally. |
Okay, I changed up the ZipDataset so that I could use it for all our datasets that download a zip file or tar.gz file and got rid of the cls.filename field, since it isn't used. |
So, it annoyed me that I couldn't get the Zipfile functionality to work the way I wanted for all the datasets. I ended up getting rid of Zipfile and adding a generic download function to the Dataset class. Datasets that want to use The upside is that any dataset that needs to download files can do so using the common |
I hadn't read through this yet; this is such a good simplification! I'm still amazed how little code you can get away with. |
This is a trivial question, but isn't Multi30k dataset from WMT 2016 shared task, not 2017? |
The shared task was in both years --- but yeah it would probably be more accurate to have the docstring be WMT 2016 (since the splits are from that year). |
* adding Multi30k wrapper * abstracting tar.gz compression * removing cls.filename * refactor datasets for downloading * bug in sst tree examples
No description provided.