Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi30k and Dataset download refactor #116

Merged
merged 5 commits into from
Sep 14, 2017
Merged

Multi30k and Dataset download refactor #116

merged 5 commits into from
Sep 14, 2017

Conversation

bmccann
Copy link
Contributor

@bmccann bmccann commented Sep 11, 2017

No description provided.

@jekbradbury
Copy link
Contributor

Maybe factor out the targz handling into a GZipDataset (or a ZipDataset with ext=’tar.gz’)?

@bmccann
Copy link
Contributor Author

bmccann commented Sep 11, 2017

It's a little weird since the class has multiple urls/tar.gz files.

@bmccann
Copy link
Contributor Author

bmccann commented Sep 11, 2017

Okay, I've got a way to unify the tarfile/gzip datasets, so will push in a few minutes after tests run locally.

@bmccann
Copy link
Contributor Author

bmccann commented Sep 11, 2017

Okay, I changed up the ZipDataset so that I could use it for all our datasets that download a zip file or tar.gz file and got rid of the cls.filename field, since it isn't used.

@bmccann bmccann changed the title adding Multi30k wrapper Multi30k and Dataset download refactor Sep 11, 2017
@bmccann
Copy link
Contributor Author

bmccann commented Sep 11, 2017

So, it annoyed me that I couldn't get the Zipfile functionality to work the way I wanted for all the datasets. I ended up getting rid of Zipfile and adding a generic download function to the Dataset class. Datasets that want to use download() need to have a class variable urls defined.

The upside is that any dataset that needs to download files can do so using the common download() function. Standardizes how we handle downloading and decompressing.

@jekbradbury
Copy link
Contributor

I hadn't read through this yet; this is such a good simplification! I'm still amazed how little code you can get away with.

@jekbradbury jekbradbury merged commit 6f930eb into master Sep 14, 2017
@jihunchoi
Copy link
Contributor

This is a trivial question, but isn't Multi30k dataset from WMT 2016 shared task, not 2017?
It seems that all links are pointing to the links of 2016.

@nelson-liu
Copy link
Contributor

The shared task was in both years --- but yeah it would probably be more accurate to have the docstring be WMT 2016 (since the splits are from that year).

jekbradbury pushed a commit that referenced this pull request Oct 9, 2017
* adding Multi30k wrapper

* abstracting tar.gz compression

* removing cls.filename

* refactor datasets for downloading

* bug in sst tree examples
@jekbradbury jekbradbury deleted the multi30k branch October 17, 2017 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants