Multi30k and Dataset download refactor #116

bmccann · 2017-09-11T10:31:44Z

No description provided.

jekbradbury · 2017-09-11T12:42:23Z

Maybe factor out the targz handling into a GZipDataset (or a ZipDataset with ext=’tar.gz’)?

bmccann · 2017-09-11T13:27:50Z

It's a little weird since the class has multiple urls/tar.gz files.

bmccann · 2017-09-11T14:09:49Z

Okay, I've got a way to unify the tarfile/gzip datasets, so will push in a few minutes after tests run locally.

bmccann · 2017-09-11T14:36:53Z

Okay, I changed up the ZipDataset so that I could use it for all our datasets that download a zip file or tar.gz file and got rid of the cls.filename field, since it isn't used.

bmccann · 2017-09-11T17:57:09Z

So, it annoyed me that I couldn't get the Zipfile functionality to work the way I wanted for all the datasets. I ended up getting rid of Zipfile and adding a generic download function to the Dataset class. Datasets that want to use download() need to have a class variable urls defined.

The upside is that any dataset that needs to download files can do so using the common download() function. Standardizes how we handle downloading and decompressing.

jekbradbury · 2017-09-14T19:08:04Z

I hadn't read through this yet; this is such a good simplification! I'm still amazed how little code you can get away with.

jihunchoi · 2017-09-18T01:26:15Z

This is a trivial question, but isn't Multi30k dataset from WMT 2016 shared task, not 2017?
It seems that all links are pointing to the links of 2016.

nelson-liu · 2017-09-18T02:03:53Z

The shared task was in both years --- but yeah it would probably be more accurate to have the docstring be WMT 2016 (since the splits are from that year).

* adding Multi30k wrapper * abstracting tar.gz compression * removing cls.filename * refactor datasets for downloading * bug in sst tree examples

bmccann requested a review from jekbradbury September 11, 2017 10:31

bmccann force-pushed the multi30k branch from ca134f1 to 2d33971 Compare September 11, 2017 12:12

bmccann force-pushed the multi30k branch from 2d33971 to fc05bcb Compare September 11, 2017 12:54

bmccann added 3 commits September 11, 2017 14:40

adding Multi30k wrapper

93189bd

abstracting tar.gz compression

795850d

removing cls.filename

3cacb41

bmccann force-pushed the multi30k branch from 243ac01 to 3cacb41 Compare September 11, 2017 14:40

refactor datasets for downloading

f8b9647

bmccann changed the title ~~adding Multi30k wrapper~~ Multi30k and Dataset download refactor Sep 11, 2017

bug in sst tree examples

e5f16a3

jekbradbury merged commit 6f930eb into master Sep 14, 2017

jekbradbury pushed a commit that referenced this pull request Oct 9, 2017

Multi30k and Dataset download refactor (#116)

34bfbe1

* adding Multi30k wrapper * abstracting tar.gz compression * removing cls.filename * refactor datasets for downloading * bug in sst tree examples

jekbradbury deleted the multi30k branch October 17, 2017 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi30k and Dataset download refactor #116

Multi30k and Dataset download refactor #116

bmccann commented Sep 11, 2017

jekbradbury commented Sep 11, 2017

bmccann commented Sep 11, 2017 •

edited

Loading

bmccann commented Sep 11, 2017

bmccann commented Sep 11, 2017

bmccann commented Sep 11, 2017

jekbradbury commented Sep 14, 2017

jihunchoi commented Sep 18, 2017

nelson-liu commented Sep 18, 2017

Multi30k and Dataset download refactor #116

Multi30k and Dataset download refactor #116

Conversation

bmccann commented Sep 11, 2017

jekbradbury commented Sep 11, 2017

bmccann commented Sep 11, 2017 • edited Loading

bmccann commented Sep 11, 2017

bmccann commented Sep 11, 2017

bmccann commented Sep 11, 2017

jekbradbury commented Sep 14, 2017

jihunchoi commented Sep 18, 2017

nelson-liu commented Sep 18, 2017

bmccann commented Sep 11, 2017 •

edited

Loading