Migrate datasets to build on top of torchdata datapipes #1494

parmeet · 2022-01-07T18:12:45Z

erip · 2022-01-07T19:33:48Z

This sounds great! Logistically, how do we want to organize the PRs? Should we have a branch which adds torchdata as a dependency, we branch from that branch and target that branch for our PRs?

Nayef211 · 2022-01-07T19:58:14Z

@erip That's a good point. Right now we have torchdata as an optional dependency since it is not yet available on pypi (torchdata is in prototype phase until the next Pytorch release). I think going forward we still want to keep it as an optional dependency until torchdata is released as beta.

What do you think @parmeet @abhinavarora?

parmeet · 2022-01-07T22:14:13Z

Right, agreed with @Nayef211 . Let's continue keeping optional dependency. I am working on PR to migrate AmazonReviewPolarity (#1490) to ensure nothing breaks in the CI when doing the migration. Once this lands, we can follow the same to migrate other datasets.

erip · 2022-01-07T22:16:33Z

Great, so individual PRs instead of one big one -- got it! I'll pick a couple soon and start digging.

mthrok · 2022-01-07T22:29:29Z

I guess it will become necessary to use some skip if logic when dealing with optional dependency.
In CI, I recommend to make them fail unless explicitly allowed to skip.
We had an issue where CUDA tests were unintentionally skipped because of wrong CI configuration and permissive skipIf.

Here is the update I applied recently to avoid that pytorch/audio#2127

erip · 2022-01-08T13:43:51Z

I'll keep a moving list here. To start, I'll take...

AG_NEWS
Amazon review full
DBPedia
Sogou News
YelpReviewFull
YelpReviewPolarity
YahooAnswers
CoNLL2000Chunking
SQUAD1
SQUAD2
UDPOS (blocked by CoNLL2000Chunking)
IWSLT2016
IWSLT2017
Multi30K

parmeet · 2022-01-08T18:41:31Z

Hey @erip, thank you for picking the datasets :).

I'll keep a moving list here. To start, I'll take...

Since we already have a backlog checklist in the the issue, you may constraint this list to the datasets you pick-up and checkmark it directly in the backlog items once the PR lands.

~~AmazonReviewPolarity~~ Parmeet has finished this

you may skip the mention for this dataset. I will check mark it once I land the PR :)

parmeet · 2022-01-08T18:57:06Z

AmazonRiviewPolarity [WIP] #1490. Note that this PR shall also resolve issues with existing CI dataset tests.

parmeet · 2022-01-10T14:58:26Z

I guess it will become necessary to use some skip if logic when dealing with optional dependency. In CI, I recommend to make them fail unless explicitly allowed to skip. We had an issue where CUDA tests were unintentionally skipped because of wrong CI configuration and permissive skipIf.

Here is the update I applied recently to avoid that pytorch/audio#2127

Actually in this case, the problem is beyond just unit-testing. The package builds (conda, wheels, docs) also rely on availability of torchdata, because it is imported inside the modules. As of now, the import is conditioned on availability. For unit-testing, we are installing torchdata from source. Perhaps we could do the same for package builds?

mthrok · 2022-01-10T18:59:44Z

Actually in this case, the problem is beyond just unit-testing. The package builds (conda, wheels, docs) also rely on availability of torchdata, because it is imported inside the modules. As of now, the import is conditioned on availability. For unit-testing, we are installing torchdata from source. Perhaps we could do the same for package builds?

Isn't torchdata optional runtime dependency? I looked into the build process, and CI configuration but I do not see any mention to torchdata or the build/packaging logic changed by the presence of torchdata.

parmeet · 2022-01-10T20:54:07Z

I plan to pick-up CC-100 next..

parmeet · 2022-01-11T17:42:30Z

Isn't torchdata optional runtime dependency?

Yes, it is if users does not use migrated datasets. But they would need installation if they need to use them since the underlying implementation of migrated datasets does not fall back to previous (non-datapipes based) implementation if torchdata package is not found.

Edit: We are going to make torchdata hard-dependency with the next release. I think the question is rather what should we be doing during this migration phase. As a safe side, I think keeping it optional make sense, so that users working directly from source/nightly do not break their workflow unless they depend on the migrated datasets (in which case they would need installation anyway).

I looked into the build process, and CI configuration but I do not see any mention to torchdata or the build/packaging logic changed by the presence of torchdata.

If we remove the conditional import, it breaks the CI build/packaging related tests. I guess we would need to install torchdata in appropriate locations for them to pass like we are doing for unit-testing here

ejguan · 2022-01-11T17:50:09Z

Edit: We are going to make torchdata hard-dependency with the next release. I think the question is rather what should we be doing during this migration phase. As a safe side, I think keeping it optional make sense, so that users working directly from source/nightly do not break their workflow unless they depend on the migrated datasets (in which case they would need installation anyway).

I second on this. Before our release, TorchData theoretically is unstable library, which means our API would change a lot. Nightly users of TorchText can encounter error during installation phase. This should not be expected as some users won't use anything from TorchData.

mthrok · 2022-01-11T19:54:53Z

Edit: We are going to make torchdata hard-dependency with the next release.

Okay, this is what I wanted to verify. As long as it is the intended move (and is going to be mentioned in README) then it sounds good to me.

If we remove the conditional import, it breaks the CI build/packaging related tests. I guess we would need to install torchdata in appropriate locations for them to pass like we are doing for unit-testing here

nit: IMU, packaging job should not be importing torchtext, so, I am not entirely convinced that it's required at build time. (especially since torchdata is pure-python package) Maybe it failed at smoke test step that happens right after the packaging step in the same CI job.

parmeet · 2022-01-11T21:41:12Z

nit: IMU, packaging job should not be importing torchtext, so, I am not entirely convinced that it's required at build time. (especially since torchdata is pure-python package) Maybe it failed at smoke test step that happens right after the packaging step in the same CI job.

sorry for the confusion, yes you are right it is not needed for packaging, but rather in testing in the same CI job (basically importing torchtext would fail as it would implicitly import datasets that would fail due to missing torchdata package)

parmeet · 2022-01-14T21:35:47Z

Thanks @abhinavarora and @Nayef211 for your thoughts on BC issue. Honestly, I am of similar impression and cannot think immediately of any obvious BC issues (which is why I wrote unexpected :) ). One of things new API doesn't support is __next__ method, though I don't know how this might impact user if any.

@Nayef211 regarding you question:

Would we want to add all the datapipe backed datasets in experimental/datasets folder if we ended up maintaining both implementations?

I don't think we want to do that. One way to keep both the implementations is by switching to new implementation through some flag etc. By default users would still use the existing implementation and along with that get user-warning and settings information to switch to new implementation. This surely adds to some maintenance overhead which @abhinavarora already raised concerns about.

kevinchn · 2022-01-18T17:55:15Z

I'd like to take a stab at the IMDB dataset

mthrok · 2022-01-21T22:19:59Z

Replying the comment from #1513

BTW, do we want to replace the old implementation directly? I think we may want to keep the original implementation at least one release cycle.

Thanks @ejguan for this call out. I think you made a valid point. Perhaps we could support both the implementation with the ability to switch to new implementation by setting flags or something. Let's take this discussion on the issue (#1494 ) directly.

If using a flag, I recommend that to make it a part of constructor. (should not be a big deal if the interface is same) Not a global flag. torchaudio has a global flag for I/O backend and it's not a good maintenance experience. It turned the codebase stateful and in a large code base, it becomes impossible to tell which downstream project sets/expects which flag state.

erip · 2022-01-23T13:29:50Z

Quick update:

UDPOS is done. I can't check if off here, but maybe someone else can. 😄
The IWSLT datasets are a bit tricky for a couple of reasons:
- The have nested tarballs. I have submitted a PR upstream to torchdata to add a flatmap datapipe and plan to add it to the torchtext repo until that's merged.
- The files get cleaned and rewritten in-place. This isn't a strict requirement, but it's a bit delicate, so I'm trying to proceed with caution.
- There's a lot of conditional logic depending on the file format. I'll need to think about how best to handle this. Probably just mapping a function which abstracts the conditions... hope it'll be serializable

In the meantime, I'll take a stab at Multi30k -- it looks quite straightforward. I can't seem to find CC100 and SST2 impls -- are these new datasets for torchtext?

parmeet · 2022-01-24T04:37:28Z

UDPOS is done. I can't check if off here, but maybe someone else can. 😄

Done :)

The have nested tarballs. I have submitted a PR upstream to torchdata to add a flatmap datapipe and plan to add it to the torchtext repo until that's merged.

Could you please help provide more details as to where is this required?

In the meantime, I'll take a stab at Multi30k

Great, thanks!

I can't seem to find CC100 and SST2 impls -- are these new datasets for torchtext?

SST-2 is in experimental folder. @Nayef211 is working on it, and is currently blocked due to some internal fbcode dependency. I am looking at CC100. It is already added as a notebook example here. Unfortunately, the host seems to be quite un-responsive and hence i am currently blocked to add this one. I am trying to contact some internal folks to figure out the path forward.

erip · 2022-01-24T04:47:26Z

Could you please help provide more details ...

Yes, it's here. We extract contents from a tarball and inside the tarball are other tarballs. This requires something roughly like:

# [ 1.tgz, 2.tgz, ..., n.tgz]
inner_tars = outer_tar.read_from_tar()

# Yuck, an IterableDataPipe of IterableDataPipes... flatmap to the rescue
# [ 1_1.txt, 1_2.txt, 2_1.txt, 2_2.txt, 3_1.txt, ...]
inner_inner_files = inner_tars.flatmap(lambda inner_tar: FileOpener(inner_tar).read_from_tar())

parmeet · 2022-01-26T00:59:14Z

inner_inner_files = inner_tars.flatmap(lambda inner_tar: FileOpener(inner_tar).read_from_tar())

I see, but isn't it that tar datapipe would recursively already do it and return all the inner files?

erip · 2022-01-26T01:00:42Z

@parmeet oh, I'm not sure... I didn't think so, but I guess I hadn't tried yet so I don't know. I'll give it a shot tomorrow.

erip · 2022-01-26T13:19:41Z

@parmeet looking at the implementation of the datapipe associated with the read_from_tar functional, it doesn't extract recursively. The PR in torchdata seems to moving forward so it will likely not be something we need to worry about much.

parmeet · 2022-01-27T03:57:14Z

The PR in torchdata seems to moving forward so it will likely not be something we need to worry about much.

Awesome! Thanks so much @erip for taking it forward. I guess these are the only two datasets left (apart for CC-100 :) ). Please feel free to post if there are any blockages :).

Summary: ## Summary - Updated datamodule to work with any torchtext dataset - No longer checking to see whether the dataset is an intance of the SST2Dataset - Updated `DocClassificationDataModuleConf` dataclass to take in user provided `columns` and `label_column` fields since different datasets have different column orderings - Updated tests to use patching for testing with mocked datasets similar to what is done in OSS for the [AmazonReviewPolarity dataset test](pytorch/text#1532) - Removed dataset test from torchrecipe since the torchtext repo unittests provide adequate coverage ## Followup Items - [ ] Update instantiation call for datasets to work with functional API as opposed to class API once the SST2 dataset has been migrated out of experimental ([reference GH issue](pytorch/text#1494)) Reviewed By: abhinavarora, mthrok, parmeet Differential Revision: D33775443 fbshipit-source-id: 1e6545949808ec5bd0e13cf3f9e7aaea08d68a59

parmeet · 2022-02-04T17:55:03Z

This issue is officially close now. For any improvements, bug fixes, CI dependencies or adding new datasets, we can follow them in separate threads. Thank you everyone for your contributions in realizing this issue sooner than I expected :).

cc: @erip, @Nayef211, @abhinavarora, @ejguan, @VirgileHlav

Nayef211 · 2022-02-04T19:03:57Z

Thanks for leading the efforts on this migration @parmeet! It was great to see how fast we worked together to get this done 😄

parmeet changed the title ~~Migrate datasets to build on top on [torchdata datapipes](https://github.com/pytorch/data).~~ Migrate datasets to build on top on torchdata datapipes. Jan 7, 2022

parmeet added datasets feature request enhancement new datasets and building blocks labels Jan 7, 2022

parmeet changed the title ~~Migrate datasets to build on top on torchdata datapipes.~~ Migrate datasets to build on top of torchdata datapipes Jan 8, 2022

parmeet mentioned this issue Jan 10, 2022

Migrating AmazonReviewPolarity to datapipes #1490

Merged

Nayef211 linked a pull request Jan 10, 2022 that will close this issue

migrate Amazon Review Full to datapipes. #1499

Merged

2 tasks

Nayef211 removed a link to a pull request Jan 10, 2022

migrate Amazon Review Full to datapipes. #1499

Merged

2 tasks

This was referenced Jan 10, 2022

migrate Amazon Review Full to datapipes. #1499

Merged

migrate SogouNews to datapipes. #1503

Merged

erip mentioned this issue Jan 11, 2022

migrate DBPedia to datapipes. #1500

Merged

2 tasks

This was referenced Jan 12, 2022

migrate YelpReviewFull to datapipes. #1507

Merged

migrate YahooAnswers to datapipes. #1508

Merged

migrate YelpReviewPolarity to datapipes. #1509

Merged

parmeet mentioned this issue Jan 16, 2022

__getitem()__ not implemented? #1524

Open

erip mentioned this issue Jan 18, 2022

Add secondary caching to newly-migrated datapipe-based datasets #1526

Closed

kevinchn mentioned this issue Jan 19, 2022

Migrate IMDB to datapipes #1531

Merged

Nayef211 pinned this issue Jan 20, 2022

erip mentioned this issue Jan 22, 2022

migrate UDPOS to datapipes. #1535

Merged

erip mentioned this issue Jan 23, 2022

migrate Multi30k to datapipes. #1536

Merged

erip mentioned this issue Jan 24, 2022

Add a flatmap datapipe pytorch/data#178

Closed

Nayef211 mentioned this issue Jan 26, 2022

Migrate SST2 from experimental to datasets folder #1538

Merged

This was referenced Jan 28, 2022

migrate IWSLT2016 to datapipes. #1545

Merged

migrate IWSLT2017 to datapipes. #1547

Merged

erip mentioned this issue Jan 31, 2022

add CC100 #1562

Merged

parmeet closed this as completed in #1562 Feb 4, 2022

Nayef211 mentioned this issue Feb 7, 2022

Add Secondary Caching for Extracted Datasets Files #1589

Closed

4 tasks

Nayef211 unpinned this issue Feb 8, 2022

parmeet mentioned this issue Jun 23, 2022

Refactoring torchtext.datasets (and _RawIterableDataset) to work with torch datapipes (and other things) by changing _RawIterableDataset's __iter__ . #1243

Closed

This was referenced Jun 24, 2022

iterator epoch property can cause error #540

Closed

RuntimeError: sizes must be non-negative #514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate datasets to build on top of torchdata datapipes #1494

Migrate datasets to build on top of torchdata datapipes #1494

parmeet commented Jan 7, 2022 •

edited

erip commented Jan 7, 2022

Nayef211 commented Jan 7, 2022

parmeet commented Jan 7, 2022

erip commented Jan 7, 2022

mthrok commented Jan 7, 2022

erip commented Jan 8, 2022 •

edited

parmeet commented Jan 8, 2022

parmeet commented Jan 8, 2022

parmeet commented Jan 10, 2022 •

edited

mthrok commented Jan 10, 2022

parmeet commented Jan 10, 2022

parmeet commented Jan 11, 2022 •

edited

ejguan commented Jan 11, 2022

mthrok commented Jan 11, 2022 •

edited by parmeet

parmeet commented Jan 11, 2022

parmeet commented Jan 14, 2022

kevinchn commented Jan 18, 2022

mthrok commented Jan 21, 2022

erip commented Jan 23, 2022 •

edited

parmeet commented Jan 24, 2022

erip commented Jan 24, 2022 •

edited

parmeet commented Jan 26, 2022

erip commented Jan 26, 2022

erip commented Jan 26, 2022

parmeet commented Jan 27, 2022

parmeet commented Feb 4, 2022

Nayef211 commented Feb 4, 2022

Migrate datasets to build on top of torchdata datapipes #1494

Migrate datasets to build on top of torchdata datapipes #1494

Comments

parmeet commented Jan 7, 2022 • edited

🚀 Feature

Backlog of datasets

Contributing

erip commented Jan 7, 2022

Nayef211 commented Jan 7, 2022

parmeet commented Jan 7, 2022

erip commented Jan 7, 2022

mthrok commented Jan 7, 2022

erip commented Jan 8, 2022 • edited

parmeet commented Jan 8, 2022

parmeet commented Jan 8, 2022

parmeet commented Jan 10, 2022 • edited

mthrok commented Jan 10, 2022

parmeet commented Jan 10, 2022

parmeet commented Jan 11, 2022 • edited

ejguan commented Jan 11, 2022

mthrok commented Jan 11, 2022 • edited by parmeet

parmeet commented Jan 11, 2022

parmeet commented Jan 14, 2022

kevinchn commented Jan 18, 2022

mthrok commented Jan 21, 2022

erip commented Jan 23, 2022 • edited

parmeet commented Jan 24, 2022

erip commented Jan 24, 2022 • edited

parmeet commented Jan 26, 2022

erip commented Jan 26, 2022

erip commented Jan 26, 2022

parmeet commented Jan 27, 2022

parmeet commented Feb 4, 2022

Nayef211 commented Feb 4, 2022

parmeet commented Jan 7, 2022 •

edited

erip commented Jan 8, 2022 •

edited

parmeet commented Jan 10, 2022 •

edited

parmeet commented Jan 11, 2022 •

edited

mthrok commented Jan 11, 2022 •

edited by parmeet

erip commented Jan 23, 2022 •

edited

erip commented Jan 24, 2022 •

edited