Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate datasets to build on top of torchdata datapipes #1494

Closed
22 tasks done
parmeet opened this issue Jan 7, 2022 · 31 comments · Fixed by #1562
Closed
22 tasks done

Migrate datasets to build on top of torchdata datapipes #1494

parmeet opened this issue Jan 7, 2022 · 31 comments · Fixed by #1562

Comments

@parmeet
Copy link
Contributor

parmeet commented Jan 7, 2022

🚀 Feature

Motivation

https://github.com/pytorch/data#why-composable-data-loading

user-experience: TorchData datasets enable new functional API, auto-sharding, and snapshotting support out-of-the-box. They also enable standard flow-control like batching, collation, shuffling, bucketing, and mapping/transformation using user-defined functions and transforms (UDFs).

Maintenance: By relying on TorchData, we no longer have to maintain low level functionality like downloading, extracting, caching, file/steam parsing, etc.

Reference
Examples: https://github.com/facebookexternal/torchdata/tree/main/examples/text
TorchData: https://github.com/facebookexternal/torchdata

Backlog of datasets

Contributing

Please leave a message below if you plan to work on particular dataset(s) to avoid duplication of efforts. Also please link to the corresponding PRs.

cc: @Nayef211 , @abhinavarora , @erip , @ejguan , @VitalyFedyunin

@parmeet parmeet changed the title Migrate datasets to build on top on [torchdata datapipes](https://github.com/pytorch/data). Migrate datasets to build on top on torchdata datapipes. Jan 7, 2022
@erip
Copy link
Contributor

erip commented Jan 7, 2022

This sounds great! Logistically, how do we want to organize the PRs? Should we have a branch which adds torchdata as a dependency, we branch from that branch and target that branch for our PRs?

@Nayef211
Copy link
Contributor

Nayef211 commented Jan 7, 2022

@erip That's a good point. Right now we have torchdata as an optional dependency since it is not yet available on pypi (torchdata is in prototype phase until the next Pytorch release). I think going forward we still want to keep it as an optional dependency until torchdata is released as beta.

What do you think @parmeet @abhinavarora?

@parmeet
Copy link
Contributor Author

parmeet commented Jan 7, 2022

Right, agreed with @Nayef211 . Let's continue keeping optional dependency. I am working on PR to migrate AmazonReviewPolarity (#1490) to ensure nothing breaks in the CI when doing the migration. Once this lands, we can follow the same to migrate other datasets.

@erip
Copy link
Contributor

erip commented Jan 7, 2022

Great, so individual PRs instead of one big one -- got it! I'll pick a couple soon and start digging.

@mthrok
Copy link
Contributor

mthrok commented Jan 7, 2022

I guess it will become necessary to use some skip if logic when dealing with optional dependency.
In CI, I recommend to make them fail unless explicitly allowed to skip.
We had an issue where CUDA tests were unintentionally skipped because of wrong CI configuration and permissive skipIf.

Here is the update I applied recently to avoid that pytorch/audio#2127

@parmeet parmeet changed the title Migrate datasets to build on top on torchdata datapipes. Migrate datasets to build on top of torchdata datapipes Jan 8, 2022
@erip
Copy link
Contributor

erip commented Jan 8, 2022

I'll keep a moving list here. To start, I'll take...

  • AG_NEWS
  • Amazon review full
  • DBPedia
  • Sogou News
  • YelpReviewFull
  • YelpReviewPolarity
  • YahooAnswers
  • CoNLL2000Chunking
  • SQUAD1
  • SQUAD2
  • UDPOS (blocked by CoNLL2000Chunking)
  • IWSLT2016
  • IWSLT2017
  • Multi30K

@parmeet
Copy link
Contributor Author

parmeet commented Jan 8, 2022

Hey @erip, thank you for picking the datasets :).

I'll keep a moving list here. To start, I'll take...

Since we already have a backlog checklist in the the issue, you may constraint this list to the datasets you pick-up and checkmark it directly in the backlog items once the PR lands.

  • AmazonReviewPolarity Parmeet has finished this

you may skip the mention for this dataset. I will check mark it once I land the PR :)

@parmeet
Copy link
Contributor Author

parmeet commented Jan 8, 2022

AmazonRiviewPolarity [WIP] #1490. Note that this PR shall also resolve issues with existing CI dataset tests.

@parmeet
Copy link
Contributor Author

parmeet commented Jan 10, 2022

I guess it will become necessary to use some skip if logic when dealing with optional dependency. In CI, I recommend to make them fail unless explicitly allowed to skip. We had an issue where CUDA tests were unintentionally skipped because of wrong CI configuration and permissive skipIf.

Here is the update I applied recently to avoid that pytorch/audio#2127

Actually in this case, the problem is beyond just unit-testing. The package builds (conda, wheels, docs) also rely on availability of torchdata, because it is imported inside the modules. As of now, the import is conditioned on availability. For unit-testing, we are installing torchdata from source. Perhaps we could do the same for package builds?

@mthrok
Copy link
Contributor

mthrok commented Jan 10, 2022

Actually in this case, the problem is beyond just unit-testing. The package builds (conda, wheels, docs) also rely on availability of torchdata, because it is imported inside the modules. As of now, the import is conditioned on availability. For unit-testing, we are installing torchdata from source. Perhaps we could do the same for package builds?

Isn't torchdata optional runtime dependency? I looked into the build process, and CI configuration but I do not see any mention to torchdata or the build/packaging logic changed by the presence of torchdata.

@parmeet
Copy link
Contributor Author

parmeet commented Jan 10, 2022

I plan to pick-up CC-100 next..

@Nayef211 Nayef211 linked a pull request Jan 10, 2022 that will close this issue
2 tasks
@Nayef211 Nayef211 removed a link to a pull request Jan 10, 2022
2 tasks
@parmeet
Copy link
Contributor Author

parmeet commented Jan 11, 2022

Isn't torchdata optional runtime dependency?

Yes, it is if users does not use migrated datasets. But they would need installation if they need to use them since the underlying implementation of migrated datasets does not fall back to previous (non-datapipes based) implementation if torchdata package is not found.

Edit: We are going to make torchdata hard-dependency with the next release. I think the question is rather what should we be doing during this migration phase. As a safe side, I think keeping it optional make sense, so that users working directly from source/nightly do not break their workflow unless they depend on the migrated datasets (in which case they would need installation anyway).

I looked into the build process, and CI configuration but I do not see any mention to torchdata or the build/packaging logic changed by the presence of torchdata.

If we remove the conditional import, it breaks the CI build/packaging related tests. I guess we would need to install torchdata in appropriate locations for them to pass like we are doing for unit-testing here

@ejguan
Copy link
Contributor

ejguan commented Jan 11, 2022

Edit: We are going to make torchdata hard-dependency with the next release. I think the question is rather what should we be doing during this migration phase. As a safe side, I think keeping it optional make sense, so that users working directly from source/nightly do not break their workflow unless they depend on the migrated datasets (in which case they would need installation anyway).

I second on this. Before our release, TorchData theoretically is unstable library, which means our API would change a lot. Nightly users of TorchText can encounter error during installation phase. This should not be expected as some users won't use anything from TorchData.

@mthrok
Copy link
Contributor

mthrok commented Jan 11, 2022

Edit: We are going to make torchdata hard-dependency with the next release.

Okay, this is what I wanted to verify. As long as it is the intended move (and is going to be mentioned in README) then it sounds good to me.

If we remove the conditional import, it breaks the CI build/packaging related tests. I guess we would need to install torchdata in appropriate locations for them to pass like we are doing for unit-testing here

nit: IMU, packaging job should not be importing torchtext, so, I am not entirely convinced that it's required at build time. (especially since torchdata is pure-python package) Maybe it failed at smoke test step that happens right after the packaging step in the same CI job.

@parmeet
Copy link
Contributor Author

parmeet commented Jan 11, 2022

nit: IMU, packaging job should not be importing torchtext, so, I am not entirely convinced that it's required at build time. (especially since torchdata is pure-python package) Maybe it failed at smoke test step that happens right after the packaging step in the same CI job.

sorry for the confusion, yes you are right it is not needed for packaging, but rather in testing in the same CI job (basically importing torchtext would fail as it would implicitly import datasets that would fail due to missing torchdata package)

@parmeet
Copy link
Contributor Author

parmeet commented Jan 14, 2022

Thanks @abhinavarora and @Nayef211 for your thoughts on BC issue. Honestly, I am of similar impression and cannot think immediately of any obvious BC issues (which is why I wrote unexpected :) ). One of things new API doesn't support is __next__ method, though I don't know how this might impact user if any.

@Nayef211 regarding you question:

Would we want to add all the datapipe backed datasets in experimental/datasets folder if we ended up maintaining both implementations?

I don't think we want to do that. One way to keep both the implementations is by switching to new implementation through some flag etc. By default users would still use the existing implementation and along with that get user-warning and settings information to switch to new implementation. This surely adds to some maintenance overhead which @abhinavarora already raised concerns about.

@kevinchn
Copy link
Contributor

I'd like to take a stab at the IMDB dataset

@mthrok
Copy link
Contributor

mthrok commented Jan 21, 2022

Replying the comment from #1513

BTW, do we want to replace the old implementation directly? I think we may want to keep the original implementation at least one release cycle.

Thanks @ejguan for this call out. I think you made a valid point. Perhaps we could support both the implementation with the ability to switch to new implementation by setting flags or something. Let's take this discussion on the issue (#1494 ) directly.

If using a flag, I recommend that to make it a part of constructor. (should not be a big deal if the interface is same) Not a global flag. torchaudio has a global flag for I/O backend and it's not a good maintenance experience. It turned the codebase stateful and in a large code base, it becomes impossible to tell which downstream project sets/expects which flag state.

@erip
Copy link
Contributor

erip commented Jan 23, 2022

Quick update:

  • UDPOS is done. I can't check if off here, but maybe someone else can. 😄
  • The IWSLT datasets are a bit tricky for a couple of reasons:
    • The have nested tarballs. I have submitted a PR upstream to torchdata to add a flatmap datapipe and plan to add it to the torchtext repo until that's merged.
    • The files get cleaned and rewritten in-place. This isn't a strict requirement, but it's a bit delicate, so I'm trying to proceed with caution.
    • There's a lot of conditional logic depending on the file format. I'll need to think about how best to handle this. Probably just mapping a function which abstracts the conditions... hope it'll be serializable

In the meantime, I'll take a stab at Multi30k -- it looks quite straightforward. I can't seem to find CC100 and SST2 impls -- are these new datasets for torchtext?

@parmeet
Copy link
Contributor Author

parmeet commented Jan 24, 2022

  • UDPOS is done. I can't check if off here, but maybe someone else can. 😄

Done :)

  • The have nested tarballs. I have submitted a PR upstream to torchdata to add a flatmap datapipe and plan to add it to the torchtext repo until that's merged.

Could you please help provide more details as to where is this required?

In the meantime, I'll take a stab at Multi30k

Great, thanks!

I can't seem to find CC100 and SST2 impls -- are these new datasets for torchtext?

SST-2 is in experimental folder. @Nayef211 is working on it, and is currently blocked due to some internal fbcode dependency. I am looking at CC100. It is already added as a notebook example here. Unfortunately, the host seems to be quite un-responsive and hence i am currently blocked to add this one. I am trying to contact some internal folks to figure out the path forward.

@erip
Copy link
Contributor

erip commented Jan 24, 2022

Could you please help provide more details ...

Yes, it's here. We extract contents from a tarball and inside the tarball are other tarballs. This requires something roughly like:

# [ 1.tgz, 2.tgz, ..., n.tgz]
inner_tars = outer_tar.read_from_tar()

# Yuck, an IterableDataPipe of IterableDataPipes... flatmap to the rescue
# [ 1_1.txt, 1_2.txt, 2_1.txt, 2_2.txt, 3_1.txt, ...]
inner_inner_files = inner_tars.flatmap(lambda inner_tar: FileOpener(inner_tar).read_from_tar())

@parmeet
Copy link
Contributor Author

parmeet commented Jan 26, 2022

inner_inner_files = inner_tars.flatmap(lambda inner_tar: FileOpener(inner_tar).read_from_tar())

I see, but isn't it that tar datapipe would recursively already do it and return all the inner files?

@erip
Copy link
Contributor

erip commented Jan 26, 2022

@parmeet oh, I'm not sure... I didn't think so, but I guess I hadn't tried yet so I don't know. I'll give it a shot tomorrow.

@erip
Copy link
Contributor

erip commented Jan 26, 2022

@parmeet looking at the implementation of the datapipe associated with the read_from_tar functional, it doesn't extract recursively. The PR in torchdata seems to moving forward so it will likely not be something we need to worry about much.

@parmeet
Copy link
Contributor Author

parmeet commented Jan 27, 2022

The PR in torchdata seems to moving forward so it will likely not be something we need to worry about much.

Awesome! Thanks so much @erip for taking it forward. I guess these are the only two datasets left (apart for CC-100 :) ). Please feel free to post if there are any blockages :).

facebook-github-bot pushed a commit to facebookresearch/recipes that referenced this issue Jan 28, 2022
Summary:
## Summary
- Updated datamodule to work with any torchtext dataset
   - No longer checking to see whether the dataset is an intance of the SST2Dataset
   - Updated `DocClassificationDataModuleConf` dataclass to take in user provided `columns` and `label_column` fields since different datasets have different column orderings
- Updated tests to use patching for testing with mocked datasets similar to what is done in OSS for the [AmazonReviewPolarity dataset test](pytorch/text#1532)
- Removed dataset test from torchrecipe since the torchtext repo unittests provide adequate coverage

## Followup Items
- [ ] Update instantiation call for datasets to work with functional API as opposed to class API once the SST2 dataset has been migrated out of experimental ([reference GH issue](pytorch/text#1494))

Reviewed By: abhinavarora, mthrok, parmeet

Differential Revision: D33775443

fbshipit-source-id: 1e6545949808ec5bd0e13cf3f9e7aaea08d68a59
@erip erip mentioned this issue Jan 31, 2022
@parmeet
Copy link
Contributor Author

parmeet commented Feb 4, 2022

This issue is officially close now. For any improvements, bug fixes, CI dependencies or adding new datasets, we can follow them in separate threads. Thank you everyone for your contributions in realizing this issue sooner than I expected :).

cc: @erip, @Nayef211, @abhinavarora, @ejguan, @VirgileHlav

@Nayef211
Copy link
Contributor

Nayef211 commented Feb 4, 2022

Thanks for leading the efforts on this migration @parmeet! It was great to see how fast we worked together to get this done 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants