Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Dask processor and Petastorm reader to train large datasets #970

Closed
wants to merge 79 commits into from

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Oct 25, 2020

This pull request accomplishes the following goals:

  1. Add support for Dask as a (mostly) drop-in replacement for Pandas during preprocessing
  2. Add support for training on large out-of-memory datasets with Petastorm

Previous implementation of Dataset assumed all data was stored in an in-memory dictionary. Previous implementation of Batcher assumed the dataset provided random access. Now we introduce separate datasets for in-memory datasets and Parquet-file datasets (processed with Petastorm), as well a separate batchers for random access and iterable datasets.

@tgaddair tgaddair changed the title [WIP] Initial commit of Dask processor for large datasets [WIP] Initial commit of Dask processor and Petastorm reader for large datasets Oct 25, 2020
@tgaddair tgaddair marked this pull request as ready for review November 1, 2020 00:12
@tgaddair tgaddair changed the title [WIP] Added Dask processor and Petastorm reader to train large datasets Added Dask processor and Petastorm reader to train large datasets Nov 1, 2020
@tgaddair
Copy link
Collaborator Author

tgaddair commented Jun 2, 2021

Superceded by #1090.

@tgaddair tgaddair closed this Jun 2, 2021
@tgaddair tgaddair deleted the dask branch June 2, 2021 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant