Added Dask processor and Petastorm reader to train large datasets #970

tgaddair · 2020-10-25T22:02:20Z

This pull request accomplishes the following goals:

Add support for Dask as a (mostly) drop-in replacement for Pandas during preprocessing
Add support for training on large out-of-memory datasets with Petastorm

Previous implementation of Dataset assumed all data was stored in an in-memory dictionary. Previous implementation of Batcher assumed the dataset provided random access. Now we introduce separate datasets for in-memory datasets and Parquet-file datasets (processed with Petastorm), as well a separate batchers for random access and iterable datasets.

tgaddair · 2021-06-02T21:25:59Z

Superceded by #1090.

tgaddair added 29 commits October 18, 2020 10:29

POC of Dask replacing Pandas for CSV

3b8713d

WIP performance improvements for categorical

197314c

Removed debug code

6bf7083

Auto parallelize across CPU cores

dd52d5c

Added DataProcessingEngine

1f228b7

Fixed split

12bfea7

Fixed API

b39d372

Fixed data processing

5b5fc60

Drop index

8b2b594

Added Petastorm dataset

b7f9546

Cleaned up dataset creation

c952130

Added Dataset

2c93b60

Train from dataset

ea6b4a7

Fixed bugs

b203fa6

Fixed string_utils

6b3bb08

Fixed tests

ef2a314

Fixed temp dataset

9a743fe

Added Backend

a630f14

Plumb through backend

945d56e

Plumb backend through get_feature_meta

2aab9c5

Plumb through backend to add_feature_data

0a0a7c4

Plumb in preprocess_for_prediction

3419178

Fixed Pandas processing

9d13c71

Added cache management

95a7952

Fixed unit tests

22b7538

Removed context, engine to processor

fd7cbab

Added numerical test

b63b316

RayBackend -> DaskBackend

0941ecd

Fixed read_xsv

77a59f9

tgaddair changed the title ~~[WIP] Initial commit of Dask processor for large datasets~~ [WIP] Initial commit of Dask processor and Petastorm reader for large datasets Oct 25, 2020

tgaddair added 12 commits October 30, 2020 14:32

Spawn Dask tests

6c94f22

Merge branch 'master' into dask

cb8bb91

Fixed test_sequence_features.py

985b5bd

Added tables

60ad4f4

Fixed image features

dff8461

Fixed string_utils.py

0469097

Fixed kfold

1922f35

Fixed test splits

8952e23

Fixed test_visualization_api.py

5057be6

Fixed test_visualization.py

20351e0

Fixed Dask

92d64c1

Fixed test_experiment.py

25ab59b

tgaddair marked this pull request as ready for review November 1, 2020 00:12

tgaddair changed the title ~~[WIP] Added Dask processor and Petastorm reader to train large datasets~~ Added Dask processor and Petastorm reader to train large datasets Nov 1, 2020

tgaddair added 5 commits November 6, 2020 08:07

Changed backend to processor in string_utils

9f92c38

Resolved conflicts

ccbaac9

Renamed Batcher to BatchProvider

60f1e59

Renamed get_proc_features_from_lists

53a427a

Fixed Dask tests

877b604

tgaddair requested a review from w4nderlust November 18, 2020 17:57

tgaddair added 2 commits November 18, 2020 10:06

Refactored Batcher

8b819be

Renamed BatchProvider to Batcher

3268700

tgaddair mentioned this pull request Nov 18, 2020

Added Backend interface for abstracting DataFrame preprocessing steps #1014

Merged

tgaddair added 4 commits November 26, 2020 09:03

Resolved conflicts

093e3d2

Fixed API

d0e3aa1

Fixed index column

2d6a231

Fixed reshaping

b7da27d

tgaddair closed this Jun 2, 2021

tgaddair deleted the dask branch June 2, 2021 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Dask processor and Petastorm reader to train large datasets #970

Added Dask processor and Petastorm reader to train large datasets #970

tgaddair commented Oct 25, 2020 •

edited

Loading

tgaddair commented Jun 2, 2021

Added Dask processor and Petastorm reader to train large datasets #970

Added Dask processor and Petastorm reader to train large datasets #970

Conversation

tgaddair commented Oct 25, 2020 • edited Loading

tgaddair commented Jun 2, 2021

tgaddair commented Oct 25, 2020 •

edited

Loading