On BatchedDataLoader performance #740

jarandaf · 2022-03-14T11:53:46Z

Hi all,

I am trying to train a PyTorch model with a pretty big dataset (of the order of millions of samples, ~100 columns, including scalars and arrays) stored as Parquet files. After reading the docs it seems BatchedDataLoader should be the choice.

I have been having a look at the BatchedDataLoader class and despite reading parquet files in parallel with PyArrow, batches seem to be built on demand in an iterative way. This somehow does not leverage the processing power of GPUs and during training I don't observe GPU usage >20% and the GPU usage is very unstable.

I am afraid the GPU is idle waiting all those batches to be built. Would it be possible to build them in advance?

The text was updated successfully, but these errors were encountered:

selitvin · 2022-03-15T17:07:14Z

Batch building implementation in BatchedDataLoader should be fairly efficient. Are you sure the slowness comes from BatchedDataLoader? Could it be that the data is not supplied fast enough to the BatchedDataLoader? Did you try tweaking parameters you pass to make_batch_reader (specifically reader_pool_type and workers_count)?

jarandaf · 2022-03-15T18:32:05Z

Hi @selitvin, thank you for your answer.

Yes, I tried both arguments and did not notice big improvements (thread vs process pool, +/- workers, etc).

Could it be that the data is not supplied fast enough to the BatchedDataLoader?

As far as I understand, independently of how fast parquet files are read in parallel and results are made available to the underlying results queue, the batches are built on demand when iterating the BatchedDataLoader, right?

I profiled a piece of code that simply consumed the dataset as follows:

PARQUET_PATH = 'file:///Users/jarandaf/some_big_dataset.parquet'
READER_POOL_TYPE = 'thread'
N_WORKERS = 10
BATCH_SIZE = 1024
COLS2KEEP = [...] # list of columns to load, around 100

reader = make_batch_reader(PARQUET_PATH, reader_pool_type=READER_POOL_TYPE, workers_count=N_WORKERS, schema_fields=COLS2KEEP)
with BatchedDataLoader(reader, batch_size=BATCH_SIZE) as loader:
    for i, batch in tqdm(enumerate(loader)): 
        pass

I observed a throughput around ~30 batches/s. From the profiling results it seems that it takes more time building batches than actually reading the parquet files and converting them to proper types (I found this quite surprising).

Note: You can download the above image and open it with your browser to see more details.

Does all this look reasonable for such dataset (~100 columns, a couple of them arrays) or would you expect a higher reading performance? I must mention that if I only select a couple of columns the dataset is read blazingly fast.

selitvin · 2022-03-16T17:14:04Z

Got it. Interesting. Indeed, multiplicity of columns is tricky since it is handled by these two loops, it might end up pretty slow.

Couple of ideas:

Add a pipelining - a thread + a queue: this way batch construction would be done in parallel with training. I think it's not a large undertaking and can be done externally to petastorm; alternatively, we can think of adding this as a built-in feature into BatchedDataLoader.
Your proposal of moving the batching into worker processes/threads might be doable. I am afraid it would be a bit trickier to implement. Also, a concern of having shuffling done at a per-worker basis would result in worse shuffling quality.

If you are interested, feel free to propose a PR - we can work together to get it into the petastorm codebase.

jarandaf · 2022-03-17T07:39:26Z

Could you please elaborate on 1?

selitvin · 2022-03-18T18:24:15Z

In (1) I am referring to the following idea:

Implement a class that has the same interface as BatchedDataLoader
The is instantiated with an instance of a petastorm's BatchedDataLoader
In the constructor it instantiates a queue (bounded size) and a thread
On the thread, we continuously read batches from BatchedDataLoader and store results in the queue
__iter__ of the new class returns data from the queue.

This way, the batching will be done on the background thread. This way the main thread can drive GPU based training while CPU/GIL would be busy creating batches.

jarandaf · 2022-03-18T18:31:55Z

Thank you for your clarification. Yeah, I implemented this and it indeed improved GPU usage. I might submit a PR in the coming days.

selitvin · 2022-03-18T18:33:38Z

I wonder if we should put this mechanism behind the current BatchedDataLoader implementation (optional with a switch).

jarandaf · 2022-03-18T18:36:37Z

I think it definitely makes sense!

jarandaf closed this as completed Mar 18, 2022

jarandaf mentioned this issue Mar 24, 2022

Enable batch fetching in parallel #748

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On BatchedDataLoader performance #740

On BatchedDataLoader performance #740

jarandaf commented Mar 14, 2022

selitvin commented Mar 15, 2022

jarandaf commented Mar 15, 2022 •

edited

Loading

selitvin commented Mar 16, 2022

jarandaf commented Mar 17, 2022

selitvin commented Mar 18, 2022

jarandaf commented Mar 18, 2022

selitvin commented Mar 18, 2022

jarandaf commented Mar 18, 2022

On BatchedDataLoader performance #740

On BatchedDataLoader performance #740

Comments

jarandaf commented Mar 14, 2022

selitvin commented Mar 15, 2022

jarandaf commented Mar 15, 2022 • edited Loading

selitvin commented Mar 16, 2022

jarandaf commented Mar 17, 2022

selitvin commented Mar 18, 2022

jarandaf commented Mar 18, 2022

selitvin commented Mar 18, 2022

jarandaf commented Mar 18, 2022

jarandaf commented Mar 15, 2022 •

edited

Loading