-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On BatchedDataLoader performance #740
Comments
Batch building implementation in BatchedDataLoader should be fairly efficient. Are you sure the slowness comes from |
Hi @selitvin, thank you for your answer. Yes, I tried both arguments and did not notice big improvements (thread vs process pool, +/- workers, etc).
As far as I understand, independently of how fast parquet files are read in parallel and results are made available to the underlying results queue, the batches are built on demand when iterating the I profiled a piece of code that simply consumed the dataset as follows: PARQUET_PATH = 'file:///Users/jarandaf/some_big_dataset.parquet'
READER_POOL_TYPE = 'thread'
N_WORKERS = 10
BATCH_SIZE = 1024
COLS2KEEP = [...] # list of columns to load, around 100
reader = make_batch_reader(PARQUET_PATH, reader_pool_type=READER_POOL_TYPE, workers_count=N_WORKERS, schema_fields=COLS2KEEP)
with BatchedDataLoader(reader, batch_size=BATCH_SIZE) as loader:
for i, batch in tqdm(enumerate(loader)):
pass I observed a throughput around ~30 batches/s. From the profiling results it seems that it takes more time building batches than actually reading the parquet files and converting them to proper types (I found this quite surprising). Note: You can download the above image and open it with your browser to see more details. Does all this look reasonable for such dataset (~100 columns, a couple of them arrays) or would you expect a higher reading performance? I must mention that if I only select a couple of columns the dataset is read blazingly fast. |
Got it. Interesting. Indeed, multiplicity of columns is tricky since it is handled by these two loops, it might end up pretty slow. Couple of ideas:
If you are interested, feel free to propose a PR - we can work together to get it into the petastorm codebase. |
Could you please elaborate on 1? |
In (1) I am referring to the following idea:
This way, the batching will be done on the background thread. This way the main thread can drive GPU based training while CPU/GIL would be busy creating batches. |
Thank you for your clarification. Yeah, I implemented this and it indeed improved GPU usage. I might submit a PR in the coming days. |
I wonder if we should put this mechanism behind the current |
I think it definitely makes sense! |
Hi all,
I am trying to train a PyTorch model with a pretty big dataset (of the order of millions of samples, ~100 columns, including scalars and arrays) stored as Parquet files. After reading the docs it seems
BatchedDataLoader
should be the choice.I have been having a look at the
BatchedDataLoader
class and despite reading parquet files in parallel with PyArrow, batches seem to be built on demand in an iterative way. This somehow does not leverage the processing power of GPUs and during training I don't observe GPU usage >20% and the GPU usage is very unstable.I am afraid the GPU is idle waiting all those batches to be built. Would it be possible to build them in advance?
The text was updated successfully, but these errors were encountered: