-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiProcDataset, high memory usage #1443
Comments
Note, to debug this, you can now use |
Some log:
|
The increase in uss combined with rather low values for shared is pretty much in line with the blog post, right? |
Yes. We don't use fork here, so these are all separate processes without shared memory from the beginning. And they all consume quite a bit of memory because they all load the zip structure and everything into memory independently. I mean, it's about 700MB, so not too much, but this is times 5 (4 workers + 1 seq order worker) times 3 (train, dev, devtrain). I think with Python 3.10 it was probably a bit less (maybe 600 MB) but that difference caused the OOM for me. So, there is not really a bug here. It's more a fundamental design question. I don't really have a good idea how to solve it. Maybe OggZipDataset is just not optimal here because of the use of ZIP files, and we should use some other dataset which requires less memory per instance. Maybe HDFDataset. (Btw, we could also embed Ogg files into HDF. I want to set this up later.) Or we manage that OggZipDataset shares the memory for the ZIP structure. But this is tricky to do in a way that we don't get into the problems as also described in the blog post. I think we cannot do this in Python, as Python objects always have refcounts, which are always modified, so it would not work. We could implement this in C++. This probably would work but it's quite a complex solution. My current solution is that I just have increased the memory requirement. I had 15GB before, but that was slightly too less, and now I have set it to 30GB. Our gpu_24gb nodes have 4 GPUs and 251GB memory, so actually using 30GB is still not really so much there. |
When using PyTorch with DataLoader option So, what does this mean in practice? E.g. DataLoader
So it means we have 34 procs. (With fork start method in DataLoader, it was 16 + 3 (DataLoader workers), so 19.) In case of multi GPU with 4 workers, it means again this multiplied by 4. The amount of procs is not necessarily a problem though. But that each requires so much memory. Two optimizations we can do:
|
Both items from the last comment are implemented now. I also closed the issue now, because I think that's all we can do for MultiProcDataset. The issue of high memory usage is otherwise more a problem of the underlying dataset. And if the underlying datasets wants to share some data across workers, I also don't think that MultiProcDataset can do this. Maybe one other idea, but not sure if we should do this: We can still try to share data automatically: Instead of creating separate procs, we can only create one worker in the beginning. Then after the first init_seq_order, do fork there to create the other workers. |
While those two changes have certainly improved the situation, the memory usage is still high. And now I also got an OOM again, in a single-node multi-GPU training. The machine has a bit more than 60GB of memory, and I occupied all of it (my Slurm limit was 60GB). Memory log before the OOM:
Looking at one individual TDL (Torch DataLoader) worker:
Looking at one MPD (MultiProcDataset) worker:
|
Maybe we really have to think about some of the other solutions. E.g. there is also TorchSerializedList and SharedList. |
There is one (relatively) simple thing we can do in case of PyTorch DataLoader multi processing: We now that we are not going to use the original dataset in the main proc anymore, so we can free that one. |
This is also implemented now, specifically for the MultiProcDataset (but also additionally for the OggZipDataset, can later be extended, but not needed when MultiProcDataset is used). I think this is really all we can realistically do for MultiProcDataset itself. For further discussions on how to improve the overall situation that dataset memory consumption is high (caused maybe by MultiProcDataset, but also PyTorch DataLoader multiprocessing, or distributed training, etc), see the new issue #1498. |
Example config:
Now this gets out-of-CPU-memory.
(Actually I don't exactly see where the memory goes, but I suspect it is MultiProcDataset.)
I don't know how relevant it is that this is with OggZipDataset.
Strangely, this happens only after I upgraded my Python environment, and did not happen before. Before I had Python 3.10 with PyTorch 2.0.1, and now I have Python 3.11 with PyTorch 2.1.0 (reason: pytorch/pytorch#111764). But maybe I was already very close to the memory limit and Python 3.11 added a little bit more, and this is now too much. So probably the problem with lots of memory usage already existed before.
@vieting (or his hiwi) independently also just stumbled upon this problem now. But his buffer size was much larger, so maybe that was the main problem. In his case, it's also with OggZipDataset.
I know that people have reported similar memory consumption problems with the PyTorch DataLoader when multiple workers are used. The implementation and situation is very similar to MultiProcDataset, so maybe it's the same problem. In that case, it is just a fundamental property of Python, and not really a bug. So we can just optimize it a bit more. (I will add some links/references when I find them.)
Edit One such issue is this: pytorch/pytorch#13246 (comment). You will find many users reporting there. There are multiple issues. One such issue is when just using
fork
. In that case, first the memory consumption is low because fork will have all memory pages duplicated but it uses a copy-on-write logic, i.e. a memory page is copied only when it is modified. Depending on your code, it can happen that Python will modify the memory (e.g. accessing Python objects will change their refcount), and that leads to a real copy of the memory page, and thus, over time, all memory pages are copied, and thus over time the memory consumption grows. It looks like a memory leak but it is actually just shared memory which gets converted into non-shared memory. Note that this is not the issue we have because our subprocesses anyway don't share memory with the parent. However, there are still also other memory consumption issues with PyTorch DataLoader multi processing, which are more relevant to us.A blog post describing the issues: ppwwyyxx.com: Demystify RAM Usage in Multi-Process Data Loaders
The text was updated successfully, but these errors were encountered: