MultiProcDataset, high memory usage #1443

albertz · 2023-10-24T10:10:28Z

Example config:

train = {
    "buffer_size": 10,
    "class": "MultiProcDataset",
    "dataset": {
        "audio": {
            "features": "raw",
            "peak_normalization": True,
            "pre_process": speed_pert_librosa_09_10_11_kaiser_fast,
            "preemphasis": None,
            "sample_rate": 16000,
        },
        "class": "OggZipDataset",
        "epoch_wise_filter": {(1, 5): {"max_mean_len": 1000}},
        "partition_epoch": 20,
        "path": [
            "/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.5ad18raRAWhr/output/out.ogg.zip",
            "/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.W2k1lPIRrws8/output/out.ogg.zip",
            "/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.VN8PpcLm5r4s/output/out.ogg.zip",
        ],
        "seq_ordering": "laplace:.1000",
        "targets": {
            "bos_label": 0,
            "bpe_file": "/u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.codes",
            "eos_label": 0,
            "seq_postfix": [0],
            "unknown_label": None,
            "vocab_file": "/u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.vocab",
        },
        "use_cache_manager": True,
    },
    "num_workers": 4,
}

Now this gets out-of-CPU-memory.

slurmstepd-cn-502: error: Detected 1 oom-kill event(s) in StepId=2739917.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

(Actually I don't exactly see where the memory goes, but I suspect it is MultiProcDataset.)

I don't know how relevant it is that this is with OggZipDataset.

Strangely, this happens only after I upgraded my Python environment, and did not happen before. Before I had Python 3.10 with PyTorch 2.0.1, and now I have Python 3.11 with PyTorch 2.1.0 (reason: pytorch/pytorch#111764). But maybe I was already very close to the memory limit and Python 3.11 added a little bit more, and this is now too much. So probably the problem with lots of memory usage already existed before.

@vieting (or his hiwi) independently also just stumbled upon this problem now. But his buffer size was much larger, so maybe that was the main problem. In his case, it's also with OggZipDataset.

I know that people have reported similar memory consumption problems with the PyTorch DataLoader when multiple workers are used. The implementation and situation is very similar to MultiProcDataset, so maybe it's the same problem. In that case, it is just a fundamental property of Python, and not really a bug. So we can just optimize it a bit more. (I will add some links/references when I find them.)
Edit One such issue is this: pytorch/pytorch#13246 (comment). You will find many users reporting there. There are multiple issues. One such issue is when just using fork. In that case, first the memory consumption is low because fork will have all memory pages duplicated but it uses a copy-on-write logic, i.e. a memory page is copied only when it is modified. Depending on your code, it can happen that Python will modify the memory (e.g. accessing Python objects will change their refcount), and that leads to a real copy of the memory page, and thus, over time, all memory pages are copied, and thus over time the memory consumption grows. It looks like a memory leak but it is actually just shared memory which gets converted into non-shared memory. Note that this is not the issue we have because our subprocesses anyway don't share memory with the parent. However, there are still also other memory consumption issues with PyTorch DataLoader multi processing, which are more relevant to us.
A blog post describing the issues: ppwwyyxx.com: Demystify RAM Usage in Multi-Process Data Loaders

The text was updated successfully, but these errors were encountered:

albertz · 2023-10-24T14:57:37Z

Note, to debug this, you can now use watch_memory = True in the RETURNN config.

albertz · 2023-10-24T16:06:21Z

Some log:

MEMORY: main proc python3.11(1627304) initial: rss=314.6MB pss=206.3MB uss=159.0MB shared=155.7MB
MEMORY: total (1 procs): pss=206.3MB uss=159.0MB
MEMORY: main proc python3.11(1627304) increased RSS: rss=392.1MB pss=279.3MB uss=234.2MB shared=157.9MB
MEMORY: sub proc python3.11(1627351) initial: rss=12.4MB pss=6.3MB uss=6.1MB shared=6.3MB
MEMORY: sub proc MPD seq order(1627354) initial: rss=117.6MB pss=93.3MB uss=92.3MB shared=25.3MB
MEMORY: sub proc MPD worker(1627357) initial: rss=87.5MB pss=63.2MB uss=62.2MB shared=25.3MB
MEMORY: sub proc MPD worker(1627360) initial: rss=87.3MB pss=63.2MB uss=62.2MB shared=25.1MB
MEMORY: sub proc MPD worker(1627364) initial: rss=87.3MB pss=63.1MB uss=62.1MB shared=25.1MB
MEMORY: sub proc MPD worker(1627366) initial: rss=87.3MB pss=63.1MB uss=62.2MB shared=25.1MB
MEMORY: total (7 procs): pss=631.5MB uss=581.2MB
MEMORY: main proc python3.11(1627304) increased RSS: rss=1.1GB pss=765.2MB uss=549.0MB shared=543.6MB
MEMORY: sub proc MPD seq order(1627354) increased RSS: rss=715.1MB pss=690.4MB uss=689.5MB shared=25.5MB
MEMORY: sub proc python3.11(1627662) initial: rss=242.3MB pss=124.4MB uss=11.5MB shared=230.8MB
MEMORY: total (8 procs): pss=1.8GB uss=1.5GB
MEMORY: sub proc MPD seq order(1627704) initial: rss=67.8MB pss=43.0MB uss=42.4MB shared=25.4MB
MEMORY: sub proc MPD worker(1627705) initial: rss=67.7MB pss=42.9MB uss=42.4MB shared=25.2MB
MEMORY: sub proc MPD worker(1627706) initial: rss=67.7MB pss=43.1MB uss=42.6MB shared=25.1MB
MEMORY: sub proc MPD worker(1627707) initial: rss=67.5MB pss=43.0MB uss=42.4MB shared=25.1MB
MEMORY: sub proc MPD worker(1627708) initial: rss=67.5MB pss=42.9MB uss=42.4MB shared=25.1MB
MEMORY: total (13 procs): pss=2.0GB uss=1.7GB
MEMORY: sub proc MPD seq order(1627704) increased RSS: rss=97.3MB pss=72.1MB uss=71.6MB shared=25.7MB
MEMORY: sub proc MPD worker(1627705) increased RSS: rss=87.4MB pss=62.5MB uss=62.2MB shared=25.2MB
MEMORY: sub proc MPD worker(1627706) increased RSS: rss=87.4MB pss=62.6MB uss=62.2MB shared=25.1MB
MEMORY: sub proc MPD worker(1627707) increased RSS: rss=87.1MB pss=62.4MB uss=62.0MB shared=25.1MB
MEMORY: sub proc MPD worker(1627708) increased RSS: rss=87.1MB pss=62.4MB uss=62.0MB shared=25.1MB
MEMORY: sub proc MPD seq order(1627960) initial: rss=451.5MB pss=426.5MB uss=426.1MB shared=25.4MB
MEMORY: sub proc MPD worker(1627961) initial: rss=87.5MB pss=62.5MB uss=62.1MB shared=25.4MB
MEMORY: sub proc MPD worker(1627962) initial: rss=87.2MB pss=62.5MB uss=62.2MB shared=25.1MB
MEMORY: sub proc MPD worker(1627963) initial: rss=87.3MB pss=62.5MB uss=62.1MB shared=25.2MB
MEMORY: sub proc MPD worker(1627964) initial: rss=87.3MB pss=62.5MB uss=62.1MB shared=25.2MB
MEMORY: total (18 procs): pss=2.8GB uss=2.4GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=2.1GB pss=1.8GB uss=1.6GB shared=502.4MB
MEMORY: sub proc MPD seq order(1627960) increased RSS: rss=715.1MB pss=689.9MB uss=689.5MB shared=25.5MB
MEMORY: total (18 procs): pss=4.1GB uss=3.8GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=2.2GB pss=1.9GB uss=1.7GB shared=502.4MB
MEMORY: total (18 procs): pss=4.2GB uss=3.9GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=2.3GB pss=2.0GB uss=1.8GB shared=502.4MB
MEMORY: total (18 procs): pss=4.3GB uss=4.0GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=2.9GB pss=2.6GB uss=2.5GB shared=499.1MB
MEMORY: total (18 procs): pss=4.9GB uss=4.6GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=3.0GB pss=1.6GB uss=329.9MB shared=2.7GB
MEMORY: sub proc MPD worker(1627705) increased RSS: rss=102.9MB pss=77.6MB uss=77.2MB shared=25.7MB
MEMORY: sub proc MPD worker(1627706) increased RSS: rss=105.9MB pss=80.7MB uss=80.3MB shared=25.6MB
MEMORY: sub proc MPD worker(1627707) increased RSS: rss=105.7MB pss=80.5MB uss=80.1MB shared=25.6MB
MEMORY: sub proc MPD worker(1627708) increased RSS: rss=105.8MB pss=80.6MB uss=80.2MB shared=25.5MB
MEMORY: sub proc TDL worker 0(1628341) initial: rss=2.6GB pss=1.4GB uss=335.8MB shared=2.3GB
MEMORY: total (19 procs): pss=5.4GB uss=2.9GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=4.2GB pss=2.8GB uss=1.7GB shared=2.5GB
MEMORY: total (19 procs): pss=6.6GB uss=4.2GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=4.8GB pss=3.4GB uss=2.2GB shared=2.6GB
MEMORY: total (19 procs): pss=7.2GB uss=4.8GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=5.3GB pss=3.8GB uss=2.6GB shared=2.7GB
MEMORY: total (19 procs): pss=7.6GB uss=5.2GB
MEMORY: sub proc MPD worker(1627961) increased RSS: rss=172.3MB pss=147.3MB uss=146.9MB shared=25.4MB
MEMORY: sub proc MPD worker(1627962) increased RSS: rss=175.3MB pss=150.6MB uss=150.2MB shared=25.1MB
MEMORY: sub proc MPD worker(1627963) increased RSS: rss=191.0MB pss=166.2MB uss=165.8MB shared=25.2MB
MEMORY: sub proc MPD worker(1627964) increased RSS: rss=191.8MB pss=167.0MB uss=166.6MB shared=25.2MB
MEMORY: sub proc TDL worker 0(1628458) initial: rss=4.6GB pss=2.0GB uss=18.7MB shared=4.6GB
MEMORY: total (20 procs): pss=9.9GB uss=5.6GB
MEMORY: sub proc MPD worker(1627961) increased RSS: rss=702.8MB pss=677.6MB uss=677.2MB shared=25.6MB
MEMORY: sub proc MPD worker(1627962) increased RSS: rss=702.4MB pss=677.5MB uss=677.1MB shared=25.3MB
MEMORY: sub proc MPD worker(1627963) increased RSS: rss=702.5MB pss=677.5MB uss=677.2MB shared=25.4MB
MEMORY: sub proc MPD worker(1627964) increased RSS: rss=702.4MB pss=677.4MB uss=677.1MB shared=25.4MB
MEMORY: total (20 procs): pss=11.9GB uss=7.6GB
MEMORY: sub proc MPD worker(1627961) increased RSS: rss=724.2MB pss=698.6MB uss=698.3MB shared=25.9MB
MEMORY: sub proc MPD worker(1627962) increased RSS: rss=722.1MB pss=696.9MB uss=696.5MB shared=25.6MB
MEMORY: sub proc MPD worker(1627963) increased RSS: rss=724.0MB pss=698.7MB uss=698.4MB shared=25.7MB 
MEMORY: sub proc MPD worker(1627964) increased RSS: rss=723.8MB pss=698.5MB uss=698.1MB shared=25.7MB
MEMORY: sub proc TDL worker 0(1628458) increased RSS: rss=4.7GB pss=2.0GB uss=107.3MB shared=4.5GB
MEMORY: total (20 procs): pss=12.1GB uss=7.8GB
MEMORY: sub proc MPD worker(1627962) increased RSS: rss=722.3MB pss=697.1MB uss=696.7MB shared=25.6MB
MEMORY: total (20 procs): pss=12.1GB uss=7.8GB
MEMORY: main proc python3.11(1627304) increased RSS: rss=5.4GB pss=2.3GB uss=171.0MB shared=5.2GB
MEMORY: total (20 procs): pss=10.6GB uss=5.3GB
MEMORY: sub proc MPD seq order(1627354) increased RSS: rss=719.3MB pss=694.0MB uss=693.6MB shared=25.7MB
MEMORY: sub proc MPD worker(1627357) increased RSS: rss=386.5MB pss=361.5MB uss=361.1MB shared=25.3MB
MEMORY: sub proc MPD worker(1627360) increased RSS: rss=380.9MB pss=356.1MB uss=355.8MB shared=25.1MB
MEMORY: sub proc MPD worker(1627364) increased RSS: rss=395.0MB pss=370.2MB uss=369.9MB shared=25.1MB
MEMORY: sub proc MPD worker(1627366) increased RSS: rss=383.1MB pss=358.4MB uss=358.0MB shared=25.1MB
MEMORY: sub proc python3.11(1627662) increased RSS: rss=242.4MB pss=100.8MB uss=65.4MB shared=177.0MB
MEMORY: sub proc TDL worker 0(1628609) initial: rss=4.7GB pss=1.4GB uss=12.5MB shared=4.6GB
MEMORY: total (21 procs): pss=13.2GB uss=6.5GB
MEMORY: sub proc MPD worker(1627357) increased RSS: rss=729.1MB pss=696.7MB uss=695.7MB shared=33.4MB
MEMORY: sub proc MPD worker(1627360) increased RSS: rss=728.8MB pss=696.6MB uss=695.6MB shared=33.2MB
MEMORY: sub proc MPD worker(1627364) increased RSS: rss=729.4MB pss=697.0MB uss=696.0MB shared=33.4MB
MEMORY: sub proc MPD worker(1627366) increased RSS: rss=728.8MB pss=696.6MB uss=695.6MB shared=33.2MB
MEMORY: sub proc MPD worker(1628662) initial: rss=728.8MB pss=696.6MB uss=695.6MB shared=33.2MB
MEMORY: sub proc MPD worker(1628651) initial: rss=729.4MB pss=697.0MB uss=696.0MB shared=33.4MB
MEMORY: sub proc MPD worker(1628663) initial: rss=728.8MB pss=696.6MB uss=695.6MB shared=33.2MB
MEMORY: sub proc MPD worker(1628658) initial: rss=729.1MB pss=696.7MB uss=695.7MB shared=33.4MB
MEMORY: total (25 procs): pss=17.2GB uss=10.6GB
MEMORY: proc (exited)(1628662) exited, old: rss=728.8MB pss=696.6MB uss=695.6MB shared=33.2MB
MEMORY: proc (exited)(1628651) exited, old: rss=729.4MB pss=697.0MB uss=696.0MB shared=33.4MB
MEMORY: proc (exited)(1628663) exited, old: rss=728.8MB pss=696.6MB uss=695.6MB shared=33.2MB
MEMORY: proc (exited)(1628658) exited, old: rss=729.1MB pss=696.7MB uss=695.7MB shared=33.4MB
MEMORY: sub proc MPD worker(1627357) increased RSS: rss=0.9GB pss=787.4MB uss=780.1MB shared=102.5MB
MEMORY: sub proc MPD worker(1627360) increased RSS: rss=0.9GB pss=789.0MB uss=782.2MB shared=100.5MB
MEMORY: sub proc MPD worker(1627364) increased RSS: rss=0.9GB pss=789.4MB uss=782.1MB shared=101.6MB
MEMORY: sub proc MPD worker(1627366) increased RSS: rss=0.9GB pss=788.3MB uss=781.5MB shared=100.4MB
MEMORY: total (21 procs): pss=14.9GB uss=8.2GB 
MEMORY: main proc python3.11(1627304) increased RSS: rss=5.7GB pss=1.9GB uss=282.8MB shared=5.4GB
MEMORY: total (21 procs): pss=14.5GB uss=8.3GB

vieting · 2023-10-25T07:30:57Z

The increase in uss combined with rather low values for shared is pretty much in line with the blog post, right?

albertz · 2023-10-25T07:41:36Z

The increase in uss combined with rather low values for shared is pretty much in line with the blog post, right?

Yes. We don't use fork here, so these are all separate processes without shared memory from the beginning. And they all consume quite a bit of memory because they all load the zip structure and everything into memory independently. I mean, it's about 700MB, so not too much, but this is times 5 (4 workers + 1 seq order worker) times 3 (train, dev, devtrain). I think with Python 3.10 it was probably a bit less (maybe 600 MB) but that difference caused the OOM for me.

So, there is not really a bug here. It's more a fundamental design question. I don't really have a good idea how to solve it. Maybe OggZipDataset is just not optimal here because of the use of ZIP files, and we should use some other dataset which requires less memory per instance. Maybe HDFDataset. (Btw, we could also embed Ogg files into HDF. I want to set this up later.)

Or we manage that OggZipDataset shares the memory for the ZIP structure. But this is tricky to do in a way that we don't get into the problems as also described in the blog post. I think we cannot do this in Python, as Python objects always have refcounts, which are always modified, so it would not work. We could implement this in C++. This probably would work but it's quite a complex solution.

My current solution is that I just have increased the memory requirement. I had 15GB before, but that was slightly too less, and now I have set it to 30GB. Our gpu_24gb nodes have 4 GPUs and 251GB memory, so actually using 30GB is still not really so much there.

albertz · 2024-01-17T10:03:16Z

When using PyTorch with DataLoader option num_workers = 1 (or higher) in addition to MultiProcDataset, this can increase the memory consumption even more. Note, earlier, this was using the fork start method in PyTorch DataLoader. This actually behaved a bit weird together with MultiProcDataset, but (out of luck) it just worked, and it reused the same created sub processes from the main proc. After fixing #1495, now the default start method is spawn (actually spawn-no-daemonic, but that's not important here). This means, the MultiProcDataset will be properly copied in the DataLoader, meaning also that the subprocesses will be created again.

So, what does this mean in practice? E.g. DataLoader num_workers = 1 and MultiProcDataset num_workers = 4, and 3 datasets (train, dev, devtrain):

RETURNN starts (1 proc), creates 3 datasets with MultiProcDataset, each creating 5 (4+1) procs, so 16 in total. (It is lazily loaded, but it requires to now the num_outputs, so this triggers the load. However, it should not actually init the seq order.)
RETURNN PyTorch engine starts, DataLoader for train: 1 new proc for DataLoader worker, then inside a copy of MultiProcDataset, 5 new procs, so in total 6 new procs.
RETURNN PyTorch engine, DataLoader for dev+devtrain: 2 new procs for DataLoader workers, then inside each a copy of MultiProcDataset, so 10 new procs, so in total 12 new procs.

So it means we have 34 procs. (With fork start method in DataLoader, it was 16 + 3 (DataLoader workers), so 19.)

In case of multi GPU with 4 workers, it means again this multiplied by 4.

The amount of procs is not necessarily a problem though. But that each requires so much memory.

Two optimizations we can do:

In the MultiProcDataset workers (_worker_proc_loop), we can init the dataset late, only once it really used, e.g. after the first init_seq_order call. In the outlined case above, it means, in the main proc, only the init seq order sub proc (_seq_order_proc_loop) will load the dataset, and the other sub procs should not really consume much memory.
We can avoid the init seq order sub proc and just do it in the first worker proc.

#1443

albertz · 2024-01-17T10:38:11Z

Both items from the last comment are implemented now.

I also closed the issue now, because I think that's all we can do for MultiProcDataset. The issue of high memory usage is otherwise more a problem of the underlying dataset. And if the underlying datasets wants to share some data across workers, I also don't think that MultiProcDataset can do this.

Maybe one other idea, but not sure if we should do this: We can still try to share data automatically: Instead of creating separate procs, we can only create one worker in the beginning. Then after the first init_seq_order, do fork there to create the other workers.

albertz · 2024-01-18T08:29:48Z

While those two changes have certainly improved the situation, the memory usage is still high. And now I also got an OOM again, in a single-node multi-GPU training. The machine has a bit more than 60GB of memory, and I occupied all of it (my Slurm limit was 60GB).

Memory log before the OOM:

...
MEMORY: total (main 1141206, 2024-01-18, 03:17:32, 31 procs): pss=14.7GB uss=13.8GB
MEMORY: main proc python3.11(1141208) increased RSS: rss=6.3GB pss=5.2GB uss=4.7GB shared=1.6GB
MEMORY: total (main 1141208, 2024-01-18, 03:28:06, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: main proc python3.11(1141206) increased RSS: rss=6.4GB pss=5.3GB uss=4.8GB shared=1.5GB
MEMORY: total (main 1141206, 2024-01-18, 03:38:49, 31 procs): pss=14.8GB uss=13.9GB
MEMORY: sub proc TDL worker 0(1144287) increased RSS: rss=487.2MB pss=324.5MB uss=320.3MB shared=166.9MB
MEMORY: sub proc MPD worker 1(1144757) increased RSS: rss=116.9MB pss=90.7MB uss=90.7MB shared=26.2MB
MEMORY: sub proc MPD worker 3(1144974) increased RSS: rss=116.0MB pss=89.7MB uss=89.7MB shared=26.4MB
MEMORY: sub proc MPD worker 1(1144758) increased RSS: rss=113.2MB pss=87.0MB uss=86.9MB shared=26.3MB
MEMORY: total (main 1141206, 2024-01-18, 03:38:58, 31 procs): pss=14.9GB uss=13.9GB
MEMORY: total (main 1141209, 2024-01-18, 03:38:59, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: sub proc TDL worker 0(1144288) increased RSS: rss=491.7MB pss=331.1MB uss=326.9MB shared=164.8MB
MEMORY: total (main 1141206, 2024-01-18, 03:39:23, 31 procs): pss=14.9GB uss=14.0GB
MEMORY: sub proc TDL worker 0(1145214) increased RSS: rss=479.0MB pss=315.2MB uss=310.9MB shared=168.0MB
MEMORY: total (main 1141207, 2024-01-18, 03:39:25, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: sub proc TDL worker 0(1144287) increased RSS: rss=495.5MB pss=332.8MB uss=328.6MB shared=166.9MB
MEMORY: total (main 1141209, 2024-01-18, 04:01:10, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: sub proc TDL worker 0(1145214) increased RSS: rss=487.6MB pss=323.8MB uss=319.5MB shared=168.0MB
MEMORY: total (main 1141207, 2024-01-18, 04:01:31, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: sub proc TDL worker 0(1145215) increased RSS: rss=477.9MB pss=314.1MB uss=309.8MB shared=168.1MB
MEMORY: sub proc MPD worker 3(1145903) increased RSS: rss=734.6MB pss=708.4MB uss=708.4MB shared=26.2MB
MEMORY: total (main 1141209, 2024-01-18, 04:01:33, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: total (main 1141206, 2024-01-18, 04:01:34, 31 procs): pss=14.9GB uss=14.0GB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=489.4MB pss=325.4MB uss=321.1MB shared=168.3MB
MEMORY: total (main 1141208, 2024-01-18, 04:01:37, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: main proc python3.11(1141207) increased RSS: rss=6.4GB pss=5.3GB uss=4.8GB shared=1.6GB
MEMORY: total (main 1141207, 2024-01-18, 04:04:27, 31 procs): pss=14.7GB uss=13.8GB
MEMORY: main proc python3.11(1141209) increased RSS: rss=6.4GB pss=5.3GB uss=4.8GB shared=1.5GB
MEMORY: total (main 1141209, 2024-01-18, 04:10:25, 31 procs): pss=14.7GB uss=13.8GB
MEMORY: sub proc TDL worker 0(1144287) increased RSS: rss=503.7MB pss=341.1MB uss=336.8MB shared=166.9MB
MEMORY: total (main 1141209, 2024-01-18, 04:23:13, 31 procs): pss=14.7GB uss=13.8GB
MEMORY: sub proc TDL worker 0(1144288) increased RSS: rss=500.8MB pss=340.2MB uss=336.0MB shared=164.8MB
MEMORY: total (main 1141206, 2024-01-18, 04:23:17, 31 procs): pss=14.9GB uss=14.0GB
MEMORY: sub proc TDL worker 0(1144288) increased RSS: rss=537.1MB pss=376.5MB uss=372.3MB shared=164.8MB
MEMORY: total (main 1141206, 2024-01-18, 04:23:25, 31 procs): pss=14.9GB uss=14.0GB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=498.0MB pss=334.0MB uss=329.8MB shared=168.3MB
MEMORY: total (main 1141208, 2024-01-18, 04:23:31, 31 procs): pss=14.6GB uss=13.7GB
MEMORY: sub proc TDL worker 0(1144288) increased RSS: rss=564.4MB pss=403.8MB uss=399.6MB shared=164.8MB
MEMORY: sub proc MPD worker 2(1145795) increased RSS: rss=736.9MB pss=710.9MB uss=710.8MB shared=26.1MB
MEMORY: sub proc TDL worker 0(1145214) increased RSS: rss=496.2MB pss=332.4MB uss=328.2MB shared=168.0MB
MEMORY: total (main 1141206, 2024-01-18, 04:23:33, 31 procs): pss=14.9GB uss=14.0GB
MEMORY: total (main 1141207, 2024-01-18, 04:23:34, 31 procs): pss=14.7GB uss=13.8GB

Looking at one individual TDL (Torch DataLoader) worker:

MEMORY: sub proc python3.11(1145216) initial: rss=367.2MB pss=209.6MB uss=205.5MB shared=161.8MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=403.1MB pss=243.7MB uss=239.6MB shared=163.6MB 
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=404.5MB pss=245.1MB uss=240.9MB shared=163.6MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=406.6MB pss=247.2MB uss=243.1MB shared=163.6MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=436.1MB pss=272.1MB uss=267.8MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=437.7MB pss=273.7MB uss=269.4MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=446.3MB pss=282.3MB uss=278.0MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=454.9MB pss=290.9MB uss=286.6MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=463.6MB pss=299.6MB uss=295.3MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=472.2MB pss=308.2MB uss=303.9MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=480.8MB pss=316.8MB uss=312.5MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=489.4MB pss=325.4MB uss=321.1MB shared=168.3MB
MEMORY: sub proc TDL worker 0(1145216) increased RSS: rss=498.0MB pss=334.0MB uss=329.8MB shared=168.3MB
MEMORY: proc TDL worker 0(1145216) exited, old: rss=498.0MB pss=334.0MB uss=329.8MB shared=168.3MB

Looking at one MPD (MultiProcDataset) worker:

MEMORY: sub proc MPD worker 0(1145581) initial: rss=74.6MB pss=49.2MB uss=49.1MB shared=25.5MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=204.5MB pss=179.0MB uss=179.0MB shared=25.5MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=372.6MB pss=347.2MB uss=347.1MB shared=25.5MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=708.8MB pss=683.2MB uss=683.2MB shared=25.6MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=727.9MB pss=701.7MB uss=701.7MB shared=26.3MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=730.6MB pss=704.4MB uss=704.3MB shared=26.3MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=732.3MB pss=706.1MB uss=706.1MB shared=26.3MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=733.8MB pss=707.6MB uss=707.6MB shared=26.3MB
MEMORY: sub proc MPD worker 0(1145581) increased RSS: rss=734.4MB pss=708.3MB uss=708.2MB shared=26.2MB
MEMORY: proc MPD worker 0(1145581) exited, old: rss=734.4MB pss=708.3MB uss=708.2MB shared=26.2MB

albertz · 2024-01-18T08:38:51Z

Maybe we really have to think about some of the other solutions. E.g. there is also TorchSerializedList and SharedList.

albertz · 2024-01-18T08:44:50Z

There is one (relatively) simple thing we can do in case of PyTorch DataLoader multi processing: We now that we are not going to use the original dataset in the main proc anymore, so we can free that one.

#1443

albertz · 2024-01-18T09:44:03Z

There is one (relatively) simple thing we can do in case of PyTorch DataLoader multi processing: We now that we are not going to use the original dataset in the main proc anymore, so we can free that one.

This is also implemented now, specifically for the MultiProcDataset (but also additionally for the OggZipDataset, can later be extended, but not needed when MultiProcDataset is used).

I think this is really all we can realistically do for MultiProcDataset itself. For further discussions on how to improve the overall situation that dataset memory consumption is high (caused maybe by MultiProcDataset, but also PyTorch DataLoader multiprocessing, or distributed training, etc), see the new issue #1498.

#1443

albertz added a commit that referenced this issue Jan 17, 2024

MultiProcDataset, worker proc, late load dataset

402347a

#1443

albertz closed this as completed in 5409798 Jan 17, 2024

albertz reopened this Jan 18, 2024

albertz closed this as completed in ca90f08 Jan 18, 2024

albertz added a commit that referenced this issue Jan 18, 2024

OggZipDataset, support freeing resources

e8c0fee

#1443

albertz mentioned this issue Jan 18, 2024

High memory usage with datasets (specifically when multi procs are used) #1498

Open

albertz added a commit that referenced this issue Jan 18, 2024

PT data loader, free resources after creation

ba9cab3

#1443

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiProcDataset, high memory usage #1443

MultiProcDataset, high memory usage #1443

albertz commented Oct 24, 2023 •

edited

Loading

albertz commented Oct 24, 2023

albertz commented Oct 24, 2023

vieting commented Oct 25, 2023

albertz commented Oct 25, 2023 •

edited

Loading

albertz commented Jan 17, 2024 •

edited

Loading

albertz commented Jan 17, 2024

albertz commented Jan 18, 2024 •

edited

Loading

albertz commented Jan 18, 2024

albertz commented Jan 18, 2024

albertz commented Jan 18, 2024 •

edited

Loading

MultiProcDataset, high memory usage #1443

MultiProcDataset, high memory usage #1443

Comments

albertz commented Oct 24, 2023 • edited Loading

albertz commented Oct 24, 2023

albertz commented Oct 24, 2023

vieting commented Oct 25, 2023

albertz commented Oct 25, 2023 • edited Loading

albertz commented Jan 17, 2024 • edited Loading

albertz commented Jan 17, 2024

albertz commented Jan 18, 2024 • edited Loading

albertz commented Jan 18, 2024

albertz commented Jan 18, 2024

albertz commented Jan 18, 2024 • edited Loading

albertz commented Oct 24, 2023 •

edited

Loading

albertz commented Oct 25, 2023 •

edited

Loading

albertz commented Jan 17, 2024 •

edited

Loading

albertz commented Jan 18, 2024 •

edited

Loading

albertz commented Jan 18, 2024 •

edited

Loading