System memory usage increase in training #12

Alt216 · 2022-01-19T15:36:21Z

Hi, when I run python train.py ./configs/train/train_weakly_supervised.yaml to train the network from scratch using our dataset, my system memory usage will slowly increase until it max out the system memory and then the traning will crash. I have 16gb of system memory and the training can only go on for a little more than one epoch with ~16000 training samples. I tried to lower the num_workers to 4 and lower the batch size to 2 but they didn't seem to resolve the issue.

The text was updated successfully, but these errors were encountered:

zgojcic · 2022-01-19T15:59:09Z

Hi,

I have never observed this issue, but it is also true that I always had more than 16gb memory. Can you try to somehow find where the memory leak is or try to train it on a machine with 32gb ram?

For the other issue that you posted: I have uploaded the preprocesing scripts that we have used for semantic_kitti and other datasets

Alt216 · 2022-01-19T16:10:13Z

Thank you very much @zgojcic ! I will look into the possible memory leaks, and thanks for the preprocesing scripts.

For the memory issue,
I added .detach() on line 145 and 151 in train.py as I saw some suggestion on the pytorch forum regarding storing the complete computation graph when adding losses. I am still unsure whether this is the issue but I will try to run the training with this modification.

The lines of code

suggestion I found

Alt216 · 2022-01-20T05:11:33Z

After some more time searching on the web, I found this that could be a possible explanation. Maybe it has to do with the dataloader iterating across lists and dicts which adds up over time? The suggested solution is to replace them with numpy arrays.

zgojcic · 2022-01-25T08:07:13Z

Hi @Alt216 this could indeed be the case, at the moment I do not have time to investigate this (especially as it work ok on machines with more RAM), but if you can find the solution it would be great if you can make a PR.

Best
Zan

zgojcic · 2022-03-05T20:13:53Z

Closing due to inactivity.

zgojcic closed this as completed Mar 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System memory usage increase in training #12

System memory usage increase in training #12

Alt216 commented Jan 19, 2022

zgojcic commented Jan 19, 2022

Alt216 commented Jan 19, 2022 •

edited

Alt216 commented Jan 20, 2022 •

edited

zgojcic commented Jan 25, 2022

zgojcic commented Mar 5, 2022

System memory usage increase in training #12

System memory usage increase in training #12

Comments

Alt216 commented Jan 19, 2022

zgojcic commented Jan 19, 2022

Alt216 commented Jan 19, 2022 • edited

Alt216 commented Jan 20, 2022 • edited

zgojcic commented Jan 25, 2022

zgojcic commented Mar 5, 2022

Alt216 commented Jan 19, 2022 •

edited

Alt216 commented Jan 20, 2022 •

edited