Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System memory usage increase in training #12

Closed
Alt216 opened this issue Jan 19, 2022 · 5 comments
Closed

System memory usage increase in training #12

Alt216 opened this issue Jan 19, 2022 · 5 comments

Comments

@Alt216
Copy link

Alt216 commented Jan 19, 2022

Hi, when I run python train.py ./configs/train/train_weakly_supervised.yaml to train the network from scratch using our dataset, my system memory usage will slowly increase until it max out the system memory and then the traning will crash. I have 16gb of system memory and the training can only go on for a little more than one epoch with ~16000 training samples. I tried to lower the num_workers to 4 and lower the batch size to 2 but they didn't seem to resolve the issue.

@zgojcic
Copy link
Owner

zgojcic commented Jan 19, 2022

Hi,

I have never observed this issue, but it is also true that I always had more than 16gb memory. Can you try to somehow find where the memory leak is or try to train it on a machine with 32gb ram?

For the other issue that you posted: I have uploaded the preprocesing scripts that we have used for semantic_kitti and other datasets

@Alt216
Copy link
Author

Alt216 commented Jan 19, 2022

Thank you very much @zgojcic ! I will look into the possible memory leaks, and thanks for the preprocesing scripts.

For the memory issue,
I added .detach() on line 145 and 151 in train.py as I saw some suggestion on the pytorch forum regarding storing the complete computation graph when adding losses. I am still unsure whether this is the issue but I will try to run the training with this modification.

The lines of code

suggestion I found

@Alt216
Copy link
Author

Alt216 commented Jan 20, 2022

After some more time searching on the web, I found this that could be a possible explanation. Maybe it has to do with the dataloader iterating across lists and dicts which adds up over time? The suggested solution is to replace them with numpy arrays.

@zgojcic
Copy link
Owner

zgojcic commented Jan 25, 2022

Hi @Alt216 this could indeed be the case, at the moment I do not have time to investigate this (especially as it work ok on machines with more RAM), but if you can find the solution it would be great if you can make a PR.

Best
Zan

@zgojcic
Copy link
Owner

zgojcic commented Mar 5, 2022

Closing due to inactivity.

@zgojcic zgojcic closed this as completed Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants