Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory issue when training the model with single GPU #12

Closed
joeljosejjc opened this issue Jul 19, 2022 · 7 comments
Closed

Comments

@joeljosejjc
Copy link

Hello Sir. Firstly, thanks for your contribution through this work.

I would to enquire about the GPU Requirements when training the GANav model, especially when using a single GPU. I am attempting to train the model using a single GPU (Nvida RTX 2060) but I am facing the error : Runtime Error: CUDA Out of Memory.
To be more specific, I am running the following code after setting up the GANav environment and processing the dataset as per readme instructions:
python -m torch.distributed.launch ./tools/train.py --launcher='pytorch' ./configs/ours/ganav_group6_rugd.py
and I am facing the error below:
Screenshot from 2022-07-19 20-43-31

PC specs and package versions & configuration used in environment for GANav:

GPU: Nvidia RTX 2060
CPU: AMD Ryzen 7 3700X
Python Version: 3.7.13
Pytorch version: 1.6.0
cudatoolkit version:10.1
mmcv-full version: 1.6.0
Dataset used: RUGD Dataset
No. of Annotation Groups: 6

Also, can you suggest some workarounds for memory management when training the model using a single GPU.
Thanks.

@LancerXE
Copy link

Same issue here

@rayguan97
Copy link
Owner

We used GeForce RTX 2080 with 8 GB RAM. In this case, I recommend that you use smaller batch size/sample_per_gpu size, which can be defined rugd_group6.py or ganav_group6_rugd.py.

However, RELLIS-3D is using sample_per_gpu size=1, so I'm not sure whether it can be run on a Nvidia RTX 2060. Another way is to reduce the crop size, which means you might not able to use the released trained model.

@joeljosejjc
Copy link
Author

Thank you sir for your quick reply regarding this issue. I reduced the samples_per_gpu parameter in GANav-offroad/configs/ours/ganav_group6_rugd.py as per your suggestion. The above issue was still persisting for 2 or 3 samples_per_GPU, but fortunately, training successfully commenced when samples_per_gpu=1.
Once again for reference, I executed the following code to run the training phase.

python -m torch.distributed.launch ./tools/train.py --launcher='pytorch' ./configs/ours/ganav_group6_rugd.py

The training phase progressed till the 10% checkpoint at which point it produced another error as below:

image

I googled and found a solution to the error which was to set priority of DistEvalHook to LOW in \mmseg\apis\train.py.
open-mmlab/mmpretrain#488
After doing so, the model training completed successfully.

But I observed that the evaluation metrics at each checkpoint were not improving throughout the training and stayed the same as that of the 1st checkpoint. The evaluation metrics for all checkpoints were as given below:

image

Evaluation results from the Testing phase using the trained model produced similar results as well.

I am not sure what is causing this issue, and I have not altered any other parameters in the repository. Hence could you suggest what may be causing this issue?

@rayguan97
Copy link
Owner

Have you checked the number of classes? Since its prediction is all 0 (background), I image there is something wrong with the setup.

  1. Can you try evaluating on the training set and make sure it's fitting the training data?
  2. Can you check the output of the model? The error could be the eval code or the inference of the model.
  3. I have never seen this issue before, so it's possible that the solution you found about the "date_time" key error leads to this error.
  4. There is no need to use "-m torch.distributed.launch" since you only use 1 gpu. Have you tried running the exact cmd as in the instruction? That would be a good starting point to narrow down the problem since I have not run into this issue with this cmd (python ./tools/train.py ./configs/ours/ganav_group6_rugd.py)

@joeljosejjc
Copy link
Author

To answer some of the above queries:

  1. The number of classes that I have used is 6 and to the best of knowledge, I have not changed any parameter in the code that affects the segmentation groups. This will be more evident as I present the results from the GANav model which I trained a second time after following your suggestions given above as much as possible.
  2. The output of the model seemed to be showing all of the images being colour coded completely as obstacles (which could be the reason for the high accuracy but low IoU of the obstacle category evaluation.)
  3. I had initially used the exact command that was given in the readme of the Repo:
    python ./tools/train.py ./configs/ours/ganav_group6_rugd.py

But this command generated the following error:
image

I researched and found that using SyncBN function for Batch Normalisation requires distributed training, and hence requires the command python -m torch.distributed to set the required parameters for distributed training.
I found that changing this function to BN in configs/base/model/ours_class_att.py takes away this requirement and allows me to run the original command with no errors.
Hence I tried training the model again using BN as the batch normalisation function. And to my surprise the initial evaluation metrics (at the first checkpoint) were not too bad:

image
(Screenshot of 1st Checkpoint evaluation)
But the error related to the 'date_time' key showed up once again and I found no other solution other than the one I mentioned in my comment above.

Up until the 4th checkpoint, the evaluation metrics seemed to be improving up until 5th checkpoint onwards, when the performance metrics plummeted down (below 10%) and continued to stay at this range till the last checkpoint. By the end of training, only L2 Navigable terrain had mediocre results (25% IoU and 48% Accuracy), and the rest off the classes had metrics below 10%.

image
Screenshot of 2nd Checkpoint evaluation (32000 iterations)

image
Screenshot of 3rd Checkpoint evaluation(48000 Iterations)

image
Screenshot of 5th Checkpoint evaluation (80000 Iterations)

image
Screenshot of final Checkpoint evaluation (160000 Iterations)

The evaluation metrics from the testing phase generated similar results to that of the last checkpoint metrics:
image

Finally, I also tried running the testing phase on the training data, which had also produced similar results to the above:
image

Another strange thing that was observed is that the output images from the testing phase were coloured completely in one of the annotation group colours. I have added a screenshot of the directory containing the segmented images to visualise what I mean:

image

image

image

image

I had the following doubts from the above observations from training the GANav model:

  1. Would it be possible that the poor and strange performance in training model stems from the fact that I am using 1 sample_per_gpu. As such, will there be an improvement in the training performance if I use a better GPU variant (like the one that you had used,RTX 2080 8 GB), and correspondingly, higher samples per gpu?
  2. Should more than 1 GPU be used (implemeniing distributed training) for improvement in training the model.
  3. Or, could the bad performance be resulting from the implementation of the date_time key error, or some other unforeseen issues. If so, could you suggest some possible troubleshooting methods that I could try out to resolve for the same.

Once again, thank you sir for your suggestions and help in resolving these issues.

@rayguan97
Copy link
Owner

  1. Regarding BN and SyncBN: Thank you so much for noting this error and pointing this out! I forgot to change it back after using distributed training; now it's updated in the config file.
  2. Regarding the poor performance -- Yes, in this case I believe it might be the issue of the batch size. I recommend lower the learning rate slightly if you might not have access to a RTX 2080, or change to a larger batch size with better GPU. I have never used a batch size of 1 on this dataset.
  3. If all the pixel are predicted as the same classes, maybe it has something to do with the label and the processing script? But it's highly unlikely and I could not see a reason why it failed if it's not changed in any way.
  4. Regarding more than one GPU: I do not see much improvement/degradation in performance with 1 or more gpus (there are some improvement using 2 gpus instead of 1 gpu, to the best of my recollection).
  5. I don't see a reason why date_time key error is an issue.

To sum up:

  1. Try using a GPU with larger RAM size with the provided batch size.
  2. Try lower the learning rate if using smaller batch size. I image you observe this behavior only if the learning rate has gone horribly wrong.

The current version of the code corresponds to this version of the paper(https://arxiv.org/abs/2103.04233v2), which is not the latest version that was described in the accepted RAL paper. I will update the code in the next couple of days and see if there are any obvious bugs that might cause this issue.

@rayguan97
Copy link
Owner

Hey just a quick note, I just uploaded the new code for the latest paper. Please start another issue if you still have training problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants