Running out of GPU memory after several minutes training #7

ganzhi · 2021-02-21T00:45:06Z

Hi,

I got a CUDA out of memory issue after several minutes training. Is there a way to fix it?

(py38) C:\Src\GitHub\MadMario>python main.py
Loading model at checkpoints\2021-02-20T16-13-06\trained_mario.chkpt with exploration rate 0.1
Episode 0 - Step 660 - Epsilon 0.1 - Mean Reward 2990.0 - Mean Length 660.0 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 10.198 - Time 2021-02-20T16:29:03
Episode 20 - Step 5262 - Epsilon 0.1 - Mean Reward 1311.095 - Mean Length 250.571 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 61.936 - Time 2021-02-20T16:30:05
Episode 40 - Step 9888 - Epsilon 0.1 - Mean Reward 1149.829 - Mean Length 241.171 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 62.843 - Time 2021-02-20T16:31:08
Episode 60 - Step 13407 - Epsilon 0.1 - Mean Reward 1072.361 - Mean Length 219.787 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 47.898 - Time 2021-02-20T16:31:56
Episode 80 - Step 19197 - Epsilon 0.1 - Mean Reward 1144.407 - Mean Length 237.0 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 77.715 - Time 2021-02-20T16:33:14
Episode 100 - Step 22474 - Epsilon 0.1 - Mean Reward 1060.12 - Mean Length 218.14 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 44.237 - Time 2021-02-20T16:33:58
Episode 120 - Step 26864 - Epsilon 0.1 - Mean Reward 1015.29 - Mean Length 216.02 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 58.86 - Time 2021-02-20T16:34:57
Episode 140 - Step 32109 - Epsilon 0.1 - Mean Reward 1094.56 - Mean Length 222.21 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 71.322 - Time 2021-02-20T16:36:08
Traceback (most recent call last):
File "main.py", line 59, in
action = mario.act(state)
File "C:\Src\GitHub\MadMario\agent.py", line 57, in act
state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.00 GiB total capacity; 7.56 GiB already allocated; 0 bytes free; 7.74 GiB reserved in total by PyTorch)

oldschooler-dev · 2021-02-24T19:08:19Z

Hi,
I also receive this error giving it a try. I tweaked the memory settings in the agent:
self.memory = deque(maxlen=20000) in the agent.py line 13 from 100000
torch.FloatTensor I guess ! these are never freed which I think is the expectation to have this on the GPU for access during training(eg the memory for experiences)! I am not 100% as I am a journey of trying to fit this into a rtx2080 on Windows, pytorch cuda 1.7.1. I have tried the latest dev with the same results so thinks its down to not having a 32gb on the GPU, I am now up to 10,000 episodes after tweaking the memory to be less as above..fingers crossed its going to take another 24 hours I would imagine....not sure how many episodes I have to go through to get the results of the model provided though.

oldschooler-dev · 2021-02-24T19:15:48Z

and also the burn in I changed... self.burnin = 1e4

ganzhi · 2021-03-03T06:06:54Z

Created a PR to address the issue here:
#8

@oldschooler-dev, you can try my fix by cloning this repo: https://github.com/ganzhi/MadMario

oldschooler-dev · 2021-03-19T06:46:44Z

Seems to work ok, no memory errors...Cheers

LI-SUSTech · 2021-03-23T02:27:02Z

Seems to work ok, no memory errors...Cheers

how did you fix the problem. It seems no difference between @ganzhi fork between the master ...

LI-SUSTech mentioned this issue Mar 23, 2021

Errors occur when running the MadMario code bionicdl-sustech/ME336-2021Spring#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of GPU memory after several minutes training #7

Running out of GPU memory after several minutes training #7

ganzhi commented Feb 21, 2021

oldschooler-dev commented Feb 24, 2021

oldschooler-dev commented Feb 24, 2021

ganzhi commented Mar 3, 2021

oldschooler-dev commented Mar 19, 2021

LI-SUSTech commented Mar 23, 2021

Running out of GPU memory after several minutes training #7

Running out of GPU memory after several minutes training #7

Comments

ganzhi commented Feb 21, 2021

oldschooler-dev commented Feb 24, 2021

oldschooler-dev commented Feb 24, 2021

ganzhi commented Mar 3, 2021

oldschooler-dev commented Mar 19, 2021

LI-SUSTech commented Mar 23, 2021