-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelisation #5
Comments
Hello @stellarpower , |
I think I don't, although if the sequences in the batch are independent, is it possible to do it in such a way that they don't all consume the resources if there's room to loop through one by one (or several at a time) instead? I.e. if I have a batch size of 16, and there's memory to compute 8 at a time, can we do two iterations and then just sum the result? I think the problem I hit after that, which was more of a concern, was running out of resources for longer sequences - the overall GPU memory was fine, but, I assume the thread limit or memory for the kernel was exhausted, with a batch size of 1. |
With regards to batching batched input depending on the memory available on the host machine, that will not be implemented in For the a longer sequence and a batch of one, it should also be the GPU memory limitation. Unless that is the case, I'll close this issue? |
Afraid I'm not sure what you mean by adapted batch size; does Torch change the batch size dynamically at times? I have not seen this in Tensorflow. But this is what I was wondering, is there a motivating reason for handling all the sequences in the batch in the kernel? I would've thought if the kernel just handles one, then the framework would be able to handle sensible decisions about how to parallelise them and queue up in pipelines, freeing it from being an issue. I haven't been able to stress-test the Keras one yet, but I am hoping that as I handle each sequence separately, given Tensorflow works on the atomic unit of "Ops", it will be smart enough to queue everything up in a way that prevents memory exhaustion when it knows it can parallelise. At the top level I have a map over the sequences in the batch, so it seems reasonably hopeful that will be the case. Are longer sequences limited by the CUDA kernel/thread memory limits mentioned in Maghoumi's version? Or just by total available memory? For particular my network, a sequence length of 512 will be quite limiting, so I may still look into writing a kernel in C++ if it learns but needs further optimisations.
Feel free, I don't know if you want to enable the Q&A feature on the repository, as was more of a question right now than a request, and this could be moved over there. I think a limited batch length could be an issue for my project, but as I'm mostly using Tensorflow then I don't have as much a reason myself to need it for the meantime. BTW whilst I am here, do you think it would be possible to add a quick example of obtaining the full gradients to the readme? I am getting a difficult bug where my implementation is differing from yours and the other Torch versions in the gradients, but only when gamma != 1.0. The intermediate matrices are all within numerical precision, so I'm not sure if I am using the gradient tape properly. This is what I have, I am looking at the tests right now to see if it looks the same: lossesPerSequenceGraph = sdtw(y_true, y_pred)
lossesPerSequence = lossesPerSequenceGraph.detach().numpy()
# Have to call sum before running the backwards pass.
lossesPerSequenceGraph.sum().backward()
# We only want the gradient on y_pred; y_true is constant and thus doesn't affect them.
torchGradients = y_pred.grad Thanks |
Edit: Looks like I have to call |
Am I also missing something on the memory use? Let's say I have data of (16, 512, 1). For each sequence, with float32 that's then:
So if we do everything at the same time that's to the order of megabytes per batch, and yet I'm able to exhaust the 24G of memory on my 3090 by increasing either the batch size or the sequence length. So presumably I've missed something in this model of what the algorithm is doing (?) |
Hi,
Was wondering - for each element in the batch, does the current algorithm automatically parallelise? I have an RTX3090 (with 24 GiB) and I run out of memory instantly for a sequence anything longer than 512 samples.
I was wondering if CUDA is trying to parallelise across each sequence in the batch automatically - if so, I think it's be good to run them in series if there isn't enough memory, seeing as they should be independent. It seems DTW inherently has high memory use, and I'd rather have the loss take longer than be limited in my sequence length if this is the case and that's possible.
Cheers
The text was updated successfully, but these errors were encountered: