New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running batch transforms (e.g. torch.nn.functional.grid_sample) is slower on TPU vs CPU #2405
Comments
Hi @butchland, thanks for reporting! Could you follow the instruction in here to run a debug run? This way we can know what exactually happened. My guess would be that xla currently does not lower |
debug_run.tar.gz If you need any other thing or need extra parameters we can send it back.
The finale python code executed is this
|
Oh, so it looks like you are running a small code snippet and does not finish a full step so the metric report is not generated. Do you mind running it again with
at the end. More detail can be find here. This report will be super helpful and telling us where the slowness coming from. |
O yes, we tried to remove all the other interference from extra code and just limit it the most to what is causing the slowness. By the way, I see I can add --hlo and generate maybe something like grab_graph.py something? By the way, in our sample what was missing to generate this report so that we dont print it manually? Found that aten are the the calls forwarded to CPU because not implemented on TPU, so I paste from the tgz for easy access.
|
yup, you are right.. For your other questions, yes you can setup |
Good! I see, thanks. So we should wait for the lowering of this 2 calls, but what about |
yup I will update this thread when I make any progress on lowering these two ops. We have a section in here talking about |
Hi there @JackCaoG, Im back on this issue so I will give a try! |
Hi, get a solution new? I just met the same issue when training GPT-neo model using TPUs on Colab |
I am using Resize inside Pytorch lightning training step and it makes my code terribly slow. Is there a solution for this? |
affine_grid should be supported now. To get a better understanding of the problem, doing a metric report
after a step will help |
@JackCaoG I am using Resize in my training step, more specifically I am using the metric report function, but the training seems to be stuck at the resize(tensor) operation, so the code doesn't reach that step. |
Hi @JackCaoG @butchland @tyoc213, I'm wondering if you have find the solution to speeding up the F.grid_sample method. I'm also running into same issue. Any help will be much appreciated. |
@dhruvrnaik if you have a small repo I might be able to take a look. It depends on what op that |
Taking another look of the cpu |
Thank you for looking into it, @JackCaoG. I'm working with a model for learning pixel-conditioned Neural Radiance Field (paper, code). Many radiance field models heavily rely on |
I don't think we have anything similar to grid_sample. PyTorch/XLA supports |
馃悰 Bug
Executing the batch transforms which use the torch.nn.functional.grid_sample function
seems to run slower on a single TPU core vs the CPU.
To Reproduce
We encountered this weird bug where the batch transforms seem to run slower on a single TPU core compared to a GPU (which we kind of expected) but we also found out that it runs even slower than the CPU!
Here's some notebooks showing the results for a single transform (Flip)
GPU (fastest) - avg time: 0.021 secs
CPU (middle) - avg time: 1.227 secs
TPU (slowest) - avg time: 7.341 secs
For the torch.nn.functional F.grid_sample method, times:
GPU - avg time: 0.000 *not measurable in time.time() diff
CPU - avg time: 0.821 secs
TPU - avg time: 4.247 secs
This is not even using gradients, just pure parallel tensor computations...
The notebooks here have a run on colab link in them so you can validate
the stats produced above.
https://github.com/butchland/fastai_xla_extensions/blob/master/archive_nbs/09_vision_augment_experiments_profile_CPU.ipynb
https://github.com/butchland/fastai_xla_extensions/blob/master/archive_nbs/09_vision_augment_experiments_profile_GPU.ipynb
https://github.com/butchland/fastai_xla_extensions/blob/master/archive_nbs/09_vision_augment_experiments_profile_TPU.ipynb
Expected behavior
We expect that the TPU should run the transforms much faster than a CPU.
Environment
Colab
Additional context
Lastly, we noticed that other data augmentations that run on batch on the TPU doesn't slow it
down as much (brightness and contrast) as they run faster on a TPU vs a CPU...
We (@butchland and @tyoc213) are building an extension library to enable the fastai library to run on TPUs.
If you have suggestions to speed it up (e.g. alternative algos for batch transforms
for data augmentations), we'd appreciate it!
The text was updated successfully, but these errors were encountered: