-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lowering unfold
#2239
Comments
Thanks for reporting @ibeltagy , we will take a look. |
Thanks, @dlibenzi. |
While loop based patch extraction is likely slower than convolution tricks: |
@dlibenzi, sorry, I am not sure I am following how this link is related to unfold |
I see. This is the c++ tensorflow version of |
Cannot be called, but we can use the same idea (convolutions using kernels picking up one element at a time), for the forward. |
hey @JackCaoG, I am just curious if there are updates here. |
Hi @ibeltagy , I am working on the lowering part but it is a bit tricky. You will see the pr linked in this issue when it is ready 😄. |
Thanks, @JackCaoG for the forward function in your PR here. I ran your code and I successfully get
Thanks |
HI @ibeltagy I am not sure if 1 hour is too long, it really depends on your model size. Did you remember how much time it takes prior to the For the second question I think I have an idea. During the lowering of unfold, for input with shape [12, 2048, 64], size=512, step=256, it will generate two iota vector of size [12 * 2048 * 64 - 512, 1, 12 * 2048 * 64 -512 ] and a filter of the same size. It will then use I chose this lowering is that convolution trick is likely much faster than the loop base approach. For pytorch native GPU unfold is just playing with the pointer and the stride, but for XLA we actually need to calculate the output and store it(unfold is not a view op on XLA). This is the downside with not being able to access the storage. Does this OOM issue block you from using XLA on this model? |
around 5 minutes
Yeah, this is huge and won't work.
Can you elaborate on what the loop-based approach is? is it a loop with multiple
Yes, and the actual input is even larger, something like |
Hi @ibeltagy 5 minutes to 1 hour seems a big jump. One possibility is that For the loop based approach, yes I was thinking about multiple |
If it is possible to implement |
Do you mind trying out the idea of splitting the tensor before unfold and concat the result afterward? something like
I will try to see if I can reduce the memory usage of the current implemantion and think a bit more about the slice approach. |
I pushed a new change to the unfold pr, the peak memory usage should be reduced to 1/3 when step > 3. |
Will try both and let you know. Thanks. |
If you guys can post a simple repro, and dump the HLO graph, we could see what is going on. print(torch_xla._XLAC._get_xla_tensors_hlo([unfold_result])) |
I tried the iterative slicing that you suggested and found it to work well. The memory usage is low enough that I can run the model on long sequences, and the model is fast enough (1.7x slower than a GPU that uses Here's another thing that can use your help, and please let me know if I should move it to a separate issue. Right now the model is 1.7x slower than GPU. If you guys have any insights on how to make it faster, that would be great. And, I don't think the iterative unfold vs. as_strided is the main contributor to the slowdown. I tried the model with this part of the code removed and it was still slower than on a GPU. |
Hi @ibeltagy , glad to hear that you get the |
Sure. I will move the model optimization to a separate issue. One thing that's still relevant here is finding out if |
Fore sure, we still want |
Hi, @ibeltagy , I have similar issues when using unfold. Do you mind elaborating on how iterative slicing works? Maybe via an example? |
@JunwenBai I believe it is this function |
Hi, is
|
@coleridge72 I think |
🚀 Feature
Add a lowering for
unfold
.Motivation
I want to run Longformer (model code on HF repo) on pytroch-xla, and this requires an overlapping sliding window operation which needs a lowering for
unfold
.Pitch
Add a lowering for
unfold
Alternatives
Use
as_strided
but the current implementation is limited as discussed in this issue.Additional context
Below is the metric report for the forward pass of Longformer with
unfold
. It has entries foraten::unfold
.The text was updated successfully, but these errors were encountered: