-
Notifications
You must be signed in to change notification settings - Fork 129
Lowering for upsample_bicubic2d #1270
Conversation
@jansel could you PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is slower, what does the eager mode version do? Are there any tricks hiding in that version?
Perhaps it is a benchmark setup issue? Have you tried disabling cudagraphs for both legs of the experiment?
Maybe @ngimel has ideas.
part of it is 64 bit vs 32 bit indexing, check the linked issue |
We should be able to use 32bit indexing in the lowering version. |
Lowering uses 32 bit indexing, so I don't think that's the reason. But yeah, looking at the raw kernel times would be helpful, as cuda graphs incur additional copies. |
torchinductor/lowering.py
Outdated
ix = ops.indirect_indexing(clamp(fx, 0, iW - 1)) | ||
return x_loader([n, c, iy, ix]) | ||
|
||
iy = ops.to_dtype(in_y, torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this break if x or y dimension is larger than int_max? (That's a very rare situation, but our users tend to poke holes like this, even if only to file an issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, and it will definitely break. Note that the cuda implementation has the same issue though, it is using int
for those indices. Probably there are a few other kernels with this issue too.
In the lowering, it is easy enough to add a check and use int32
or int64
accordingly, but if we are worrying about that we should probably also worry about float32
(use for all the floating point index calculations) not being precise enough if the xy dimensions are large enough. Should we also be picking the floating point size according to those dimensions? CUDA implementation uses at::acc_type<scalar_t, true>
, which doesn't make much sense to me.
I will add a check for the int dtype, seems like a bigger problem, but let me know what you think about the float type issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry fixing this. I think this should be fixed at a compiler level, as discussed in #1293
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, yeah, your solution of just adding a check for now SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, ideally it should be at compiler level, but meanwhile we can get significantly better performance by having the explicit check
coeffs_x = tuple((load_bounded(y, x) for x in ixs_ofs)) | ||
return cubic_interp1d(coeffs_x, t_x) | ||
|
||
coeffs_y = tuple(get_x_interp(y) for y in iys_ofs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are coeffs_x
?
Edit: ah ok it's rolled in get_x_interp
function, but that's pretty confusing, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an asymmetry between coeffs_x
and coeffs_y
. For coeffs_x
, they are direct memory reads from input tensor and there are actually 4 sets of them, one for each of 4 y offsets. For coeffs_y
they are the result of applying cubic_interp1d
to each of the 4 sets of coeffs_x
.
(Replying to @jansel above)
Not that I can see? I largely based this lowering on the cuda implementation, which seems pretty straightforward.
This is what you get without and with cudagraphs (+CG columns). I also included the results for the decomp in pytorch/pytorch#85403
So doesn't seem like it's a cudagraph thing. By the way, I disabled cudagraphs for torchinductor versions by passing |
7172ec3
to
121d149
Compare
Yeah it's ok to merge this |
Yes, we can merge this. |
This reverts commit 1d4794d.
Sorry, had to revert, as master tests started failing |
@fdrocha the failure is related to not handling None inputs:
In this case:
|
Added a lowering for upsample_bicubic2d.
Currently performance is about 40% worse than eager.
Not sure why this is happening, would appreciate suggestions on how to address it @Chillee @ezyang @jansel @lezcano
Benchmarks
The generated triton looks reasonable to me
Code used for benchmarking