UpSample-nearest cuda kernel update #21694

jjsjann123 · 2019-06-12T18:45:28Z

updating upsampling kernel:

avoids atomicAdd for better fp16 performance.
better launch configures for 2D input.

updating upsampling kernel: 1. avoids atomicAdd for better fp16 performance. 2. better launch configures for 2D input.

jjsjann123 · 2019-06-12T18:46:26Z

Perf number/scripts will be posted shortly.
I'll have another PR for bilinear upsampling coming soon as well.

cc @ngimel

…edundant code

jjsjann123 · 2019-06-13T01:25:42Z

fp16 forward perf number has been observed to be all over the place, especially for tiny input. :/
Things to notice here is the speedup on the fp16 backward path.

Here's the script for the benchmark

import torch
import numpy as np

nrep = 300
sample_mode = "nearest"

def bench(size, fn, factor, half=False, b=2, c=32, dim=2):
   x=torch.ones([b*c*(size**dim)], device='cuda', dtype = torch.float)
   if half:
     x = x.half()
   if dim==1:
     x=x.view(b, c, size).requires_grad_()
   elif dim==2:
     x=x.view(b, c, size, size).requires_grad_()
   elif dim==3:
     x=x.view(b, c, size, size, size).requires_grad_()
   torch.cuda.synchronize()
   import time
   start = time.time()
   for i in range(nrep):
      out = fn(x, scale_factor=factor, mode=sample_mode)
   torch.cuda.synchronize()
   end = time.time()
   inp_size = x.size()
   out_size = out.size()
   del x, out
   return ((end-start)/nrep, inp_size, out_size)

def bench_back(size, fn, factor, half=False, b=2, c=32, dim=2):
   x=torch.ones([b*c*(size**dim)], device='cuda', dtype = torch.float)
   if half:
     x = x.half()
   if dim==1:
     x=x.view(b, c, size).requires_grad_()
   elif dim==2:
     x=x.view(b, c, size, size).requires_grad_()
   elif dim==3:
     x=x.view(b, c, size, size, size).requires_grad_()
   torch.cuda.synchronize()
   out = fn(x, scale_factor=factor, mode=sample_mode)
   grad = torch.randn_like(out)
   import time
   start = time.time()
   for i in range(nrep):
      out.backward(grad)
   torch.cuda.synchronize()
   end = time.time()
   inp_size = x.size()
   out_size = out.size()
   del x, out, grad
   return ((end-start)/nrep, inp_size, out_size)


spatial_size = [2**i for i in range(5,12)]
batch = [8]
channel = [32]
dim = [1, 2, 3]
scale_factor = [2]
bool_flag = [False, True]
cap = 2**31

from itertools import product
for d, b, c, s, f, half_flag in product(dim, batch, channel, spatial_size, scale_factor, bool_flag):
  if ((s*f)**d)*b*c*(2 if half_flag else 4) < cap:
    (fw_time, inp_size, out_size) = bench(s, torch.nn.functional.interpolate, f, half_flag, b, c, d)
    (bw_time, inp_size, out_size) = bench_back(s, torch.nn.functional.interpolate, f, half_flag, b, c, d)
    print(inp_size, f, half_flag, 1./fw_time, 1./bw_time)

jjsjann123 · 2019-06-13T01:28:18Z

Removed the specialized 2d kernel, as the speedup is sparse. Caching seems to have done a great job saving the memory accessing pattern.

I don't think I can justify having a dedicated kernel there 😢

ezyang · 2019-06-14T17:44:59Z

@ngimel happy to merge this if you give it the OK

ngimel

Make sure there are checks against empty tensors, and make sure you are not excessively zeroing outputs. Those already might be somewhere in the code, and I might be blind, in which case it is good to go.

aten/src/ATen/native/cuda/UpSample.cuh

aten/src/ATen/native/cuda/UpSampleNearest1d.cu

aten/src/ATen/native/cuda/UpSampleNearest2d.cu

jjsjann123 · 2019-06-17T20:20:06Z

Addressed review comments. Should be good to go when test passes

ngimel

Review comments are addressed, great job, Jie!

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: updating upsampling kernel: 1. avoids atomicAdd for better fp16 performance. 2. better launch configures for 2D input. Pull Request resolved: pytorch/pytorch#21694 Differential Revision: D15875791 Pulled By: ezyang fbshipit-source-id: 426fc5d5f0c0cdf58bfa1a2b564f17a9ea286fa4

facebook-github-bot · 2019-06-18T17:35:50Z

@ezyang merged this pull request in c471a63.

UpSample-nearest cuda kernel update

3284764

updating upsampling kernel: 1. avoids atomicAdd for better fp16 performance. 2. better launch configures for 2D input.

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 12, 2019

ezyang added the open source label Jun 12, 2019

jjsjann123 added 2 commits June 12, 2019 13:05

fixing CI build error with int to unsigned int conversion; removing r…

d3ae984

…edundant code

removing specialized 2d nearest upsampling kernel

684ca8e

ezyang requested a review from ngimel June 14, 2019 17:43

zhangguanheng66 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 14, 2019

ezyang self-requested a review June 14, 2019 18:40

ngimel requested changes Jun 14, 2019

View reviewed changes

jjsjann123 added 2 commits June 17, 2019 12:59

addressing comment issues

2047444

addressing review comments

db4850b

ngimel approved these changes Jun 17, 2019

View reviewed changes

jjsjann123 mentioned this pull request Jun 18, 2019

updating upsampling bilinear2d kernel: #21879

Closed

facebook-github-bot reviewed Jun 18, 2019

View reviewed changes

ezyang approved these changes Jun 18, 2019

View reviewed changes

facebook-github-bot closed this in c471a63 Jun 18, 2019

facebook-github-bot added the merged label Jun 18, 2019

fmassa mentioned this pull request Jul 6, 2019

nn.functional.interpolate very slow for fp16 (half) precision inputs #12409

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UpSample-nearest cuda kernel update #21694

UpSample-nearest cuda kernel update #21694

jjsjann123 commented Jun 12, 2019

jjsjann123 commented Jun 12, 2019

jjsjann123 commented Jun 13, 2019

jjsjann123 commented Jun 13, 2019

ezyang commented Jun 14, 2019

ngimel left a comment

jjsjann123 commented Jun 17, 2019 •

edited

Loading

ngimel left a comment

facebook-github-bot left a comment

facebook-github-bot commented Jun 18, 2019

UpSample-nearest cuda kernel update #21694

UpSample-nearest cuda kernel update #21694

Conversation

jjsjann123 commented Jun 12, 2019

jjsjann123 commented Jun 12, 2019

jjsjann123 commented Jun 13, 2019

jjsjann123 commented Jun 13, 2019

ezyang commented Jun 14, 2019

ngimel left a comment

Choose a reason for hiding this comment

jjsjann123 commented Jun 17, 2019 • edited Loading

ngimel left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 18, 2019

jjsjann123 commented Jun 17, 2019 •

edited

Loading