Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential improvements to jpeg decoding on GPU #3848

Closed
NicolasHug opened this issue May 17, 2021 · 6 comments
Closed

Potential improvements to jpeg decoding on GPU #3848

NicolasHug opened this issue May 17, 2021 · 6 comments

Comments

@NicolasHug
Copy link
Member

A minimal version of jpeg decoding on GPUs was implemented in #3792. Here's a list of potential future improvements:

@cceyda
Copy link

cceyda commented Jun 15, 2021

I have just tested the new v0.10.0 release with beta support for nvjpeg. But I found it to be slower x2.

images_bytes=[np.frombuffer(open(os.path.join(folder,a),'rb').read(), dtype=np.uint8) for a in os.listdir(folder) if 
 a.endswith('jpg')]

#%%timeit -n 1 -r 100
for img_bytes in images_bytes:
    z=torch.from_numpy(img_bytes)
    z=decode_jpeg(z, device='cuda') # z=decode_jpeg(z)

Using:

  • Titan RTX
  • Cuda 10.2
  • python 3.6.9

benchmarking code: https://github.com/cceyda/image-checker/blob/master/examples/benchmark_jpeg_decode_extended.ipynb

Also kept getting below error with cuda 11.1

~/.local/lib/python3.6/site-packages/torchvision/io/image.py in decode_jpeg(input, mode, device)
     174     device = torch.device(device)
     175     if device.type == 'cuda':
 --> 176         output = torch.ops.image.decode_jpeg_cuda(input, mode.value, device)
     177     else:
     178         output = torch.ops.image.decode_jpeg(input, mode.value)

 RuntimeError: nvjpegDecode failed: 5

@NicolasHug
Copy link
Member Author

hi @cceyda , the GPU benchmarks you're reporting should be using something like torch.cuda.synchronize between each run, to get accurate results. For more comparable reasults, would you mind using something like the code in #2786 (comment) ? You can find it by clicking on the "Benchmark code for ref" part.

Also please note that this issue is for tracking potential improvements to the GPU decoding. Could you please submit the bug failure as a separate issue? It would be easier to keep track of it.

@cceyda
Copy link

cceyda commented Jun 16, 2021

Even with the benchmark code I adapted from nvjpeg_bench.py used in #2786 (comment) I always get slower results with cuda decoding. I have tried many many different versions of benchmarking.

nvjpeg_bench.py below:

import torch
from torch.utils.benchmark import Timer
from torchvision.io.image import decode_jpeg, read_file, ImageReadMode, write_jpeg, encode_jpeg
from torchvision import transforms as T

img_path = './grace_hopper_517x606.jpg'
data = read_file(img_path)
img = decode_jpeg(data)

def sumup(name, mean, median, throughput, fps):
    print(
        f"{name:<10} mean: {mean:.3f} ms, median: {median:.3f} ms, "
        f"Throughput = {throughput:.3f} Megapixel / sec, "
        f"{fps:.3f} fps"
    )

print(f"img.shape = {img.shape}")
print(f"data.shape = {data.shape}")
height, width = img.shape[-2:]

num_pixels = height * width
num_runs = 100

stmt = "a=decode_jpeg(data, device='{}')\na=a.to(device='cuda:0')" # added .to(device) to account for moving to gpu time
setup = 'from torchvision.io.image import decode_jpeg'
globals = {'data': data}

for device in ('cpu', 'cuda'):
    t = Timer(stmt=stmt.format(device), setup=setup, globals=globals).timeit(num_runs)
    sumup(device, t.mean * 1000, t.median * 1000, num_pixels / 1e6 / t.median, 1 / t.median)

Server 1 ENV:

Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.14.4
Libc version: glibc-2.25

Python version: 3.6 (64-bit runtime)
Python platform: Linux-4.15.0-108-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: 
GPU 0: TITAN RTX

Nvidia driver version: 460.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.1
[pip3] pytorch-lightning==1.4.0.dev0
[pip3] torch==1.9.0
[pip3] torch-model-archiver==0.2.0
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchgeometry==0.1.2
[pip3] torchmetrics==0.3.2
[pip3] torchserve==0.4.0
[pip3] torchserve-dashboard==0.3.2
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.5.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.4                      243  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.0.14           py37ha843d7b_0  
[conda] mkl_random                1.1.0            py37hd6b4f25_0  
[conda] numpy                     1.17.2           py37haad9e8e_0  
[conda] numpy-base                1.17.2           py37hde5b4d6_0  
[conda] numpydoc                  0.9.1                      py_0

Server 1 results: (cuda x2 slower)

#run 1 python3 nvjpeg_bench.py 
cpu        mean: 2.071 ms, median: 2.071 ms, Throughput = 151.248 Megapixel / sec, 482.753 fps
cuda       mean: 4.988 ms, median: 4.988 ms, Throughput = 62.816 Megapixel / sec, 200.497 fps
#run 2
cpu        mean: 2.157 ms, median: 2.157 ms, Throughput = 145.254 Megapixel / sec, 463.624 fps
cuda       mean: 4.417 ms, median: 4.417 ms, Throughput = 70.937 Megapixel / sec, 226.417 fps
#run 3
cpu        mean: 2.182 ms, median: 2.182 ms, Throughput = 143.612 Megapixel / sec, 458.381 fps
cuda       mean: 3.836 ms, median: 3.836 ms, Throughput = 81.682 Megapixel / sec, 260.712 fps
#run 4
cpu        mean: 2.178 ms, median: 2.178 ms, Throughput = 143.874 Megapixel / sec, 459.217 fps
cuda       mean: 3.725 ms, median: 3.725 ms, Throughput = 84.108 Megapixel / sec, 268.455 fps

cuda 11.1 bug disappeared mysteriously, ipykernel must have been reconnecting to an old one despite restarts 🤷 I'll open a separate issue if I re-incounter & isolate it.

So I ran benchmarks also on an A100 with cuda 11.1

Server 2 ENV:

Collecting environment information...
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.4.0-70-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration: 
GPU 0: A100-PCIE-40GB

Nvidia driver version: 460.73.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0+cu111
[pip3] torch-model-archiver==0.3.0b20210517
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchgeometry==0.1.2
[pip3] torchserve==0.3.0b20210517
[pip3] torchserve-dashboard==0.4.0
[pip3] torchtext==0.8.1
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect

Server 2 results: (cuda x5 times slower)

#run 1 python3 nvjpeg_bench.py 
cpu        mean: 1.703 ms, median: 1.703 ms, Throughput = 184.022 Megapixel / sec, 587.362 fps
cuda       mean: 8.776 ms, median: 8.776 ms, Throughput = 35.701 Megapixel / sec, 113.949 fps
#run 2
cpu        mean: 1.765 ms, median: 1.765 ms, Throughput = 177.545 Megapixel / sec, 566.688 fps
cuda       mean: 8.709 ms, median: 8.709 ms, Throughput = 35.975 Megapixel / sec, 114.825 fps
#run 3
cpu        mean: 1.741 ms, median: 1.741 ms, Throughput = 179.986 Megapixel / sec, 574.481 fps
cuda       mean: 8.586 ms, median: 8.586 ms, Throughput = 36.492 Megapixel / sec, 116.474 fps
#run 4
cpu        mean: 1.735 ms, median: 1.735 ms, Throughput = 180.537 Megapixel / sec, 576.239 fps
cuda       mean: 8.950 ms, median: 8.950 ms, Throughput = 35.005 Megapixel / sec, 111.728 fps

(Nothing else was running on the gpu during benchmarks)

@NicolasHug
Copy link
Member Author

Thanks for the details, we will keep that in mind

cuda 11.1 bug disappeared mysteriously, ipykernel must have been reconnecting to an old one despite restarts 🤷 I'll open a separate issue if I re-incounter & isolate it.

Sounds good!

So I ran benchmarks also on an A100 with cuda 11.1

Just note that wile the code runs on A100, we haven't implemented the full A100 support yet so we can't take advantage of the dedicated hardware instructions that the A100 has. We'll look into it in the future, and this is one of the items of this issue, but I don't have access to an A100 ATM.

@cceyda
Copy link

cceyda commented Jun 16, 2021

Just ran on collab and cuda is slightly faster... don't know what is wrong with my local setup :/

image

@NicolasHug
Copy link
Member Author

I think most of these items have been addressed in #8496, so I'll close this issue. Feel free to open follow-up issues for any feedback on the jpeg GPU decoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants