-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential improvements to jpeg decoding on GPU #3848
Comments
I have just tested the new v0.10.0 release with beta support for nvjpeg. But I found it to be slower x2. images_bytes=[np.frombuffer(open(os.path.join(folder,a),'rb').read(), dtype=np.uint8) for a in os.listdir(folder) if
a.endswith('jpg')]
#%%timeit -n 1 -r 100
for img_bytes in images_bytes:
z=torch.from_numpy(img_bytes)
z=decode_jpeg(z, device='cuda') # z=decode_jpeg(z) Using:
benchmarking code: https://github.com/cceyda/image-checker/blob/master/examples/benchmark_jpeg_decode_extended.ipynb Also kept getting below error with cuda 11.1
|
hi @cceyda , the GPU benchmarks you're reporting should be using something like Also please note that this issue is for tracking potential improvements to the GPU decoding. Could you please submit the bug failure as a separate issue? It would be easier to keep track of it. |
Even with the benchmark code I adapted from nvjpeg_bench.py below: import torch
from torch.utils.benchmark import Timer
from torchvision.io.image import decode_jpeg, read_file, ImageReadMode, write_jpeg, encode_jpeg
from torchvision import transforms as T
img_path = './grace_hopper_517x606.jpg'
data = read_file(img_path)
img = decode_jpeg(data)
def sumup(name, mean, median, throughput, fps):
print(
f"{name:<10} mean: {mean:.3f} ms, median: {median:.3f} ms, "
f"Throughput = {throughput:.3f} Megapixel / sec, "
f"{fps:.3f} fps"
)
print(f"img.shape = {img.shape}")
print(f"data.shape = {data.shape}")
height, width = img.shape[-2:]
num_pixels = height * width
num_runs = 100
stmt = "a=decode_jpeg(data, device='{}')\na=a.to(device='cuda:0')" # added .to(device) to account for moving to gpu time
setup = 'from torchvision.io.image import decode_jpeg'
globals = {'data': data}
for device in ('cpu', 'cuda'):
t = Timer(stmt=stmt.format(device), setup=setup, globals=globals).timeit(num_runs)
sumup(device, t.mean * 1000, t.median * 1000, num_pixels / 1e6 / t.median, 1 / t.median) Server 1 ENV:
Server 1 results: (cuda x2 slower)
cuda 11.1 bug disappeared mysteriously, ipykernel must have been reconnecting to an old one despite restarts 🤷 I'll open a separate issue if I re-incounter & isolate it. So I ran benchmarks also on an A100 with cuda 11.1 Server 2 ENV:
Server 2 results: (cuda x5 times slower)
(Nothing else was running on the gpu during benchmarks) |
Thanks for the details, we will keep that in mind
Sounds good!
Just note that wile the code runs on A100, we haven't implemented the full A100 support yet so we can't take advantage of the dedicated hardware instructions that the A100 has. We'll look into it in the future, and this is one of the items of this issue, but I don't have access to an A100 ATM. |
I think most of these items have been addressed in #8496, so I'll close this issue. Feel free to open follow-up issues for any feedback on the jpeg GPU decoder |
A minimal version of jpeg decoding on GPUs was implemented in #3792. Here's a list of potential future improvements:
The text was updated successfully, but these errors were encountered: