Use libjpeg-turbo in CI instead of libjpeg #5941

NicolasHug · 2022-05-04T13:56:03Z

This PR makes our CI rely on libjpeg-turbo insteaf of libjpeg. The main benefit of libjpeg-turbo is decoding speed, but this will also allow us to clean up our tests and make them more robust.

Note: This PR only concerns the CI tests, not the packaging for conda or PyPI. I'll look into that in a separate PR.

We had numerous issues and headaches in the past because of differences between PIL and torchvision and their underlying implementation (libjpeg vs libjpeg-turbo), e.g. #3913, #5910 or #5162.

PIL has been shipped on libjpeg-turbo for a while on Windows python-pillow/Pillow#3833 (comment), and since PIL 9, the linux and MacOS wheels are now also linked against turbo.

Shipping with libjpeg-turbo instead of libjpeg will allow us to be fully aligned with PIL, and greatly simplify / cleanup our tests, which had become a bit messy (see the amount of removed code in this PR). Note that internally, we already rely on turbo.

Closes #5184
Closes #5162
Closes #3913
Fixes #5910

pmeier · 2022-05-04T15:01:40Z

If this works, we can probably close #5184, correct?

NicolasHug · 2022-05-04T15:03:22Z

yup

NicolasHug · 2022-05-04T16:50:56Z

CI looks happy so far (looks red but the relevant tests passed). @andfoy @fmassa do you remember by any chance why we didn't rely on libjpeg-turbo from the start? Perhaps it's because at the time, PIL was not linked against libjpeg-turbo?

andfoy · 2022-05-04T17:58:35Z

Yes indeed, that was the reason

pmeier

Since CI seems to be happy, I think this is good to go. Thanks Nicolas!

test/test_image.py

datumbox

LGTM, thanks!

datumbox · 2022-05-09T09:49:57Z

.circleci/unittest/linux/scripts/environment.yml

@@ -1,13 +1,14 @@
 channels:
  - pytorch
  - defaults
+  - conda-forge


cc @malfet as previously you've tried to phase out conda-forge from TorchVision.

There are two issues I'm slightly concerned about:

Would it mean that we are testing something different then what we ship?

This might unintentionally pull in different dependencies, that we think exist in conda, but they are not.

If this is only needed for testing, we can build and host libjpeg-turbo for all the platforms we care about in pytorch or pytorch-nightly channels

Thanks for your feedback Nikita

Would it mean that we are testing something different then what we ship?

Yes, but only until #5951 is merged. #5951 's goal is to ship torchvision with libjpeg turbo.

It's also not as bad as it sounds: it's fair at this point to assume that libjpeg-turbo's libjpeg.h is compatible with that of libjpeg. In terms of decoding results, they're both jpeg-compliant as well.

If this is only needed for testing, we can build and host libjpeg-turbo for all the platforms we care about in pytorch or pytorch-nightly channels

Ultimately we don't want this just for testing: we also want to ship torchvision with libjpeg in #5951 . But if we're happy to host libjpeg-turbo in the pytorch channel, then this might make all this much easier. WDYT @malfet ?

NicolasHug · 2022-05-09T12:52:22Z

test/test_image.py

+    if mode == ImageReadMode.GRAY:
+        abs_mean_diff = (img_ljpeg.type(torch.float32) - img_pil).abs().mean().item()
+        assert abs_mean_diff < 1


Not too sure why this happens TBH. Maybe there are some slight differences in our decoding C++ code? But this is not a regression, in fact we're now making this check much stricter than it previously was.

datumbox · 2022-05-09T14:33:39Z

If we merge this, we should test that it has no big effects on the model accuracies. We don't need to test all the models, a sample of 2-3 cases will do.

NicolasHug · 2022-05-09T14:35:48Z

@datumbox I think you meant to comment on #5951? This PR should have zero impact on user-facing features

datumbox · 2022-05-09T15:23:35Z

@NicolasHug Yes, I meant to comment on the other PR. :( Sorry for the confusion. Should I post it again on the other one or we are good?

Minor changes detected in accuracy, needs a more discussion.

datumbox · 2022-09-26T11:22:29Z

@NicolasHug I ran a few benchmark checks to see what would be the effect of this change. I do factor in the argument that PIL has changed the backend on 9.x, but I think we should measure carefully the effect on existing pre-trained models, the speed improvements that we get by switching and potential alternative approaches. This is one of these changes that are trivial to make on the code side but might have effects that need to be investigated.

Accuracy Benchmarks

First some good news. The effect on the models is detectable but very small. Here is how the model accuracies change with JPEG and JPEG-turbo:

ResNet50_Weights.IMAGENET1K_V1:
JPEG: Acc@1 76.130 Acc@5 92.864
JPEG-turbo: Acc@1 76.148 Acc@5 92.876

ResNet50_Weights.IMAGENET1K_V2:
JPEG: Acc@1 80.854 Acc@5 95.434
JPEG-turbo: Acc@1 80.844 Acc@5 95.436

MobileNet_V3_Large_Weights.IMAGENET1K_V1:
JPEG: Acc@1 74.044 Acc@5 91.322
JPEG-turbo: Acc@1 74.056 Acc@5 91.318

The above were executed on a single gpu and batch=1 to minimize variations. From the tests that your PR is removing, we know that there are differences between the two implementations (noted also on PIL release notes you reference) but in practice it's mostly noise. It would be worth to check other model families such as Object Detection, Semantic Segmentation and Optical Flow to see if any of them is more sensitive to the change. If the differences continue being so small and the speed gains are significant, I think there is a compelling case to introduce this minor BC-breaking change. Perhaps another argument for doing so is the stability we will get across platforms on our unit-tests and the ability to compare easier the results of PIL vs TorchVision on the reading of images.

Alternative approaches

There are alternative routes we can take, some of which are temporary. None of them are "free" and we should consider them only if there is massive effect on the accuracy of the existing models:

We can temporarily patch the pil_loader() method on our Datasets (until Datasets V2 is stable), to remove PIL's jpeg-decoding from the equation and use TorchVision's read_image() which relies on LibJPEG. Here is a quick and dirty patch for this that I tested and works as expected, giving identical accuracy.

index 40d5e26d..982dd2ba 100644
--- a/torchvision/datasets/folder.py
+++ b/torchvision/datasets/folder.py
@@ -243,9 +243,15 @@ IMG_EXTENSIONS = (".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tif
 
 def pil_loader(path: str) -> Image.Image:
     # open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
+    """
     with open(path, "rb") as f:
         img = Image.open(f)
         return img.convert("RGB")
+    """
+    from torchvision.io.image import read_image
+    from torchvision.transforms.functional import to_pil_image
+    img = read_image(path)
+    return to_pil_image(img).convert("RGB")

It's worth noting that this method is not used by most of the datasets right now (they opt for doing Image.open(f).convert("RGB") directly) but in this scenario we could easily change that without BC-breaking problems. This only partially mitigates the issue as users who read images directly with PIL (outside of our datasets) would still be affected. At least this gives us an option to ensure that TorchVision doesn't break BC on the components it controls.

Pin PIL temporarily to a 8.x version and reach out to them and discuss options (potentially offer a way to switch backends). We can investigate options to address dependency issues that will be caused by this.
Do nothing to mitigate the issue on PIL, flag the problem to our users and push them to use the Tensor Transforms backend while ensuring that the accuracy will be identical if they switch to it. This might mean that we would have to aggressively align Tensor transforms with PIL in any difference (for example antialiasing=True) which is also an issue of its own and needs to be discussed.

I hope that none of the above will be necessary and that the difference on accuracy will remain minor, allowing us to get away with this minor BC breakage. I would also be OK to merge a PR like this immediately after the upcoming release to give a lot of time to users to detect and flag potential issues. It's highly likely we would need to rerun inference jobs for all models to correct the meta-data and documentation, so this needs to be factored in on the amount of work that this switch entails. It might be worth starting a project doc or an RFC on Github to record all these (instead of having these discussions spread across PRs and issues) to keep track of what we need to do. Happy to chat more on this.

NicolasHug · 2022-09-26T11:39:30Z

Thanks a lot for the benchmarks @datumbox . Just sharing initial thoughts below:

Here is how the model accuracies change with JPEG and JPEG-turbo

Could you share the exact setup you used to compare libjpeg and libjpeg-turbo? Did you rely on decode_jpeg(), or did you instead compare PIL8 vs PIL9? I'm wondering if PIL8 vs PIL9 may flag some differences that come e.g. from a difference in transforms results, on top of the different decoders. I guess the transforms should always give the same results across PIL versions, but who knows?

We can temporarily patch the pil_loader() method on our Datasets (until Datasets V2 is stable), to remove PIL's jpeg-decoding from the equation and use TorchVision's read_image() which relies on LibJPEG

Unfortunately, our read_image() (and particularly decode_jpeg()) isn't as complete as PIL's decoders. For example we currently don't support decoding CMYK -> RGB jpegs, which makes it impossible to decode some of the ImageNet samples #6538 with read_image()

datumbox · 2022-09-26T12:01:06Z

@NicolasHug

Could you share the exact setup you used to compare libjpeg and libjpeg-turbo?

I tried to isolate the effects of PIL versions as follows:

I used the patch to completely circumvent PIL and linked against different JPEG backends. So differences on PIL versions shouldn't factor in here.
I compared models (MobileNetV3) for which I know the full training history and was done prior the PIL 9 changes. The accuracy reported on our release notes and meta-data matches what I get. So this way I know that the old PIL backend aligns extremely close with our IO. Worth noting that we had flaky tests to check this, so historically prior the backend change, we ensured we were aligned.

From the above, I'm fairly confident that our IO read_image() closely aligns with PIL v8 and that my benchmarks isolate only the effect of JPEG vs JPEG-turbo. If you spot anything weird in my logic let me know.

don't support decoding CMYK -> RGB jpegs

That's a good call out. There are probably no CMYK images on the validation split and that's why I didn't get an error during inference. I think, CMYK is something we should look into doing eventually; I had put a TODO when I worked on the read_image() API with some references and it didn't look like too hard at the time but we didn't have the time to do it. This might mean that if we get a substantial accuracy difference in any of the models and we have to go with the 1st alternative option proposed above, we would have to do more work.

Use libjpeg-turbo instead of libjpeg

73cf2ba

facebook-github-bot added the cla signed label May 4, 2022

NicolasHug marked this pull request as draft May 4, 2022 13:56

NicolasHug added 3 commits May 4, 2022 14:57

Unskip tests

8d83854

Adding conda-forge to the list of channels to look for

cdfaeca

Wrong package name...

c48f861

Also let Windows use libjpeg-turbo

bc55700

NicolasHug added 2 commits May 4, 2022 17:54

Set a stricter tolerance and remove unnecessary legacy tests

b8c091f

Forgot this one

a915354

pmeier approved these changes May 5, 2022

View reviewed changes

test/test_image.py Outdated Show resolved Hide resolved

NicolasHug added 2 commits May 5, 2022 08:34

Make decoding check stricter

d0e64eb

Merge branch 'main' of github.com:pytorch/vision into turbo

3a69bf3

NicolasHug changed the title ~~[WIP] Use libjpeg-turbo instead of libjpeg~~ Use libjpeg-turbo instead of libjpeg May 5, 2022

NicolasHug changed the title ~~Use libjpeg-turbo instead of libjpeg~~ Use libjpeg-turbo in CI instead of libjpeg May 5, 2022

NicolasHug marked this pull request as ready for review May 5, 2022 08:59

NicolasHug added bug module: tests module: io module: ci labels May 5, 2022

NicolasHug mentioned this pull request May 5, 2022

[WIP] Package torchvision with libjpeg-turbo instead of libjpeg #5951

Closed

datumbox previously approved these changes May 9, 2022

View reviewed changes

NicolasHug commented May 9, 2022

View reviewed changes

pmeier mentioned this pull request Sep 12, 2022

Build against libjpeg-turbo instead of libjpeg #6563

Closed

NicolasHug mentioned this pull request Aug 10, 2023

Clean up jpeg tests #7820

Merged

NicolasHug closed this in #7820 Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use libjpeg-turbo in CI instead of libjpeg #5941

Use libjpeg-turbo in CI instead of libjpeg #5941

NicolasHug commented May 4, 2022 •

edited

pmeier commented May 4, 2022

NicolasHug commented May 4, 2022

NicolasHug commented May 4, 2022 •

edited

andfoy commented May 4, 2022

pmeier left a comment

datumbox left a comment

datumbox May 9, 2022

malfet May 12, 2022

NicolasHug May 12, 2022 •

edited

NicolasHug May 9, 2022 •

edited

datumbox commented May 9, 2022

NicolasHug commented May 9, 2022 •

edited

datumbox commented May 9, 2022

datumbox commented Sep 26, 2022

NicolasHug commented Sep 26, 2022

datumbox commented Sep 26, 2022

Use libjpeg-turbo in CI instead of libjpeg #5941

Use libjpeg-turbo in CI instead of libjpeg #5941

Conversation

NicolasHug commented May 4, 2022 • edited

pmeier commented May 4, 2022

NicolasHug commented May 4, 2022

NicolasHug commented May 4, 2022 • edited

andfoy commented May 4, 2022

pmeier left a comment

Choose a reason for hiding this comment

datumbox left a comment

Choose a reason for hiding this comment

datumbox May 9, 2022

Choose a reason for hiding this comment

malfet May 12, 2022

Choose a reason for hiding this comment

NicolasHug May 12, 2022 • edited

Choose a reason for hiding this comment

NicolasHug May 9, 2022 • edited

Choose a reason for hiding this comment

datumbox commented May 9, 2022

NicolasHug commented May 9, 2022 • edited

datumbox commented May 9, 2022

datumbox commented Sep 26, 2022

Accuracy Benchmarks

Alternative approaches

NicolasHug commented Sep 26, 2022

datumbox commented Sep 26, 2022

NicolasHug commented May 4, 2022 •

edited

NicolasHug commented May 4, 2022 •

edited

NicolasHug May 12, 2022 •

edited

NicolasHug May 9, 2022 •

edited

NicolasHug commented May 9, 2022 •

edited