Add script to mass convert images to webp. #6594

Pentarctagon · 2022-03-28T05:05:35Z

This is a script that converts png to lossless webp and jpg to lossy webp (quality 90), as well as reconverts existing webp images using lossless conversion. This PR is not for actually converting any images.

Remaining work:

Some TODO comments.
See if there's some way to tell if a webp image was created with lossy vs lossless compression.

conversion-good.csv has a list of the files that had their size reduced by more than 10%.
conversion-bad.csv has a list of files that did not have their size reduced by more than 10%.

A quick summary:

Out of 13832 files, 13345 files had their size reduced by more than 10% while 487 did not.
Of the 13345 files, the average size reduction was ~30%.
The total size reduction for the 13345 files was 71.8 MBs.

Wedge009 · 2022-03-28T05:11:27Z

I was wondering about this. Was there, by any chance, conversions that increased the file size (regardless of how much the increase was)?

Pentarctagon · 2022-03-28T05:15:40Z

Some, but not many, and not by much. They're included in conversion-bad.csv.

CelticMinstrel · 2022-03-28T05:35:12Z

I see there are quite a few in "bad", mainly unit and terrain images. I wonder how much sense it would make to just state that unit and terrain images stay as PNG? I do notice a lot of unit and terrain images are also listed in "good" though… mostly with percentages above 70%, though there are also some outliers around 30% and I even spotted one as low as 9% which seems quite weird.

Elvish-Hunter · 2022-03-28T08:39:39Z

See if there's some way to tell if a webp image was created with lossy vs lossless compression.

@Pentarctagon Actually, there is a way. A few days ago, I was reading the WebP specifications to see if I could update wmlscope's incorrectlysized() function to support them; it turns out that there are three variants of WebP files, which are lossy, lossless and extended. Long short story, it should be enough to read the first 16 bytes of a file, check if the first 12 are an actual WebP header and check if the last 4 are VP8 (lossy), VP8L (lossless) or VP8X (extended, in which case further checks are needed because it might be either lossy or lossless).

Wedge009 · 2022-03-28T09:05:50Z

It looks like change_in_percent is actually ratio of new size to old size. How is it that files that are already in webp format are increasing so much?

Pentarctagon · 2022-03-28T14:17:54Z

It looks like change_in_percent is actually ratio of new size to old size. How is it that files that are already in webp format are increasing so much?

That I don't know. I had been thinking that webp -> webp might show improvements as improvements are made to libwebp/cwebp, but it might be that for whatever reason that doesn't work well.

@Pentarctagon Actually, there is a way. A few days ago, I was reading the WebP specifications to see if I could update wmlscope's incorrectlysized() function to support them; it turns out that there are three variants of WebP files, which are lossy, lossless and extended. Long short story, it should be enough to read the first 16 bytes of a file, check if the first 12 are an actual WebP header and check if the last 4 are VP8 (lossy), VP8L (lossless) or VP8X (extended, in which case further checks are needed because it might be either lossy or lossless).

That sounds like it wouldn't fully be able to check due to VP8X though?

Elvish-Hunter · 2022-03-28T17:02:21Z

That sounds like it wouldn't fully be able to check due to VP8X though?

Not quite. A VP8X file can contain various types of chunks (alpha, animation frames...) and it also contains at least a VP8 or VP8L chunk. I think that, for our purposes, we can safely assume that our files contain only one frame and no animations, so searching for the first VP8 or VP8L chunk header after the VP8X one should be enough.

Pentarctagon · 2022-03-28T18:09:45Z

It looks like the webp images generated both by gimp and by cwebp say they are VP8X for lossy conversion. cwebp for lossless has VP8L. So that might work actually.

Elvish-Hunter · 2022-03-28T19:14:39Z

If we want to consider VP8X lossy by default, indeed this simplifies a lot of things.
So, what would you like to do with this lossy/lossless detection? Am I correct in thinking that you want to improve the if check at line 56, so webp files aren't handled as lossless by default, but only if they're already lossless?

Pentarctagon · 2022-03-28T21:24:11Z

Yeah. I assume checking whether a lossless version of a lossy webp image is smaller is pointless, and even if there are some size savings at some point we shouldn't keep using lossy encoding again and again on the same image.

Elvish-Hunter · 2022-03-28T22:10:31Z

You can start replacing the if check with this one (I'm still testing it, but I want to let you have a look anyway):

        if filetype == ".png":
            subprocess_args = lossless_args
        elif filetype == ".jpg" or filetype == ".jpeg":
            subprocess_args = lossy_args
        elif filetype == ".webp":
            # detect if the file is lossy or lossless
            with open(os.path.join(image_root, filename), "rb") as img:
                # discard first 12 bytes of WebP "magic number" and file size
                img.read(12)
                # read type of WebP file
                webp_type = img.read(4)
                if webp_type == b"VP8L":
                    subprocess_args = lossless_args
                # assume that extended WebP files are lossy
                elif webp_type == b"VP8 " or webp_type == b"VP8X":
                    subprocess_args = lossy_args
                else:
                    print(filename, "is not a valid WebP file", file=sys.stderr)
                    sys.exit(1)
        else:
            continue

Some other random thoughts:

Don't use string concatenation to create a path from two strings, use os.path.join() instead; it's useful if you want to run the script on Windows.
You can increase the counter with count += 1.
Instead of repeated string concatenations, you can use the format() method on string objects; you'll have better performances and control on the output.
In the last for cycle, you're effectively opening and closing the csv files for each file. Since I/O operations on disk are expensive, can you open the files just once and move the for cycle inside their context managers (with .. as blocks)?
Commas are valid characters in filenames, so tabs (\t) are probably a better choice as field separators.
When doing time calculations, you can also use the modulo / remainder of division operator (example: seconds_duration = int(duration % 60)).
Perhaps you should add a try ... except block to intercept KeyboardInterrupt exceptions and do some cleanup, like removing the temporary files that might still be there.

Pentarctagon · 2022-03-29T17:15:35Z

Commas are valid characters in filenames, so tabs (\t) are probably a better choice as field separators.

I don't follow this one - what does commas being a valid character for a filename have to do with using them in the contents of the file?

Elvish-Hunter · 2022-03-29T18:09:37Z

In your CSV files, you're using commas as field separators and no character as field qualifier. In this situation, if a file path contains a comma, this one will be interpreted as a field separator rather than as character, causing issues (stuff placed in the wrong columns) when you attempt to import the resulting CSV file into a spreadsheet.
This is why I recommended using tabs as field separators: on Windows they aren't allowed at all in filenames and I've never seen anyone using them in other systems. Alternatively, you can wrap each field into a pair of double quotes, so only commas outside of quotes are recognized as separators.

Pentarctagon · 2022-03-31T02:08:43Z

Alright, the above comments have been addressed.

Pentarctagon · 2022-04-01T20:29:55Z

And should be good to go. The only question I have being whether it really makes sense to try reconverting lossy webp images at all given the conversion is, well, lossy.

Wedge009 · 2022-04-02T00:27:58Z

Generally I would avoid conversion of lossy images at all, but the earlier discussion seemed to suggest that it wouldn't be easy to tell if an image was using lossy or lossless compression.

Pentarctagon · 2022-04-02T01:33:04Z

I think we have a reasonable way of telling apart lossy from lossless. The script is assuming VP8X is lossy when it can contain either, but I'm not sure how to tell which it really is for that variation.

Elvish-Hunter · 2022-04-02T19:04:47Z

utils/optiwebp.py

+    print("cwebp executable not found in PATH, exiting.")
+    sys.exit(1)
+
+image_dirs = [


Here you should use os.path.join() instead of slashes.

Elvish-Hunter · 2022-04-02T19:06:57Z

utils/optiwebp.py

+                    else:
+                        print(filename, "is not a valid WebP file", file=sys.stderr)
+                        continue
+            elif filetype == ".jpg":


JPEG files can also have the .jpeg extension, so you should add or filetype == ".jpeg" (or you can use elif filetype in (".jpg", ".jpeg"):.

Elvish-Hunter · 2022-04-02T19:13:27Z

utils/optiwebp.py

+duration = time.time() - start
+
+hours_duration = int(duration / 3600)
+minutes_duration = int((duration - (hours_duration * 3600)) / 60)


Again, using the modulo operator is better: minutes_duration = int((duration % 3600) / 60).
If you want, you can also remove the int() casting and use the integer division operator // (example: hours_duration = duration // 3600

Elvish-Hunter · 2022-04-02T19:16:43Z

utils/optiwebp.py

+if os.path.exists(good_conversion):
+    os.remove(good_conversion)
+with open(good_conversion, "a") as f:
+    f.write("filename,old_size,new_size,change_in_percent,change_in_bytes\n")


You can move this line and line 134 inside the with ... as block at line 139.

Elvish-Hunter · 2022-04-02T19:23:01Z

utils/optiwebp.py

+initial_total_size = 0
+final_total_size = 0
+
+with open(good_conversion, "a") as good_f, open(bad_conversion, "a") as bad_f:


If you move lines 129 and 134 here, you can replace the opening mode "a" with "w" and overwrite the files even if they already exist (no need to remove a file then append to an empty file).
You should also wrap this block inside a try ... except OSError: block, in case the user cannot write to the files for any reason (like missing permissions).

Elvish-Hunter · 2022-04-02T19:30:45Z

I think we have a reasonable way of telling apart lossy from lossless. The script is assuming VP8X is lossy when it can contain either, but I'm not sure how to tell which it really is for that variation.

With the file already open and read into a variable, it is possible to use the bytes.find() method to see if there's either a VP8 or VP8L somewhere (those are also chunk headers, not just type identifiers) and raise an error if both are found (it shouldn't happen, but the documentation I read doesn't say that it cannot happen in a file with multiple frames either).

Pentarctagon · 2022-04-03T04:39:06Z

At that point I kinda start feeling "if they don't clearly define in their spec how to tell them apart, then I don't want to try guessing at it unless I really have to".

Pentarctagon · 2022-04-03T19:05:25Z

All review comments have been addressed, I believe.

Wedge009 added the Graphics Issues that involve the graphics engine or assets. label Mar 29, 2022

Pentarctagon force-pushed the webp-convert branch from 50ebd6f to dc335ee Compare March 31, 2022 02:08

Pentarctagon force-pushed the webp-convert branch from dc335ee to 2fe7fe3 Compare April 1, 2022 20:27

Pentarctagon marked this pull request as ready for review April 1, 2022 20:28

Pentarctagon requested a review from Elvish-Hunter April 1, 2022 20:29

Elvish-Hunter reviewed Apr 2, 2022

View reviewed changes

Add script to mass convert images to webp.

6e0b34a

Pentarctagon force-pushed the webp-convert branch from 2fe7fe3 to 6e0b34a Compare April 3, 2022 15:14

Elvish-Hunter approved these changes Apr 3, 2022

View reviewed changes

Pentarctagon merged commit 25e763f into wesnoth:master Apr 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to mass convert images to webp. #6594

Add script to mass convert images to webp. #6594

Pentarctagon commented Mar 28, 2022

Wedge009 commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022

CelticMinstrel commented Mar 28, 2022 •

edited

Elvish-Hunter commented Mar 28, 2022

Wedge009 commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022 •

edited

Elvish-Hunter commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022 •

edited

Elvish-Hunter commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022

Elvish-Hunter commented Mar 28, 2022

Pentarctagon commented Mar 29, 2022

Elvish-Hunter commented Mar 29, 2022

Pentarctagon commented Mar 31, 2022

Pentarctagon commented Apr 1, 2022

Wedge009 commented Apr 2, 2022

Pentarctagon commented Apr 2, 2022

Elvish-Hunter Apr 2, 2022

Pentarctagon Apr 3, 2022

Elvish-Hunter Apr 2, 2022

Pentarctagon Apr 3, 2022

Elvish-Hunter Apr 2, 2022

Pentarctagon Apr 3, 2022

Elvish-Hunter Apr 2, 2022

Pentarctagon Apr 3, 2022

Elvish-Hunter Apr 2, 2022

Pentarctagon Apr 3, 2022

Elvish-Hunter commented Apr 2, 2022

Pentarctagon commented Apr 3, 2022

Pentarctagon commented Apr 3, 2022

Add script to mass convert images to webp. #6594

Add script to mass convert images to webp. #6594

Conversation

Pentarctagon commented Mar 28, 2022

Wedge009 commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022

CelticMinstrel commented Mar 28, 2022 • edited

Elvish-Hunter commented Mar 28, 2022

Wedge009 commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022 • edited

Elvish-Hunter commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022 • edited

Elvish-Hunter commented Mar 28, 2022

Pentarctagon commented Mar 28, 2022

Elvish-Hunter commented Mar 28, 2022

Pentarctagon commented Mar 29, 2022

Elvish-Hunter commented Mar 29, 2022

Pentarctagon commented Mar 31, 2022

Pentarctagon commented Apr 1, 2022

Wedge009 commented Apr 2, 2022

Pentarctagon commented Apr 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Elvish-Hunter commented Apr 2, 2022

Pentarctagon commented Apr 3, 2022

Pentarctagon commented Apr 3, 2022

CelticMinstrel commented Mar 28, 2022 •

edited

Pentarctagon commented Mar 28, 2022 •

edited

Pentarctagon commented Mar 28, 2022 •

edited