Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to mass convert images to webp. #6594

Merged
merged 1 commit into from Apr 3, 2022

Conversation

Pentarctagon
Copy link
Member


This is a script that converts png to lossless webp and jpg to lossy webp (quality 90), as well as reconverts existing webp images using lossless conversion. This PR is not for actually converting any images.

Remaining work:

  • Some TODO comments.
  • See if there's some way to tell if a webp image was created with lossy vs lossless compression.

conversion-good.csv has a list of the files that had their size reduced by more than 10%.
conversion-bad.csv has a list of files that did not have their size reduced by more than 10%.

A quick summary:

  • Out of 13832 files, 13345 files had their size reduced by more than 10% while 487 did not.
  • Of the 13345 files, the average size reduction was ~30%.
  • The total size reduction for the 13345 files was 71.8 MBs.

@Wedge009
Copy link
Member

I was wondering about this. Was there, by any chance, conversions that increased the file size (regardless of how much the increase was)?

@Pentarctagon
Copy link
Member Author

Some, but not many, and not by much. They're included in conversion-bad.csv.

@CelticMinstrel
Copy link
Member

CelticMinstrel commented Mar 28, 2022

I see there are quite a few in "bad", mainly unit and terrain images. I wonder how much sense it would make to just state that unit and terrain images stay as PNG? I do notice a lot of unit and terrain images are also listed in "good" though… mostly with percentages above 70%, though there are also some outliers around 30% and I even spotted one as low as 9% which seems quite weird.

@Elvish-Hunter
Copy link
Contributor

See if there's some way to tell if a webp image was created with lossy vs lossless compression.

@Pentarctagon Actually, there is a way. A few days ago, I was reading the WebP specifications to see if I could update wmlscope's incorrectlysized() function to support them; it turns out that there are three variants of WebP files, which are lossy, lossless and extended. Long short story, it should be enough to read the first 16 bytes of a file, check if the first 12 are an actual WebP header and check if the last 4 are VP8 (lossy), VP8L (lossless) or VP8X (extended, in which case further checks are needed because it might be either lossy or lossless).

@Wedge009
Copy link
Member

It looks like change_in_percent is actually ratio of new size to old size. How is it that files that are already in webp format are increasing so much?

@Pentarctagon
Copy link
Member Author

Pentarctagon commented Mar 28, 2022

It looks like change_in_percent is actually ratio of new size to old size. How is it that files that are already in webp format are increasing so much?

That I don't know. I had been thinking that webp -> webp might show improvements as improvements are made to libwebp/cwebp, but it might be that for whatever reason that doesn't work well.

@Pentarctagon Actually, there is a way. A few days ago, I was reading the WebP specifications to see if I could update wmlscope's incorrectlysized() function to support them; it turns out that there are three variants of WebP files, which are lossy, lossless and extended. Long short story, it should be enough to read the first 16 bytes of a file, check if the first 12 are an actual WebP header and check if the last 4 are VP8 (lossy), VP8L (lossless) or VP8X (extended, in which case further checks are needed because it might be either lossy or lossless).

That sounds like it wouldn't fully be able to check due to VP8X though?

@Elvish-Hunter
Copy link
Contributor

That sounds like it wouldn't fully be able to check due to VP8X though?

Not quite. A VP8X file can contain various types of chunks (alpha, animation frames...) and it also contains at least a VP8 or VP8L chunk. I think that, for our purposes, we can safely assume that our files contain only one frame and no animations, so searching for the first VP8 or VP8L chunk header after the VP8X one should be enough.

@Pentarctagon
Copy link
Member Author

Pentarctagon commented Mar 28, 2022

It looks like the webp images generated both by gimp and by cwebp say they are VP8X for lossy conversion. cwebp for lossless has VP8L. So that might work actually.

@Elvish-Hunter
Copy link
Contributor

If we want to consider VP8X lossy by default, indeed this simplifies a lot of things.
So, what would you like to do with this lossy/lossless detection? Am I correct in thinking that you want to improve the if check at line 56, so webp files aren't handled as lossless by default, but only if they're already lossless?

@Pentarctagon
Copy link
Member Author

Yeah. I assume checking whether a lossless version of a lossy webp image is smaller is pointless, and even if there are some size savings at some point we shouldn't keep using lossy encoding again and again on the same image.

@Elvish-Hunter
Copy link
Contributor

You can start replacing the if check with this one (I'm still testing it, but I want to let you have a look anyway):

        if filetype == ".png":
            subprocess_args = lossless_args
        elif filetype == ".jpg" or filetype == ".jpeg":
            subprocess_args = lossy_args
        elif filetype == ".webp":
            # detect if the file is lossy or lossless
            with open(os.path.join(image_root, filename), "rb") as img:
                # discard first 12 bytes of WebP "magic number" and file size
                img.read(12)
                # read type of WebP file
                webp_type = img.read(4)
                if webp_type == b"VP8L":
                    subprocess_args = lossless_args
                # assume that extended WebP files are lossy
                elif webp_type == b"VP8 " or webp_type == b"VP8X":
                    subprocess_args = lossy_args
                else:
                    print(filename, "is not a valid WebP file", file=sys.stderr)
                    sys.exit(1)
        else:
            continue

Some other random thoughts:

  • Don't use string concatenation to create a path from two strings, use os.path.join() instead; it's useful if you want to run the script on Windows.
  • You can increase the counter with count += 1.
  • Instead of repeated string concatenations, you can use the format() method on string objects; you'll have better performances and control on the output.
  • In the last for cycle, you're effectively opening and closing the csv files for each file. Since I/O operations on disk are expensive, can you open the files just once and move the for cycle inside their context managers (with .. as blocks)?
  • Commas are valid characters in filenames, so tabs (\t) are probably a better choice as field separators.
  • When doing time calculations, you can also use the modulo / remainder of division operator (example: seconds_duration = int(duration % 60)).
  • Perhaps you should add a try ... except block to intercept KeyboardInterrupt exceptions and do some cleanup, like removing the temporary files that might still be there.

@Pentarctagon
Copy link
Member Author

Commas are valid characters in filenames, so tabs (\t) are probably a better choice as field separators.

I don't follow this one - what does commas being a valid character for a filename have to do with using them in the contents of the file?

@Elvish-Hunter
Copy link
Contributor

In your CSV files, you're using commas as field separators and no character as field qualifier. In this situation, if a file path contains a comma, this one will be interpreted as a field separator rather than as character, causing issues (stuff placed in the wrong columns) when you attempt to import the resulting CSV file into a spreadsheet.
This is why I recommended using tabs as field separators: on Windows they aren't allowed at all in filenames and I've never seen anyone using them in other systems. Alternatively, you can wrap each field into a pair of double quotes, so only commas outside of quotes are recognized as separators.

@Wedge009 Wedge009 added the Graphics Issues that involve the graphics engine or assets. label Mar 29, 2022
@Pentarctagon
Copy link
Member Author

Alright, the above comments have been addressed.

@Pentarctagon Pentarctagon marked this pull request as ready for review April 1, 2022 20:28
@Pentarctagon
Copy link
Member Author

And should be good to go. The only question I have being whether it really makes sense to try reconverting lossy webp images at all given the conversion is, well, lossy.

@Wedge009
Copy link
Member

Wedge009 commented Apr 2, 2022

Generally I would avoid conversion of lossy images at all, but the earlier discussion seemed to suggest that it wouldn't be easy to tell if an image was using lossy or lossless compression.

@Pentarctagon
Copy link
Member Author

I think we have a reasonable way of telling apart lossy from lossless. The script is assuming VP8X is lossy when it can contain either, but I'm not sure how to tell which it really is for that variation.

print("cwebp executable not found in PATH, exiting.")
sys.exit(1)

image_dirs = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should use os.path.join() instead of slashes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

else:
print(filename, "is not a valid WebP file", file=sys.stderr)
continue
elif filetype == ".jpg":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JPEG files can also have the .jpeg extension, so you should add or filetype == ".jpeg" (or you can use elif filetype in (".jpg", ".jpeg"):.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

duration = time.time() - start

hours_duration = int(duration / 3600)
minutes_duration = int((duration - (hours_duration * 3600)) / 60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, using the modulo operator is better: minutes_duration = int((duration % 3600) / 60).
If you want, you can also remove the int() casting and use the integer division operator // (example: hours_duration = duration // 3600

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if os.path.exists(good_conversion):
os.remove(good_conversion)
with open(good_conversion, "a") as f:
f.write("filename,old_size,new_size,change_in_percent,change_in_bytes\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this line and line 134 inside the with ... as block at line 139.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

initial_total_size = 0
final_total_size = 0

with open(good_conversion, "a") as good_f, open(bad_conversion, "a") as bad_f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move lines 129 and 134 here, you can replace the opening mode "a" with "w" and overwrite the files even if they already exist (no need to remove a file then append to an empty file).
You should also wrap this block inside a try ... except OSError: block, in case the user cannot write to the files for any reason (like missing permissions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Elvish-Hunter
Copy link
Contributor

I think we have a reasonable way of telling apart lossy from lossless. The script is assuming VP8X is lossy when it can contain either, but I'm not sure how to tell which it really is for that variation.

With the file already open and read into a variable, it is possible to use the bytes.find() method to see if there's either a VP8 or VP8L somewhere (those are also chunk headers, not just type identifiers) and raise an error if both are found (it shouldn't happen, but the documentation I read doesn't say that it cannot happen in a file with multiple frames either).

@Pentarctagon
Copy link
Member Author

At that point I kinda start feeling "if they don't clearly define in their spec how to tell them apart, then I don't want to try guessing at it unless I really have to".

@Pentarctagon
Copy link
Member Author

All review comments have been addressed, I believe.

@Pentarctagon Pentarctagon merged commit 25e763f into wesnoth:master Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Graphics Issues that involve the graphics engine or assets.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants