Skip to content

Files with non ascii filenames

Raphael Das Gupta edited this page Apr 23, 2015 · 9 revisions

Note: This reflects my current understanding of this issue and may be incorrect or incomplete – feel free to improve!

TL;DR: Avoid non-ascii filenames inside zip archives if possible, especially if working cross-platform.

Background information

The zip format stores filenames as sequence of bytes – it’s up to the zip handling tool how to interpret the names. Most modern OS use UTF-8 for filename encoding. However, Windows traditionally doesn’t (at least for contents of zip archives).

So, if you create a zip archive containing a file named ä.txt on Mac OS or Linux and then extract this archive on Windows (using Windows’ native support for “compressed folders”, not an external tool like WinZip), it will produce a file named something like ├ñ.txt (the exact name depends on the current codepage). Microsoft has released a hotfix for Windows 7 that is intended to fix this bug, but few computers with Windows 7 will actually have that hotfix installed.

Recent versions of WinZip (I tested with v17.5) and 7-zip (tested with 9.20) seem to handle those filenames correct, though. (This probably is also true for other tools, like WinRar etc., but I did not test any of those.)

Unicode filenames and the general purpose flags field

Recent versions of the zip format specification support unicode filenames explicitly:

Names must be encoded in UTF-8, and the 11th bit in the general purpose flags field (2 bytes at offset 6) must be set. –Source

Since 317fdd0 (not yet released) you can tell RubyZip to set this flag for unicode filenames with Zip.unicode_names = true. This makes it possible to extract files with non-ascii filenames on Windows 8 without any external tools. Unfortunately, this does not work for older versions (Windows 7, Windows Vista, Windows XP) – they just seem to ignore this flag.

Summary

Avoid non-ascii filenames in zip archives, if possible. If you really must use them, and you’re creating the archives on Mac OS X or Linux with a recent RubyZip version, do set Zip.unicode_names = true. Extracting those archives will work correctly on Mac OS X, Linux and Windows 8. It will not work on older versions of Windows, unless an external tool like WinZip is used.