UnicodeDecodeError when creating tar.gz with unicode name #57848

jaraco · 2011-12-20T01:23:23Z

BPO	13639
Nosy	@terryjreedy, @jaraco, @gustaebel, @vstinner
Files	tarfile-stream-gzip-unicode-fix.diff smime.p7s

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gustaebel'
closed_at = <Date 2011-12-26.07:46:51.099>
created_at = <Date 2011-12-20.01:23:23.392>
labels = ['type-bug']
title = 'UnicodeDecodeError when creating tar.gz with unicode name'
updated_at = <Date 2011-12-26.17:22:36.092>
user = 'https://github.com/jaraco'

bugs.python.org fields:

activity = <Date 2011-12-26.17:22:36.092>
actor = 'python-dev'
assignee = 'lars.gustaebel'
closed = True
closed_date = <Date 2011-12-26.07:46:51.099>
closer = 'terry.reedy'
components = []
creation = <Date 2011-12-20.01:23:23.392>
creator = 'jaraco'
dependencies = []
files = ['24066', '24090']
hgrepos = []
issue_num = 13639
keywords = ['patch']
message_count = 15.0
messages = ['149896', '149974', '149985', '150032', '150035', '150037', '150038', '150203', '150228', '150237', '150248', '150249', '150259', '150260', '150268']
nosy_count = 5.0
nosy_names = ['terry.reedy', 'jaraco', 'lars.gustaebel', 'vstinner', 'python-dev']
pr_nums = []
priority = 'low'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue13639'
versions = ['Python 2.7']

jaraco · 2011-12-20T01:23:21Z

python -c "import tarfile; tarfile.open(u'hello.tar.gz', 'w|gz')"

produces

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\jaraco\projects\public\cpython\Lib\tarfile.py", line 1687, in open
    _Stream(name, filemode, comptype, fileobj, bufsize),
  File "C:\Users\jaraco\projects\public\cpython\Lib\tarfile.py", line 431, in __init__
    self._init_write_gz()
  File "C:\Users\jaraco\projects\public\cpython\Lib\tarfile.py", line 459, in _init_write_gz
    self.__write(self.name + NUL)
  File "C:\Users\jaraco\projects\public\cpython\Lib\tarfile.py", line 475, in __write
    self.buf += s
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)

Remove the compression ('|gz') or remove the unicode name or run under Python 3 and the command completes without error.

The error does not occur under Python 3 (even with non-ascii characters), so it should be possible to create a tarfile with a unicode filename on Python 2.7.

This failure is the underlying cause of bpo-11638.

gustaebel · 2011-12-21T09:42:42Z

tarfile under Python 2.x is not particularly designed to support unicode filenames (the gzip module does not support them either), but that should not be too hard to fix.

jaraco · 2011-12-21T13:33:50Z

That looks like a good patch to me. Do you want to commit it, or would you rather I do?

python-dev · 2011-12-21T18:28:23Z

New changeset a60a3610a97b by Lars Gustäbel in branch '2.7':
Issue bpo-13639: Accept unicode filenames in tarfile.open(mode="w|gz").
http://hg.python.org/cpython/rev/a60a3610a97b

vstinner · 2011-12-21T19:07:23Z

+ self.name = self.name.encode("iso-8859-1", "replace")

Why did you chose ISO-8859-1? I think that the filesystem encoding should be used instead:

       self.name = self.name.encode("iso-8859-1", "replace")

+ self.name = self.name.encode(ENCODING, "replace")

gustaebel · 2011-12-21T19:31:14Z

See http://bugs.python.org/issue11638#msg150029

vstinner · 2011-12-21T19:36:39Z

"The gzip format (defined in RFC 1952) allows storing the original filename (without the .gz suffix) in an additional field in the header (the FNAME field). Latin-1 (iso-8859-1) is required."

Hum, it looks like the author of the gzip program (on Linux Fedora 16) didn't read the RFC!

$ tar -cvf hého.tar README
README
$ gzip hého.tar 
$ hachoir-urwid ~/prog/python/default/hého.tar.gz 
0) file:/home/haypo/prog/python/default/hého.tar.gz: ...
   0) signature= "\x1f\x8b": GZip file signature (\x1F\x8B) (2 bytes)
   2) compression= deflate: Compression method (1 byte)
   3.0) is_text= False: File content is probably ASCII text (1 bit)
   3.1) has_crc16= False: Header CRC16 (1 bit)
   3.2) has_extra= False: Extra informations (variable size) (1 bit)
   3.3) has_filename= True: Contains filename? (1 bit)
   3.4) has_comment= False: Contains comment? (1 bit)
   3.5) reserved[0]= <null> (3 bits)
   4) mtime= 2011-12-21 19:34:54: Modification time (4 bytes)
   8.0) reserved[1]= <null> (1 bit)
   8.1) slowest= False: Compressor used maximum compression (slowest) (1 bit)
   8.2) fastest= False: Compressor used the fastest compression (1 bit)
   8.3) reserved[2]= <null> (5 bits)
   9) os= Unix: Operating system (1 byte)
   10) filename= "hÃ©ho.tar": Filename (10 bytes)

Raw display:

filename= "h\xc3\xa9ho.tar\0": Filename (10 bytes)

terryjreedy · 2011-12-24T02:10:46Z

2.7 is closed to new features. This looks like it mignt be one. The 2.7 doc for tarfile.open says "Return a TarFile object for the pathname name." Does the meaning of 'pathname' in 2.7 generally include unicode as well as str objects? (It is not in the Glossary.)

The error does not occur under Python 3 (even with non-ascii characters), so it should be possible to create a tarfile with a unicode filename on Python 2.7.

Python 3 has many new features that are not in 2.7, so 'possible' is not exactly the point ;-).

gustaebel · 2011-12-24T11:13:00Z

I thought about that myself, too. It is clearly no new feature, it is really more some kind of a fix.

Unicode pathnames given to tarfile.open() are just passed through to the open() function, which is why this always has been working, except for this particular case. There are 6 different possible write modes: "w:", "w:gz", "w:bz2", "w|", "w|gz" and "w|bz2". And the only one not working with a unicode pathname is "w|gz". Although admittedly tarfile.open() is not supposed to be used with a unicode path, people do it anyway, because they don't care, and because it works. The patch does not add a new broad functionality, it merely harmonises the way the six write modes work.

Neither can we retroactively enforce using string pathnames at this point, nor should we let a user run into this strange error. The patch is very small and minimally invasive. The error message you get without the patch is completely incomprehensible.

terryjreedy · 2011-12-24T21:42:54Z

With that explanation, that it is one case out of six that fails, for whatever reason, I agree.

That leaves the issue of whether the fix is the right one. I currently agree with Victor that we should do what the rest of Python does and what is most universally useful. That fact that an old standard requires a *storage* encoding for a nearly unused field for .gz files that (I believe) only works for Western Europe, does not mean we should use it for *opening* .tar files. WestEuro-centrism is as bad as Anglo-centrism. If the unicode filename cannot be Latin-1 encoded, the filename field should be left blank. But it seems to me that the filename should be converted to the bytes that the user wants, expects, and can use.

gustaebel · 2011-12-25T11:26:25Z

I think we should wrap this up as soon as possible, because it has already absorbed too much of our time. The issue we discuss here is a tiny glitch triggered by a corner-case. My original idea was to fix it in a minimal sort of way that is backwards-compatible.

There are at least 4 different solutions now:

Keep the patch.
Revert the patch, leave everything as it was as wontfix.
Don't write an FNAME field at all if the filename that is passed is a unicode string.
Rewrite the FNAME code the way Terry suggests. This seems to me like the most complex solution, because we have to fix gzip.py as well, because the code in question was originally taken from the gzip module. (BTW, both the tarfile and gzip module discard the FNAME field when a file is opened for reading.)

My favorites are 1 and 3 ;-)

jaraco · 2011-12-25T17:00:46Z

I also feel (1) or (3) is best for this issue. If there is a _better_
implementation, it should be reserved for a separate improvement to Python
3.2+.

I lean slightly toward (3) because it would support filenames with Unicode
characters other than latin-1 (as long as the file system allows it to be
saved), because I suspect it would enable tests such as this to pass:
https://bitbucket.org/jaraco/cpython-issue11638/changeset/9e9ea96eb0dd#chg-Lib/distutils/tests/test_archive_util.py

terryjreedy · 2011-12-26T07:26:24Z

As I understand the patched code, it only fixes the issue for unicode names that can be latin-1 encoded and that other unicode names will raise the same exception with 'latin-1' (or equivalent) substituted for 'ascii'. So it is easy for me to anticipate a new issue reporting such someday.

I would prefer a more complete fix. If 3 is easier than 4, fine with me.

terryjreedy · 2011-12-26T07:46:51Z

I just took a look as the 3.2 tarfile code and see that it always (because self.name is always unicode) does the same encoding, with 'replace', referencing RFC1952. Although there are a few other differences, they appear inconsequential, so that the code otherwise should behave the same. Reading further on codec error handling, I gather that my previously understanding was off; non-Latin1 chars will just all appear as '?' instead of raising an exception. While that is normally useless, it does not matter since the result is not used. So I agree to call this fixed.

python-dev · 2011-12-26T17:22:36Z

New changeset dc1045d08bd8 by Jason R. Coombs in branch '2.7':
Issue bpo-11638: Adding test to ensure .tar.gz files can be generated by sdist command with unicode metadata, based on David Barnett's patch.
http://hg.python.org/cpython/rev/dc1045d08bd8

terryjreedy added the type-bug An unexpected behavior, bug, or error label Dec 24, 2011

gustaebel mannequin self-assigned this Dec 25, 2011

terryjreedy closed this as completed Dec 26, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError when creating tar.gz with unicode name #57848

UnicodeDecodeError when creating tar.gz with unicode name #57848

jaraco commented Dec 20, 2011

jaraco commented Dec 20, 2011

gustaebel mannequin commented Dec 21, 2011

jaraco commented Dec 21, 2011

python-dev mannequin commented Dec 21, 2011

vstinner commented Dec 21, 2011

gustaebel mannequin commented Dec 21, 2011

vstinner commented Dec 21, 2011

terryjreedy commented Dec 24, 2011

gustaebel mannequin commented Dec 24, 2011

terryjreedy commented Dec 24, 2011

gustaebel mannequin commented Dec 25, 2011

jaraco commented Dec 25, 2011

terryjreedy commented Dec 26, 2011

terryjreedy commented Dec 26, 2011

python-dev mannequin commented Dec 26, 2011

UnicodeDecodeError when creating tar.gz with unicode name #57848

UnicodeDecodeError when creating tar.gz with unicode name #57848

Comments

jaraco commented Dec 20, 2011

jaraco commented Dec 20, 2011

gustaebel mannequin commented Dec 21, 2011

jaraco commented Dec 21, 2011

python-dev mannequin commented Dec 21, 2011

vstinner commented Dec 21, 2011

gustaebel mannequin commented Dec 21, 2011

vstinner commented Dec 21, 2011

terryjreedy commented Dec 24, 2011

gustaebel mannequin commented Dec 24, 2011

terryjreedy commented Dec 24, 2011

gustaebel mannequin commented Dec 25, 2011

jaraco commented Dec 25, 2011

terryjreedy commented Dec 26, 2011

terryjreedy commented Dec 26, 2011

python-dev mannequin commented Dec 26, 2011