New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tarfile extract fails when Unicode in pathname #61355
Comments
The attached file failing.tar.gz contains a path with UTF-8-encoded Unicode. This causes extractall() to fail, but only when the destination path is Unicode. That's because it leads to a implicit str->unicode conversion using ASCII. Test script: import shutil, tarfile, tempfile
tf = tarfile.open('failing.tar.gz', 'r:gz')
workdir = tempfile.mkdtemp()
try:
# N.B. ensure dest path is Unicode to trigger the failure
tf.extractall(unicode(workdir))
finally:
shutil.rmtree(workdir) Result: $ python untar.py
Traceback (most recent call last):
File "untar.py", line 8, in <module>
tf.extractall(unicode(workdir))
File "/usr/lib/python2.7/tarfile.py", line 2046, in extractall
self.extract(tarinfo, path)
File "/usr/lib/python2.7/tarfile.py", line 2083, in extract
self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
File "/usr/lib/python2.7/posixpath.py", line 71, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 44: ordinal not in range(128) |
@Lars can we have a response on this issue please? |
IIRC, tarfile under 2.7 has never been explicitly unicode-safe, support for unicode objects is heterogeneous at best. The obvious work-around is to work exclusively with str objects. What we can't do is to decode the utf-8 pathname from the archive to a unicode object, because we have no way to detect an archive's encoding. We can either emit a warning if the user passes a unicode object to extract() or we implicitly encode the passed unicode object using TarFile.encoding, so that the os.path.join() succeeds. Unfortunately, I am not entirely sure if there was possibly a rationale behind the current behaviour of extract(). This needs more inspection. |
So... The bug persists in 3.5 ad 3.6. It prevents from e.g. unpacking tarballs coming from GitHub repos with Unicode file names. |
Relevant issue in pip: pypa/setuptools#710 |
Could you point to some suitable projects from GitHub whose tarballs fail on 3.5 / 3.6? My script in the first post, with the replacing of "unicode(...)" with "str(...)" and my original failing archive, works on Python 3.5 and 3.6 on Linux. Which platform have you seen failures on? |
Python 2.7 is no longer supported, so I think this issue should be closed. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: