Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile extract fails when Unicode in pathname #61355

Closed
vsajip opened this issue Feb 7, 2013 · 7 comments
Closed

tarfile extract fails when Unicode in pathname #61355

vsajip opened this issue Feb 7, 2013 · 7 comments
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@vsajip
Copy link
Member

vsajip commented Feb 7, 2013

BPO 17153
Nosy @vsajip, @gustaebel, @ezio-melotti, @hynek, @ZackerySpytz
Files
  • failing.tar.gz: Failing archive
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-05-31.22:27:36.008>
    created_at = <Date 2013-02-07.16:43:21.781>
    labels = ['type-bug', 'library', 'expert-unicode']
    title = 'tarfile extract fails when Unicode in pathname'
    updated_at = <Date 2021-05-31.22:27:36.008>
    user = 'https://github.com/vsajip'

    bugs.python.org fields:

    activity = <Date 2021-05-31.22:27:36.008>
    actor = 'vinay.sajip'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-05-31.22:27:36.008>
    closer = 'vinay.sajip'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2013-02-07.16:43:21.781>
    creator = 'vinay.sajip'
    dependencies = []
    files = ['28988']
    hgrepos = []
    issue_num = 17153
    keywords = []
    message_count = 7.0
    messages = ['181631', '221135', '222553', '272329', '272330', '272370', '394828']
    nosy_count = 6.0
    nosy_names = ['vinay.sajip', 'lars.gustaebel', 'ezio.melotti', 'hynek', 'Vadim Markovtsev2', 'ZackerySpytz']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue17153'
    versions = ['Python 2.7']

    @vsajip
    Copy link
    Member Author

    vsajip commented Feb 7, 2013

    The attached file failing.tar.gz contains a path with UTF-8-encoded Unicode. This causes extractall() to fail, but only when the destination path is Unicode. That's because it leads to a implicit str->unicode conversion using ASCII.

    Test script:

    import shutil, tarfile, tempfile
    
    tf = tarfile.open('failing.tar.gz', 'r:gz')
    workdir = tempfile.mkdtemp()
    try:
        # N.B. ensure dest path is Unicode to trigger the failure
        tf.extractall(unicode(workdir))
    finally:
        shutil.rmtree(workdir)

    Result:

    $ python untar.py
    Traceback (most recent call last):
      File "untar.py", line 8, in <module>
        tf.extractall(unicode(workdir))
      File "/usr/lib/python2.7/tarfile.py", line 2046, in extractall
        self.extract(tarinfo, path)
      File "/usr/lib/python2.7/tarfile.py", line 2083, in extract
        self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
      File "/usr/lib/python2.7/posixpath.py", line 71, in join
        path += '/' + b
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 44: ordinal not in range(128)

    @vsajip vsajip added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Feb 7, 2013
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jun 20, 2014

    @Lars can we have a response on this issue please?

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jul 8, 2014

    IIRC, tarfile under 2.7 has never been explicitly unicode-safe, support for unicode objects is heterogeneous at best. The obvious work-around is to work exclusively with str objects.

    What we can't do is to decode the utf-8 pathname from the archive to a unicode object, because we have no way to detect an archive's encoding. We can either emit a warning if the user passes a unicode object to extract() or we implicitly encode the passed unicode object using TarFile.encoding, so that the os.path.join() succeeds.

    Unfortunately, I am not entirely sure if there was possibly a rationale behind the current behaviour of extract(). This needs more inspection.

    @VadimMarkovtsev2
    Copy link
    Mannequin

    VadimMarkovtsev2 mannequin commented Aug 10, 2016

    So... The bug persists in 3.5 ad 3.6. It prevents from e.g. unpacking tarballs coming from GitHub repos with Unicode file names.

    @VadimMarkovtsev2
    Copy link
    Mannequin

    VadimMarkovtsev2 mannequin commented Aug 10, 2016

    Relevant issue in pip: pypa/setuptools#710

    @vsajip
    Copy link
    Member Author

    vsajip commented Aug 10, 2016

    Could you point to some suitable projects from GitHub whose tarballs fail on 3.5 / 3.6? My script in the first post, with the replacing of "unicode(...)" with "str(...)" and my original failing archive, works on Python 3.5 and 3.6 on Linux. Which platform have you seen failures on?

    @ZackerySpytz
    Copy link
    Mannequin

    ZackerySpytz mannequin commented May 31, 2021

    Python 2.7 is no longer supported, so I think this issue should be closed.

    @vsajip vsajip closed this as completed May 31, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant