tarball header mods choke on non-ascii file names #335

sosiouxme · 2019-03-28T13:23:49Z

I ran tito on a project that introduced a file with the name utf8_tést_app.rb. When tito went to edit the source tarball, it choked on this filename being in the headers:

  File "/bin/tito", line 23, in <module>
    CLI().main(sys.argv[1:])
  File "/usr/lib/python2.7/site-packages/tito/cli.py", line 202, in main
    return module.main(argv)
  File "/usr/lib/python2.7/site-packages/tito/cli.py", line 593, in main
    scratch=self.options.scratch)
  File "/usr/lib/python2.7/site-packages/tito/release/distgit.py", line 73, in release
    self._git_release()
  File "/usr/lib/python2.7/site-packages/tito/release/distgit.py", line 90, in _git_release
    self.builder.tgz()
  File "/usr/lib/python2.7/site-packages/tito/builder/main.py", line 484, in tgz
    self._setup_sources()
  File "/usr/lib/python2.7/site-packages/tito/builder/main.py", line 519, in _setup_sources
    os.path.join(self.rpmbuild_sourcedir, self.tgz_filename))
  File "/usr/lib/python2.7/site-packages/tito/common.py", line 972, in create_tgz
    tarfixer.fix()
  File "/usr/lib/python2.7/site-packages/tito/tar.py", line 331, in fix
    self.process_chunk(chunk)
  File "/usr/lib/python2.7/site-packages/tito/tar.py", line 314, in process_chunk
    self.process_header(chunk_props)
  File "/usr/lib/python2.7/site-packages/tito/tar.py", line 203, in process_header
    chunk_props['checksum'] = self.calculate_checksum(chunk_props)
  File "/usr/lib/python2.7/site-packages/tito/tar.py", line 241, in calculate_checksum
    values = self.encode_header(chunk_props)
  File "/usr/lib/python2.7/site-packages/tito/tar.py", line 198, in encode_header
    pack_values.append(chunk_props[member].encode("utf8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 75: ordinal not in range(128)

In python2 there's an implicit decode("ascii") which is what it's complaining about, this being equivalent to

    pack_values.append(chunk_props[member].decode("ascii").encode("utf8"))

If I assume that the text is utf8 (which includes ascii) and explicitly decode("utf8") then this code does not raise an exception, but tar does not like the output on the next step.

Error running command: [...] tar xzf openshift-git-0.9896b19.tar.gz

Status code: 512

Command output: tar: Skipping to next header
tar: Exiting with failure status due to previous errors

So it seems likely that the text needs to be decoded into unicode at some point before this code, but it's not clear to me where. Also I'm not sure what encoding tar assumes.

The text was updated successfully, but these errors were encountered:

sosiouxme · 2019-03-28T13:25:51Z

@awood @xsuchy

…racters in the name. Following the principals of https://nedbatchelder.com/text/unipain.html our goal is to decode bytes in to unicode as soon as we read them and encode unicode date to bytes at the last second. The specific problem we were seeing was caused by calling "encode" on a byte string rather than a unicode string. Python attempts to be "helpful" and tries to decode the bytes as ASCII in order to provide a unicode string to the encode function. Since the bytes aren't ASCII, the decode fails and we get the UnicodeDecodeError despite the fact that we never explicitly asked for a decode at all.

…racters in the name. Following the principals of https://nedbatchelder.com/text/unipain.html our goal is to decode bytes in to unicode as soon as we read them and encode unicode date to bytes at the last second. The specific problem we were seeing was caused by calling "encode" on a byte string rather than a unicode string. Python attempts to be "helpful" and tries to decode the bytes as ASCII in order to provide a unicode string to the encode function. Since the bytes aren't ASCII, the decode fails and we get the UnicodeDecodeError despite the fact that we never explicitly asked for a decode at all. Also, calculate checksums correctly for tarballs that have files with UTF8 characters in the file name.

…name." This reverts commit 03509b3.

…ls with UTF8 characters in the name." This partialy reverts commit 03509b3. It removes just test and keep the functionality. The test cannot be there right now because tito 0.6.11 and older will choke on this and will produce demaged tarball. This revert can be added back later when all devel has tito in version 0.6.12 or higher. Resolves: rpm-software-management#337

Partial revert "Fix #335. Handle source tarballs with UTF8 characters…

sosiouxme mentioned this issue Mar 28, 2019

Removing utf-8 file in extended build test openshift/origin#22421

Merged

dgoodwin closed this as completed in 03509b3 Apr 10, 2019

jmrodri added a commit that referenced this issue Sep 20, 2019

Revert "Fix #335. Handle source tarballs with UTF8 characters in the …

c2c4c53

…name." This reverts commit 03509b3.

dgoodwin added a commit that referenced this issue Oct 3, 2019

Merge pull request #345 from xsuchy/tarutf8

fd6945d

Partial revert "Fix #335. Handle source tarballs with UTF8 characters…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tarball header mods choke on non-ascii file names #335

tarball header mods choke on non-ascii file names #335

sosiouxme commented Mar 28, 2019

sosiouxme commented Mar 28, 2019

tarball header mods choke on non-ascii file names #335

tarball header mods choke on non-ascii file names #335

Comments

sosiouxme commented Mar 28, 2019

sosiouxme commented Mar 28, 2019