test_tarfile fails on cygwin (unicode decode error) #48074

ocean-city · 2008-09-09T20:42:08Z

BPO	3824
Nosy	@loewis, @amauryfa

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-01-13.17:39:53.619>
created_at = <Date 2008-09-09.20:42:08.273>
labels = ['type-bug']
title = 'test_tarfile fails on cygwin (unicode decode error)'
updated_at = <Date 2016-01-13.17:39:53.618>
user = 'https://bugs.python.org/ocean-city'

bugs.python.org fields:

activity = <Date 2016-01-13.17:39:53.618>
actor = 'ezio.melotti'
assignee = 'none'
closed = True
closed_date = <Date 2016-01-13.17:39:53.619>
closer = 'ezio.melotti'
components = []
creation = <Date 2008-09-09.20:42:08.273>
creator = 'ocean-city'
dependencies = []
files = []
hgrepos = []
issue_num = 3824
keywords = ['patch']
message_count = 15.0
messages = ['72907', '72918', '72957', '72959', '72960', '73545', '73556', '73580', '73598', '73658', '73681', '73707', '73731', '73748', '148001']
nosy_count = 4.0
nosy_names = ['loewis', 'amaury.forgeotdarc', 'ocean-city', 'rpetrov']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue3824'
versions = ['Python 3.0']

ocean-city · 2008-09-09T20:42:07Z

I noticed test_tarfile on py3k fails like this.

======================================================================
ERROR: test_directory_size (main.WriteTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "test_tarfile.py", line 598, in test_directory_size
    tarinfo = tar.gettarinfo(path)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1869, in
gettari
nfo
    tarinfo.gname = grp.getgrgid(tarinfo.gid)[0]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 0:
unexpecte
d code byte

======================================

And I noticed PyUnicode_FromString supposes input as UTF-8, but actually
member of struct grp is MBCS or CP932 on cygwin.

After patched following workaround, test passed. I don't know how to fix
this... Does python have system encoding or something?
(I experienced similar error on test_grp and test_pwd)

Index: Modules/grpmodule.c
===================================================================

--- Modules/grpmodule.c	(revision 66345)
+++ Modules/grpmodule.c	(working copy)
@@ -32,6 +32,8 @@
 static int initialized;
 static PyTypeObject StructGrpType;
 
+#define PyUnicode_FromString(s) PyUnicode_DecodeMBCS(s, strlen(s),
"strict")
+
 static PyObject *
 mkgrent(struct group *p)
 {
@@ -83,6 +85,8 @@
     return v;
 }
 
+#undef PyUnicode_FromString
+
 static PyObject *
 grp_getgrgid(PyObject *self, PyObject *pyo_id)
 {

loewis · 2008-09-09T21:21:37Z

I think you should use the locale's encoding to process the data, ie.
either mbstowcs, then Unicode from wchar_t, or decode with the
nl_langinfo(CODESET) encoding. You might have to set the locale before
this can work (which isn't thread-safe), so it might be tricky to implement.

Python already does nl_langinfo at startup, but then restores the
locale. It should probably save the default locale's codeset somewhere,
as C code requires it in many places.

There is also a "system" encoding, but that is UTF-8 independent of the
system.

ocean-city · 2008-09-10T12:44:32Z

Sorry, probably I saw illusion... If uses cp932 codec, still
test_tarfile.py reports error. :-(

======================================================================
ERROR: test_tar_size (main.WriteTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "test_tarfile.py", line 570, in test_tar_size
    tar.add(path)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1953, in add
    self.addfile(tarinfo, f)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1976, in
addfile

    buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 987, in
tobuf
    return self.create_gnu_header(info, encoding, errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1018, in
create_
gnu_header
    return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1107, in
_create
_header
    stn(info.get("gname", "root"), 32, encoding, errors),
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 177, in stn
    s = s.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordin
al not in range(128)

amauryfa · 2008-09-10T12:57:07Z

Is PyUnicode_DecodeMBCS available on cygwin?
I get compilation errors when I try your patch.

ocean-city · 2008-09-10T13:10:30Z

Yes, when I did it last night, I thought I could compile it and saw OK
on test_tarfile.py, but probably I dreamed. :-(

#define PyUnicode_FromString(s) PyUnicode_Decode(s, strlen(s), "cp932",
"strict")

or following patch should work.

ocean-city · 2008-09-22T00:29:40Z

Sorry, the patch didn't work... I didn't understand Martin's word. And
nl_langinfo(CODESET) is useless on cygwin because it's always US-ASCII.

loewis · 2008-09-22T05:16:01Z

I didn't mean to suggest that a new codec is created; instead, mbstowcs
should be called directly in grpmodule.c.

By default, mbstowcs will use ASCII, so it is likely to fail - you would
need to call setlocale first.

ocean-city · 2008-09-22T14:48:13Z

I'm not cygwin user, but cygwin seems not to support multibyte function.
Following program outputs 5 on VC6 as expected, but 10 on cygwin.
Hmm...

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(int argc, char* argv[])
{
	const char s[] = "あいうえお";
	size_t len;
	wchar_t *buf;

setlocale(LC_ALL, "");

	len = strlen(s); /* 10 */

	buf = (wchar_t*)malloc((len+1)*sizeof(wchar_t));

	len = mbstowcs(buf, s, len+1);

printf("--------\> %d\\n", len); /* should be 5 \*/

	return 0;
}

loewis · 2008-09-22T21:17:15Z

In this case, I think there is nothing we can do. Perhaps it is useful
to put a comment into the test, pointing out that this is likely to
break on Cygwin, and refer to this issue.

I don't see that as a problem: it's just a test that fails, and only on
some systems (i.e. when you have non-ASCII characters in the group
file). People running into the problem should first resolve the
underlying problem in Cygwin, and, when Cygwin actually works correctly,
come back to fixing this issue.

rpetrov · 2008-09-23T19:34:01Z

What is test result if the environment variable LANG is set to C ?

ocean-city · 2008-09-24T00:04:49Z

What is test result if the environment variable LANG is set to C ?

There is no change.

amauryfa · 2008-09-24T12:00:46Z

Doesn't getgrgid() return the untranslated content of /etc/group?
Then the encoding of this file is relevant.

On cygwin, "mkgroup -l" is often (exclusively?) used to generate this
/etc/group, extracting the user definitions from the Windows settings.
Its source code
http://www.google.com/codesearch?q=file:mkgroup.c+package:cygwin-1.5
shows that it uses MBCS to encode strings.

Maybe we should start considering cygwin as a posix platform with win32
features; but I am not sure to like this idea.

ocean-city · 2008-09-24T16:45:10Z

Doesn't getgrgid() return the untranslated content of /etc/group?
Then the encoding of this file is relevant.

Yes, /etc/group contains "なし" as gr_name in MBCS,（"なし" means
"nothing"）and I can print it with puts() in grpmodule.c, so it
shouldn't be translated.

Maybe we should start considering cygwin as a posix platform with win32
features; but I am not sure to like this idea.

Me neigher.

loewis · 2008-09-24T18:42:04Z

Doesn't getgrgid() return the untranslated content of /etc/group?
Then the encoding of this file is relevant.

That certainly depends on the implementation of getgrgid. On some
systems, it uses NIS, LDAP, or a relational database in addition to
or instead of /etc/group.

I don't think POSIX specifies the charset of gr_name, except perhaps
implicitly by inheriting the notion of "execution character set" from
C, which proposes that all char* should have the same interpretation.
In POSIX, the character set comes from the locale.
The getgrid man page gives an example demonstrating that you are
supposed to be able to printf() gr_name using %s, which suggests that
it should have the locale's encoding.

In Cygwin, I have no doubt that the implementation literally copies
the encoding untranslated. I would claim this to be a bug in Cygwin.
A quality POSIX implementation would convert the /etc/group encoding
to the locale's encoding (just as it should when fetching the data
from LDAP).

Maybe we should start considering cygwin as a posix platform with win32
features; but I am not sure to like this idea.

If it is desired that we support this specific implementation aspect, we
can certainly special-case Cygwin in getgrgid. However, I would prefer
if this issue was discussed with the Cygwin people first, suggesting
that it might be a Cygwin bug (one that it certainly shares with many
other POSIX implementations - people just don't use non-ASCII characters
in /etc/group).

If Cygwin ever changes its implementation in that respect, we would need
to have a way for telling which specific implementation is being used.

amauryfa · 2011-11-20T16:37:50Z

grp.getgrgid() now calls .decode('utf8', errors="surrogateescape").
Even if cygwin does not correctly copy strings from the Windows registry, tarinfo.gname should now contain a string that will at least round trip and give the same value on disk.
Ocean-city, can you check whether this error still reproduces?

ezio-melotti closed this as completed Jan 13, 2016

ezio-melotti added the type-bug An unexpected behavior, bug, or error label Jan 13, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_tarfile fails on cygwin (unicode decode error) #48074

test_tarfile fails on cygwin (unicode decode error) #48074

ocean-city mannequin commented Sep 9, 2008

ocean-city mannequin commented Sep 9, 2008

loewis mannequin commented Sep 9, 2008

ocean-city mannequin commented Sep 10, 2008

amauryfa commented Sep 10, 2008

ocean-city mannequin commented Sep 10, 2008

ocean-city mannequin commented Sep 22, 2008

loewis mannequin commented Sep 22, 2008

ocean-city mannequin commented Sep 22, 2008

loewis mannequin commented Sep 22, 2008

rpetrov mannequin commented Sep 23, 2008

ocean-city mannequin commented Sep 24, 2008

amauryfa commented Sep 24, 2008

ocean-city mannequin commented Sep 24, 2008

loewis mannequin commented Sep 24, 2008

amauryfa commented Nov 20, 2011

test_tarfile fails on cygwin (unicode decode error) #48074

test_tarfile fails on cygwin (unicode decode error) #48074

Comments

ocean-city mannequin commented Sep 9, 2008

ocean-city mannequin commented Sep 9, 2008

loewis mannequin commented Sep 9, 2008

ocean-city mannequin commented Sep 10, 2008

amauryfa commented Sep 10, 2008

ocean-city mannequin commented Sep 10, 2008

ocean-city mannequin commented Sep 22, 2008

loewis mannequin commented Sep 22, 2008

ocean-city mannequin commented Sep 22, 2008

loewis mannequin commented Sep 22, 2008

rpetrov mannequin commented Sep 23, 2008

ocean-city mannequin commented Sep 24, 2008

amauryfa commented Sep 24, 2008

ocean-city mannequin commented Sep 24, 2008

loewis mannequin commented Sep 24, 2008

amauryfa commented Nov 20, 2011