Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_tarfile fails on cygwin (unicode decode error) #48074

Closed
ocean-city mannequin opened this issue Sep 9, 2008 · 15 comments
Closed

test_tarfile fails on cygwin (unicode decode error) #48074

ocean-city mannequin opened this issue Sep 9, 2008 · 15 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@ocean-city
Copy link
Mannequin

ocean-city mannequin commented Sep 9, 2008

BPO 3824
Nosy @loewis, @amauryfa

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-01-13.17:39:53.619>
created_at = <Date 2008-09-09.20:42:08.273>
labels = ['type-bug']
title = 'test_tarfile fails on cygwin (unicode decode error)'
updated_at = <Date 2016-01-13.17:39:53.618>
user = 'https://bugs.python.org/ocean-city'

bugs.python.org fields:

activity = <Date 2016-01-13.17:39:53.618>
actor = 'ezio.melotti'
assignee = 'none'
closed = True
closed_date = <Date 2016-01-13.17:39:53.619>
closer = 'ezio.melotti'
components = []
creation = <Date 2008-09-09.20:42:08.273>
creator = 'ocean-city'
dependencies = []
files = []
hgrepos = []
issue_num = 3824
keywords = ['patch']
message_count = 15.0
messages = ['72907', '72918', '72957', '72959', '72960', '73545', '73556', '73580', '73598', '73658', '73681', '73707', '73731', '73748', '148001']
nosy_count = 4.0
nosy_names = ['loewis', 'amaury.forgeotdarc', 'ocean-city', 'rpetrov']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue3824'
versions = ['Python 3.0']

@ocean-city
Copy link
Mannequin Author

ocean-city mannequin commented Sep 9, 2008

I noticed test_tarfile on py3k fails like this.

======================================================================
ERROR: test_directory_size (main.WriteTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "test_tarfile.py", line 598, in test_directory_size
    tarinfo = tar.gettarinfo(path)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1869, in
gettari
nfo
    tarinfo.gname = grp.getgrgid(tarinfo.gid)[0]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 0:
unexpecte
d code byte

======================================

And I noticed PyUnicode_FromString supposes input as UTF-8, but actually
member of struct grp is MBCS or CP932 on cygwin.

After patched following workaround, test passed. I don't know how to fix
this... Does python have system encoding or something?
(I experienced similar error on test_grp and test_pwd)

Index: Modules/grpmodule.c
===================================================================

--- Modules/grpmodule.c	(revision 66345)
+++ Modules/grpmodule.c	(working copy)
@@ -32,6 +32,8 @@
 static int initialized;
 static PyTypeObject StructGrpType;
 
+#define PyUnicode_FromString(s) PyUnicode_DecodeMBCS(s, strlen(s),
"strict")
+
 static PyObject *
 mkgrent(struct group *p)
 {
@@ -83,6 +85,8 @@
     return v;
 }
 
+#undef PyUnicode_FromString
+
 static PyObject *
 grp_getgrgid(PyObject *self, PyObject *pyo_id)
 {

@loewis
Copy link
Mannequin

loewis mannequin commented Sep 9, 2008

I think you should use the locale's encoding to process the data, ie.
either mbstowcs, then Unicode from wchar_t, or decode with the
nl_langinfo(CODESET) encoding. You might have to set the locale before
this can work (which isn't thread-safe), so it might be tricky to implement.

Python already does nl_langinfo at startup, but then restores the
locale. It should probably save the default locale's codeset somewhere,
as C code requires it in many places.

There is also a "system" encoding, but that is UTF-8 independent of the
system.

@ocean-city
Copy link
Mannequin Author

ocean-city mannequin commented Sep 10, 2008

Sorry, probably I saw illusion... If uses cp932 codec, still
test_tarfile.py reports error. :-(

======================================================================
ERROR: test_tar_size (main.WriteTest)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "test_tarfile.py", line 570, in test_tar_size
    tar.add(path)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1953, in add
    self.addfile(tarinfo, f)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1976, in
addfile
    buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 987, in
tobuf
    return self.create_gnu_header(info, encoding, errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1018, in
create_
gnu_header
    return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1107, in
_create
_header
    stn(info.get("gname", "root"), 32, encoding, errors),
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 177, in stn
    s = s.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordin
al not in range(128)

@amauryfa
Copy link
Member

Is PyUnicode_DecodeMBCS available on cygwin?
I get compilation errors when I try your patch.

@ocean-city
Copy link
Mannequin Author

ocean-city mannequin commented Sep 10, 2008

Yes, when I did it last night, I thought I could compile it and saw OK
on test_tarfile.py, but probably I dreamed. :-(

#define PyUnicode_FromString(s) PyUnicode_Decode(s, strlen(s), "cp932",
"strict")

or following patch should work.

@ocean-city
Copy link
Mannequin Author

ocean-city mannequin commented Sep 22, 2008

Sorry, the patch didn't work... I didn't understand Martin's word. And
nl_langinfo(CODESET) is useless on cygwin because it's always US-ASCII.

@loewis
Copy link
Mannequin

loewis mannequin commented Sep 22, 2008

I didn't mean to suggest that a new codec is created; instead, mbstowcs
should be called directly in grpmodule.c.

By default, mbstowcs will use ASCII, so it is likely to fail - you would
need to call setlocale first.

@ocean-city
Copy link
Mannequin Author

ocean-city mannequin commented Sep 22, 2008

I'm not cygwin user, but cygwin seems not to support multibyte function.
Following program outputs 5 on VC6 as expected, but 10 on cygwin.
Hmm...

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(int argc, char* argv[])
{
	const char s[] = "あいうえお";
	size_t len;
	wchar_t *buf;
setlocale(LC_ALL, "");
	len = strlen(s); /* 10 */

	buf = (wchar_t*)malloc((len+1)*sizeof(wchar_t));

	len = mbstowcs(buf, s, len+1);
printf("--------\> %d\\n", len); /* should be 5 \*/
	return 0;
}

@loewis
Copy link
Mannequin

loewis mannequin commented Sep 22, 2008

In this case, I think there is nothing we can do. Perhaps it is useful
to put a comment into the test, pointing out that this is likely to
break on Cygwin, and refer to this issue.

I don't see that as a problem: it's just a test that fails, and only on
some systems (i.e. when you have non-ASCII characters in the group
file). People running into the problem should first resolve the
underlying problem in Cygwin, and, when Cygwin actually works correctly,
come back to fixing this issue.

@rpetrov
Copy link
Mannequin

rpetrov mannequin commented Sep 23, 2008

What is test result if the environment variable LANG is set to C ?

@ocean-city
Copy link
Mannequin Author

ocean-city mannequin commented Sep 24, 2008

What is test result if the environment variable LANG is set to C ?

There is no change.

@amauryfa
Copy link
Member

Doesn't getgrgid() return the untranslated content of /etc/group?
Then the encoding of this file is relevant.

On cygwin, "mkgroup -l" is often (exclusively?) used to generate this
/etc/group, extracting the user definitions from the Windows settings.
Its source code
http://www.google.com/codesearch?q=file:mkgroup.c+package:cygwin-1.5
shows that it uses MBCS to encode strings.

Maybe we should start considering cygwin as a posix platform with win32
features; but I am not sure to like this idea.

@ocean-city
Copy link
Mannequin Author

ocean-city mannequin commented Sep 24, 2008

Doesn't getgrgid() return the untranslated content of /etc/group?
Then the encoding of this file is relevant.

Yes, /etc/group contains "なし" as gr_name in MBCS,("なし" means
"nothing")and I can print it with puts() in grpmodule.c, so it
shouldn't be translated.

Maybe we should start considering cygwin as a posix platform with win32
features; but I am not sure to like this idea.

Me neigher.

@loewis
Copy link
Mannequin

loewis mannequin commented Sep 24, 2008

Doesn't getgrgid() return the untranslated content of /etc/group?
Then the encoding of this file is relevant.

That certainly depends on the implementation of getgrgid. On some
systems, it uses NIS, LDAP, or a relational database in addition to
or instead of /etc/group.

I don't think POSIX specifies the charset of gr_name, except perhaps
implicitly by inheriting the notion of "execution character set" from
C, which proposes that all char* should have the same interpretation.
In POSIX, the character set comes from the locale.
The getgrid man page gives an example demonstrating that you are
supposed to be able to printf() gr_name using %s, which suggests that
it should have the locale's encoding.

In Cygwin, I have no doubt that the implementation literally copies
the encoding untranslated. I would claim this to be a bug in Cygwin.
A quality POSIX implementation would convert the /etc/group encoding
to the locale's encoding (just as it should when fetching the data
from LDAP).

Maybe we should start considering cygwin as a posix platform with win32
features; but I am not sure to like this idea.

If it is desired that we support this specific implementation aspect, we
can certainly special-case Cygwin in getgrgid. However, I would prefer
if this issue was discussed with the Cygwin people first, suggesting
that it might be a Cygwin bug (one that it certainly shares with many
other POSIX implementations - people just don't use non-ASCII characters
in /etc/group).

If Cygwin ever changes its implementation in that respect, we would need
to have a way for telling which specific implementation is being used.

@amauryfa
Copy link
Member

grp.getgrgid() now calls .decode('utf8', errors="surrogateescape").
Even if cygwin does not correctly copy strings from the Windows registry, tarinfo.gname should now contain a string that will at least round trip and give the same value on disk.
Ocean-city, can you check whether this error still reproduces?

@ezio-melotti ezio-melotti added the type-bug An unexpected behavior, bug, or error label Jan 13, 2016
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants