New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_tarfile fails on cygwin (unicode decode error) #48074
Comments
I noticed test_tarfile on py3k fails like this. ====================================================================== Traceback (most recent call last):
File "test_tarfile.py", line 598, in test_directory_size
tarinfo = tar.gettarinfo(path)
File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1869, in
gettari
nfo
tarinfo.gname = grp.getgrgid(tarinfo.gid)[0]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 0:
unexpecte
d code byte ====================================== And I noticed PyUnicode_FromString supposes input as UTF-8, but actually After patched following workaround, test passed. I don't know how to fix Index: Modules/grpmodule.c --- Modules/grpmodule.c (revision 66345)
+++ Modules/grpmodule.c (working copy)
@@ -32,6 +32,8 @@
static int initialized;
static PyTypeObject StructGrpType;
+#define PyUnicode_FromString(s) PyUnicode_DecodeMBCS(s, strlen(s),
"strict")
+
static PyObject *
mkgrent(struct group *p)
{
@@ -83,6 +85,8 @@
return v;
}
+#undef PyUnicode_FromString
+
static PyObject *
grp_getgrgid(PyObject *self, PyObject *pyo_id)
{ |
I think you should use the locale's encoding to process the data, ie. Python already does nl_langinfo at startup, but then restores the There is also a "system" encoding, but that is UTF-8 independent of the |
Sorry, probably I saw illusion... If uses cp932 codec, still ====================================================================== Traceback (most recent call last):
File "test_tarfile.py", line 570, in test_tar_size
tar.add(path)
File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1953, in add
self.addfile(tarinfo, f)
File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1976, in
addfile buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 987, in
tobuf
return self.create_gnu_header(info, encoding, errors)
File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1018, in
create_
gnu_header
return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1107, in
_create
_header
stn(info.get("gname", "root"), 32, encoding, errors),
File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 177, in stn
s = s.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordin
al not in range(128) |
Is PyUnicode_DecodeMBCS available on cygwin? |
Yes, when I did it last night, I thought I could compile it and saw OK #define PyUnicode_FromString(s) PyUnicode_Decode(s, strlen(s), "cp932",
"strict") or following patch should work. |
Sorry, the patch didn't work... I didn't understand Martin's word. And |
I didn't mean to suggest that a new codec is created; instead, mbstowcs By default, mbstowcs will use ASCII, so it is likely to fail - you would |
I'm not cygwin user, but cygwin seems not to support multibyte function. #include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main(int argc, char* argv[])
{
const char s[] = "あいうえお";
size_t len;
wchar_t *buf;
len = strlen(s); /* 10 */
buf = (wchar_t*)malloc((len+1)*sizeof(wchar_t));
len = mbstowcs(buf, s, len+1);
return 0;
} |
In this case, I think there is nothing we can do. Perhaps it is useful I don't see that as a problem: it's just a test that fails, and only on |
What is test result if the environment variable LANG is set to C ? |
There is no change. |
Doesn't getgrgid() return the untranslated content of /etc/group? On cygwin, "mkgroup -l" is often (exclusively?) used to generate this Maybe we should start considering cygwin as a posix platform with win32 |
Yes, /etc/group contains "なし" as gr_name in MBCS,("なし" means
Me neigher. |
That certainly depends on the implementation of getgrgid. On some I don't think POSIX specifies the charset of gr_name, except perhaps In Cygwin, I have no doubt that the implementation literally copies
If it is desired that we support this specific implementation aspect, we If Cygwin ever changes its implementation in that respect, we would need |
grp.getgrgid() now calls .decode('utf8', errors="surrogateescape"). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: