Skip to content

Commit

Permalink
bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
Browse files Browse the repository at this point in the history
Modify locale.localeconv(), time.tzname, os.strerror() and other
functions to ignore the UTF-8 Mode: always use the current locale
encoding.

Changes:

* Add _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx(). On decoding or
  encoding error, they return the position of the error and an error
  message which are used to raise Unicode errors in
  PyUnicode_DecodeLocale() and PyUnicode_EncodeLocale().
* Replace _Py_DecodeCurrentLocale() with _Py_DecodeLocaleEx().
* PyUnicode_DecodeLocale() now uses _Py_DecodeLocaleEx() for all
  cases, especially for the strict error handler.
* Add _Py_DecodeUTF8Ex(): return more information on decoding error
  and supports the strict error handler.
* Rename _Py_EncodeUTF8_surrogateescape() to _Py_EncodeUTF8Ex().
* Replace _Py_EncodeCurrentLocale() with _Py_EncodeLocaleEx().
* Ignore the UTF-8 mode to encode/decode localeconv(), strerror()
  and time zone name.
* Remove PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
  and PyUnicode_EncodeLocale() now ignore the UTF-8 mode: always use
  the "current" locale.
* Remove _PyUnicode_DecodeCurrentLocale(),
  _PyUnicode_DecodeCurrentLocaleAndSize() and
  _PyUnicode_EncodeCurrentLocale().
  • Loading branch information
vstinner committed Jan 15, 2018
1 parent ee3b835 commit 7ed7aea
Show file tree
Hide file tree
Showing 12 changed files with 472 additions and 505 deletions.
22 changes: 22 additions & 0 deletions Doc/c-api/sys.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,16 @@ Operating System Utilities
surrogate character, escape the bytes using the surrogateescape error
handler instead of decoding them.
Encoding, highest priority to lowest priority:
* ``UTF-8`` on macOS and Android;
* ``UTF-8`` if the Python UTF-8 mode is enabled;
* ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
``ISO-8859-1`` encoding.
* the current locale encoding.
Return a pointer to a newly allocated wide character string, use
:c:func:`PyMem_RawFree` to free the memory. If size is not ``NULL``, write
the number of wide characters excluding the null character into ``*size``
Expand Down Expand Up @@ -137,6 +147,18 @@ Operating System Utilities
:ref:`surrogateescape error handler <surrogateescape>`: surrogate characters
in the range U+DC80..U+DCFF are converted to bytes 0x80..0xFF.
Encoding, highest priority to lowest priority:
* ``UTF-8`` on macOS and Android;
* ``UTF-8`` if the Python UTF-8 mode is enabled;
* ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
``ISO-8859-1`` encoding.
* the current locale encoding.
The function uses the UTF-8 encoding in the Python UTF-8 mode.
Return a pointer to a newly allocated byte string, use :c:func:`PyMem_Free`
to free the memory. Return ``NULL`` on encoding error or memory allocation
error
Expand Down
16 changes: 16 additions & 0 deletions Doc/c-api/unicode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -770,12 +770,20 @@ system.
:c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
Python startup).
This function ignores the Python UTF-8 mode.
.. seealso::
The :c:func:`Py_DecodeLocale` function.
.. versionadded:: 3.3
.. versionchanged:: 3.7
The function now also uses the current locale encoding for the
``surrogateescape`` error handler. Previously, :c:func:`Py_DecodeLocale`
was used for the ``surrogateescape``, and the current locale encoding was
used for ``strict``.
.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
Expand All @@ -797,12 +805,20 @@ system.
:c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
Python startup).
This function ignores the Python UTF-8 mode.
.. seealso::
The :c:func:`Py_EncodeLocale` function.
.. versionadded:: 3.3
.. versionchanged:: 3.7
The function now also uses the current locale encoding for the
``surrogateescape`` error handler. Previously, :c:func:`Py_EncodeLocale`
was used for the ``surrogateescape``, and the current locale encoding was
used for ``strict``.
File System Encoding
""""""""""""""""""""
Expand Down
37 changes: 30 additions & 7 deletions Include/fileutils.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,41 @@ PyAPI_FUNC(char*) _Py_EncodeLocaleRaw(
#endif

#ifdef Py_BUILD_CORE
PyAPI_FUNC(int) _Py_DecodeUTF8Ex(
const char *arg,
Py_ssize_t arglen,
wchar_t **wstr,
size_t *wlen,
const char **reason,
int surrogateescape);

PyAPI_FUNC(int) _Py_EncodeUTF8Ex(
const wchar_t *text,
char **str,
size_t *error_pos,
const char **reason,
int raw_malloc,
int surrogateescape);

PyAPI_FUNC(wchar_t*) _Py_DecodeUTF8_surrogateescape(
const char *s,
Py_ssize_t size,
size_t *p_wlen);
const char *arg,
Py_ssize_t arglen);

PyAPI_FUNC(wchar_t *) _Py_DecodeCurrentLocale(
PyAPI_FUNC(int) _Py_DecodeLocaleEx(
const char *arg,
size_t *size);
wchar_t **wstr,
size_t *wlen,
const char **reason,
int current_locale,
int surrogateescape);

PyAPI_FUNC(char*) _Py_EncodeCurrentLocale(
PyAPI_FUNC(int) _Py_EncodeLocaleEx(
const wchar_t *text,
size_t *error_pos);
char **str,
size_t *error_pos,
const char **reason,
int current_locale,
int surrogateescape);
#endif

#ifndef Py_LIMITED_API
Expand Down
14 changes: 0 additions & 14 deletions Include/unicodeobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -1810,20 +1810,6 @@ PyAPI_FUNC(PyObject*) PyUnicode_EncodeLocale(
PyObject *unicode,
const char *errors
);

PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocale(
const char *str,
const char *errors);

PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocaleAndSize(
const char *str,
Py_ssize_t len,
const char *errors);

PyAPI_FUNC(PyObject*) _PyUnicode_EncodeCurrentLocale(
PyObject *unicode,
const char *errors
);
#endif

/* --- File system encoding ---------------------------------------------- */
Expand Down
2 changes: 1 addition & 1 deletion Modules/_datetimemodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -696,7 +696,7 @@ static int parse_isoformat_date(const char *dtstr,
if (NULL == p) {
return -1;
}

if (*(p++) != '-') {
return -2;
}
Expand Down
3 changes: 2 additions & 1 deletion Modules/_localemodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -572,8 +572,9 @@ PyIntl_bind_textdomain_codeset(PyObject* self,PyObject*args)
if (!PyArg_ParseTuple(args, "sz", &domain, &codeset))
return NULL;
codeset = bind_textdomain_codeset(domain, codeset);
if (codeset)
if (codeset) {
return PyUnicode_DecodeLocale(codeset, NULL);
}
Py_RETURN_NONE;
}
#endif
Expand Down
4 changes: 2 additions & 2 deletions Modules/getpath.c
Original file line number Diff line number Diff line change
Expand Up @@ -449,8 +449,8 @@ search_for_exec_prefix(const _PyCoreConfig *core_config,
n = fread(buf, 1, MAXPATHLEN, f);
buf[n] = '\0';
fclose(f);
rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n, NULL);
if (rel_builddir_path != NULL) {
rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n);
if (rel_builddir_path) {
wcsncpy(exec_prefix, calculate->argv0_path, MAXPATHLEN);
exec_prefix[MAXPATHLEN] = L'\0';
joinpath(exec_prefix, rel_builddir_path);
Expand Down
4 changes: 2 additions & 2 deletions Modules/readline.c
Original file line number Diff line number Diff line change
Expand Up @@ -132,13 +132,13 @@ static PyModuleDef readlinemodule;
static PyObject *
encode(PyObject *b)
{
return _PyUnicode_EncodeCurrentLocale(b, "surrogateescape");
return PyUnicode_EncodeLocale(b, "surrogateescape");
}

static PyObject *
decode(const char *s)
{
return _PyUnicode_DecodeCurrentLocale(s, "surrogateescape");
return PyUnicode_DecodeLocale(s, "surrogateescape");
}


Expand Down
11 changes: 5 additions & 6 deletions Modules/timemodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -418,11 +418,11 @@ tmtotuple(struct tm *p
SET(8, p->tm_isdst);
#ifdef HAVE_STRUCT_TM_TM_ZONE
PyStructSequence_SET_ITEM(v, 9,
_PyUnicode_DecodeCurrentLocale(p->tm_zone, "surrogateescape"));
PyUnicode_DecodeLocale(p->tm_zone, "surrogateescape"));
SET(10, p->tm_gmtoff);
#else
PyStructSequence_SET_ITEM(v, 9,
_PyUnicode_DecodeCurrentLocale(zone, "surrogateescape"));
PyUnicode_DecodeLocale(zone, "surrogateescape"));
PyStructSequence_SET_ITEM(v, 10, _PyLong_FromTime_t(gmtoff));
#endif /* HAVE_STRUCT_TM_TM_ZONE */
#undef SET
Expand Down Expand Up @@ -809,8 +809,7 @@ time_strftime(PyObject *self, PyObject *args)
#ifdef HAVE_WCSFTIME
ret = PyUnicode_FromWideChar(outbuf, buflen);
#else
ret = _PyUnicode_DecodeCurrentLocaleAndSize(outbuf, buflen,
"surrogateescape");
ret = PyUnicode_DecodeLocaleAndSize(outbuf, buflen, "surrogateescape");
#endif
PyMem_Free(outbuf);
break;
Expand Down Expand Up @@ -1541,8 +1540,8 @@ PyInit_timezone(PyObject *m) {
PyModule_AddIntConstant(m, "altzone", timezone-3600);
#endif
PyModule_AddIntConstant(m, "daylight", daylight);
otz0 = _PyUnicode_DecodeCurrentLocale(tzname[0], "surrogateescape");
otz1 = _PyUnicode_DecodeCurrentLocale(tzname[1], "surrogateescape");
otz0 = PyUnicode_DecodeLocale(tzname[0], "surrogateescape");
otz1 = PyUnicode_DecodeLocale(tzname[1], "surrogateescape");
PyModule_AddObject(m, "tzname", Py_BuildValue("(NN)", otz0, otz1));
#else /* !HAVE_TZNAME || __GLIBC__ || __CYGWIN__*/
{
Expand Down
Loading

0 comments on commit 7ed7aea

Please sign in to comment.