From b51a52c275fc8ade6433ed2c69e2c16fe6dcff68 Mon Sep 17 00:00:00 2001 From: Parham MohammadAlizadeh Date: Sat, 18 Oct 2025 19:47:04 +0100 Subject: [PATCH] gh-128571: Document UTF-16/32 native byte order (GH-139974) Closes GH-128571 (cherry picked from commit 920de7ccdcfa7284b6d23a124771b17c66dd3e4f) Co-authored-by: Parham MohammadAlizadeh Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> --- Doc/library/codecs.rst | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 5932012c535b56..24b5a9d64b2cd2 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -982,17 +982,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their -disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you -will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this -problem: bytes will always be in natural endianness. When these bytes are read -by a CPU with a different endianness, then bytes have to be swapped though. To -be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, -there's the so called BOM ("Byte Order Mark"). This is the Unicode character -``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` -byte sequence. The byte swapped version of this character (``0xFFFE``) is an -illegal character that may not appear in a Unicode text. So when the -first character in a ``UTF-16`` or ``UTF-32`` byte sequence -appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. +disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian +machine you will always have to swap bytes on encoding and decoding. +Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the +platform's native byte order when no BOM is present. +Python follows prevailing platform +practice, so native-endian data round-trips without redundant byte swapping, +even though the Unicode Standard defaults to big-endian when the byte order is +unspecified. When these bytes are read by a CPU with a different endianness, +the bytes have to be swapped. To be able to detect the endianness of a +``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used. +This is the Unicode character ``U+FEFF``. This character can be prepended to every +``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character +(``0xFFFE``) is an illegal character that may not appear in a Unicode text. +When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is +``U+FFFE``, the bytes have to be swapped on decoding. + Unfortunately the character ``U+FEFF`` had a second purpose as a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow a word to be split. It can e.g. be used to give hints to a ligature algorithm.