Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 16 additions & 11 deletions Doc/library/codecs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -982,17 +982,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
code point, is to store each code point as four consecutive bytes. There are two
possibilities: store the bytes in big endian or in little endian order. These
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
problem: bytes will always be in natural endianness. When these bytes are read
by a CPU with a different endianness, then bytes have to be swapped though. To
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
illegal character that may not appear in a Unicode text. So when the
first character in a ``UTF-16`` or ``UTF-32`` byte sequence
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
machine you will always have to swap bytes on encoding and decoding.
Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
platform's native byte order when no BOM is present.
Python follows prevailing platform
practice, so native-endian data round-trips without redundant byte swapping,
even though the Unicode Standard defaults to big-endian when the byte order is
unspecified. When these bytes are read by a CPU with a different endianness,
the bytes have to be swapped. To be able to detect the endianness of a
``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
This is the Unicode character ``U+FEFF``. This character can be prepended to every
``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
``U+FFFE``, the bytes have to be swapped on decoding.

Unfortunately the character ``U+FEFF`` had a second purpose as
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
Expand Down
Loading