# Text Versus Bytes

## Byte Essentials

The first thing to know is that there are two basic built-in types for binary sequences: the `im‐mutable bytes type` introduced in Python 3 and the `mutable bytearray`, added inPython 2.6. 

In [13]:
s = 'cafe'
b = s.encode(encoding='utf8')
b

b'cafe'

In [5]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

In [6]:
cafe[0]

99

In [7]:
cafe[:1]

b'c'

In [8]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'caf\xc3\xa9')

In [9]:
cafe_arr[-1:] 

bytearray(b'\xa9')

1. bytes can be built from a str, given an encoding.
2. Each item is an integer in range(256).
3. Slices of bytes are also bytes—even slices of a single byte.
4. There is no literal syntax for bytearray: they are shown as bytearray() with a
bytes literal as argument.
5. A slice of bytearray is also a bytearray

Both bytes and bytearray support every str method except those that do formatting (format, format_map) and a few others that depend on Unicode data, including case fold, isdecimal, isidentifier, isnumeric, isprintable, and encode

Binary sequences have a class method that str doesn’t have, called fromhex, which builds a binary sequence by parsing pairs of hex digits optionally separated by spaces:

In [10]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

In [11]:
# Initializing bytes from the raw data of an arra
import array
numbers = array.array('h', [-2, -1, 0, 1, 2])
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

1. Typecode 'h' creates an array of short integers (16 bits).
2. octets holds a copy of the bytes that make up numbers.
3. These are the 10 bytes that represent the five short integers.

Creating a bytes or bytearray object from any buffer-like source will always copy the bytes. In contrast, memoryview objects let you share memory between binary data structures. To extract structured information from binary sequences, the struct module is invaluable.

## Structs and Memory Views

The struct module provides functions to parse packed bytes into a tuple of fields of different types and to perform the oposite conversion, from a tuple into packed bytes. struct is used with bytes, bytesarray, and memoryview objects.

the memoryview class does not let you create or store byte sequences, but provides shared memory access to slices of data from other binary sequences, packed arrays, and buffers such as Python Imaging Library
(PIL) images without copying the bytes.

In [None]:
# . Using memoryview and struct to inspect a GIF image heade
import struct
fmt = '<3s3sHH'
with open('filter', 'rb') as fp:
    img = memoryview(fp.read())
header = img[:10]
bytes(header)
struct.unpack(fmt, header)
del header
del img

1. struct format: < little-edian; 3s3s two sequences of 3bytes; HH two 16-bit integers
2. Create memory view from file contents in memory...
3. ... then another memoryview by slicing the first one; no bytes are copied here
4. Convert to bytes for display only; 10 bytes are copied here
5. Unpack memoryview into typle of; type, version, width and height
6. Delete references to release the memory associated with the memory view instances


even less byte copying would happen if I used the mmap module to open the image as a memory-mapped file

## Basic Encoders / Decoders

The python distribution bundles more than 100 codec (encoder/decoder) for text to byte conversion and vice versa. Each codec has a name like `utf_8` and often aliases such as `utf8`, `utf-8`, which you can us as the encoding arguement in functions like open(), str.encode(), bytes.decode()

In [14]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Nino'.encode(encoding=codec), sep='\t')

latin_1	b'El Nino'
utf_8	b'El Nino'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00n\x00o\x00'


## Understanding Encode/Decode Problems

Although there is a generic UnicodeError exception, the error reported is almost always  more specific; either a UnicodeEncodError (when converting str to binary sequences) or a UnicodeDecodeError (when reading binary sequences into str). Loading python modules may also generate a SyntaxError when the source encoding is unexpected.

### Coping with UnicodeEncodeError

Most non-UTF codecs handle only a small subset of the Unicode characters. When converting text to bytes, if a character is not defined in the target encoding, UnicodeEncodeError will be raised, unless special handling is provided by passing an error argument to the encoding method or funcion.

In [15]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [16]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [17]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [18]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [19]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

1. The 'utf_?' encodings handle any str.
2. 'cp437' can’t encode the 'ã' (“a” with tilde). The default error handler
—'strict'—raises UnicodeEncodeError.
3. The error='ignore' handler silently skips characters that cannot be encoded;
this is usually a very bad idea.
4. When encoding, error='replace' substitutes unencodable characterswith '?';
data is lost, but users will know something is amiss.
5. 'xmlcharrefreplace' replaces unencodable characters with an XML entity.

### Coping with UnicodeDecodeError

Not every bytes holds a valid ASCII character, and not every byte sequence is valid UTF-8, therfore when you assume one of these encodings while converting a binary  sequence to text, you will get a UnicodeDecodeError if unexpected bytes are found.

On the other hand, many legacy 8-bit encodings like 'cp1252' are able to decode any stream of bytes, including random noise, without generating errors. Therefore, if your program assumes the wrong 8-bit encoding, it will silently decode garbage.

In [20]:
octets = b'Montr\xe9al'
octets.decode('cp1252') 

'Montréal'

In [21]:
octets.decode('iso8859_7')

'Montrιal'

In [22]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [23]:
octets.decode('utf_8', errors='replace')

'Montr�al'

1. These bytes are the characters for “Montréal” encoded as latin1; '\xe9' is the byte for “é”
2. Decoding with 'cp1252' (Windows 1252) works because it is a proper superset of latin1.
3. ISO-8859-7 is intended for Greek, so the '\xe9' byte is misinterpreted, and no error is issued.
4. The 'utf_8' codec detects that octets is not valid UTF-8, and raises Unicode DecodeError.
5. Using 'replace' error handling, the \xe9 is replaced by “�” (code point U+FFFD), the official Unicode REPLACEMENT CHARACTER intended to represent unknown characters

### SyntaxError When Loading Modules with Unexpected Encoding

UTF-8 is the default source encoding for Python 3, just as ASCII was the default for Python 2 (starting with 2.5). If you load a .py module containing non-UTF-8 data and no encoding declaration, you get a message like this:

    SyntaxError: Non-UTF-8 code starting with '\xe1' in file ola.py on line
    1, but no encoding declared; see http://python.org/dev/peps/pep-0263/
    for details

Because UTF-8 is widely deployed in GNU/Linux and OSX systems, a likely scenario is opening a .py file created on Windows with cp1252. Note that this error happens even in Python for Windows, because the default encoding for Python 3 is UTF-8 across all platforms.

### How to Discover the Encoding of a Byte Sequence

How do you find the encoding of a byte sequence? Short answer: you can’t. You must be told.

`Use python package Chardet`

### BOM: A Useful Gremlin

In [24]:
u16 = 'El Niño'.encode('utf_16')
u16

b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

The bytes are b'\xff\xfe'. That is a BOM—byte-order mark—denoting the “littleendian” byte ordering of the Intel CPU where the encoding was performed. On a little-endian machine, for each code point the least significant byte comes first:the letter 'E', code point U+0045 (decimal 69), is encoded in byte offsets 2 and 3 as 69 and 0

In [25]:
list(u16)

[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

On a big-endian CPU, the encoding would be reversed; 'E' would be encoded as 0 and 69. To avoid confusion, the UTF-16 encoding prepends the text to be encoded with the special character ZERO WIDTH NO-BREAK SPACE (U+FEFF), which is invisible. On a littleendian system, that is encoded as b'\xff\xfe' (decimal 255, 254). Because, by design, there is no U+FFFE character, the byte sequence b'\xff\xfe' must mean the ZERO
WIDTH NO-BREAK SPACE on a little-endian encoding, so the codec knows which byt ordering to use. There is a variant of UTF-16—UTF-16LE—that is explicitly little-endian, and another one explicitly big-endian, UTF-16BE. If you use them, a BOM is not generated.

## Handling Text Files

The best practice for handling text is the “Unicode sandwich” (Figure 4-2).4 This means that bytes should be decoded to str as early as possible on input (e.g., when opening a file for reading). The “meat” of the sandwich is the business logic of your program, where text handling is done exclusively on str objects. You should never be encoding or decoding in the middle of other processing. On output, the str are encoded to bytes as late as possible. Most web frameworks work like that, and we rarely touch byteswhen using them.


Also make sure to explicitly specify the encoding standards for a file when reading and writing to it.

### Encoding Defaults: A Madhouse

Several settings affect the encoding defaults for I/O in Python. See the default_encodings.py 


In [32]:
import sys
import locale

expressions = """
 locale.getpreferredencoding()
 type(my_file)
 my_file.encoding
 sys.stdout.isatty()
 sys.stdout.encoding
 sys.stdin.isatty()
 sys.stdin.encoding
 sys.stderr.isatty()
 sys.stderr.encoding
 sys.getdefaultencoding()
 sys.getfilesystemencoding()
 """
my_file = open('dummy', 'w')

for expression in expressions.split():
 value = eval(expression)
 print(expression.rjust(30), '->', repr(value))

 locale.getpreferredencoding() -> 'cp1252'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'cp1252'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


If you omit the encoding argument when opening a file, the default is given by
locale.getpreferredencoding() ('cp1252' in Example 4-12).

## Normalizing Unicode for Saner Comparisons

String comparisons are complicated by the fact that Unicode has combining characters: diacritics and other marks that attach to the preceding character, appearing as one when printed.
For example, the word “café” may be composed in two ways, using four or five code points, but the result looks exactly the same

In [38]:
s1 = 'café'
s2 = 'cafe\u0301'
s1, s2

('café', 'café')

In [34]:
s1 == s2

False

The code point U+0301 is the COMBINING ACUTE ACCENT. Using it after “e” renders “é”. In the Unicode standard, sequences like 'é' and 'e\u0301' are called `“canonical equivalents,”` and applications are supposed to treat them as the same. 
But Python sees two different sequences of code points, and considers them not equal. The solution is to use Unicode normalization, provided by the `unicodedata.normalize `function. The first argument to that function is one of four strings: `'NFC', 'NFD','NFKC', and 'NFKD'.` Let’s start with the first two.

Normalization Form C (NFC) composes the code points to produce the shortest equivalent string, while NFD decomposes, expanding composed characters into base characters and separate combining characters. Both of these normalizations make comparisons work as expected

In [39]:
from unicodedata import normalize
len(s1), len(s2)

(4, 5)

In [41]:
len(normalize('NFC', s1)), len(normalize('NFC', s2))


(4, 4)

In [42]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [43]:
normalize('NFD', s1) == normalize('NFD', s2)

True

### Case Folding

Case folding is essentially converting all text to lowercase, with some additional transformations. It is supported by the `str.casefold`method.

### Utilirt Functions for Normalized Text Matching

As we’ve seen, NFC and NFD are safe to use and allow sensible comparisons between Unicode strings. NFC is the best normalized form for most applications. str.casefold() is the way to go for case-insensitive comparisons

If you work with text in many languages, a pair of functions like `nfc_equal` and `fold_equal`

In [44]:
from unicodedata import normalize


def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)


def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() ==
            normalize('NFC', str2).casefold())


### Extreme “Normalization”: Taking Out Diacritics

Removing diacritics is not a proper form of normalization because it often changes the meaning of words and may produce false positives when searching. But it helps coping with some facts of life: people sometimes are lazy or ignorant about the correct use of diacritics, and spelling rules change over time, meaning that accents come and go in living lan‐
guages.

In [46]:
import unicodedata


def shave_marks(txt):
    """Remove all diacritic marks"""
    norm_txt = unicodedata.normalize('NFD', txt)
    shaved = ''.join(c for c in norm_txt
                     if not unicodedata.combining(c))
    return unicodedata.normalize('NFC', shaved)


In [47]:
shave_marks('caffè')

'caffe'

## Sorting Unicode Text

Python sorts sequences of any type by comparing the items in each sequence one by one. For strings, this means comparing the code points. Unfortunately, this produces unacceptable results for anyone who uses non-ASCII characters.


The standard way to sort non-ASCII text in Python is to use the locale.strxfrm function which, according to the locale module docs, “transforms a string to one that can be used in locale-aware comparisons.”

In [48]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted(fruits) 

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

In [49]:
import locale
locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')

'pt_BR.UTF-8'

In [50]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

### Sorting with the Unicode Collation Algorithm

```py
>>> import pyuca
>>> coll = pyuca.Collator()
>>> fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
>>> sorted_fruits = sorted(fruits, key=coll.sort_key)
>>> sorted_fruits
['açaí', 'acerola', 'atemoia', 'cajá', 'caju']
```

## The Unicode Database

The Unicode standard provides an entire database—in the form of numerous structured
text files—that includes not only the table mapping code points to character names, but
also metadata about the individual characters and how they are related. For example,
the Unicode database records whether a character is printable, is a letter, is a decimal
digit, or is some other numeric symbol. That’s how the str methods isidentifier,
isprintable, isdecimal, and isnumeric work. str.casefold also uses information
from a Unicode table.
The unicodedata module has functions that return character metadata; for instance,
its official name in the standard, whether it is a combining character (e.g., diacritic like
a combining tilde), and the numeric value of the symbol for humans (not its code point).

In [51]:
import unicodedata
import re
re_digit = re.compile(r'\d')
sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'
for char in sample:
    print('U+%04x' % ord(char),
          char.center(6),
          're_dig' if re_digit.match(char) else '-',
          'isdig' if char.isdigit() else '-',
          'isnum' if char.isnumeric() else '-',
          format(unicodedata.numeric(char), '5.2f'),
          unicodedata.name(char),
          sep='\t')


U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


## Dual-Mode str and bytes API's

The standard library has functions that accept str or bytes arguments and behave differently depending on the type. Some examples are in the re and os modules.

### str Versus bytes in Regular Expressions

If you build a regular expression with bytes, patterns such as \d and \w only match ASCII characters; in contrast, if these patterns are given as str, they match Unicode digits or letters beyond ASCII.

In [52]:
import re

re_numbers_str = re.compile(r'\d+')
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')
re_words_bytes = re.compile(rb'\w+')

text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef"
            " as 1729 = 1³ + 12³ = 9³ + 10³.")
text_bytes = text_str.encode('utf_8')
print('Text', repr(text_str), sep='\n ')
print('Numbers')
print(' str :', re_numbers_str.findall(text_str))
print(' bytes:', re_numbers_bytes.findall(text_bytes))
print('Words')
print(' str :', re_words_str.findall(text_str))
print(' bytes:', re_words_bytes.findall(text_bytes))


Text
 'Ramanujan saw ௧௭௨௯ as 1729 = 1³ + 12³ = 9³ + 10³.'
Numbers
 str : ['௧௭௨௯', '1729', '1', '12', '9', '10']
 bytes: [b'1729', b'1', b'12', b'9', b'10']
Words
 str : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '1³', '12³', '9³', '10³']
 bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'1', b'12', b'9', b'10']


### str Versus bytes on os Functions

The GNU/Linux kernel is not Unicode savvy, so in the real world you may find filenames
made of byte sequences that are not valid in any sensible encoding scheme, and cannot
be decoded to str. File servers with clients using a variety of OSes are particularly prone
to this problem.
In order to work around this issue, all os module functions that accept filenames or
pathnames take arguments as str or bytes. If one such function is called with a str
argument, the argument will be automatically converted using the codec named by
sys.getfilesystemencoding(), and the OS response will be decoded with the same
codec. This is almost always what you want, in keeping with the Unicode sandwich best
practice.
But if you must deal with (and perhaps fix) filenames that cannot be handled in that
way, you can pass bytes arguments to the os functions to get bytes return values. This
130 | Chapter 4: Text versus Bytes
feature lets you deal with any file or pathname, no matter how many gremlins you may
find.

In [53]:
import os 

os.listdir('.')

['Ch2_AnArrayOfSequences.ipynb',
 'Ch3_DictionariesAndSets.ipynb',
 'Ch4_Text_Vs_Bytes.ipynb',
 'dummy',
 'floats.bin']

In [54]:
os.listdir(b'.')

[b'Ch2_AnArrayOfSequences.ipynb',
 b'Ch3_DictionariesAndSets.ipynb',
 b'Ch4_Text_Vs_Bytes.ipynb',
 b'dummy',
 b'floats.bin']

# Chapter Summary

`We started the chapter by dismissing the notion that 1 character == 1 byte.` As the world adopts Unicode (80% of websites already use UTF-8), we need to keep the concept of text strings separated from the binary sequences that represent them in files, andPython 3 enforces this separation.

After a brief overview of the binary sequence data types—`bytes, bytearray, and memoryview`—we jumped into encoding and decoding, with a sampling of important codecs,followed by approaches to prevent or deal with the infamous `UnicodeEncodeError, UnicodeDecodeError`, and the `SyntaxError` caused by wrong encoding in Python source files.

We then considered the theory and practice of encoding detection in the absence of metadata: in theory, it can’t be done, but in practice the `Chardet package` pulls it offpretty well for a number of popular encodings. `Byte order marks` were then presented as the only encoding hint commonly found in UTF-16 and UTF-32 files—sometimes in UTF-8 files as well.

In the next section, we demonstrated opening text files, an easy task except for one pitfall: the e`ncoding= keyword argument is not mandatory when you open a text file, but it should be.` If you fail to specify the encoding, you end up with a program that manages to generate “plain text” that is incompatible across platforms, due to conflicting
default encodings. We then exposed the different encoding settings that Python uses as defaults and how to detect them: locale getpreferredencoding(), sys.getfilesys temencoding(), sys.getdefaultencoding(), and the encodings for the standard I/O files (e.g., sys.stdout.encoding). A sad realization for Windows users is that these
settings often have distinct values within the same machine, and the values are mutually incompatible; GNU/Linux and OSX users, in contrast, live in a happier place where UTF-8 is the default pretty much everywhere.

`Text comparisons are surprisingly complicated because Unicode provides multiple ways of representing some characters, so normalizing is a prerequisite to text matching.` In addition to explaining normalization and case folding, we presented some utility functions that you may adapt to your needs, including drastic transformations like removing all accents. We then saw how to sort Unicode text correctly by leveraging the standard locale module—with some caveats—and an alternative that does not depend on tricky locale configurations: the external PyUCA package.

Finally, we glanced at the Unicode database (a source of metadata about every character), and wrapped up with brief discussion of dual-mode APIs (e.g., the `re and os modules, where some functions can be called with str or bytes arguments, prompting different yet fitting results`).