# Text versus Bytes
Python 3 introduced a sharp distinction between strings of human text and sequences of raw bytes.

## Character Issues
The concept of 'string' is simple enough: a string is a sequence of characters. The problem lies in the definition of 'character.' The actual bytes that represent a character depend on the encoding in use. Converting from code points to bytes is encoding; converting from bytes to code points is decoding.

In [1]:
s = 'café'
len(s)

4

In [2]:
b = s.encode('utf8')
b

b'caf\xc3\xa9'

In [3]:
len(b)

5

In [4]:
b.decode('utf8')

'café'

In [5]:
len(b.decode('utf8'))

4

## Byte Essentials
Each item in bytes or bytearray is an integer from 0 to 255.

In [6]:
cafe = bytes('café', encoding='utf_8')

In [7]:
cafe

b'caf\xc3\xa9'

In [8]:
cafe[0]

99

In [9]:
cafe[:1]

b'c'

In [10]:
cafe_arr = bytearray(cafe)

In [11]:
cafe_arr

bytearray(b'caf\xc3\xa9')

In [12]:
cafe_arr[-1:]

bytearray(b'\xa9')

## Structs and Memory Views
memoryview class does not let you create or store byte sequences, but provides shared memory access to slices of data from
other binary sequences, packed arrays, and buffers such as Python Imaging Library (PIL) images, without copying the bytes.

In [1]:
import struct

with open('google.png', 'rb') as fp:
    img = memoryview(fp.read())

img

<memory at 0x0000000004BF3C48>

In [3]:
fmt = '<3s3sHH'
header = img[:10]
bytes(header)

b'\x89PNG\r\n\x1a\n\x00\x00'

In [4]:
struct.unpack(fmt, header) 

(b'\x89PN', b'G\r\n', 2586, 0)

In [5]:
del header
del img

## Basic Encoders/Decoders
The Python distribution bundles more than 100 codecs (encoder/decoder) for text to byte conversion and vice versa. 

In [10]:
# this will not produce all
# see https://stackoverflow.com/questions/1728376/get-a-list-of-all-the-encodings-python-can-encode-to

from encodings.aliases import aliases
for index, value in enumerate(list(aliases.keys())):
    print(index, value)

0 646
1 ansi_x3.4_1968
2 ansi_x3_4_1968
3 ansi_x3.4_1986
4 cp367
5 csascii
6 ibm367
7 iso646_us
8 iso_646.irv_1991
9 iso_ir_6
10 us
11 us_ascii
12 base64
13 base_64
14 big5_tw
15 csbig5
16 big5_hkscs
17 hkscs
18 bz2
19 037
20 csibm037
21 ebcdic_cp_ca
22 ebcdic_cp_nl
23 ebcdic_cp_us
24 ebcdic_cp_wt
25 ibm037
26 ibm039
27 1026
28 csibm1026
29 ibm1026
30 1125
31 ibm1125
32 cp866u
33 ruscii
34 1140
35 ibm1140
36 1250
37 windows_1250
38 1251
39 windows_1251
40 1252
41 windows_1252
42 1253
43 windows_1253
44 1254
45 windows_1254
46 1255
47 windows_1255
48 1256
49 windows_1256
50 1257
51 windows_1257
52 1258
53 windows_1258
54 273
55 ibm273
56 csibm273
57 424
58 csibm424
59 ebcdic_cp_he
60 ibm424
61 437
62 cspc8codepage437
63 ibm437
64 500
65 csibm500
66 ebcdic_cp_be
67 ebcdic_cp_ch
68 ibm500
69 775
70 cspc775baltic
71 ibm775
72 850
73 cspc850multilingual
74 ibm850
75 852
76 cspcp852
77 ibm852
78 855
79 csibm855
80 ibm855
81 857
82 csibm857
83 ibm857
84 858
85 csibm858
86 ibm858
87 860
88 csi

In [11]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')


latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


## Understanding Encode/Decode Problems
### Coping with UnicodeEncodeError


In [13]:
city = 'São Paulo'
city

'São Paulo'

In [14]:
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [15]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [16]:
city.encode('iso8859_1') 

b'S\xe3o Paulo'

In [17]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [19]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [20]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [21]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

### Coping with UnicodeDecodeError
Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting a binary sequence to text, you will get a UnicodeDecodeError if unexpected bytes are
found.

In [22]:
octets = b'Montr\xe9al'

In [23]:
octets.decode('cp1252') 

'Montréal'

In [24]:
octets.decode('iso8859_7') 

'Montrιal'

In [25]:
octets.decode('koi8_r')

'MontrИal'

In [26]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

### SyntaxError When Loading Modules with Unexpected Encoding
UTF-8 is the default source encoding for Python 3, just as ASCII was the default for Python 2 (starting with 2.5). If you load a .py module containing non-UTF-8 data and no encoding declaration...

In [4]:
import os

def create_new_file(contents, file_name):
    if os.path.isfile(file_name):
        os.remove(file_name)
    with open(file_name, "w") as new_file:
        new_file.write(contents)

contents = '''
# coding: cp1252
print('Olá, Mundo!')
'''

create_new_file(contents, "working_file.py")

In [5]:
%%bash
cat working_file.py


# coding: cp1252
print('Olá, Mundo!')


In [6]:
! python working_file.py

Olá, Mundo!


Now if you remove the encoding declaration...

In [9]:
contents = contents.replace("# coding: cp1252", "")
create_new_file(contents, "working_file.py")

In [10]:
%%bash
cat working_file.py



print('Olá, Mundo!')


In [11]:
! python working_file.py

  File "working_file.py", line 3
SyntaxError: Non-ASCII character '\xe1' in file working_file.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details


## Handling Text Files
The best practice for handling text is the 'Unicode sandwich'. This means that bytes should be decoded to str as early as possible.

In [12]:
fp = open('cafe.txt', 'w', encoding='utf_8')
fp

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>

In [13]:
fp.write('café')

4

In [14]:
fp.close()

In [15]:
import os
os.stat('cafe.txt').st_size

5

In [16]:
fp2 = open('cafe.txt')
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='cp1252'>

In [17]:
fp2.encoding

'cp1252'

In [18]:
fp2.read()

'cafÃ©'

In [19]:
fp2.close()

In [20]:
fp3 = open('cafe.txt', encoding='utf_8')
fp3

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='utf_8'>

In [21]:
fp3.read()

'café'

In [23]:
fp3.close()

## Encoding Defaults: A Madhouse

In [24]:
import sys, locale
expressions = """
 locale.getpreferredencoding()
 type(my_file)
 my_file.encoding
 sys.stdout.isatty()
 sys.stdout.encoding
 sys.stdin.isatty()
 sys.stdin.encoding
 sys.stderr.isatty()
 sys.stderr.encoding
 sys.getdefaultencoding()
 sys.getfilesystemencoding()
 """

my_file = open('dummy', 'w')
for expression in expressions.split():
    value = eval(expression)
    print(expression.rjust(30), '->', repr(value))


 locale.getpreferredencoding() -> 'cp1252'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'cp1252'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'cp1252'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


Therefore, the best advice about encoding defaults is: do not rely on them. Follow the advice of the Unicode sandwich and be explicit about encodings.

***