###### References: 
- https://docs.python.org/3/tutorial/datastructures.html   
- Fluent Python by Luciano Ramalho. Chapter 4: Text versus Bytes


# Character Issues
## Unicode 
The Unicode standard separates the identity of characters from specific byte representations.
* The character id, i.e. _code point_, is a number from 0 to 1,114,111 in the Unicode standdard
* U+ prefix
* The actual bytes depend on the _encoding_, e.g. `A` (U+0041) is encoded as a single byte in UTF-8, `\x41`, or as `\x41\x00` in UTF-16LE  encoding.

#### Encoding and decoding:

In [1]:
s = 'cafe'
len(s)

4

In [2]:
b = s.encode('utf8')
b

b'cafe'

In [3]:
len(b)

4

In [4]:
b.decode('utf8')

'cafe'

# Byte Essentials
## `bytes` or `bytearray`
Each item is an integer from 0 to 255, and not a one character string.

 `bytes` object is immutable; `bytearray` object allows you to modify its elements.
#### A five-byte sequence as  bytes and bytearray

In [5]:
cafe = bytes('cafe', encoding='utf-8')
cafe

b'cafe'

In [6]:
cafe[0]

99

In [7]:
cafe[:1]

b'c'

In [8]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'cafe')

In [9]:
cafe_arr[1:]

bytearray(b'afe')

### Building bytes and  bytearray
#### `fromhex`

In [10]:
bytes.fromhex('31 48 CE A9')

b'1H\xce\xa9'

* using a `str` and an `encoding`
* An iterable providing items with values from 0 to 255
* An object that implements the buffer protocol  (e.g. `bytes`, `bytearray`, `memoryview`, `array.array`);  this copies the bytes from the source object to the newly created binary sequence
####  Initializing bytes from the raw data of  an array

In [11]:
import array

numbers = array.array('h', [-2, -1, 0, 1, 2])
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

Creating a `bytes` or `bytearray` from any buffer-like source will always copy the bytes. In contract, `memoryview` objects let you share memory between binary data structures.
# Structs and Memory Views
`struct` module provides functions to parse packed bytes into a tuple of fields of different types of different types and to perform the opposite conversion.

`memoryview` class does not let you create or store byte sequences, but provides a shared memory access to   slices of data from other binary sequnences, etc, without copying the bytes.

####  Using  `memoryview` and `struct` to inspect a GIF image header

In [12]:
import struct
fmt = '<3s3sHH' # struct format: < little endian;3s3s two sequences of 3 bytes; HH two 16-bit integers;

with open('20220219/women_who_code.gif', 'rb') as fp:
    img = memoryview(fp.read())
    
header = img[:10]
bytes(header)

b'GIF89a\xc2\x01\xd7\x00'

In [13]:
struct.unpack(fmt, header) # type, version, width, height

(b'GIF', b'89a', 450, 215)

In [14]:
del header
del img

##### Ref: https://docs.python.org/3/library/mmap.html

# Basic Encoders / Decoders

The Python distribution bundles more than 100 _codecs_ for text to byte conversion and vice versa.

In [15]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, "El Niño".encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


## Understanding Encode/Decode Problems
###  Coping with UnicodeEncodeError

In [16]:
city = "São Paulo"
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [17]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [18]:
city.encode('iso8859_1')  # "Latin alphabet no. 1"

b'S\xe3o Paulo'

In [19]:
city.encode('cp437')  # character set of the original IBM PC 

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [20]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [21]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [22]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

###  Coping with UnicodeDecodeError

In [23]:
octets = b'Monter\xe9al'
octets.decode('cp1252')  # Windows-1252 is a single-byte character encoding of the Latin alphabet

'Monteréal'

In [24]:
octets.decode('iso8859_7')  # Part 7: Latin/Greek alphabet

'Monterιal'

In [25]:
octets.decode('koi8_r')

'MonterИal'

In [26]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 6: invalid continuation byte

In [27]:
octets.decode('utf_8', errors='replace')

'Monter�al'

### SyntaxError

In [28]:
%run 20220219/hello.py

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 20: character maps to <undefined>

## Discover Encoding
### Chardet
Character Encodding Detector

https://pypi.python.org/pypi/chardet

### BOM
byte-order-mark

UTF-16 encodings prepends text to be encoded with `ZERO WIDTH NO-BREAK SPACE` U+FEFF.

Such that little-endian system is encoded as `'\xff\xfe'`

In [29]:
u16 = 'Olá Mundo'.encode('utf_16')
u16

b'\xff\xfeO\x00l\x00\xe1\x00 \x00M\x00u\x00n\x00d\x00o\x00'

In [30]:
u16le = 'Olá Mundo'.encode('utf_16le')
list(u16le)

[79, 0, 108, 0, 225, 0, 32, 0, 77, 0, 117, 0, 110, 0, 100, 0, 111, 0]

In [31]:
u16be = 'Olá Mundo'.encode('utf_16be')
list(u16be)

[0, 79, 0, 108, 0, 225, 0, 32, 0, 77, 0, 117, 0, 110, 0, 100, 0, 111]

# Handling Text Files
##  The Unicode Sandwich
* Decode bytes on input
 *  process text only
* Encode text on output

In [32]:
open('cafe.txt', 'w', encoding='utf_8').write('café')

4

In [33]:
open('cafe.txt', encoding='cp1252').read()

'cafÃ©'

In [34]:
fp = open('cafe.txt')
fp

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='UTF-8'>

In [35]:
fp.close()

Write size and encoded file size

In [36]:
import os

os.stat('cafe.txt').st_size

5

In [37]:
fp2 = open('cafe.txt')
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='UTF-8'>

In [38]:
fp2.encoding

'UTF-8'

reading in binary

In [39]:
fp3 = open('cafe.txt', 'rb')
fp3

<_io.BufferedReader name='cafe.txt'>

In [40]:
fp3.read()

b'caf\xc3\xa9'

## Encoding defaults

In [41]:
import sys, locale

In [42]:
expressions  = """
        locale.getpreferredencoding()
        type(my_file)
        my_file.encoding
        sys.stdout.isatty()
        sys.stdout.encoding
        sys.stdin.isatty()
        sys.stdin.encoding
        sys.stderr.isatty()
        sys.stderr.encoding
        sys.getdefaultencoding()
        sys.getfilesystemencoding()
        """

my_file = open('dummy', 'w')

for expression in expressions.split():
    value = eval(expression)
    print(expression.rjust(30), '->', repr(value))
    

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'UTF-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


# Normalizing Unicode for saner comparison

In [43]:
s1 = 'café'
s2 = 'cafe\u0301'
s1, s2

('café', 'café')

In [44]:
len(s1), len(s2)

(4, 5)

In [45]:
from unicodedata import normalize

In [46]:
len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [47]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [48]:
normalize('NFC', s1) == normalize('NFC', s2)

True

In [49]:
normalize('NFD', s1) == normalize('NFD', s2)

True

Certain characters are similar, e.g.:

In [50]:
from unicodedata import name

In [51]:
ohm = '\u2126'

In [52]:
name(ohm)

'OHM SIGN'

In [53]:
ohm_c  = normalize('NFC', ohm)
name(ohm_c)

'GREEK CAPITAL LETTER OMEGA'

In [54]:
ohm == ohm_c

False

In [55]:
normalize('NFC', ohm) == normalize('NFC', ohm_c)

True

### Compatibility in characters
NFKC  , NFKD

In [56]:
half =  '½'
normalize('NFKC', half)

'1⁄2'

In [57]:
four_squared = '4²'
normalize('NFKC', four_squared)

'42'

In [58]:
micro = 'µ'
micro_kc = normalize('NFKD', micro)
micro, micro_kc

('µ', 'μ')

In [59]:
ord(micro), ord(micro_kc)

(181, 956)

In [60]:
name(micro), name(micro_kc)

('MICRO SIGN', 'GREEK SMALL LETTER MU')

Normalise with care. In some cases, it is useful for indexing and searching.

## Case folding
`s.casefold()` is converting all text to lowercase, with some additional transformation.

In [61]:
name(micro)

'MICRO SIGN'

In [62]:
micro_cf = micro.casefold()
name(micro_cf)

'GREEK SMALL LETTER MU'

In [63]:
micro, micro_cf

('µ', 'μ')

In [64]:
eszett = 'ß'
name(eszett)

'LATIN SMALL LETTER SHARP S'

In [65]:
eszett_cf = eszett.casefold()

In [66]:
eszett, eszett_cf

('ß', 'ss')

## Utility Functions for Normalized Text Matching
For most cases, `str.casefold()` is good for case-insensitive comparison.

If you work with text in many languages, creating a pair of functions like `nfc_equal` and `fold_equal` are useful.

In [67]:
def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() ==
            normalize('NFC', str2).casefold())

In [68]:
s1, s2

('café', 'café')

In [69]:
s1 == s2

False

In [70]:
nfc_equal(s1,  s2)

True

In [71]:
nfc_equal('a', 'A')

False

In [72]:
fold_equal('a', 'A')

True

In [73]:
street1 = 'Straße'
street2 = 'strasse'
street1 == street2

False

In [74]:
nfc_equal(street1, street2)

False

In [75]:
fold_equal(street1, street2)

True

## Extreme "Normalization":  Take  out  Diacritics
Diacritics are __marks placed above or below__ (or sometimes next to) a letter in a word to indicate a particular pronunciation—in regard to accent, tone, or stress—as well as meaning

Google search ignores diacritics, (e.g. accents, cedillas, etc.), in certain contexts.

In [76]:
import unicodedata
import string

def shave_marks(txt):
    """Remove all diacritic marks"""
    norm_text = unicodedata.normalize('NFD', txt)  # Decompose characters into base char and combining marks
    shaved = ''.join(c for c  in norm_text if not unicodedata.combining(c)) # Filter out combining marks
    return unicodedata.normalize('NFC', shaved) # Recompose all characters

In [77]:
order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'
shave_marks(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [78]:
greek = 'Ζέφυρος, Zéfiro'
shave_marks(greek)

'Ζεφυρος, Zefiro'

In [79]:
def shave_marks_latin(txt):
    """Remove all diacritic marks from Latin base characters"""
    norm_txt = unicodedata.normalize('NFD', txt)  
    latin_base = False
    keepers = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base:   # Skip over combining marks when base char is Latin
            continue  # ignore diacritic on Latin base char
        keepers.append(c)                             # Otherwise keep current char
        # if it isn't combining char, it's a new base char
        if not unicodedata.combining(c):              # Detect new base char and dertermine if it's Latin
            latin_base = c in string.ascii_letters
    shaved = ''.join(keepers)
    return unicodedata.normalize('NFC', shaved)   # Recompose all characters

In [80]:
shave_marks_latin(greek)

'Ζέφυρος, Zefiro'

In [81]:
single_map = str.maketrans("""‚ƒ„†ˆ‹‘’“”•–—˜›""",  # Build mapping table for char-to-char replacement
                           """'f"*^<''""---~>""")

multi_map = str.maketrans({  # Build mapping table for char-to-string replacement
    '€': '<euro>',
    '…': '...',
    'Œ': 'OE',
    '™': '(TM)',
    'œ': 'oe',
    '‰': '<per mille>',
    '‡': '**',
})

multi_map.update(single_map)  # Merge mapping tables


def dewinize(txt):
    """Replace Win1252 symbols with ASCII chars or sequences"""
    return txt.translate(multi_map)  # does not affect ASCII or latin1 text


def asciize(txt):
    no_marks = shave_marks_latin(dewinize(txt))     # Remove diacritical marks
    no_marks = no_marks.replace('ß', 'ss')          # We want to preserve the case
    return unicodedata.normalize('NFKC', no_marks)  # Recompose with compatibility codepoints

In [82]:
shave_marks_latin(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [83]:
dewinize(order)

'"Herr Voß: - ½ cup of OEtker(TM) caffè latte - bowl of açaí."'

In [84]:
asciize(order)

'"Herr Voss: - 1⁄2 cup of OEtker(TM) caffe latte - bowl of acai."'

#  Sorting Unicode Text

In [85]:
fruits = ['Cupuaçu', 'Açaí', 'Acerola', 'Cajá', 'Caju']
sorted(fruits)

['Acerola', 'Açaí', 'Caju', 'Cajá', 'Cupuaçu']

In [86]:
import locale

In [87]:
locale.setlocale(locale.LC_COLLATE,  'pt_BR.UTF-8')

'pt_BR.UTF-8'

In [88]:
sorted(fruits, key=locale.strxfrm)

['Açaí', 'Acerola', 'Cajá', 'Caju', 'Cupuaçu']

## Sorting with Unicode Collation Algorithm

In [89]:
import pyuca

In [90]:
coll = pyuca.Collator()

In [91]:
sorted(fruits, key=coll.sort_key)

['Açaí', 'Acerola', 'Cajá', 'Caju', 'Cupuaçu']

# The Unicode Database

In [92]:
import unicodedata
import re

In [93]:
re_digit = re.compile(r'\d')

sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'

In [94]:
for char in sample:
    print('U+%04x' % ord(char),                       # Code point in U+0000 format
          char.center(6),                             # Characterized in str len of 6
          're_dig' if re_digit.match(char) else '-',  # Show re_dig if character matches the regex
          'isdig' if char.isdigit() else '-',         # Show isdig if char.isdigit()
          'isnum' if char.isnumeric() else '-',       # Show isnum if char.isnumeric()
          format(unicodedata.numeric(char), '5.2f'),  # Numeric value formated, width 5 and 2 decimal place
          unicodedata.name(char),                     # Unicode character name
          sep='\t')

U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


# Dual-Mode str and bytes APIs
## `str` Versus `bytes` in Regular Expressions
Hardy–Ramanujan number 1729 : the smallest number expressible as the sum of two cubes in two different ways.

In [95]:
re_numbers_str = re.compile(r'\d+')     # String types
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')  # Byte types
re_words_bytes = re.compile(rb'\w+')

text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef"  # Unicodde text to search, contining Tamil digits 1729
            " as 1729 = 1³ + 12³ = 9³ + 10³.")        # String literal concatenation

text_bytes = text_str.encode('utf_8')  # bytes string is needed to search with the bytes regular expression

print('Text', repr(text_str), sep='\n  ')
print('Numbers')
print('  str  :', re_numbers_str.findall(text_str))      # The str pattern r'\d+' matches the Tamil and ASCII
print('  bytes:', re_numbers_bytes.findall(text_bytes))  # The bytes pattern rb'\d+' matches only ASCII digits
print('Words')
print('  str  :', re_words_str.findall(text_str))        # The str pattern r'\w+' matches letters, superscripts, Tamil, and ASCII digits.
print('  bytes:', re_words_bytes.findall(text_bytes))    # The bytes pattern rb'\w+' matches only ASCII bytes for letters and  digits.

Text
  'Ramanujan saw ௧௭௨௯ as 1729 = 1³ + 12³ = 9³ + 10³.'
Numbers
  str  : ['௧௭௨௯', '1729', '1', '12', '9', '10']
  bytes: [b'1729', b'1', b'12', b'9', b'10']
Words
  str  : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '1³', '12³', '9³', '10³']
  bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'1', b'12', b'9', b'10']


Regular expression can be used on `str` as well as `bytes`.

`re` on `bytes` outside of the ASCII range are treated as nondigits and nonword characters.

## `str` Versus `bytes` on `os` Functions

`listdir` with `str` and `bytes` arguments and results:

In [96]:
os.listdir('20220219/')

['hello.py', 'abc.txt', 'digits-of-π.txt', 'women_who_code.gif']

In [97]:
os.listdir(b'20220219/')

[b'hello.py', b'abc.txt', b'digits-of-\xcf\x80.txt', b'women_who_code.gif']