# Converting Unicode strings to UTF-8 byte sequences and vice versa
Learning goals:
  + Understand how decoding of UTF-8 `bytes` objects as Unicode `str` objects works.
  + Understand how encoding of `str` objects as UTF-8 byte objects works.
  + Understand reading and writing files with explicit encodings works.
  + Understand the `unicodedata` module for determining a characters class and official name


Byte sequences consisting of raw bytes can be used to store any data, including UTF-8 bytes. However, their literal notation can only contain ASCII listeral characters. Therefore, escaping is needed for non-ASCII characters...

In [None]:
byte_text = b'B\xc3\xa4h'
print(byte_text)

Bytes sequences split into integers representing the decimal byte value when cast into a list

In [None]:
type(byte_text), list(byte_text), bytes([66, 195, 164, 104])

Explicit decoding of UTF-8 byte sequences

In [None]:
print(repr(byte_text.decode('utf-8')))
byte_text.decode('utf-8')

From Unicode to UTF-8

In [None]:
unicode_text = 'Bäh'
unicode_text

Explicit encoding of Unicode characters as UTF-8 byte sequence

In [None]:
print(unicode_text.encode('utf-8'))

## Converting files

Create a  iso-latin-1 encoded file. We can use `print()` with its optional arguments

In [None]:
f = open("baeh-l1.txt", "w", encoding="l1")
print(unicode_text, file=f)
f.close()
    

The byte-encoded file content looks weird when shown as Unicode.

In [None]:
! cat baeh-l1.txt

Convert the iso-latin-1 file to uppercased utf-8 using the `write()` method of file objects (takes only str arguments in contrast to `print()`)

In [None]:
# Decode from l1 encoded file into unicode strings
f = open("baeh-l1.txt", "r", encoding="l1")

# Encode unicode strings into UTF-8 encoded file
g = open("BAEH-l1-encoded-as-utf8.txt", "w", encoding="utf-8")

for line in f:
    g.write(line.upper())
    
    # some diagnostic output®
    print("Type:", type(line))
    print("Canonical:", "==>",repr(line.upper()), "<==")

f.close()
g.close()



Now look at the UTF-8 encoded file

In [None]:
! cat BAEH-l1-encoded-as-utf8.txt

## Unicode data in Python
The module `unicodedata` knows everything about Unicode characters.

In [None]:
import unicodedata
utfstr = '1a* äöü.'

for c in utfstr:
    print(c, "Cat:", unicodedata.category(c))
    print(c, "Name:", unicodedata.name(c))


## Using Unicode Class Codes in Regular Expressions
The external PCRE-based  module `regex` offers more powerful regular expression functionality using the normal functions from the standard library module `re`.

In [None]:
! pip install regex  # if you run it locally

In [None]:
import regex

The notation `\p{UNICODECATEGORY}` allows to match UNICODE character classes: 
 + `P` for any punctuation character from any script
 + `Po` for other punctuation. See https://www.compart.com/en/unicode/category/

In [None]:
regex.sub(r'\p{P}+',' ',"Oh... What?!?")