# Lecture 8: Text encoding, text formats, command-line interfaces

**NOTE:** All of the content of this lecture is in the PDF slides. This notebook only contains the code snippets and exercises.

Some of the code in this notebook requires example files, which can be downloaded from OLAT.

## Text encoding

### What's in a string?

In [1]:
def analyze_string(string, encoding):
    chars_encoded = [(char, char.encode(encoding)) for char in string]
    chars = []
    bytes = []
    for char, encoded in chars_encoded:
        for byte in encoded:
            chars.append(char)
            char = None
            bytes.append(byte)
    blank = " " * 8
    print("         Characters:", " ".join(f"{char:<8}" if char else blank for char in chars))
    print("Unicode code points:", " ".join(f"{ord(char):<8}" if char else blank for char in chars))
    print("Encoded bytes (bin):", " ".join(f"{byte:08b}" for byte in bytes))
    print("Encoded bytes (hex):", " ".join(f"{byte:02x}".ljust(8) for byte in bytes))

In [2]:
analyze_string("Hello", "ASCII")

         Characters: H        e        l        l        o       
Unicode code points: 72       101      108      108      111     
Encoded bytes (bin): 01001000 01100101 01101100 01101100 01101111
Encoded bytes (hex): 48       65       6c       6c       6f      


### Comparing encodings

In [None]:
print("Latin-1:")
analyze_string("hä", "Latin-1")
print()
print("UTF-8:")
analyze_string("hä", "utf-8")
print()
print("UTF-16:")
analyze_string("hä", "utf-16be")
print()
print("UTF-32:")
analyze_string("hä", "utf-32be")

### Encoding/decoding strings

In [None]:
my_string = "Hi! 🤓"
my_bytes = my_string.encode("utf-8")
my_bytes

In [None]:
my_bytes.decode("utf-8")

### Opening and reading files

#### Text mode

In [None]:
my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt")
my_file

In [None]:
my_file.read()

In [None]:
# Wrong encoding!
my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt", encoding="ascii")
my_file.read()

In [None]:
# Wrong encoding!
my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt", encoding="latin-1")
my_file.read()

#### Byte mode

In [None]:
my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt", "rb")
my_file

In [None]:
my_file.read()

### Unicode normal forms

In [None]:
import unicodedata

word = "hétérogénéité"
nfd_normalized = unicodedata.normalize("NFD", word)
print("NFD:", nfd_normalized, len(nfd_normalized))
nfc_normalized = unicodedata.normalize("NFC", word)
print("NFC:", nfc_normalized, len(nfc_normalized))

### Fun with emojis

In [None]:
PERSON = "\U0001F9D1"
MAN = "\U0001F468"
WOMAN = "\U0001F469"
SKIN_TONE_1 = "\U0001F3FB"
SKIN_TONE_2 = "\U0001F3FC"
SKIN_TONE_3 = "\U0001F3FD"
SKIN_TONE_4 = "\U0001F3FE"
SKIN_TONE_5 = "\U0001F3FF"
ZERO_WIDTH_JOINER = "\U0000200D"
RIGHTWARDS_HAND = "\U0001FAF1"
LEFTWARDS_HAND = "\U0001FAF2"
HEART = "\U00002764"
KISS = "\U0001F48B"
RED_HAIR = "\U0001F9B0"
BLOND_HAIR = "\U0001F471"
BALD = "\U0001F9B2"
WHITE_HAIR = "\U0001F9B3"

print(MAN + ZERO_WIDTH_JOINER + RED_HAIR)
print(WOMAN + ZERO_WIDTH_JOINER + BALD)
print(RIGHTWARDS_HAND + SKIN_TONE_2 + ZERO_WIDTH_JOINER + LEFTWARDS_HAND + SKIN_TONE_3)
print(WOMAN + SKIN_TONE_5 + ZERO_WIDTH_JOINER + HEART + ZERO_WIDTH_JOINER + KISS + ZERO_WIDTH_JOINER + MAN + SKIN_TONE_1)

## Text-based data formats

### CSV

In [None]:
import csv

#### Reading CSV files

In [None]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.csv", encoding="utf-8") as infile:
    reader = csv.reader(infile)
    for row in reader:
        print(row)

In [None]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.csv", encoding="utf-8") as infile:
	reader = csv.DictReader(infile)
	for row in reader:
        	print(row)

#### Writing CSV files

In [None]:
with open("new.csv", "w", encoding="utf-8") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(["name", "age"])
    writer.writerow(["Martha", 36])
    writer.writerow(["Carl", 19])

In [None]:
with open("new.csv", "w", encoding="utf-8") as outfile:
    writer = csv.DictWriter(outfile, ["name", "age"])
    writer.writeheader()
    writer.writerow({"name": "Martha", "age": 36})
    writer.writerow({"name": "Carl", "age": 19})

### JSON

In [None]:
import json

#### Reading JSON files

In [None]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.json", encoding="utf-8") as infile:
    data = json.load(infile)
data

#### Writing JSON

In [None]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.json", "w", encoding="utf-8") as outfile:
    data = {"example": [1, 2, 3]}
    json.dump(data, outfile)

### XML

In [None]:
import xml.etree.ElementTree as ET

#### Reading XML files

In [None]:
tree = ET.parse("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.xml")
tree

In [None]:
texts = tree.findall("./tweet/text")
texts

In [None]:
texts[-1].text

#### Writing XML files

In [None]:
root = ET.Element("examples")
ET.SubElement(root, "example")
subelement = ET.SubElement(root, "example", id="123")
subelement.text = "Content! <:-)"
tree = ET.ElementTree(root)
with open("new.xml", "wb") as outfile:
	tree.write(outfile, xml_declaration=True, encoding="utf-8")

## Exercise: Extracting information from XML

[*The ArchiMob Corpus*](https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html) is a collection of transcribed texts in Swiss German.

### Downloading the XML file:

In [None]:
from urllib.request import urlretrieve

urlretrieve("https://drive.switch.ch/index.php/s/vYZv9sNKetuPYTn/download?path=%2F&files=1044.xml", "archimob_1044.xml")

### Parsing the XML file:

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('archimob_1044.xml')

# Use TEI as default namespace (without prefix)
ns = {"": "http://www.tei-c.org/ns/1.0"}

In [None]:
# Example: Find the <title> element, and get its text
tree.find("./teiHeader//title", ns).text

### Finding the longest noun:

In [None]:
# TODO: Your code here