# Lecture 8: Text encoding, text formats, command-line interfaces

**NOTE:** All of the content of this lecture is in the PDF slides. This notebook only contains the code snippets and exercises.

Some of the code in this notebook requires example files, which can be downloaded from OLAT.

## Text encoding

### What's in a string?

In [1]:
def analyze_string(string, encoding):
    chars_encoded = [(char, char.encode(encoding)) for char in string]
    chars = []
    bytes = []
    for char, encoded in chars_encoded:
        for byte in encoded:
            chars.append(char)
            char = None
            bytes.append(byte)
    blank = " " * 8
    print("         Characters:", " ".join(f"{char:<8}" if char else blank for char in chars))
    print("Unicode code points:", " ".join(f"{ord(char):<8}" if char else blank for char in chars))
    print("Encoded bytes (bin):", " ".join(f"{byte:08b}" for byte in bytes))
    print("Encoded bytes (hex):", " ".join(f"{byte:02x}".ljust(8) for byte in bytes))

In [2]:
analyze_string("Hello", "ASCII")

         Characters: H        e        l        l        o       
Unicode code points: 72       101      108      108      111     
Encoded bytes (bin): 01001000 01100101 01101100 01101100 01101111
Encoded bytes (hex): 48       65       6c       6c       6f      


### Comparing encodings

In [3]:
print("Latin-1:")
analyze_string("hä", "Latin-1")
print()
print("UTF-8:")
analyze_string("hä", "utf-8")
print()
print("UTF-16:")
analyze_string("hä", "utf-16be")
print()
print("UTF-32:")
analyze_string("hä", "utf-32be")

Latin-1:
         Characters: h        ä       
Unicode code points: 104      228     
Encoded bytes (bin): 01101000 11100100
Encoded bytes (hex): 68       e4      

UTF-8:
         Characters: h        ä                
Unicode code points: 104      228              
Encoded bytes (bin): 01101000 11000011 10100100
Encoded bytes (hex): 68       c3       a4      

UTF-16:
         Characters: h                 ä                
Unicode code points: 104               228              
Encoded bytes (bin): 00000000 01101000 00000000 11100100
Encoded bytes (hex): 00       68       00       e4      

UTF-32:
         Characters: h                                   ä                                  
Unicode code points: 104                                 228                                
Encoded bytes (bin): 00000000 00000000 00000000 01101000 00000000 00000000 00000000 11100100
Encoded bytes (hex): 00       00       00       68       00       00       00       e4      


### Encoding/decoding strings

In [39]:
my_string = "Hi! 🤓"
my_bytes = my_string.encode("utf-8")
print(my_bytes)

# codepoint for H
print(my_bytes[0])

print(type(my_bytes))

b'Hi! \xf0\x9f\xa4\x93'
72
<class 'bytes'>


In [41]:
text = my_bytes.decode("utf-8")
print(text)

print(type(text))

Hi! 🤓
<class 'str'>


### Opening and reading files

#### Text mode

In [6]:
my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt")
my_file

<_io.TextIOWrapper name='/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt' mode='r' encoding='utf-8'>

In [7]:
my_file.read()

'Hi! 🤓\n'

In [27]:
# Wrong encoding!
try:
    my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt", encoding="ascii")
    my_file.read()
except UnicodeDecodeError as e:
    print(e)

'ascii' codec can't decode byte 0xf0 in position 4: ordinal not in range(128)


In [43]:
# Wrong encoding!
my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt", encoding="latin-1")
my_file.read()

'Hi! ð\x9f¤\x93\n'

#### Byte mode

In [10]:
my_file = open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt", "rb")
my_file

<_io.BufferedReader name='/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/my_file.txt'>

In [11]:
my_file.read()

b'Hi! \xf0\x9f\xa4\x93\n'

### Unicode normal forms

In [12]:
import unicodedata

word = "hétérogénéité"
nfd_normalized = unicodedata.normalize("NFD", word) # use combining chars
print("NFD:", nfd_normalized, len(nfd_normalized))
nfc_normalized = unicodedata.normalize("NFC", word) # use single chars
print("NFC:", nfc_normalized, len(nfc_normalized))

NFD: hétérogénéité 18
NFC: hétérogénéité 13


### Fun with emojis

In [13]:
PERSON = "\U0001F9D1"
MAN = "\U0001F468"
WOMAN = "\U0001F469"
SKIN_TONE_1 = "\U0001F3FB"
SKIN_TONE_2 = "\U0001F3FC"
SKIN_TONE_3 = "\U0001F3FD"
SKIN_TONE_4 = "\U0001F3FE"
SKIN_TONE_5 = "\U0001F3FF"
ZERO_WIDTH_JOINER = "\U0000200D"
RIGHTWARDS_HAND = "\U0001FAF1"
LEFTWARDS_HAND = "\U0001FAF2"
HEART = "\U00002764"
KISS = "\U0001F48B"
RED_HAIR = "\U0001F9B0"
BLOND_HAIR = "\U0001F471"
BALD = "\U0001F9B2"
WHITE_HAIR = "\U0001F9B3"

print(MAN + ZERO_WIDTH_JOINER + RED_HAIR)
print(WOMAN + ZERO_WIDTH_JOINER + BALD)
print(RIGHTWARDS_HAND + SKIN_TONE_2 + ZERO_WIDTH_JOINER + LEFTWARDS_HAND + SKIN_TONE_3)
print(WOMAN + SKIN_TONE_5 + ZERO_WIDTH_JOINER + HEART + ZERO_WIDTH_JOINER + KISS + ZERO_WIDTH_JOINER + MAN + SKIN_TONE_1)

👨‍🦰
👩‍🦲
🫱🏼‍🫲🏽
👩🏿‍❤‍💋‍👨🏻


## Text-based data formats

### CSV

In [14]:
import csv

#### Reading CSV files

In [15]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.csv", encoding="utf-8") as infile:
    reader = csv.reader(infile)
    for row in reader:
        print(row)

['date', 'author', 'text']
['2018-05-01', 'Philomena Cunk', 'What a wonderful day']
['', 'Guido van Rossum', 'Hello, world!']
['2006-08-04', 'Borat Sagdiyev', 'This suit is black...\n\nNOT!']
['1637-08-04', 'Pierre de Fermat', 'aⁿ + bⁿ = cⁿ for n > 2']


In [16]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.csv", encoding="utf-8") as infile:
	reader = csv.DictReader(infile)
	for row in reader:
        	print(row)

{'date': '2018-05-01', 'author': 'Philomena Cunk', 'text': 'What a wonderful day'}
{'date': '', 'author': 'Guido van Rossum', 'text': 'Hello, world!'}
{'date': '2006-08-04', 'author': 'Borat Sagdiyev', 'text': 'This suit is black...\n\nNOT!'}
{'date': '1637-08-04', 'author': 'Pierre de Fermat', 'text': 'aⁿ + bⁿ = cⁿ for n > 2'}


#### Writing CSV files

In [20]:
with open("new.csv", "w", encoding="utf-8") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(["name", "age"])
    writer.writerow(["Martha", 36])
    writer.writerow(["Carl", 19])

In [21]:
with open("new.csv", "w", encoding="utf-8") as outfile:
    writer = csv.DictWriter(outfile, ["name", "age"])
    writer.writeheader()
    writer.writerow({"name": "Martha", "age": 36})
    writer.writerow({"name": "Carl", "age": 19})

### JSON

In [17]:
import json

#### Reading JSON files

In [18]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.json", encoding="utf-8") as infile:
    data = json.load(infile)
data

{'tweets': [{'date': {'year': 2018, 'month': 5, 'day': 1},
   'author': 'Philomena Cunk',
   'text': 'What a wonderful day'},
  {'author': 'Borat Sagdiyev', 'text': 'This suit is black...\n\nNOT!'},
  {'date': None,
   'author': 'Pierre de Fermat',
   'text': 'aⁿ + bⁿ = cⁿ for n > 2'}]}

#### Writing JSON

In [19]:
with open("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.json", "w", encoding="utf-8") as outfile:
    data = {"example": [1, 2, 3]}
    json.dump(data, outfile)

### XML

In [22]:
import xml.etree.ElementTree as ET

#### Reading XML files

In [23]:
tree = ET.parse("/Users/merterol/uzh/Computational Linguistics/Sem 2/PCL 2/Lecture/Lecture 8/tweets.xml")
tree

<xml.etree.ElementTree.ElementTree at 0x10d2d48f0>

In [24]:
texts = tree.findall("./tweet/text") # find the <text> tags nested in a <tweet> tag
texts

[<Element 'text' at 0x10d781120>,
 <Element 'text' at 0x10d783380>,
 <Element 'text' at 0x10d783b00>,
 <Element 'text' at 0x10d781bc0>]

In [25]:
texts[-1].text

'aⁿ + bⁿ = cⁿ for n > 2'

#### Writing XML files

In [26]:
root = ET.Element("examples")
ET.SubElement(root, "example")
subelement = ET.SubElement(root, "example", id="123")
subelement.text = "Content! <:-)"
tree = ET.ElementTree(root)
with open("new.xml", "wb") as outfile:
	tree.write(outfile, xml_declaration=True, encoding="utf-8")

## Exercise: Extracting information from XML

[*The ArchiMob Corpus*](https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html) is a collection of transcribed texts in Swiss German.

### Downloading the XML file:

In [29]:
from urllib.request import urlretrieve

urlretrieve("https://drive.switch.ch/index.php/s/vYZv9sNKetuPYTn/download?path=%2F&files=1044.xml", "archimob_1044.xml")

('archimob_1044.xml', <http.client.HTTPMessage at 0x10d2e2bd0>)

### Parsing the XML file:

In [30]:
import xml.etree.ElementTree as ET

tree = ET.parse('archimob_1044.xml')

# Use TEI as default namespace (without prefix)
ns = {"": "http://www.tei-c.org/ns/1.0"}

In [31]:
# Example: Find the <title> element, and get its text
tree.find("./teiHeader//title", ns).text

'Transcription 1044'

### Finding the longest noun:

In [53]:
noun = (element.text for element in tree.findall("//w[@tag = 'NN']", ns))

max(noun, key= len)

  noun = (element.text for element in tree.findall("//w[@tag = 'NN']", ns))


'kholleggtiivsaueggsischtänz'