# Textual data

##  Reading text from files

We use python 3.4 to read text from two files.


In [2]:
s1 = open('abipon-latin1.txt', 'rb').read()

In [3]:
s1

b'Abip\xf3n'

In [4]:
s2 = open('abipon-utf8.txt', 'rb').read()

In [5]:
s2

b'Abipo\xcc\x81n'

The binary content of the two files is obviously different.

In [6]:
s1 == s2

False

In [7]:
len(s1) == len(s2)

False

Let's try to decode the data

In [8]:
s1.decode('utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 4: invalid continuation byte

Decoding as UTF-8 did not work. Let's try a different common encoding:

In [9]:
s1.decode('latin1')

'Abipón'

Ok. That looks about right.

In [10]:
s2.decode('latin1')

'AbipoÌ\x81n'

In [11]:
s2.decode('utf8')

'Abipón'

Ah! So the content of the two files may be the same after all!?

### Summary

- Textual data may be represented in files in different [character encodings](https://en.wikipedia.org/wiki/Character_encoding).
- Common encodings are 
  - [ASCII](https://en.wikipedia.org/wiki/ASCII)
  - Latin 1, aka [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) Western Europe
  - CP-1252, aka [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252)
  - macroman, aka [Mac OS Roman](https://en.wikipedia.org/wiki/Mac_OS_Roman)
  - [UTF-8](https://en.wikipedia.org/wiki/UTF-8)
- All the above encodings are compatible with ASCII, i.e. ASCII-only text will be encoded
  identically in all these encodings.
- But generally comparing text in different encodings does not make sense.
- UTF-8 is the only Unicode encoding of the ones listed above, i.e. it can encode all characters
  defined in the [Unicode standard](https://en.wikipedia.org/wiki/Unicode).
- It is impossible in general to automatically detect the encoding used in a file, since
  for example everything can be "decoded" as Latin 1.

## Representation of text in software: Unicode

Software systems need a unified internal representation of textual data. This is typically
either a unicode encoding like UTF-8 (R) or UTF-16 (Windows) or an implementation of
Unicode (Python).

The advantages of implementing the Unicode standard are:
- character properties:
  - category (letter [uppercase|lowercase|...], punctuation, symbol, ...)
  - script (Coptic, Cyrillic, ...)
- rules for normalization, decomposition, collation, bidirectional display

So let's see how our data looks internally:

In [11]:
s1 = s1.decode('latin1')

In [12]:
s2 = s2.decode('utf8')

In [13]:
s1 == s2

False

Hm. So this is not WYSIWYG. Looking the same is not enough to pass as identity in Unicode. There are many instances where the fonts rendering the glyphs for different sequences of unicode code points will look the same, because
- different scripts may have characters looking alike, e.g. the cyrillic letter а and the latin letter a
- there are different unicode sequences representing the same glyphs, e.g. through the use of combining accents or diacritics.

In [14]:
len(s1), len(s2)

(6, 7)

Let's try to find out what's going on here with the help of the [unicodedata](https://docs.python.org/3/library/unicodedata.html) library:

In [15]:
import unicodedata

In [16]:
for c in s1:
    print(unicodedata.name(c))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER I
LATIN SMALL LETTER P
LATIN SMALL LETTER O WITH ACUTE
LATIN SMALL LETTER N


In [17]:
for c in s2:
    print(unicodedata.name(c))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER I
LATIN SMALL LETTER P
LATIN SMALL LETTER O
COMBINING ACUTE ACCENT
LATIN SMALL LETTER N


So the difference comes from the fact that in `s2`, the ó has been composed from two unicode code points. Fortunately unicode has a concept of [equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence) which solves this problem, and [normalization](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization) is the way to assess equivalence.

We can either force the canonical composition:

In [18]:
s1 == unicodedata.normalize('NFC', s2)

True

or canonical decomposition:

In [19]:
unicodedata.normalize('NFD', s1) == s2

True

### Summary:
- Naive string comparisons do not work when text is encoded differently.
- It is also not that easy when text is encoded with the same encoding ...
- ... and it is not easy with unicode either.
- But unicode provides the tools (normalization!) to make it work.
- The unicode support in python does not mean normalization is applied implicitely before comparing strings!

## More pitfalls

### Line endings

Quoting [Wikipedia](https://en.wikipedia.org/wiki/Newline)

> Systems based on ASCII or a compatible character set use **either** LF (Line feed, '\n') **or** CR (Carriage return, '\r') individually, **or** CR followed by LF (CR+LF, '\r\n').

Or to quote [Jenny Bryan](https://gist.github.com/jennybc)

| English                             | OS connotation                     | character | "vibe"        |
|-------------------------------------|------------------------------------|-----------|---------------|
| carriage return, "CR"               | classic Mac, i.e. OS 9 and earlier | `\r`      | archaic       |
| line feed, "LF"                     | Unix, including Mac OS X           | `\n`      | The Very Best |
| carriage return + line feed, "CRLF" | Windows, going back to DOS         | `\r\n`    | Boo! Windows! |

CSV files exported from LibreOffice on Linux will have LF as line endings, if exported from [Excel on Mac OS](https://gist.github.com/jennybc/0be7717c2b5b30088811), it will be CR, if exported from Excel on Windows it will be CR+LF.

So it is quite likely that you will run into different line endings at some point.

Python's [`open` builtin function](https://docs.python.org/3/library/functions.html#open) supports opening files in [universal newlines](https://docs.python.org/3/glossary.html#term-universal-newlines) mode, by passing `newline=None`. 

In [12]:
with open('newlines.txt') as fp:
    lines = fp.readlines()

In [13]:
lines

['line\n', 'line\n', 'line\n']

In [14]:
with open('newlines.txt', newline='') as fp:
    lines = fp.readlines()

In [15]:
lines

['line\n', 'line\r', 'line\r\n']

In [16]:
with open('newlines.txt', newline='\r') as fp:
    lines = fp.readlines()

In [17]:
lines

['line\nline\r', 'line\r', '\n']

[git can help](https://help.github.com/articles/dealing-with-line-endings/) to solve this problem in a collaborative setting, by [configuring what line endings to use for text files in a repository](https://help.github.com/articles/dealing-with-line-endings/#per-repository-settings).



### Control characters

Why are things like LF and CR in ASCII anyway, rather than just NEWLINE? I guess I'm too young to know.

But there are more weird things in ASCII:

In [27]:
s = '\x07'

In [28]:
print(s)




You can't see them. But they must have properties.

In [29]:
import unicodedata

In [30]:
unicodedata.name(s)

ValueError: no such name

Hm. That's diappointing. No name for these things? [It turns out](http://stackoverflow.com/a/24553272) the generic [name for these is just `<control>`](http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt), although this specific character is called *BELL*.

What about the category?

In [31]:
print(unicodedata.category(s))

Cc


Ok, looking this up in [the specification](http://www.unicode.org/reports/tr44/#GC_Values_Table) reveals this is in fact a control character.

But then, there's all kinds of funny stuff in Unicode, and an encoding like UTF-8 can handle all of it, so putting them in text files shouldn't be a problem.

In [32]:
from xml.dom import minidom

In [33]:
minidom.parseString('<e>' + s + '</e>')

ExpatError: not well-formed (invalid token): line 1, column 3

Ouch! I turns out, while the default character encoding for XML files is UTF-8, actually [not all of UTF-8 is valid](https://www.w3.org/TR/xml/#charsets).

So if you want to put UTF-8 encoded data into a [BEAST2](http://beast2.org/) XML configuration file, you may have to remove the control characters first.

Encountering control characters in text files is not uncommon, in particular of the files are old and/or composed by copying and pasting text from different applications on different platforms.