##  Reading text from files

We use python 3.4 to read text from two files.


In [1]:
s1 = open('../data/abipon-latin1.txt', 'rb').read()

In [2]:
s1

b'Abip\xf3n'

In [3]:
s2 = open('../data/abipon-utf8.txt', 'rb').read()

In [4]:
s2

b'Abipo\xcc\x81n'

The binary content of the two files is obviously different.

In [5]:
s1 == s2

False

In [6]:
len(s1) == len(s2)

False

Let's try to decode the data

In [7]:
s1.decode('utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 4: invalid continuation byte

In [8]:
s1.decode('latin1')

'Abipón'

Ok. That looks about right.

In [9]:
s2.decode('latin1')

'AbipoÌ\x81n'

In [10]:
s2.decode('utf8')

'Abipón'

Ah! So the content of the two files may be the same after all!?

### Summary

- Textual data may be represented in files in different [character encodings](https://en.wikipedia.org/wiki/Character_encoding).
- Common encodings are 
  - [ASCII](https://en.wikipedia.org/wiki/ASCII)
  - Latin 1, aka [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) Western Europe
  - CP-1252, aka [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252)
  - macroman, aka [Mac OS Roman](https://en.wikipedia.org/wiki/Mac_OS_Roman)
  - [UTF-8](https://en.wikipedia.org/wiki/UTF-8)
- All the above encodings are compatible with ASCII, i.e. ASCII-only text will be encoded
  identically in all these encodings.
- But generally comparing text in different encodings does not make sense.
- UTF-8 is the only Unicode encoding of the ones listed above, i.e. it can encode all characters
  defined in the [Unicode standard](https://en.wikipedia.org/wiki/Unicode).
- It is impossible in general to automatically detect the encoding used in a file, since
  for example everything can be "decoded" as Latin 1.

## Representation of text in software: Unicode

Software systems need a unified internal representation of textual data. This is typically
either a unicode encoding like UTF-8 (R) or UTF-16 (Windows) or an implementation of
Unicode (Python).

The advantages of implementing the Unicode standard are:
- character properties:
  - category (letter [uppercase|lowercase|...], punctuation, symbol, ...)
  - script (Coptic, Cyrillic, ...)
- rules for normalization, decomposition, collation, bidirectional display

So let's see how our data looks internally:

In [11]:
s1 = s1.decode('latin1')

In [12]:
s2 = s2.decode('utf8')

In [13]:
s1 == s2

False

Hm. So this is not WYSIWYG. Looking the same is not enough to pass as identity in Unicode. There are many instances where the fonts rendering the glyphs for different sequences of unicode code points will look the same, because
- different scripts may have characters looking alike, e.g. the cyrillic letter а and the latin letter a
- there are different unicode sequences representing the same glyphs, e.g. through the use of combining accents or diacritics.

In [19]:
len(s1), len(s2)

(6, 7)

Let's try to find out what's going on here with the help of the [unicodedata](https://docs.python.org/3/library/unicodedata.html) library:

In [14]:
import unicodedata

In [15]:
for c in s1:
    print(unicodedata.name(c))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER I
LATIN SMALL LETTER P
LATIN SMALL LETTER O WITH ACUTE
LATIN SMALL LETTER N


In [16]:
for c in s2:
    print(unicodedata.name(c))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER I
LATIN SMALL LETTER P
LATIN SMALL LETTER O
COMBINING ACUTE ACCENT
LATIN SMALL LETTER N


So the difference comes from the fact that in `s2`, the ó has been composed from two unicode code points. Fortunately unicode has a concept of [equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence) which solves this problem, and [normalization](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization) it the way to assess equivalence.

We can either force the canonical composition:

In [17]:
s1 == unicodedata.normalize('NFC', s2)

True

or canonical decomposition:

In [18]:
unicodedata.normalize('NFD', s1) == s2

True

### Summary:
- Naive string comparisons do not work when text is encoded differently.
- It is also not that easy when text is encoded with the same encoding ...
- ... and it is not easy with unicode either.
- But unicode provides the tools (normalization!) to make it work.
- The unicode support in python does not mean normalization is applied implicitely before comparing strings!