# Encoding Schemes

This is not a computer science class, but in your practical work, you will frequently have to deal with text that is encoded in a variety of styles. Understanding the difference between them is key.

### Implications for Data Science

Reasons I've encountered in my own work with NLP why it is beneficial to understand text encoding schemes:
* If you are not using the right encoding, you cannot perform adequate feature engineering
* Some data-scientists have simply "thrown away" samples of tweets, social media comments that seem "mal-formed" but are actually simply just using a different encoding scheme

## World Languages, in Context

<figure>
  <img src="images/most_popular_languages.png" alt="my alt text"/>
</figure>

## Bits and Bytes

- Computers, at its lowest level, store everything in the form of bits (either a 0 or a 1). The amount of information that can be represented in a computer is determined by the number of bits.

For instance, a using only 4 bits, you can store **$2^4$ = 16** different values.
<figure>
  <img src="images/binary.png" alt="my alt text"/>
    <figcaption><i>How <b>$101010$</b> is converted to decimal (human-readable numbers): each of the green numbers is summed up to equal 42.</i></figcaption>
</figure>

It is not physically efficient for a computer to try to read one bit at a time, so typically data is stored in **8-bit** groups called **bytes**.


### Exercises
1. How many bits (0s and 1s) does it take to represent **4 different characters**?

\begin{equation}
2^{n} = 4
\end{equation}

\begin{equation}
log(2^{n}) = log(4)
\end{equation}

\begin{equation}
n \cdot log(2) = log(4)
\end{equation}

\begin{equation}
n = \frac{log(4)}{log(2)}
\end{equation}

\begin{equation}
n = \frac{log(4)}{log(2)}
\end{equation}

\begin{equation}
n = 2
\end{equation}

It takes **2** bits to represent **4** distinct characters.

2. How many bits (0s and 1s) does it take to represent **128 different characters**?

\begin{equation}
2^{n} = 128
\end{equation}

\begin{equation}
log(2^{n}) = log(128)
\end{equation}

\begin{equation}
n \cdot log(2) = log(128)
\end{equation}

\begin{equation}
n = \frac{log(128)}{log(2)}
\end{equation}

\begin{equation}
n = \frac{log(128)}{log(2)}
\end{equation}

\begin{equation}
n = 7
\end{equation}

It takes **7** bits to represent **128** distinct characters.


## ASCII

The oldest, yet still relevant encoding style to be aware of is **ASCII**, where computers represent text (**every character on a keyboard**) initially as a number between 0 and 127 (question: how many bits does it take to do this?)

<figure>
  <img src="images/ascii.svg" alt="my alt text"/>
    <figcaption><i>ASCII table converting numbers to characters.<b>(Wikipedia)</b></i></figcaption>
</figure>

*If the smallest amount of data a computer can realistically read in is a byte (**8-bits**), why is ASCII only **7-bits**?* The last bit was a **parity bit** is used for **error checking** - to ensure that the data wasn't corrected or unintentionally altered.

### Encoding/Decoding Words

How would you write the word `Data` using ASCII encoding?

#### Steps:

1. Look up the "codepoint" for the first character (`D`).

When you look up the character map value for `D`, its corresponding codepoint is `68`. Note that this is a different codepoint than lowercase `d`.

2. Write out that number in binary.

<figure>
  <img src="images/empty_binary_workbook.png" alt="my alt text"/>
</figure>

<figure>
  <img src="images/full_binary_workbook.png" alt="my alt text"/>
</figure>

The ASCII binary encoding for `D` is `1000100`.

3. Repeat for `a`, `t`, and `a`.

Use [this website to check your answer](https://www.rapidtables.com/convert/number/ascii-to-binary.html).

### Python Code

Do not worry about understanding what is happening inside the `get_binary_for_char` and `get_binary` functions. Just know that they take in a string and produce the 0s and 1s that the string is encoded in:

In [None]:
# to find out your computer's default encoding system
import sys
sys.getdefaultencoding()

In [2]:
import re
from typing import List

def get_binary_for_char(char: str, encoding="utf-8") -> str:
    """
    Encodes a character using the desired encoding into its corresponding hex, then converts the
    hex code into binary, formatted with tab spaces between byte marks.
    """

    hex_code = char.encode(encoding).hex()
    code_point = hex(ord(char))[2:].upper()

    binary: str = f"{int(hex_code, 16):08b}"

    byte_list: List[str] = re.findall('[01]{8}', binary)
    formatted_binary: str = "\t".join(byte_list)  # for variable length encoding, tab space between byte marks.
    print(f"{char} (U+{code_point.zfill(4)}, hex:{hex_code}) - {encoding}: {formatted_binary}")
    return formatted_binary


def get_binary(text: str, encoding="utf-8"):
    return " ".join([get_binary_for_char(char, encoding) for char in text])

get_binary("Data")

D (U+0044, hex:44) - utf-8: 01000100
a (U+0061, hex:61) - utf-8: 01100001
t (U+0074, hex:74) - utf-8: 01110100
a (U+0061, hex:61) - utf-8: 01100001


'01000100 01100001 01110100 01100001'

In [None]:
get_binary("Data", encoding="ascii")

### Extended ASCII

The dominant language in earlier eras of computing was English. People began to realize that ASCII was relatively limited, and even other European languages could not be properly supported. At the same time, transmission technology evolved to a standard of reliability such that the parity bit used for checking for errors was no longer needed. 

As a result, people began using the last (eighth) bit to extend the number of characters represented by ASCII from 128 characters to 256 characters.

#### Latin-1

Character map [available here](https://www.htmlhelp.com/reference/charset/latin1.gif).

Characters such as `Ç` (pronounced `ch` in Turkish, for instance), is represented by the number `199`. The Spanish word `año` (year) includes a character `ñ` that would be represented by the code point `241`.

#### Excel on Macs

Macs commonly use [Mac OS Roman encoding](https://en.wikipedia.org/wiki/Mac_OS_Roman). 

In [None]:
get_binary("cat", encoding="latin1")

In [None]:
get_binary("¿Cuántas?", encoding="ascii")
get_binary("¿Cuántas?", encoding="latin1")

### Exercises

#### Encode the word **`más`** in ASCII.

Find the codepoints for each of the characters:

* `m` $\rightarrow$ 109
* `á` $\rightarrow$ **None**
* `s` $\rightarrow$ 115

It is impossible to correctly encode `más` in ASCII because there is no corresponding code point for `á`.

#### Encode the word **`más`** in **`latin1`**.

Find the codepoints for each of the characters:

* `m` $\rightarrow$ 109
* `á` $\rightarrow$ 225
* `s` $\rightarrow$ 115

Next, encode each of the codepoints into binary:

* `m` $\rightarrow$ 109: `01101101`
* `á` $\rightarrow$ 225: `11100001`
* `s` $\rightarrow$ 115: `01110011`

The `latin1` encoding for `más` is `01101101 11100001 01110011`.

In [3]:
# using Python to check
get_binary("más", encoding="latin1")

m (U+006D, hex:6d) - latin1: 01101101
á (U+00E1, hex:e1) - latin1: 11100001
s (U+0073, hex:73) - latin1: 01110011


'01101101 11100001 01110011'

#### Decode the binary stream **`01100011 01100001 01110100`**. Assume that it is using **`latin1`** encoding.

* `01100011` $\rightarrow$ 99 $\rightarrow$ `c`
* `01100001` $\rightarrow$ 97 $\rightarrow$ `a`
* `01110100` $\rightarrow$ 116 $\rightarrow$ `t`

The binary stream `01100011 01100001 01110100`, decoded using **`latin1`**, represents the string `cat`.

## Unicode

128 characters is not enough to represent the characters in other Languages, like **Greek, Turkish, Cyrillic**, etc., or newer social media phenomenons like **emojis**. Unicode stores text as either 8, 16, or 32 bits (1, 2, or 4 bytes). This means there's significantly more characters that can be encoded (approximately 1 billion characters).

As a point of reference, there's a total of **50,000** characters in the Chinese language (but only around **15-20,000** that are used commonly).

If you don't specify the right encoding to read in text, you'll end up with something like this:
<figure>
  <img src="images/mojibake.png" alt="my alt text"/>
    <figcaption><i>Malformed characters because of incorrect encoding.<b>(Wikipedia)</b></i></figcaption>
</figure>

### UTF-8

The default encoding scheme of the internet today is `UTF-8`.

<figure>
  <img src="images/encoding_shares.svg" alt="my alt text"/>
    <figcaption><i>Share of web pages with different encodings.<b>(Wikipedia)</b></i></figcaption>
</figure>

There is another encoding schema very similar to `UTF-8` called `UTF-16`. You'll typically find it being used on Windows systems and within Java applications.

### Variable Length Encoding /Digitalization and Internationalization

UTF-8 is the default encoding schema of the internet. Whenever you save files to disk, or read files in, your first choice should be to try using UTF-8. UTF-8 is an example of **variable-length encoding**. This means sometimes a character will take 8 bits to encode (represent), sometimes 16 bits, sometimes 24 bits, and sometimes 32 bits.

On the other hand, another encoding scheme is `UTF-32`- it always takes **32 bits**. 

#### Data Science Implications
UTF-8 should be your default encoding of choice when working with Big Data. Because the # of bits it takes to encode a character changes, it can be more "storage-efficient" on disk, and more "memory-efficient" when representing this text in memory. 

Many machine-learning algorithms (like **batch and mini-batch gradient descent**) will perform updates using batches of samples. If you choose the wrong encoding, you will not be able to fit as many samples into your batch for training as you'd like - this means your model may require significantly more training time and perform worse.

In [None]:
#get_binary("I 😍 DSO 560", encoding="ascii")
#get_binary("I 😍 DSO 560", encoding="latin1")
get_binary("I 😍 DSO 560", encoding="utf8")