<a href="https://colab.research.google.com/github/retrosnob/Jupyter-Notebooks/blob/master/Unicode_vs_UTF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unicode vs UTF

## Unicode is a character set that associates numbers with characters.

* Unicode is a standard that associates numbers with characters. There are 1,114,111 possible Unicode characters. At the moment only about 10% are actually used. 
* The Unicode standard explicitly separates the identity of a character (its *code point*, ie the number used to represent it) from its encoding (its *byte representation*, ie how the number happens to be represented in any given system).

## UTF-8 is one way of expressing those numbers in bytes.

* UTF-8 and UTF-16 are different *encodings*, not *character sets*. They are each responsible for converting code points to bytes and they do it in different ways. 
* UTF-8 is by far the commonest encoding for text on the web.
* The Unicode standard includes UTF-8 and UTF-16 as two of the different ways of encoding characters.

### UTF-8 Format

| No. bytes | No. bits per code point   | Byte 1| Byte 2| Byte 3| Byte 4|Used for|
|------|------|---|---|---|---|---|
|   1  | 7 |0xxxxxxx||||ASCII|
|   2  | 11|110xxxxx|10xxxxxx|||All Latin alphabets plus Greek, Cyrillic, Arabic, etc.|
|   3  | 16|1110xxxx|10xxxxxx|10xxxxxx||Chinese, Japanese, Korean, etc.|
|   4  | 21|11110xxx|10xxxxxx|10xxxxxx|10xxxxxx|Mathematical symbols, emojis, etc.|

* UTF-8 is a variable length encoding. ASCII characters required one byte (this was implemented for backwards compatibility with ASCII).

* It "knows" how many bytes are being used for the current character by the leading bits. E.g. Anything that starts with 0 is a 1-byte ASCII character, anything that starts with 11110 is a 4-byte character.

* All UTF-8 bytes begin with 0, 10 or 11. Any that begin with 0 or 11 are the first byte of a sequence. Any that begin 10 are not the first byte of a sequence. Being able to tell this helps in increase the reliabilty of data transmission and sequencing.

### Examples

A Thai character: ล  (LO LING)


Code point in decimal:

In [0]:
ord('ล') 

3621

Code point in hexadecimal"

In [0]:
hex(ord('ล'))

'0xe25'



Code point in binary:

In [0]:
f'{ord("ล"):b}'

'111000100101'

Notice how the ord() function only gives the code point, not the full byte encoding. Let's look at the bytes themselves:

In [0]:
f'{int("ล".encode("utf-8").hex(), 16):b}'

'111000001011100010100101'

|Byte 1| Byte 2| Byte 3|
|---|---|---|
|1110 <font color='blue'><b>0110</b></font>|10<font color='blue'><b>110001</b></font>|10<font color='blue'><b>001001</b></font>|

Here you can enter any character and see its code point and its byte encoding in UTF-8. Some characters to try:

* Ж - Cyrillic letter Zhe (2 bytes)
* ℕ - Natural numbers set symbol (3 bytes)
* 😱 - Scream emoji (4 bytes)

In [0]:
c = input('Enter a single character: ')
print()
print(f'Character: {c}')
print()
print('Code point')
print(f'Hex: {hex(ord(c))}')
print(f'Dec: {ord(c)}')
print(f'Bin: {ord(c):b}')
print()
print('UTF-8 encoding')
bytesobj = c.encode("utf-8")
print(f'{int(bytesobj.hex(), 16):08b}')
print(f'Number of bytes: {len(bytesobj)}')

Enter a single character: ล

Character: ล

Code point
Hex: 0xe25
Dec: 3621
Bin: 111000100101

UTF-8 encoding
111000001011100010100101
Number of bytes: 3
