<a href="https://colab.research.google.com/github/yihaozhong/479_data_management/blob/main/%E2%80%9Cunicode_bytes_strings_slides_ipynb%E2%80%9D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unicode, Encoding, Bytes and Strings



## How Does a Computer Store Data?

Bits! ... 0's and 1's.

How do we encode numbers into 0's and 1's... what about characters?

## Straightforward for Numbers: Binary!

* use some set number of bits to represent:
* whole numbers... or floating point numbers
* maybe some bits can be reserved for:
    * where to place decimal point
    * positive or negative
* need to represent more values? add more bits!

## A Refresher

__Do you remember how binary numbers work?__ &rarr;

* a __binary number__ is composed of bits
* a __bit__ can contain a 0 or 1
* a __byte__ is 8 bits
* in a byte (assuming a specific architecture), the least significant bit (last bit) is the number of 2^0 or 1's, follwed by 2's, 4's, 8's, 16's, etc.
* the most significant bit in a byte is 128's
* to calculate the value in based 10, sum the prodcuts of the bits and their places

## An Example

__What is `10000011` in decimal__ &rarr;


```
1   | 0   | 0   | 0   | 0   | 0   | 1   | 1
----+-----+-----+-----+-----+-----+-----+-----
2^7 | 2^6 | 2^5 | 2^4 | 2^3 | 2^2 | 2^1 | 2^0
128 | 64  | 32  | 16  | 8   | 4   | 2   | 1
```

```
(1 * 128) + (1 * 2) + (1 * 1)
```

```
10000011 = 131
```

## Some Useful Tools

You can use `int` and `format` to convert binary to decimal and decimal to binary:

In [None]:
int('10000011',2)

131

In [None]:
int('100', 2) 

4

format(4, 'b')

## Converting to Binary, Octal, and Hexadecimal

In [None]:
bin(131)

'0b10000011'

In [None]:
int('10000011',2)

131

In [None]:
oct(131)

'0o203'

In [None]:
int('203',8)

131

In [None]:
hex(131)

'0x83'

In [None]:
int('83',16)

131

Poll Everywhere Question

1001 Decimal is what in Hex?

* 4a2
* 3e7
* 3e9
* b88

## For Text: Map Numbers to Characters

Map numbers respending ASCII codes to corresponding characters...

* ASCII: American Standard Code for Information Interchange
* for example: 65 -> A
* you may have a table of mappings from code points to characters (something like [http://www.asciitable.com](http://www.asciitable.com))
* those mappings have to be encoded into (some number of) bits

## To convert back and forth, use ord and chr

In [None]:
# ord and chr will convert to and from a character and code
# point
print(ord('A'))
print(chr(65))

65
A


## ASCII

* ASCII is encoded using 7 bits (or 8 for extended!)
    * this is ~ 128 different values
    * good for Western languages without diacritical marks: a-z, A-Z, 0-9 etc., e.g. English
    * not so good even for French
    * not so good for languages that use different character sets
    * some languages (e.g. Chinese) contain thousands of different changes
    * not so good if you want to send 🤢 or other emoji
* ASCII is both the name of the mapping and name of the encoding

## _Other Encodings_

* Because ASCII was limited, [many other encodings](https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings) were developed. 
* These encodings weren't guaranteed to have common mappings, even if they were meant to represent the same character set! 

__What to do?__

## Code Points

A code point is a particular numerical value that is used in a particular encoding. For instance, 7-bit ASCII has 128 points from 0 to 7F (hex). 8-bit (extended) ASCII has 256 from 0 to FF (hex)

## Unicode

Unicode is the name of a mapping of _code points_ only (it does not specify encoding!). It can represent over 1 million characters! Everything from Cyrillic to all of your favorite emoji.



## More Unicode
The links below show some tables. Code points may be represented in binary, decimal, and hexidecimal. Many tables use hexadecimal... but resulting code point is still same value.

* unicode.org has all the charts: [https://unicode.org/charts/](https://unicode.org/charts/)
* the first 128 characters are backward compatible with ASCII: [https://unicode.org/charts/PDF/U0000.pdf](https://unicode.org/charts/PDF/U0000.pdf)
* here are some emoji mappings if want 'em 🙏: [https://unicode.org/emoji/charts/full-emoji-list.html](https://unicode.org/emoji/charts/full-emoji-list.html)



## Encodings for Unicode

Again, unicode is just the name of the mapping from code points to characters. Want to actually _encode_ a character? You have some choices:

* `utf-8`: variable length (1 to 4 bytes)
* `utf-16`: (2 bytes or 4 bytes)
* `utf-32`: (4 bytes)

## utf-8

Can store characters in 1 byte or as many as 4 bytes (variable length encoding)

* even though there are only 8 bits in 1 byte, can represent other unicode characters by adding additional bytes 
* higher bytes (left most) specify whether or not other bytes should be combined 
* for example, if left-most bit is 0, then character can be represented by a single byte
* if first bit is 1, then multiple bytes needed to represent character!

## utf-8, Multiple Bytes

* starting with 110xxxxx means two bytes needed
* starting with 1110xxxx means three bytes needed
* __see a pattern?__ &rarr;

## utf-8, Multiple Bytes

* number of 1's specifies number of bytes to represent character
* additional / continuation bytes are prefixed with 10
* so, take the binary representation of a code point and fill in the x's
* for something that needs 4 bytes to represent:
    * `11110xxx` `10xxxxxx` `10xxxxxx` `10xxxxx`
    * first 4 1's and 0 mean 4 bytes
    * remaining 3 bytes are prefixed with 10
    * fill in x's with bits from binary representation of code point

## Before We Go On... Strings, chr, ord, Bytes, and Strings vs Bytes

## Strings

From the docs: "The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal"... see below ...

In [None]:
s = "this is clearly a string"
s2 = "also a string ☃"
print(s2)
print(type(s2))

also a string ☃
<class 'str'>


In [None]:
s3 = "not sure 🙃"
print(s3)
print(type(s3))

not sure 🙃
<class 'str'>


In [None]:
for c in s3:
    print(c,ord(c))
chr(128579)

n 110
o 111
t 116
  32
s 115
u 117
r 114
e 101
  32
🙃 128579


'🙃'

## chr and ord

__What do these two built-in functions do again__?

They map unicode characters to and from unicode code points (numbers):
    
* `ord` - returns the unicode code point of the character passed in
* `chr` - returns the character that the code point passed in maps to

```
ord('A') # 65
chr(65) # B
```

These two functions handle mapping a number to a character.

Poll Everywhere Question

utf-8 uses 

* 2 bytes per character
* 3 bytes per character
* A variable number of bytes per character
* 8 bytes per character


## bytes Objects

__If we want to deal with how character data is actually stored in bits/bytes, we can work with a `bytes` object:__

From Python docs: "Bytes objects are immutable sequences of single bytes." ...

__one way to do this is with a sequence (for example, a list) of ints `0` - `255`__

In [None]:
b = bytes([67, 65, 66])
b

b'CAB'

In [None]:
for c in "CAB":
    print(ord(c))

67
65
66


## bytes Can Also Simply be Created with Strings

In [None]:
for c in b"CAB":
    print(c)

67
65
66


In [None]:
b = b"hello"

In [None]:
b[0]

104

In [None]:
ord('h')

104

## Wait, Isn't That Just a String?

In [None]:
try:
    b + '!!!!!'
except TypeError as e:
    print(type(e), e)

<class 'TypeError'> can't concat str to bytes


In [None]:
type(b)

bytes

## Use `decode` Method to Convert to a String

__Interpret a series of bytes as utf-8 using using a `byte` object's `decode` method__ &rarr;

* it takes a single argument
* the encoding to use when decoding the bytes

In [None]:
b = b'hello!'
print(b)
print(type(b))
s=b.decode('utf-8')
print(s)
print(type(s))
# ... works as you expect!

b'hello!'
<class 'bytes'>
hello!
<class 'str'>


## Now Let's Try utf-16

__What do you think will happen if we use `utf-16` as the encoding?__

```
b.decode('utf-16')
```

In [None]:
b.decode('utf-16')
# ... how about same bytes as utf-16

'敨汬Ⅿ'


## What? Here's an actual example of utf8: 

First, you can find a nice [explanation on stackoverlow](https://stackoverflow.com/a/44568131).

__But for now, let's check out (https://www.fileformat.info/info/unicode/char/1f602/index.htm))__ &rarr;


## ([Tears of Joy](https://www.fileformat.info/info/unicode/char/1f602/index.htm))

    
* Its unicode code point, in decimal is: `128514`
* In binary, `128514` is `000011111011000000010` (21 bits including some 0 padding)
* This can't be represented in a single byte or even 3 bytes in utf-8... but we can do it with 4 bytes
* Here's the pattern for 4 bytes: `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`
* Breaking up our binary representation of the code point to fit in the x's above, we have: `000 011111 011000 000010`
* And, finally: `11110000 10011111 10011000 10000010`
    

## Let's prove that this works...

The following shows that we can encode and decode using utf-8 using the rules above.

* we can do it with a little bit of Python.
* (it's a bit much; just knowing it's _possible_ is fine)

👇

In [None]:
# here is tears of joy...
ch = '😂'
print(f'Tears of joy: {ch}\n============')

Tears of joy: 😂


In [None]:
# let's see the code point using ord
print(f'Let\'s see the utf-8 encoding of {ch}!\n----')
      
code_point = ord(ch)
print(f'code point for {ch} is: {code_point}')

Let's see the utf-8 encoding of 😂!
----
code point for 😂 is: 128514


In [None]:
# let's see the binary version of the code point

# this format specifier, 021b, means:
# * pad with 0's
# * there should be 21 total characters (# of x's based on pattern)
# * format as binary number
format_as_padded_bin = '021b'


# use the format specifier as nested variable in 
# format string after colon
padded_bin = f'{code_point:{format_as_padded_bin}}'
print(f'{code_point} is in binary is: {padded_bin}')

128514 is in binary is: 000011111011000000010


[Python f-strings](https://realpython.com/python-f-strings/)

In [None]:
# distribute into 4 bytes: UTF-8 FTW!
# fill bits into the x's in the pattern below:
# 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
a, b, c, d = padded_bin[:3], padded_bin[3:9], padded_bin[9:15], padded_bin[15:]
encoded = f'11110{a}10{b}10{c}10{d}'
print(f'encoded as utf-8, in binary: {encoded}')

encoded as utf-8, in binary: 11110000100111111001100010000010


In [None]:
# ok, let's test if this is encoded correctly by decoding it!
print(f'\nLet\'s go from utf-8 back to the actual character (decode), {ch}\n----')

# let's turn this into a sequence of bytes, with each byte shown in binary (as a string)
bytes_as_bin = list(encoded[i: i + 8] for i in range(0, 25, 8))
print(f'these are our bytes as strings in a list: {bytes_as_bin}')



Let's go from utf-8 back to the actual character (decode), 😂
----
these are our bytes as strings in a list: ['11110000', '10011111', '10011000', '10000010']


In [None]:
# convert each string into an int, and use that to create a bytes object
# call decode on bytes object to get back character
# decode will decode a series of bytes using utf-8 
# (though you can specify an encoding as a keyword arg)
b = bytes([int(i, 2) for i in bytes_as_bin])
print('decode those bytes to get the original character:', b.decode())

decode those bytes to get the original character: 😂


🛑

## Addendum

* Most of the characters in unicode (character sets from natural languages) are in the first ~65,000 code points (called the _Basic Multilingual Plane_). 
* Emoji exist above that, and typically require 4 bytes to represent.

## Other Encoding Schemes: utf-16, utf-32

* If using mostly ASCII characters, then utf-8 is a great choice. 
* However, if using many characters that can only be encoded in more than one byte, utf-32 might be a better option. 
* utf-32 is a fixed width encoding that takes up four bytes per character


Poll Everywhere Question

What encoding does Python often default to?

* UTF-16
* UTF-32
* UTF-8
* ASCII

## Why use utf-8 (usually)?

* if most of your characters can be encoded in 1 byte; use it! 
* it saves space... (why use 4 bytes to represent `A` when you can use 1?)

## Why use utf-32 instead?

* if using lots of code points that require multiple bytes, it's a bit more complex decoding utf-8, since the number of bytes used per character has to be determined

## Why Does This Matter?

__Why might knowing about encodings be useful?__ &rarr; 


* ...Sometimes you source a file, but you don’t know what encoding it is!
* (but how do you know what encoding it is?)

## Decoding a File

__How might you determine the encoding of a file?__ &rarr;

* If you have a series of bytes, you can decode with a scheme of your choice (utf-8, latin-1, etc.?)
* Automatic detection of encoding is tricky! (no standard for embedding encoding a file, usually encoding not specified!)
* Editors/viewers will use different strategies, but no guarantee guess will be right! 😮
* btw, some tools: file and enca to guess at encoding... sublime, atom, etc. to load in different encoding
* and, of course, Python can read files with different encodings (though default is utf-8)

## Example / Mystery!

Download this file in the same directory as your notebook:

[https://www.gutenberg.org/files/4909/old/olavg10.txt](https://www.gutenberg.org/files/4909/old/olavg10.txt) 

Try to figure out how to _read_ it correctly. 🕵

* open it in a text editor, what do you see?
* reopen, but change encoding in your text editor of your choice; does that fix things?
* note that most text editors, like sublime and atom, can be set to use a specific encoding
* choose CP1251 or Window-1251

## Examining Our Text with Ptyhon
If you're unable to change the encoding, we can look at it with python.

1. first as utf-8 (which causes an exception)
2. then as cp1251 (which shows us cyrillic)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# we try to read the file with open
# by default, it'll read it as utf-8
# there are some invalid continuation characters
# ... so we'll get an exception
fn='/content/drive/MyDrive/Colab Notebooks/olavg10.txt'
try:
    with open(fn,'r') as f:
        print(f.read())
except FileNotFoundError as e:
    print('ERROR! please download https://www.gutenberg.org/files/4909/old/olavg10.txt into same directory as this notebook first. k thx bai.')
except UnicodeDecodeError as e:
    print('Cannot decode this file... we are trying utf-8, but it is not that.')
    print(e)

Cannot decode this file... we are trying utf-8, but it is not that.
'utf-8' codec can't decode byte 0xce in position 1494: invalid continuation byte


In [None]:
# now let's use codecs.open so we can read the file with a specific encoding
import codecs
try:
    with codecs.open(fn, encoding='cp1251') as f:
        lines = f.readlines()
        print('a line in our text:\n', lines[51][:100])
except FileNotFoundError as e:
    print('ERROR!!!!!! plz download https://www.gutenberg.org/files/4909/old/olavg10.txt into same directory as this notebook first. k thx bai.')

a line in our text:
 Той и сам не знае кога е роден, но като го запитат, казва, че сега е на 36 години. Родното му градче
