# Character representation schemes

# Ascii 
- each character represented by one byte
- english characters and control characters(NL, CR, TAB...)
- to see ascii encoding on mac or linux, in a shell window do:
  - man ascii 

```
       0 nul    1 soh    2 stx    3 etx    4 eot    5 enq    6 ack    7 bel
       8 bs     9 ht    10 nl    11 vt    12 np    13 cr    14 so    15 si
      16 dle   17 dc1   18 dc2   19 dc3   20 dc4   21 nak   22 syn   23 etb
      24 can   25 em    26 sub   27 esc   28 fs    29 gs    30 rs    31 us
      32 sp    33  !    34  "    35  #    36  $    37  %    38  &    39  '
      40  (    41  )    42  *    43  +    44  ,    45  -    46  .    47  /
      48  0    49  1    50  2    51  3    52  4    53  5    54  6    55  7
      56  8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?
      64  @    65  A    66  B    67  C    68  D    69  E    70  F    71  G
      72  H    73  I    74  J    75  K    76  L    77  M    78  N    79  O
      80  P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W
      88  X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _
      96  `    97  a    98  b    99  c   100  d   101  e   102  f   103  g
     104  h   105  i   106  j   107  k   108  l   109  m   110  n   111  o
     112  p   113  q   114  r   115  s   116  t   117  u   118  v   119  w
     120  x   121  y   122  z   123  {   124  |   125  }   126  ~   127 del
```

# type 'bytes' 
- stores a sequence of 8 bit bytes
- prints bytes using ascii map
- immutable


In [1]:
# leading b' means bytes
# last two characters written in hex
# holds ints, not chars

b = b'foobar\x41\x5a'

[b, len(b), b[3], b[-1], type(b), type(b[0])]

[b'foobarAZ', 8, 98, 90, bytes, int]

In [2]:
# 'bytes' are not mutable, like a string

b[3] = 33

TypeError: 'bytes' object does not support item assignment

In [3]:
# similar functionality to the 'str' type we have been using

[a for a in dir(bytes) if not a.startswith('__')]

['capitalize',
 'center',
 'count',
 'decode',
 'endswith',
 'expandtabs',
 'find',
 'fromhex',
 'hex',
 'index',
 'isalnum',
 'isalpha',
 'isdigit',
 'islower',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

# type 'bytearray'
- mutable version of 'bytes'

In [88]:
ba = bytearray(b)
[ba, len(ba), ba[-1], type(ba), type(ba[0])]

[bytearray(b'foobarAZ'), 8, 90, bytearray, int]

In [89]:
# mutable

ba[0] = ord('F')
ba

bytearray(b'FoobarAZ')

In [90]:
# stores ints, NOT characters

[ba[0], type(ba[0])]

[70, int]

In [91]:
ba[0] = 255

In [92]:
# in fact, only stores a subset of ints, 0-255

ba[0] = 256

ValueError: byte must be in range(0, 256)

In [86]:
[a for a in dir(bytearray) if not a.startswith('__')]

['append',
 'capitalize',
 'center',
 'clear',
 'copy',
 'count',
 'decode',
 'endswith',
 'expandtabs',
 'extend',
 'find',
 'fromhex',
 'hex',
 'index',
 'insert',
 'isalnum',
 'isalpha',
 'isdigit',
 'islower',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'pop',
 'remove',
 'replace',
 'reverse',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

# Limitations of Ascii
- 1 bit was not used, leaving some room for extensions, but one byte is not enough to represent all the characters in the world



# Unicode
- "universal character set"
- represents over a million different characters
- every language on earth
    - somebody tried to add Klingon, but it was rejected
- each character represented by a unique integer
- [code charts](http://www.unicode.org/charts/)

# Python 'str' type 
- stores Unicode characters, not ascii

# encodings
- 'encoding' is converting a unicode string into a byte array or stream (in some encoding)
- 'decoding' is converting a byte stream(in some encoding) into a unicode string
- there are several different encoding/decoding schemes
- java uses utf-16
- english web pages often use utf-8
- the utf-8 encoding has the special property that if the unicode string is just ascii characters, the utf-8 encoding
is the same as the ascii encoding
- when you move a unicode string OUT of python, you must ENCODE it into a sequence of bytes
- when you move a unicode string INTO of python, you must DECODE it from a sequence of bytes


In [62]:
# 'Python' spelled in characters from different unicode character sets
# len is 6, which is the numbers of characters, not the number bytes it takes to represent them

uni = '\u2119\u01b4\u2602\u210c\xf8\u1f24'
[type(uni), uni, len(uni)]

[str, 'ℙƴ☂ℌøἤ', 6]

# 'ord' maps a char into its unicode integer
# 'chr' maps a unicode integer into a char

In [76]:
# 3rd char is from 'dingbats'

[ ord('A'), chr(65), chr(0x2702)]

[65, 'A', '✂']

In [4]:
# three different encodings of uni 

utf8, utf16, utf32 = [uni.encode(et) for et in ['utf-8', 'utf-16', 'utf-32']]

In [5]:
# length of byte encoding varies with different encodings

[[len(u), type(u)] for u in [utf8, utf16, utf32]]

[[16, bytes], [14, bytes], [28, bytes]]

In [6]:
# utf8 is type 'bytes', not a str. 
# note b' prefix

[type(uni), type(utf8), utf8, utf16, utf32]

[str,
 bytes,
 b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4',
 b'\xff\xfe\x19!\xb4\x01\x02&\x0c!\xf8\x00$\x1f',
 b'\xff\xfe\x00\x00\x19!\x00\x00\xb4\x01\x00\x00\x02&\x00\x00\x0c!\x00\x00\xf8\x00\x00\x00$\x1f\x00\x00']

In [7]:
# decode converts bytes into unicode string

utf32.decode('utf-32')

'ℙƴ☂ℌøἤ'

In [8]:
utf8.decode('utf-8')

'ℙƴ☂ℌøἤ'

In [9]:
# to decode, must know the encoding type(key)
# selecting the wrong decoder doesn't always generate an error
# sometimes you will just get a bogus string

utf32.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

# Can do I/O in unicode or binary
- 'open' defaults to 'str' (unicode)
- pass 'b' flag to 'open' for 'bytes'(binary)


In [64]:
# won't work - file stream expects a 'str' by default, but utf8 is type 'bytes'
import tempfile

path = tempfile.NamedTemporaryFile().name

with open(path, "w") as f:
    f.write(utf8)

TypeError: write() argument must be str, not bytes

In [65]:
# make a binary stream by adding 'b' flag to 'open'

with open(path, 'bw') as f:
    f.write(utf32)

In [66]:
#  reading in 'str' mode defaults to utf-8, but the file we wrote is utf-32
# so, this read fails

# but, somethimes if you give open the wrong encoding, it will read
# w/o error and give you garbage!

with open(unipath, "r") as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

In [68]:
# tell 'open' the right unicode encoding

with open(path, "r" , encoding='utf-32') as f:
    print(f.read())

ℙƴ☂ℌøἤ


In [69]:
# can read file bytes

with open(path, "rb") as f:
    b = f.read()
b

b'\xff\xfe\x00\x00\x19!\x00\x00\xb4\x01\x00\x00\x02&\x00\x00\x0c!\x00\x00\xf8\x00\x00\x00$\x1f\x00\x00'

In [70]:
utf32

b'\xff\xfe\x00\x00\x19!\x00\x00\xb4\x01\x00\x00\x02&\x00\x00\x0c!\x00\x00\xf8\x00\x00\x00$\x1f\x00\x00'

# ascii vs unicode
- ascii is easy, because storage media and networks handle bytes, and ascii is just bytes
- no byte order issues(big/little endian)
- unicode is harder, because
    - writing to the network or storage, the unicode string must be ENCODED into a byte stream, in some format like utf-8, utf-16, etc
    - reading from the network or storage, the byte stream must be DECODED into a unicode stream. somehow the encoding used must be provided
- given Python uses 'str' unicode, you are always
    - encoding as strings leave your program
    - decoding as strings enter your program
- if all you are using are ascii characters, then everything just works, without any special effort
- [standard text encoders](https://docs.python.org/3/library/codecs.html#standard-encodings)