Item 3 Know the Differences Between bytes and str 

Things to Remember
- bytes contains squences of 8-bit values, and str contains sequences of Unicode code points.
- Use helper functions to ensure that the inputs you operate on are the type of character 
  sequence that you expect (8 bit values, UTF-8-encoded, Unicode code points, etc).
- bytes and str instances can't be used together with operators (like >, ==, +, and %).
- If you want to read or write binary data to/from a file, always open the file using a 
  binary mode(like 'rb' or 'wb').
- If you want to read or write Unicode data to/from a file, be careful about your system's 
  default text encoding. Explicitly pass the encoding parameter to open if you want to avoid
  surprises. 

In [None]:
a = b'h\x65llo' 
print(list(a))
print(a)


The leading \x escape sequence means the next two characters are interpreted as hex digits for the character code

In [None]:
a = 'a\u0300 propos'
print(list(a))
print(a)

Unicode sandwich approach
- Do encoding and decoding of Unicode data at the furthest of your interfaces
- The core of your program should use the str type containing Unicode data and
  should not assume anything about character encodings.
- use helper functions like the following to convert between str and bytes and
  to ensure that the type of input values matches your code's expectations
-      


In [None]:
def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value # instance of str

print(repr(to_str(b'foo')))
print(repr(to_str('bar')))


- repr : returns a printable representation of the given object. 
- In the cases above it will print out 'foo' and 'bar' - values
  will be enclosed by a pair of single quotes.  
    

In [None]:
def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode('utf-8')
    else:
        value = bytes_or_str
    return value # instance of bytes
print(repr(to_bytes(b'foo')))
print(repr(to_bytes('bar')))

Two big gotchas when dealing with raw 8-bit values and Unicode strings in Python
- bytes and str seem to work the same way, but their instances are not compatible
  with each other
- operations involving file handlers (returned by the open built-in function) default
  to requiring Unicode strings instead of raw bytes 

In [None]:
# + operator 
# this is fine
print (b'one' + b'two')
print ('one' + 'two')

In [None]:
# you can't add str instacnes to bytes instances
b'one' + 'two' # error

In [None]:
# binary operators
# this is fine
assert b'red' > b'blue'
assert 'red' > 'blue'

In [None]:
# you can't compare a str instance to a bytes instance
assert 'red' > b'blue'

In [None]:
# comparing bytes and str instances for equality will always evaluate to False
print(b'foo' == 'foo')

In [None]:
# the % operator works with format strings for each type
print (b'red %s' % b'blue')
print ('red %s' % 'blue')

In [None]:
# Python doesn't know what binary text encoding to use
print (b'red %s' % 'blue') #error

In [None]:
# this won't cause error, but the result is not what you would expect
# internally the __repr__ method will be called on the bytes instance
# the result will be used to substitute %s 
print ('red %s' % b'blue')

In [None]:
# you can't write binary data to a file opened in text mode
with open('data.bin', 'w') as f:  
    f.write(b'\xf1\xf2\xf3\xf4\xf5') # error

In [1]:
# change the open mode to 'wb' fixes the problem
with open('data.bin', 'wb') as f:  
    f.write(b'\xf1\xf2\xf3\xf4\xf5')

In [None]:
import locale
print (locale.getpreferredencoding()) # my default encoding is cp1252

In [None]:
# you just treat the binary data as a string encoded as 'cp1252'
# the result is very different than you expect

with open('data.bin', 'r') as f:
    data = f.read()
print(data)

In [None]:
# change the mode to 'rb', and it works as expected
with open('data.bin', 'rb') as f:
    data = f.read()
print(data) # b'\xf1\xf2\xf3\xf4\xf5'