# Understanding Encoding on file opening

I found this quite helpful before getting stuck into thinking about what is going on here...
https://eli.thegreenplace.net/2012/01/30/the-bytesstr-dichotomy-in-python-3

Text documents can be saved with different encodings.
Python, by default will attempt to decode a file as it is read:

In [2]:
# python3
with open('example_encoding_files/ascii.txt', 'r') as fh:
    print(fh.read())

"New England", "Massachusetts", "Boston", "SuperMart",
"Feb" , 2000000"New England", "Massachusetts", "Springfield", "SuperMart",
"Feb" , 1400000"New England", "Massachusetts", "Worcester", "SuperMart",
"Feb" , 2200000



In [4]:
%%python2
# python2
with open('example_encoding_files/ascii.txt', 'r') as fh:
    print(fh.read())

"New England", "Massachusetts", "Boston", "SuperMart",
"Feb" , 2000000"New England", "Massachusetts", "Springfield", "SuperMart",
"Feb" , 1400000"New England", "Massachusetts", "Worcester", "SuperMart",
"Feb" , 2200000



Which is fine if you want to open a file which is 
1. Text
2. Encoded in the same way as your system...

You can find out what your system encoding is by:

In [5]:
%%bash
locale charmap

ANSI_X3.4-1968


or if you are using python3:

In [6]:
import locale
locale.getpreferredencoding(False)

'ANSI_X3.4-1968'

(ANSI_X3.4-1968 being a particular version of ASCII)

**_If_** you are using Python3 and you want to open a file with a different encoding you have to specify the keyword argument `encoding="name of encoding"`. Most commonly you will want 'UTF-8'. Sadly this is not an option in Python2, you have to import the codecs library. The default value of `encoding` is set to `locale.getpreferredencoding(False)`.

In [7]:
with open('example_encoding_files/unicode.text', 'r', encoding='UTF-8') as fh:
    print(fh.read())

Braille:

  ⡌⠁⠧⠑ ⠼⠁⠒  ⡍⠜⠇⠑⠹⠰⠎ ⡣⠕⠌

  ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠙⠑⠁⠙⠒ ⠞⠕ ⠃⠑⠛⠔ ⠺⠊⠹⠲ ⡹⠻⠑ ⠊⠎ ⠝⠕ ⠙⠳⠃⠞
  ⠱⠁⠞⠑⠧⠻ ⠁⠃⠳⠞ ⠹⠁⠞⠲ ⡹⠑ ⠗⠑⠛⠊⠌⠻ ⠕⠋ ⠙⠊⠎ ⠃⠥⠗⠊⠁⠇ ⠺⠁⠎
  ⠎⠊⠛⠝⠫ ⠃⠹ ⠹⠑ ⠊⠇⠻⠛⠹⠍⠁⠝⠂ ⠹⠑ ⠊⠇⠻⠅⠂ ⠹⠑ ⠥⠝⠙⠻⠞⠁⠅⠻⠂
  ⠁⠝⠙ ⠹⠑ ⠡⠊⠑⠋ ⠍⠳⠗⠝⠻⠲ ⡎⠊⠗⠕⠕⠛⠑ ⠎⠊⠛⠝⠫ ⠊⠞⠲ ⡁⠝⠙
  ⡎⠊⠗⠕⠕⠛⠑⠰⠎ ⠝⠁⠍⠑ ⠺⠁⠎ ⠛⠕⠕⠙ ⠥⠏⠕⠝ ⠰⡡⠁⠝⠛⠑⠂ ⠋⠕⠗ ⠁⠝⠹⠹⠔⠛ ⠙⠑
  ⠡⠕⠎⠑ ⠞⠕ ⠏⠥⠞ ⠙⠊⠎ ⠙⠁⠝⠙ ⠞⠕⠲

  (The first paragraph of "A Christmas Carol" by Dickens)



If you don't know the encoding, or if the file is a binary you can't open the file this way...

In [8]:
with open('/bin/true', 'r') as fh:
    print(fh.read())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 96: ordinal not in range(128)

But it's still possibly to open the file in binary `'b'` mode in both Python2 and 3:
In Python 2 you will get a (very ugly) string representing the file contents:

In [2]:
%%python2
with open('/bin/true', 'rb') as fh:
    data = fh.read()
    #print(data[:1000])
    
print(type(data))

<type 'str'>


And unless you know what that data is, you aren't likely to find it very useful.

In Python3, however, the `'b'` switch will read a bytes type instead of a string:

In [10]:
with open('/bin/true', 'rb') as fh:
    data = fh.read()
    print(data[:100])

b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00>\x00\x01\x00\x00\x00p\x0f@\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00 Z\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x008\x00\x08\x00@\x00 \x00\x1f\x00\x06\x00\x00\x00\x05\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00@\x00@\x00\x00\x00\x00\x00@\x00@\x00\x00\x00\x00\x00\xc0\x01\x00\x00'


Which if you know the encoding, you can decode  (functionally equivelent to the last time I showed you Braille):

In [12]:
with open('example_encoding_files/unicode.text', 'rb') as fh:
    data = fh.read()
    print(f"The type of the data before decoding is: {type(data)}")
    print(f"The type of the data after decoding is: {type(data.decode('UTF-8'))}")
    print(data.decode('UTF-8'))

The type of the data before decoding is: <class 'bytes'>
The type of the data after decoding is: <class 'str'>
Braille:

  ⡌⠁⠧⠑ ⠼⠁⠒  ⡍⠜⠇⠑⠹⠰⠎ ⡣⠕⠌

  ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠙⠑⠁⠙⠒ ⠞⠕ ⠃⠑⠛⠔ ⠺⠊⠹⠲ ⡹⠻⠑ ⠊⠎ ⠝⠕ ⠙⠳⠃⠞
  ⠱⠁⠞⠑⠧⠻ ⠁⠃⠳⠞ ⠹⠁⠞⠲ ⡹⠑ ⠗⠑⠛⠊⠌⠻ ⠕⠋ ⠙⠊⠎ ⠃⠥⠗⠊⠁⠇ ⠺⠁⠎
  ⠎⠊⠛⠝⠫ ⠃⠹ ⠹⠑ ⠊⠇⠻⠛⠹⠍⠁⠝⠂ ⠹⠑ ⠊⠇⠻⠅⠂ ⠹⠑ ⠥⠝⠙⠻⠞⠁⠅⠻⠂
  ⠁⠝⠙ ⠹⠑ ⠡⠊⠑⠋ ⠍⠳⠗⠝⠻⠲ ⡎⠊⠗⠕⠕⠛⠑ ⠎⠊⠛⠝⠫ ⠊⠞⠲ ⡁⠝⠙
  ⡎⠊⠗⠕⠕⠛⠑⠰⠎ ⠝⠁⠍⠑ ⠺⠁⠎ ⠛⠕⠕⠙ ⠥⠏⠕⠝ ⠰⡡⠁⠝⠛⠑⠂ ⠋⠕⠗ ⠁⠝⠹⠹⠔⠛ ⠙⠑
  ⠡⠕⠎⠑ ⠞⠕ ⠏⠥⠞ ⠙⠊⠎ ⠙⠁⠝⠙ ⠞⠕⠲

  (The first paragraph of "A Christmas Carol" by Dickens)



## Conclusions

<table>
<tr>
    <td> - </td>
    <td>Python 2.7 </td>
    <td>Python 3</td>
</tr>
<tr>
    <td> Open a file containing text encoded in the same way as your system </td>
    <td> OK </td>
    <td> OK </td>
</tr> 
<tr>
    <td> Open a file containing text encoded in another standard way </td>
    <td> Tricky </td>
    <td> Easy </td>
</tr>
<tr>
    <td> Open any file, including binaries </td>
    <td> Can be done, but you need to know how to decode them </td>
    <td> Opens as a byte-string which is slightly easier to use </td>
</tr>
    

</table>