Figure out encoding for a text file using chardet

https://chardet.readthedocs.io/en/latest/usage.html

In [1]:
import chardet

# Opening files

In text mode ('r'), Python will parse the file according to the text encoding you give it. If you don't give one, Python uses a platform-dependent default.<BR>
Then read() will give you a str. 

In binary ('rb') mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read() gives you a bytes object.

Also, in Python 3, the universal newlines (the translating between '\n' and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.
    
<a href="https://stackoverflow.com/questions/9644110/difference-between-parsing-a-text-file-in-r-and-rb-mode">source</a>; see also <a href="https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files">official docs</a>.

# open a CSV containing text, read the contents

In [3]:
with open('csv_of_text.csv','rb') as fraw: # "rb" = bytes mode
    file_content = fraw.read()

take a look at what the variable is storing:

In [4]:
file_content

b'"what","won","needs","chicken","whispered","picture"\n"hungry","moon","ride","sleep","give","activity"\n"worry","attempt","poor","during","mistake","possibly"\n"sugar","edge","furniture","basic","plural","son"\n"balance","copper","broke","police","slave","discover"\n"compass","particularly","mice","floating","wrong","living"\n"pick","white","earn","shaking","store","something"\n'

use `chardet` to deterine encoding of file

In [5]:
chardet.detect(file_content)

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

<BR>
<BR>
<BR>
<BR>

Using 'r' when opening the file leads chardet to complain:

In [6]:
with open('csv_of_text.csv','r') as fraw: # 'r' = text mode rather than bytes
    file_content = fraw.read()

In [7]:
file_content

'"what","won","needs","chicken","whispered","picture"\n"hungry","moon","ride","sleep","give","activity"\n"worry","attempt","poor","during","mistake","possibly"\n"sugar","edge","furniture","basic","plural","son"\n"balance","copper","broke","police","slave","discover"\n"compass","particularly","mice","floating","wrong","living"\n"pick","white","earn","shaking","store","something"\n'

Note the missing "b" at the start of the output

In [8]:
# the following causes a TypeError because a string is passed rather than a bytes object

chardet.detect(file_content)

TypeError: Expected object of type bytes or bytearray, got: <class 'str'>