# Encodings

Encodings are a set of rules mapping string characters to their binary representations. Python supports dozens of different encoding as seen here in [this link](https://docs.python.org/3/library/codecs.html#standard-encodings). Because the web was originally in English, the first encoding rules mapped binary code to the English alphabet. 

The English alphabet has only 26 letters. But other languages have many more characters including accents, tildes and umlauts. As time went on, more encodings were invented to deal with languages other than English. The utf-8 standard tries to provide a single encoding schema that can encompass all text.

The problem is that it's difficult to know what encoding rules were used to make a file unless somebody tells you. The most common encoding by far is utf-8. Pandas will assume that files are utf-8 when you read them in or write them out.

Run the code cell below to read in the population data set.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../data/population_data.csv', skiprows=4)

Pandas should have been able to read in this data set without any issues. Next, run the code cell below to read in the 'mystery.csv' file.

In [21]:
df = pd.read_csv('../data/mystery.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

You should have gotten an error: **UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte**. This means pandas assumed the file had a utf-8 encoding but had trouble reading in the data file. 

Your job in the next cell is to figure out the encoding for the mystery.csv file.

In [4]:
# TODO: Figure out what the encoding is of the myster.csv file
# HINT: pd.read_csv('mystery.csv', encoding=?) where ? is the string for an encoding like 'ascii'
# HINT: This link has a list of encodings that Python recognizes https://docs.python.org/3/library/codecs.html#standard-encodings

# Python has a file containing a dictionary of encoding names and associated aliases
# This line imports the dictionary and then creates a set of all available encodings
# You can use this set of encodings to search for the correct encoding
# If you'd like to see what this file looks like, execute the following Python code to see where the file is located
#    from encodings import aliases
#    aliases.__file__

from encodings.aliases import aliases

alias_values = set(aliases.values())

# TODO: iterate through the alias_values list trying out the different encodings to see which one or ones work
# HINT: Use a try - except statement. Otherwise your code will produce an error when reading in the csv file
#       with the wrong encoding.
# HINT: In the try statement, print out the encoding name so that you know which one(s) worked.

In [22]:
df = pd.read_csv('../data/mystery.csv', encoding='cp1125' )

ParserError: Error tokenizing data. C error: Expected 64 fields in line 23, saw 65


In [23]:
works = []
for code in alias_values:
    try:
        df = pd.read_csv('../data/mystery.csv', encoding=code)
        print(f'encoding worked: {code}')
        works.append(code)
    except:
        print(f'try again, {code} did not work')

try again, cp1254 did not work
try again, euc_jp did not work
try again, hz did not work
try again, cp950 did not work
try again, cp1125 did not work
try again, cp1251 did not work
try again, base64_codec did not work
try again, bz2_codec did not work
encoding worked: cp1140
try again, cp932 did not work
try again, cp1258 did not work
try again, hp_roman8 did not work
try again, iso8859_6 did not work
try again, iso2022_jp_1 did not work
try again, cp424 did not work
try again, cp861 did not work
try again, iso2022_jp_2004 did not work
try again, zlib_codec did not work
try again, euc_jisx0213 did not work
try again, gb2312 did not work
try again, iso8859_15 did not work
encoding worked: utf_16
try again, hex_codec did not work
try again, iso2022_jp_ext did not work
try again, cp866 did not work
encoding worked: cp1026
try again, mac_greek did not work
try again, cp860 did not work
try again, iso8859_14 did not work
try again, iso8859_13 did not work
try again, koi8_r did not work
try 

In [24]:
works

['cp1140',
 'utf_16',
 'cp1026',
 'utf_16_be',
 'cp273',
 'cp037',
 'utf_16_le',
 'cp500']

## Conclusion

There are dozens of encodings that Python can handle; however, Pandas assumes a utf-8 encoding. This makes sense since utf-8 is very common. However, you will sometimes come across files with other encodings. If you don't know the encoding, you have to search for it.

Note, as always, there is a solution file for this exercise. Go to File->Open.

There is a Python library that can be of some help when you don't know an encoding: chardet. Run the code cells below to see how it works.