# Cleaning data with Python - Challenge Day 4 - Character Encodings

Time for [day 4 of Rachael's 5-day challenge][1], this time, as you may have guessed from the title, it's character encodings. Quite boring, but very important...
[1]: https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings

In [2]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

### Your turn! Try encoding and decoding different symbols to ASCII and see what happens. I'd recommend $, \#, 你好 and नमस्ते but feel free to try other characters. What happens? When would this cause problems?

In [3]:
# start with a string
before = "This is the dollar symbol: $"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

In [4]:
# start with a string
before = "This is the hash symbol: #"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

In [5]:
# start with a string
before = "This is the copyright symbol: ©"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

In [6]:
# start with a string
before = "This is the micro sign: µ"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

Okay, we're okay with dollar and hash (pound sign in the US) characters, but they're both in the [ASCII set][1]. Once we step outside of that, the problems begin...
[1]: https://www.asciitable.com/

### Your Turn! Trying to read in this file gives you an error. Figure out what the correct encoding should be and read in the file. :) `police_killings = pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv")`

In [14]:
police_killings = pd.read_csv("../input/PoliceKillingsUS.csv")

Aye, that's thrown an error, let's get *chardet*ing and see if we can work out what encoding we need to use...

In [18]:
# look at the first ten thousand bytes to guess the character encoding
with open("../input/PoliceKillingsUS.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

In [19]:
# read in the file with the encoding detected by chardet
policeKillings = pd.read_csv("../input/PoliceKillingsUS.csv", encoding='ascii')

# look at the first few lines
policeKillings.head()

Ah, well that wasn't too successful, what if we increase the number of bytes that we're checking? Maybe we weren't giving chardet enough to work with for this particular dataset. If all the first 10000 bytes happened to fit with the ASCII set, that's probably what it's going to predict.

In [22]:
# try with 100,000 bytes to see if we can correctly predict encoding
with open("../input/PoliceKillingsUS.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))

# check what the character encoding might be
print(result)

There we go, 73% confidence that we're dealing with Windows-1252 encoding, let's give that a try:

In [17]:
# read in the file with the encoding detected by chardet
policeKillings = pd.read_csv("../input/PoliceKillingsUS.csv", encoding='Windows-1252')

# look at the first few lines
policeKillings.head()

So, something went a bit wrong with our chardet process at first, but we gave it a bit more to chew on an we got our answer.

### Your turn! Save out a version of the police_killings dataset with UTF-8 encoding 

In [21]:
# save our file (will be saved as UTF-8 by default!)
policeKillings.to_csv("policeKillingsCb-utf8.csv")

And that wraps that up for day 4. Not an interesting lot of code, but potentially hugely useful.