**This notebook is derived from the tutorial on [Data Cleaning](https://www.kaggle.com/learn/data-cleaning) by [Rachael Tatman](https://www.kaggle.com/rtatman) at [kaggle](https://www.kaggle.com/)**

# Character Encoding

## Importing Libraries

In [5]:
import pandas as pd
import numpy as np
import chardet
import warnings
warnings.filterwarnings('ignore')

## What are encodings?
There are specific set of rules for mapping from war binary byte strings (like: 01110011001100) to characters which can be easily understood by humans (ex: "Hello"). These specific set of rules is called **Character Encodings.** There are different types of encodings, and if we try to read in a text with a different encoding with which it was originally written in, then we see scrambled text called "mojibake", e.g.:
> æ–‡å—åŒ–ã?? 

We might also end up with "unknown" characters. These are printed when there is no mapping between a particular byte and a character in the encoding that we're using to read the data, e.g.:
> ����������

Mismatches of character encoding is rare nowadays but stil it is a problem. There are a lot of character encoding but the main is UTF-8.
> UTF-8 is the standard encoding. All Python code is in UTF-8 and ideally all data sould be in UTF-8 as well. The problem starts when data is not encoded in UTF-8.

- There are two main data types you'll encounter when working with text in Python 3. One is is the string, which is what text is by default.

In [8]:
before = "This is the rupees  symbol : ₹"

type(before)

str

- The other data is the bytes data type, which is a sequence of integers. You can convert a string into bytes by specifying which encoding it's in:

In [10]:
after = before.encode("utf-8", errors='replace')

type(after)

bytes

In [11]:
print(after)

b'This is the rupees  symbol : \xe2\x82\xb9'


Look at a bytes object, notice that it has a b in front of it, and then maybe some text after. That's because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.) <br>
We can see that our rupee symbol has been replaced with some mojibake that looks like "\xe2\x82\xb9" when it's printed as if it were an ASCII string.

- We can convert "after" back to its correct encoding :

In [12]:
print(after.decode("utf-8"))

This is the rupees  symbol : ₹


However, when we try to use a different encoding to map our bytes into a string, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. We need to tell Python the encoding that the byte string is actually supposed to be in.
>You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a CD player. If you try to play a cassette in a CD player, it just won't work.

- To check above point try to decode our bytes with ASCII encoding:

In [13]:
print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 29: ordinal not in range(128)

We can also run into trouble if we try to use the wrong encoding to map from a string to bytes.  Strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we'll run into problems.

For example, if we try to convert a string to bytes for ASCII using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn't in ASCII, though, there will be some characters it can't handle. We can automatically replace the characters that ASCII can't handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there's not way to tell which character it should have been. That means we may have just made our data unusable!

In [14]:
# start with a string
before = "This is the rupees  symbol : ₹"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

This is the rupees  symbol : ?


We have lost the original underlying byte string, it has been replaced with the underlying byte string for the unknown character.
This is not good for our data. The best practice is to convert all our text to UTF-8 as soon as we can and keep it in that encoding. 
> **Only convert non UTF-8 to UTF-8 when reading in files.**

## Reading in files with encoding problems 
[Dataset](https://bit.ly/2TK5Xn5)<br>
Most of the files are encoded in UTF-8 and this is what Python expects by default. But sometimes we may get error like this :

In [15]:
dataset = pd.read_csv('globalterrorismdb_0718dist.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 7127: invalid continuation byte

Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. 

A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

In [19]:
with open('globalterrorismdb_0718dist.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
print(result)

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


- So chardet is 73% sure that the data is encoded in 'ISO-8859-1', Let's check if that is correct:

In [21]:
dataset = pd.read_csv('globalterrorismdb_0718dist.csv', encoding="ISO-8859-1")
dataset.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


- Guess made by chardet was right our data is loaded successfully.
> **What if the encoding chardet guesses isn't right?** Since chardet is basically just a guesser, sometimes it will guess the wrong encoding. One thing you can try is looking at more or less of the file and seeing if you get a different result and then try that.

## Saving file with UTF-8 encoding
The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

In [22]:
dataset.to_csv("GTD-utf8.csv")

- Now you can use UTF-8 encoded file directly.