The first thing we'll need to do is load in the libraries we'll be using. Not our datasets, though: we'll get to those later!

In [None]:
# modules we'll use
import numpy as np
import pandas as pd

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)



Now we're ready to work with some character encodings! (If you like, you can add a code cell here and take this opportunity to take a look at some of the data.)


Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding that the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

æ–‡å—åŒ–ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

    UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

It was pretty hard to deal with encodings in Python 2, but thankfully in Python 3 it's a lot simpler. (Kaggle Kernels only use Python 3.) There are two main data types you'll encounter when working with text in Python 3. One is is the string, which is what text is by default.

In [None]:
# start with a string
before = "This is the euro symbol: € "

# check to see what datatype it is
type(before)



The other data is the bytes data type, which is a sequence of integers. You can convert a string into bytes by specifying which encoding it's in:


In [None]:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors="replace")

# check the type
type(after)



If you look at a bytes object, you'll see that it has a b in front of it, and then maybe some text after. That's because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.) Here you can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string

In [None]:
# take a look at what the bytes look like
after

When we convert our bytes back to a string with the correct encoding, we can see that our text is all there correctly, which is great! :)

In [None]:
# convert it back to utf-8
print(after.decode("utf-8"))



However, when we try to use a different encoding to map our bytes into a string,, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

    You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a cd player. If you try to play a cassette in a CD player, it just won't work.



In [None]:
# try to decode our bytes with the ascii encoding
print(after.decode("ascii"))



We can also run into trouble if we try to use the wrong encoding to map from a string to bytes. Like I said earlier, strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we'll create problems.

For example, if we try to convert a string to bytes for ascii using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn't in ASCII, though, there will be some characters it can't handle. We can automatically replace the characters that ASCII can't handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there's not way to tell which character it should have been. That means we may have just made our data unusable!


In [None]:
# start with a string
before = "This is the euro symbol: €"


# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors="replace")

# convert it back to utf-8
print(after.decode("ascii"))

# We've lost the original underlying byte string! It's been 
# replaced with the underlying byte string for the unknown character :(

This is bad and we want to avoid doing it! It's far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files, which we'll talk about next

First, however, try converting between bytes and strings with different encodings and see what happens. Notice what this does to your text. Would you want this to happen to data you were trying to analyze?

# Reading in files with encoding problems


Most files you'll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. However, sometimes you'll get an error like this:

In [None]:
# try to read in a file not in a UTF-8
kickstarter = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv")



Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

I'm going to just look at the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. (Especially with a large file this can be very slow.) Another reason to just look at the first part of the file is that we can see by looking at the error message that the first problem is the 11th character. So we probably only need to look at the first little bit of the file to figure out what's going on.


In [None]:
# look at the first ten thousand bytes to guess the character encoding
with open('../input/kickstarter-projects/ks-projects-201801.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)



So chardet is 73% confidence that the right encoding is "Windows-1252". Let's see if that's correct:


In [None]:
# read in the file with the encoding detected by chardet
kickstarter = pd.read_csv('../input/kickstarter-projects/ks-projects-201612.csv', encoding='Windows-1252')

# look at the first few lines
kickstarter.head()



Yep, looks like chardet was right! The file reads in with no problem (although we do get a warning about datatypes) and when we look at the first few rows it seems to be be fine.

    What if the encoding chardet guesses isn't right? Since chardet is basically just a fancy guesser, sometimes it will guess the wrong encoding. One thing you can try is looking at more or less of the file and seeing if you get a different result and then try that.



In [None]:
# Your Turn! Trying to read in this file gives you an error. Figure out
# what the correct encoding should be and read in the file. :)
police_killings = pd.read_csv('../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv')

# Saving your files with UTF-8 encoding


Finally, once you've gone through all the trouble of getting your file into UTF-8, you'll probably want to keep it that way. The easiest way to do that is to save your files with UTF-8 encoding. The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

In [None]:
# save our file (will be saved as UTF-8 by default!)
kickstarter.to_csv('ks-projects-201801-utf8.csv')



Pretty easy, huh? :)

    If you haven't saved a file in a kernel before, you need to hit the commit & run button and wait for your notebook to finish running first before you can see or access the file you've saved out. If you don't see it at first, wait a couple minutes and it should show up. The files you save will be in the directory "../output/", and you can download them from your notebook.



# # Inconsistent Data Entry

In [None]:
# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process

When I tried to read in the PakistanSuicideAttacks Ver 11 (30-November-2017).csvfile the first time, I got a character encoding error, so I'm going to quickly check out what the encoding should be...

In [None]:
# look at the first ten thousand bytes to guess the character encoding
with open('../input/pakistansuicideattacks/PakistanSuicideAttacks Ver 11 (30-November-2017).csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))

# check what the character encoding might be    
print(result)    

And then read it in with the correct encoding.

In [None]:
suicide_attack = pd.read_csv('../input/pakistansuicideattacks/PakistanSuicideAttacks Ver 11 (30-November-2017).csv', encoding= 'Windows-1252')


Now we're ready to get started! You can, as always, take a moment here to look at the data and get familiar with it. :)

# Do some preliminary text pre-processing


For this exercise, I'm interested in cleaning up the "City" column to make sure there's no data entry inconsistencies in it. We could go through and check each row by hand, of course, and hand-correct inconsistencies when we find them. There's a more efficient way to do this though!

In [None]:
# get all the unique values in the 'City' column
cities = suicide_attack['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities



Just looking at this, I can see some problems due to inconsistent data entry: 'Lahore' and 'Lahore ', for example, or 'Lakki Marwat' and 'Lakki marwat'.

The first thing I'm going to do is make everything lower case (I can change it back at the end if I like) and remove any white spaces at the beginning and end of cells. Inconsistencies in capitalizations and trailing white spaces are very common in text data and you can fix a good 80% of your text data entry inconsistencies by doing this.


In [None]:
# convert to lower case
suicide_attack['City'] = suicide_attack['City'].str.lower()

# remove trailing white spaces
suicide_attack['City'] = suicide_attack['City'].str.strip()

Next we're going to tackle more difficult inconsistencies.


# Use fuzzy matching to correct inconsistent data entry


Alright, let's take another look at the city column and see if there's any more data cleaning we need to do.

In [None]:
# get all the unique values in the 'City' column
cities = suicide_attack['City'].unique()


# sort them alphabetically and then take a closer look
cities.sort()
cities



It does look like there are some remaining inconsistencies: 'd. i khan' and 'd.i khan' should probably be the same. (I looked it up and 'd.g khan' is a seperate city, so I shouldn't combine those.)

I'm going to use the fuzzywuzzy package to help identify which string are closest to each other. This dataset is small enough that we could probably could correct errors by hand, but that approach doesn't scale well. (Would you want to correct a thousand errors by hand? What about ten thousand? Automating things as early as possible is generally a good idea. Plus, it’s fun! :)

    Fuzzy matching: The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. Here, we're going to get the ten strings from our list of cities that have the closest distance to "d.i khan".


In [None]:
# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)


# take a look at them
matches



We can see that two of the items in the cities are very close to "d.i khan": "d. i khan" and "d.i khan". We can also see the "d.g khan", which is a seperate city, has a ratio of 88. Since we don't want to replace "d.g khan" with "d.i khan", let's replace all rows in our City column that have a ratio of > 90 with "d. i khan".

To do this, I'm going to write a function. (It's a good idea to write a general purpose function you can reuse if you think you might have to do a specific task more than once or twice. This keeps you from having to copy and paste code too often, which saves time and can help prevent mistakes.)


In [None]:
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_columns(df, column, string_to_match, min_ratio=90):
    
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches=fuzzywuzzy.process.extract(string_to_match, strings, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    
    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    
    # get the rows of all the close matches in our dataframe
    rows_with_matches= df[column].isin(close_matches)
    
    # replace all rows with close matches with the input matches
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

Now that we have a function, we can put it to the test!

In [None]:
# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_columns(df=suicide_attack, column='City', string_to_match='d.i khan')

And now let's can check the unique values in our City column again and make sure we've tidied up d.i khan correctly.

In [None]:
# get all the unique values in the 'City' column
cities = suicide_attack['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

Excellent! Now we only have "d.i khan" in our dataframe and we didn't have to change anything by hand.