Material came from [here](https://www.kaggle.com/rtatman/data-cleaning-challenge-inconsistent-data-entry/?utm_medium=email&utm_source=mailchimp&utm_campaign=5DDC-data-cleaning). Days 1 to 5 were all used. Loading modules, reading data, setting seed to make things reproducible.

In [19]:
import pandas as pd
import numpy as np
from scipy import stats
from mlxtend.preprocessing import minmax_scaling
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import chardet
import fuzzywuzzy
from fuzzywuzzy import process
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
earthquakes = pd.read_csv("../input/earthquake-database/database.csv")
landslides = pd.read_csv("../input/landslide-events/catalog.csv")
volcanos = pd.read_csv("../input/volcanic-eruptions/database.csv")
np.random.seed(0) 

You can check a sample of the data to check for missing data.

In [20]:
nfl_data.sample(5)

Now we will try to quantify how much missing data there is. `isnull()` returns `True` for missing data. `sum()` sums the boolean as 0 and 1 for each column. We only check the results for the first ten columns.

In [21]:
missing_values_count = nfl_data.isnull().sum()
missing_values_count[0:10]

Let us check what proportion of our data is missing. `np.product` multiplies the elements of a list. `sum()` again sums the sum for each column. Giving the number as a percentage.

In [22]:
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()
(total_missing/total_cells) * 100

Let's try removing the missing data.

In [23]:
nfl_data.dropna()

Because every row has some missing data, everything was removed. Instead let us remove columns that have missing data. 

In [24]:
columns_with_na_dropped = nfl_data.dropna(axis=1)
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])
columns_with_na_dropped.head()

We will take a subset of the data to see how to perform imputation for missing data. One option is to replace by 0.

In [26]:
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

In [27]:
subset_nfl_data.fillna(0)

You can use adjacent values for the missing values. `method` spcifices how to do this. `bfill` propagates the non-null value backwards to fill in the missing value. This is done column wise. Of course if the last element of the column is missing, we cannot back propagate. So this will be left as a missing value but by running the `fillna(0)` again it changes it to 0.

In [28]:
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

Now we will look at scaling the data. This helps when you want to compare two different random variables on features other than scale.

In [29]:
original_data = np.random.exponential(size = 1000)
scaled_data = minmax_scaling(original_data, columns = [0])
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
plt.show()

Now we will try to transform a non-normal distribution to a Normal distribution by a Box-Cox transformation. Box-Cox transformation basically tries to power each observation by some number to make it more normal.

In [30]:
normalized_data = stats.boxcox(original_data)
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
plt.show()

Now we will look at how to work with date data. The data indeed looks like dates but the data type is `object`.  `dtype('O')` is also just signifying that it is an object. This tells us that the data is not being recognized as dates.

In [31]:
print(landslides['date'].head())
landslides['date'].dtype

We can convert to a datetime object by using the `to_datetime()` function. You would need to specify where the month, day, and year data appears. Adding `infer_datetime_format=True` will help you parse data that has multiple differing formats but it will not always work and it will make your program slower.

In [32]:
landslides['date_parsed'] = pd.to_datetime(landslides['date'], format = "%m/%d/%y")
landslides['date_parsed'].head()

Now with the parsed data, you can retrieve the day information for example.

In [33]:
day_of_month_landslides = landslides['date_parsed'].dt.day
day_of_month_landslides.head()

Let's see on what day of the month landslides happen often.

In [34]:
day_of_month_landslides = day_of_month_landslides.dropna()
sns.distplot(day_of_month_landslides, kde=False, bins=31)
plt.show()

Now we will talk about encodings. Basically if you are using UTF-8 you will be fine. We will first look at a string data (`str`). 

In [35]:
before = "This is the euro symbol: €"
type(before)

You can also look at the string as a sequence of numbers. `errors="replace"` just tries to find replacement characters for ones that cannot be encoded with the encoding used.

In [41]:
after = before.encode("utf-8", errors = "replace")
type(after)

When you print it, it will try to show it as a string with ASCII encoding. But because we encoded it in utf-8, its not going to succeed perfectly.

In [38]:
after

If you decode the byte data using utf-8, it will work fine. If you decode with ASCII, you will get an error.

In [40]:
print(after.decode("utf-8"))
#print(after.decode("ascii"))

ASCII cannot encode €, so if we encode it with ASCII, it will replace it with some other character. And when we decode it back with ASCII, the original € will have been swapped with the replacement.

In [42]:
before = "This is the euro symbol: €"
after = before.encode("ascii", errors = "replace")
print(after.decode("ascii"))

In below, just reading `ks-projects-201612.csv` will give you an error as by default the decoding used is utf-8, while this particular file was decoded with some other encoding. To find out what this encoding is, one can use `chardet.detect`. `rb` just means it opens the file to read a binary. This is read as `rawdata`, and we look at the first 10000 bytes of it. It thinks that with 73% confidence, the encoding is Windows-1252.

In [44]:
#kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv")
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
print(result)

In [4]:
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')
kickstarter_2016.head()

One may save the data using the `to_csv()` function. You can download it from the Output tab (the one you see before hitting the "Fork Notebook"). 

In [45]:
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

Now let us take a look at the Pakistan suicide attacks data. Again we try to find its encoding.

In [46]:
with open("../input/pakistansuicideattacks/PakistanSuicideAttacks Ver 11 (30-November-2017).csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
print(result)

Using the found encoding we read in the data. Let us see what kind of data is in the City column. There are some that look really similar, for example ATTOCK and Attock. 

In [48]:
suicide_attacks = pd.read_csv("../input/pakistansuicideattacks/PakistanSuicideAttacks Ver 11 (30-November-2017).csv", 
                              encoding='Windows-1252')
cities = suicide_attacks['City'].unique()
cities.sort()
cities

We wish to unify notations for referals to the same city. For a start we can make all characters lower case and remove the space in front of and after the string. Now we don't have the split in notation between ATTOCK and Attock, they both became attock. Other splits has also been resolved. But there are still some notations that seem to be referring to the same city. For example d. i khan and d.i khan. 

In [50]:
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
suicide_attacks['City'] = suicide_attacks['City'].str.strip()
cities = suicide_attacks['City'].unique()
cities.sort()
cities

To remove these kind of diversions, we may use the `fuzzywuzzy` module. It tries to quantify how similar strings are with eachother. For example if there are more replacements or removals required, the two strings would be considered more dissimilar. A 100 is assigned for similar characters and a 0 is assigned for dissimilar characters. We find the 10 strings in `cities` most similar to d.i khan.

In [51]:
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
matches

Since d.g khan turns out to be a different city, we will replace cities with similarity score larger than 90 with d.i khan. We make a function for this. `df` is the dataframe for which the `column` for which we are doing the replacement with `string_to_match` reside. `close_matches` are found by looking through all elements of `matches` and taking out the strings that had a similarity score larger than 90. Remember, the first column `matches[0]` stores the strings and the second column `matches[1]` stores the similarity score. Then we look through each element in the column in question and replace them with `string_to_match` if they are a close match. 

In [53]:
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    strings = df[column].unique()
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    rows_with_matches = df[column].isin(close_matches)
    df.loc[rows_with_matches, column] = string_to_match
    print("All done!")
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")

Now if we check the City column we do not have separate notations for d.i khan.

In [54]:
cities = suicide_attacks['City'].unique()
cities.sort()
cities