This notebook follows the Kaggle Data Cleaning Challenges; the code and tasks are taken and inspired from the corresponding daily challenges.

## Day 1 : Handling Missing Values

See challenge [here](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values?utm_medium=email&utm_source=mailchimp&utm_campaign=5DDC-data-cleaning). I will make use of the San Francisco Building Permits dataset, which can be found [here](https://www.kaggle.com/aparnashastry/building-permit-applications-data/data).  This dataset contains various information about structural building permits in SF from 1 January 2013 until 25 February 2018. 

In [1]:
import numpy as np
import pandas as pd
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv", low_memory=False)
np.random.seed(0)

In [None]:
sf_permits.shape

I will now look at the number of missing values. Since the dataset contains a quite large number of columns, I will only look at the number of missing values for the first 10 columns:

In [None]:
sf_permits.isnull().sum()[:10]

We can also compute the percentage of missing values.. I will do this by computing the total number of missing values divided by the total number of values:

In [None]:
sf_permits_prop_missing = (sf_permits.isnull().sum().sum()/np.product(sf_permits.shape))*100
print('Percentage of missing values in sf_permits: {}'.format(sf_permits_prop_missing))

As noted in the referenced notebook, some values can be missing because they don't exist and other can be missing because they were not recorded. For example, $\texttt{Street Number Suffix}$ (e.g. 1A Smith St.) column has over 98% missing values, and most likely because most adresses don't have suffixes; every address has a zipcode, hence the column $\texttt{Zipcode}$ has missing values because they were not recorded.

In [None]:
print('Number of columns with no missing values: {}'.format(sf_permits.dropna(axis=1, inplace=False).shape[1]))

Consider a sample. I will fill in the missing values with the whatever values come next in the corresponding columns:

In [None]:
sf_permits_sample = sf_permits.sample(n=7)
sf_permits_sample.fillna(method='bfill', axis=0).fillna(0)  #fill remaining missing values with 0

## Day 2 : Scaling and Normalization

See challenge [here](https://www.kaggle.com/rtatman/data-cleaning-challenge-scale-and-normalize-data). In this section I will be looking at the Kickstarter dataset:

In [None]:
ks = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv')
ks.head()

I will now scale the $\texttt{goal}$ columns as follows:

$$ x \mapsto \frac{x-x_{min}}{x_{max}-x_{min}}$$

In [None]:
import matplotlib.pyplot as plt

goal_original = ks['goal']
goal_scaled = (goal_original - goal_original.min())/(goal_original.max() - goal_original.min())
fig, ax = plt.subplots(1,2, figsize=(12,5))
ax[0].hist(goal_original, ec='black')
ax[0].set_title('Original Data')
ax[1].hist(goal_scaled, ec='black')
ax[1].set_title('Scaled Data')

I will now normalize the $\texttt{pledged}$ columns using the Box-Cox power transformation. I will use the parameter $\lambda$ which maximizes the log-likelihood function:

$$ y_{i}^{(\lambda)} = 
\begin{cases}
\frac{y_{i}^{\lambda}-1}{\lambda}, & \text{if}\ \lambda \neq 0 \\
\text{ln}(y_{i}), & \text{otherwise}
\end{cases}$$

for $y_{i}$ > 0 for all $i$.

In [None]:
from scipy.stats import boxcox
import seaborn as sns

msk = ks.pledged>0
positive_pledges = ks[msk].pledged
normalized_pledges = boxcox(x=positive_pledges)[0]
fig, ax = plt.subplots(1,2, figsize=(12,5))
sns.distplot(a=positive_pledges, hist=True, kde=True, ax=ax[0])
ax[0].set_title('Original Positive Pledges')
sns.distplot(a=normalized_pledges, hist=True, kde=True, ax=ax[1])
ax[1].set_title('Normalized Positive Pledges')

## Day 3 : Parsing Dates

See challenge [here](https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates/notebook)

In [None]:
quakes = pd.read_csv('../input/earthquake-database/database.csv')

# Check type of date column
quakes['Date'].dtype

This means the type of the dates columns is "object", hence Python does not know these numbers represent dates. Therefore, convert the entries to $\texttt{datetime64}$.

In [None]:
quakes.Date.head()

I will use $\texttt{infer_datetime_format=True}$ since not all date values are consistent, for example:

In [None]:
quakes.loc[3378,'Date']

In [None]:
quakes['date_parsed'] = pd.to_datetime(quakes.Date, infer_datetime_format=True) 
quakes.date_parsed.head()

In [None]:
day_of_month = quakes['date_parsed'].dt.day

# Plot the days
plt.hist(day_of_month, bins=31, ec='black')

## Day 4 : Character Encodings

See challenge [here](https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings/). As described there, data can be lost if we use the wrong encoding from string to bytes. As an example, consider a Python string, with UTF-8 encoding:

In [None]:
before = "This is an interesting text: 你好"

after = before.encode(encoding='UTF-8', errors='replace')
after.decode('UTF-8') # No issue here

In [None]:
after = before.encode(encoding='ASCII', errors='replace')
after.decode('ASCII') # Lose information

I will now use the $\texttt{chardet}$ module to detect the encoding of the Police Killings dataset, as it is not UTF-8 encoded:

In [2]:
import chardet

with open('../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))  # Read first 10000 bytes.
result

In [None]:
# See if this is the right encoding
police_killings = pd.read_csv('../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv', encoding='ascii')

Does not work. Read the first 100 thousand bytes:

In [None]:
with open('../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

In [None]:
police_killings = pd.read_csv('../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv', encoding='Windows-1252')

Now it works!

## Day 5 : Inconsistent Data Entries

![](http://)See challenge [here](https://www.kaggle.com/rtatman/data-cleaning-challenge-inconsistent-data-entry/).  In this part I'm going to do some text pre-processing on the $\texttt{PakistanSuicideAttacks Ver 11 (30-November-2017).csv}$ dataset (more information [here](https://www.kaggle.com/zusmani/pakistansuicideattacks)). As noted in the day 5 notebook, the dataset is $\texttt{Windows-1252}$ encoded. I will start by looking at inconsistencies in the City and Province columns:

In [10]:
suicide_attacks = pd.read_csv('../input/pakistansuicideattacks/PakistanSuicideAttacks Ver 11 (30-November-2017).csv', encoding='Windows-1252')
cities = suicide_attacks['City'].unique()
cities.sort()
cities

We can see inconsistencies in the city names, e.g. 'karachi' & 'karachi ', 'South Waziristan' & 'South waziristan'. As suggested in the notebook, convert all names to lower case and strip all white spaces:

In [13]:
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

I will do the same thing for the Province column:

In [15]:
provinces = suicide_attacks['Province'].unique()
provinces.sort()
provinces

In [16]:
suicide_attacks['Province'] = suicide_attacks['Province'].str.lower()
suicide_attacks['Province'] = suicide_attacks['Province'].str.strip()

There are still text inconsistencies in the City column, e.g. 'd.i khan ' & 'd. i khan', 'kuram agency' & 'kurram agency'.  I will use the [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) module to search for similar strings. As described in the notebook:

>$\textbf{Fuzzy matching}$: The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

>Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. Here, we're going to get the ten strings from our list of cities that have the closest distance to "d.i khan".

Below is a function defined in the day 5 notebook. It takes an input string and replaces all row entries with fuzzy matching ratio of >90 (by default). 

In [18]:
import fuzzywuzzy
from fuzzywuzzy import process

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

I will apply this function to the City column to match entries similar to 'd. i khan' and, separately, to 'kuram agency'. 

In [19]:
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")

In [21]:
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="kuram agency")

In [22]:
cities = suicide_attacks['City'].unique()
cities.sort()
cities