# Data cleaning: inconsistent data

This notebook is an abstraction of the Kaggle's 5-Day Challenge.

The **goal** of this exercise is to clean inconsistent text entries. 

The **evaluation** of the assignment will follow:

* Design process and thinking as a data engineer.
* Validation of knowledge on the different tools and steps throughout the process.
* Storytelling and visualisation of the insights.

Exercise **workflow**:

* Import dependencies & download dataset from [here](https://www.kaggle.com/zusmani/pakistansuicideattacks/download).
* Preliminary text pre-processing
* Matching of inconsistent data entries
    
Notes:

* Write your code into the `TODO` cells
* Feel free to choose how to present the results throughout the exercise, what libraries (e.g., seaborn, bokeh, etc.) and/or tools (e.g., PowerBI or Tableau).

## Preamble
________

In [None]:
import pandas as pd
import numpy as np

import fuzzywuzzy
from fuzzywuzzy import process
import chardet

np.random.seed(0)

## Data
________


**TODO**

* Download dataset from [here](https://www.kaggle.com/zusmani/pakistansuicideattacks/download).
* Identify the encoding of the data in `filename`
* Read the csv into `suicide_attacks` variable using the correct encoding (the `chardet` module might come handy).

In [None]:
filename = "?"

encoding = "foo"

suicide_attacks = pd.read_csv(filename, encoding=encoding)
suicide_attacks.info()

## Preliminary text pre-processing
___

**TODO**

* Clean the `City` column for inconsisntecies
* Normalize the `City` column for upper or lowercase, spaces, etc.

In [None]:
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

## Matching of inconsistent data entries
___

**TODO** 

* Verify there are no more inconsistencies in the `City` column.
* Feel free to use the [`fuzzywuzzy`](https://github.com/seatgeek/fuzzywuzzy) package to match an remove possible issues.

> **Fuzzy matching:** The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

In [None]:
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities2=cities.astype(str)

print("\n".join(cities))