# Data cleaning: inconsistent data

This notebook is an abstraction of the Kaggle's 5-Day Challenge.

The **goal** of this exercise is to clean inconsistent text entries. 

The **evaluation** of the assignment will follow:

* Design process and thinking as a data engineer.
* Validation of knowledge on the different tools and steps throughout the process.
* Storytelling and visualisation of the insights.

Exercise **workflow**:

* Import dependencies & download dataset from [here](https://www.kaggle.com/zusmani/pakistansuicideattacks/download).
* Preliminary text pre-processing
* Matching of inconsistent data entries
    
Notes:

* Write your code into the `TODO` cells
* Feel free to choose how to present the results throughout the exercise, what libraries (e.g., seaborn, bokeh, etc.) and/or tools (e.g., PowerBI or Tableau).

## Preamble
________

In [3]:
#!pip install bs4
#!pip3 install fuzzywuzzy

import pandas as pd
import numpy as np
import warnings
from bs4 import UnicodeDammit
import string
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

warnings.filterwarnings("ignore") 
np.random.seed(0)

## Data
________


**TODO**

* Download dataset from [here](https://www.kaggle.com/zusmani/pakistansuicideattacks/download).
* Identify the encoding of the data in `filename`
* Read the csv into `suicide_attacks` variable using the correct encoding (the `chardet` module might come handy).

In [6]:
filename = "PakistanSuicideAttacks Ver 11 (30-November-2017).csv"

## Find file encoding
with open(filename, 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding



'windows-1252'

In [7]:
encoding = "windows-1252"

suicide_attacks = pd.read_csv(filename, encoding=encoding)
suicide_attacks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   S#                       496 non-null    int64  
 1   Date                     496 non-null    object 
 2   Islamic Date             342 non-null    object 
 3   Blast Day Type           486 non-null    object 
 4   Holiday Type             72 non-null     object 
 5   Time                     285 non-null    object 
 6   City                     496 non-null    object 
 7   Latitude                 493 non-null    float64
 8   Longitude                493 non-null    object 
 9   Province                 496 non-null    object 
 10  Location                 493 non-null    object 
 11  Location Category        461 non-null    object 
 12  Location Sensitivity     460 non-null    object 
 13  Open/Closed Space        461 non-null    object 
 14  Influencing Event/Event  1

## Preliminary text pre-processing
___

**TODO**

* Clean the `City` column for inconsisntecies
* Normalize the `City` column for upper or lowercase, spaces, etc.

In [8]:
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

array(['ATTOCK', 'Attock ', 'Bajaur Agency', 'Bannu', 'Bhakkar ', 'Buner',
       'Chakwal ', 'Chaman', 'Charsadda', 'Charsadda ', 'D. I Khan',
       'D.G Khan', 'D.G Khan ', 'D.I Khan', 'D.I Khan ', 'Dara Adam Khel',
       'Dara Adam khel', 'Fateh Jang', 'Ghallanai, Mohmand Agency ',
       'Gujrat', 'Hangu', 'Haripur', 'Hayatabad', 'Islamabad',
       'Islamabad ', 'Jacobabad', 'KURRAM AGENCY', 'Karachi', 'Karachi ',
       'Karak', 'Khanewal', 'Khuzdar', 'Khyber Agency', 'Khyber Agency ',
       'Kohat', 'Kohat ', 'Kuram Agency ', 'Lahore', 'Lahore ',
       'Lakki Marwat', 'Lakki marwat', 'Lasbela', 'Lower Dir', 'MULTAN',
       'Malakand ', 'Mansehra', 'Mardan', 'Mohmand Agency',
       'Mohmand Agency ', 'Mohmand agency', 'Mosal Kor, Mohmand Agency',
       'Multan', 'Muzaffarabad', 'North Waziristan', 'North waziristan',
       'Nowshehra', 'Orakzai Agency', 'Peshawar', 'Peshawar ', 'Pishin',
       'Poonch', 'Quetta', 'Quetta ', 'Rawalpindi', 'Sargodha',
       'Sehwan town',

In [73]:
## Clean city name

suicide_attacks['City'] = suicide_attacks['City'].str.lower().str.strip()
print(suicide_attacks['City'].unique())

['islamabad' 'karachi' 'quetta' 'rawalpindi' 'north waziristan' 'kohat'
 'attock' 'sialkot' 'lahore' 'swat' 'hangu' 'bannu' 'lasbela' 'malakand'
 'peshawar' 'di khan' 'lakki marwat' 'tank' 'gujrat' 'charsadda'
 'kuram agency' 'shangla' 'bajaur agency' 'south waziristan' 'haripur'
 'sargodha' 'nowshehra' 'mohmand agency' 'dara adam khel' 'khyber agency'
 'mardan' 'bhakkar' 'orakzai agency' 'buner' 'dg khan' 'pishin' 'chakwal'
 'upper dir' 'muzaffarabad' 'totalai' 'multan' 'lower dir' 'sudhanoti'
 'poonch' 'mansehra' 'karak' 'swabi' 'shikarpur' 'sukkur' 'chaman'
 'd i khan' 'khanewal' 'fateh jang' 'taftan' 'tirah valley' 'wagah' 'zhob'
 'kurram agency' 'taunsa' 'jacobabad' 'shabqadar-charsadda' 'khuzdar'
 'ghallanai, mohmand agency' 'hayatabad' 'mosal kor, mohmand agency'
 'sehwan town' 'tangi, charsadda district']


We can still see few inconsistencies for instance "di khan","kurram agency" etc. has not been cleaned enough to remove the duplicate occurence so we willm use fuzzy logic in the next section

## Matching of inconsistent data entries
___

**TODO** 

* Verify there are no more inconsistencies in the `City` column.
* Feel free to use the [`fuzzywuzzy`](https://github.com/seatgeek/fuzzywuzzy) package to match an remove possible issues.

> **Fuzzy matching:** The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

In [74]:
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities2=cities.astype(str)
print("\n".join(cities))

attock
bajaur agency
bannu
bhakkar
buner
chakwal
chaman
charsadda
d i khan
dara adam khel
dg khan
di khan
fateh jang
ghallanai, mohmand agency
gujrat
hangu
haripur
hayatabad
islamabad
jacobabad
karachi
karak
khanewal
khuzdar
khyber agency
kohat
kuram agency
kurram agency
lahore
lakki marwat
lasbela
lower dir
malakand
mansehra
mardan
mohmand agency
mosal kor, mohmand agency
multan
muzaffarabad
north waziristan
nowshehra
orakzai agency
peshawar
pishin
poonch
quetta
rawalpindi
sargodha
sehwan town
shabqadar-charsadda
shangla
shikarpur
sialkot
south waziristan
sudhanoti
sukkur
swabi
swat
taftan
tangi, charsadda district
tank
taunsa
tirah valley
totalai
upper dir
wagah
zhob


In [76]:
## Apply fuzzy wuzzy
# let's check top closest matches to "kuram agency"
kuram_match = fuzzywuzzy.process.extract("kuram agency", cities, limit=5, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
kuram_match 

[('kuram agency', 100),
 ('kurram agency', 96),
 ('bajaur agency', 72),
 ('khyber agency', 72),
 ('orakzai agency', 69)]

Top two values have a matching score of 96 and above, we can also see bajaur agency with matching score of 72 but we will not considet it similar to kurram agency. 

In [77]:
### Checking for di khan
di_match = fuzzywuzzy.process.extract("di khan", cities, limit=5, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
di_match 

[('di khan', 100),
 ('d i khan', 93),
 ('dg khan', 86),
 ('khanewal', 53),
 ('hangu', 50)]

Based on above two results, we can consider score 90 and above to match the strings

In [100]:
## Creating a function to handle the replacements
def rep_fuzzy_matches(df, column, string_to_match, score):
  
  strings = df[column].unique()
  matches = fuzzywuzzy.process.extract(string_to_match, strings, limit=5, scorer=fuzzywuzzy.fuzz.token_sort_ratio)  
  # get matches for a specified score
  ninety_matches = [matches[0] for matches in matches if matches[1] >= score]
  # get the rows of all the close matches in our dataframe
  rows_with_matches = df[column].isin(ninety_matches)
  # replace all rows with close matches with the input matches 
  df.loc[rows_with_matches, column] = string_to_match  
  print("All string replaced with the matching string")
  print("Replacement count : ",len(ninety_matches))


In [101]:
rep_fuzzy_matches(suicide_attacks,"City","kuram agency",90)

All string replaced with the matching string
Replacement count :  1


In [102]:
rep_fuzzy_matches(suicide_attacks,"City","di khan",90)

All string replaced with the matching string
Replacement count :  2


In [104]:
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
print("\n".join(cities))

attock
bajaur agency
bannu
bhakkar
buner
chakwal
chaman
charsadda
dara adam khel
dg khan
di khan
fateh jang
ghallanai, mohmand agency
gujrat
hangu
haripur
hayatabad
islamabad
jacobabad
karachi
karak
khanewal
khuzdar
khyber agency
kohat
kuram agency
lahore
lakki marwat
lasbela
lower dir
malakand
mansehra
mardan
mohmand agency
mosal kor, mohmand agency
multan
muzaffarabad
north waziristan
nowshehra
orakzai agency
peshawar
pishin
poonch
quetta
rawalpindi
sargodha
sehwan town
shabqadar-charsadda
shangla
shikarpur
sialkot
south waziristan
sudhanoti
sukkur
swabi
swat
taftan
tangi, charsadda district
tank
taunsa
tirah valley
totalai
upper dir
wagah
zhob


City column is clean now 😀