# Data Cleaning Example

This notebook uses a synthetic dataset to perform simple data cleaning using Pandas.

#### Dataset: data/patient_heart_rate.csv
This dataset has lot of inconsistencies, so data cleanup is necessary to address the data flaws and to reveal the potential of the data, realize its true value. 

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('data/patient_heart_rate.csv')

In [3]:
data.head()

Unnamed: 0,1,Mickéy Mousé,56,70kgs,72,69,71,-,-.1,-.2
0,2.0,Donald Duck,34.0,154.89lbs,-,-,-,85,84,76
1,3.0,Mini Mouse,16.0,,-,-,-,65,69,72
2,4.0,Scrooge McDuck,,78kgs,78,79,72,-,-,-
3,5.0,Pink Panther,54.0,198.658lbs,-,-,-,69,,75
4,6.0,Huey McDuck,52.0,189lbs,-,-,-,68,75,72


Looking at the dataset, the inconsistencies/faults are:
    1. Missing headers in the csv file
    2. Change the Row index to point to 'Id' column.
    3. Segregating Name into First and Last Name
    4. Handling Missing values
    5. Handling duplicate records
    6. Handling data inconsistencies
    7. Split the columns into multiple columns (merge, concat, string split)

Let's address each one of them separately.

### 1. Adding Headers Manually

The above dataset doesn't have any headers. Let's add them manually.

In [4]:
col_names = ['Id', 'Name', 'Age', 'Weight', 'm0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']
data = pd.read_csv('data/patient_heart_rate.csv', names = col_names)
data 

Unnamed: 0,Id,Name,Age,Weight,m0006,m0612,m1218,f0006,f0612,f1218
0,1.0,Mickéy Mousé,56.0,70kgs,72,69,71,-,-,-
1,2.0,Donald Duck,34.0,154.89lbs,-,-,-,85,84,76
2,3.0,Mini Mouse,16.0,,-,-,-,65,69,72
3,4.0,Scrooge McDuck,,78kgs,78,79,72,-,-,-
4,5.0,Pink Panther,54.0,198.658lbs,-,-,-,69,,75
5,6.0,Huey McDuck,52.0,189lbs,-,-,-,68,75,72
6,7.0,Dewey McDuck,19.0,56kgs,-,-,-,71,78,75
7,8.0,Scööpy Doo,32.0,78kgs,78,76,75,-,-,-
8,,,,,,,,,,
9,9.0,Huey McDuck,52.0,189lbs,-,-,-,68,75,72


### 2. Change the row index

Let's make the "Id" column to be the row index of this dataset.

In [5]:
data = data.set_index('Id')
data.head()

Unnamed: 0_level_0,Name,Age,Weight,m0006,m0612,m1218,f0006,f0612,f1218
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1.0,Mickéy Mousé,56.0,70kgs,72,69,71,-,-,-
2.0,Donald Duck,34.0,154.89lbs,-,-,-,85,84,76
3.0,Mini Mouse,16.0,,-,-,-,65,69,72
4.0,Scrooge McDuck,,78kgs,78,79,72,-,-,-
5.0,Pink Panther,54.0,198.658lbs,-,-,-,69,,75


### 3. Split the Name column into First and Last Name

In [6]:
data[['First Name', 'Last Name']] = data['Name'].str.split(expand = True)
data.drop('Name', axis = 1, inplace = True)
data.head()

Unnamed: 0_level_0,Age,Weight,m0006,m0612,m1218,f0006,f0612,f1218,First Name,Last Name
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,56.0,70kgs,72,69,71,-,-,-,Mickéy,Mousé
2.0,34.0,154.89lbs,-,-,-,85,84,76,Donald,Duck
3.0,16.0,,-,-,-,65,69,72,Mini,Mouse
4.0,,78kgs,78,79,72,-,-,-,Scrooge,McDuck
5.0,54.0,198.658lbs,-,-,-,69,,75,Pink,Panther


### 4. Handling Missing Values

There are few missing values in the Age, Weight and Heart Rate values in the dataset. This typically means that a piece of information was simply not collected. Typical ways of handling missing values are:  
1. Deletion: Remove records with missing values  
2. Dummy substitution: Replace missing values with a dummy but valid value: e.g.: 0 for numerical values.  
3. Mean substitution: Replace the missing values with the mean.  
4. Frequent substitution: Replace the missing values with the most frequent item.   

In [7]:
data.isna().any()

Age           True
Weight        True
m0006         True
m0612         True
m1218         True
f0006         True
f0612         True
f1218         True
First Name    True
Last Name     True
dtype: bool

Every column has a null value. let's check the total null values in each column

In [8]:
data.isna().sum()

Age           2
Weight        2
m0006         1
m0612         1
m1218         1
f0006         1
f0612         2
f1218         1
First Name    1
Last Name     1
dtype: int64

Drop the row only if all its columns are null.

In [9]:
data.dropna(how = 'all', inplace = True)
data

Unnamed: 0_level_0,Age,Weight,m0006,m0612,m1218,f0006,f0612,f1218,First Name,Last Name
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,56.0,70kgs,72,69,71,-,-,-,Mickéy,Mousé
2.0,34.0,154.89lbs,-,-,-,85,84,76,Donald,Duck
3.0,16.0,,-,-,-,65,69,72,Mini,Mouse
4.0,,78kgs,78,79,72,-,-,-,Scrooge,McDuck
5.0,54.0,198.658lbs,-,-,-,69,,75,Pink,Panther
6.0,52.0,189lbs,-,-,-,68,75,72,Huey,McDuck
7.0,19.0,56kgs,-,-,-,71,78,75,Dewey,McDuck
8.0,32.0,78kgs,78,76,75,-,-,-,Scööpy,Doo
9.0,52.0,189lbs,-,-,-,68,75,72,Huey,McDuck
10.0,12.0,45kgs,-,-,-,92,95,87,Louie,McDuck


### 5. Duplicate records in the data

First, check if you have duplicate records. If duplicate records exist, then you can use the Pandas function drop_duplicates() to remove the duplicate records.

In [10]:
data

Unnamed: 0_level_0,Age,Weight,m0006,m0612,m1218,f0006,f0612,f1218,First Name,Last Name
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,56.0,70kgs,72,69,71,-,-,-,Mickéy,Mousé
2.0,34.0,154.89lbs,-,-,-,85,84,76,Donald,Duck
3.0,16.0,,-,-,-,65,69,72,Mini,Mouse
4.0,,78kgs,78,79,72,-,-,-,Scrooge,McDuck
5.0,54.0,198.658lbs,-,-,-,69,,75,Pink,Panther
6.0,52.0,189lbs,-,-,-,68,75,72,Huey,McDuck
7.0,19.0,56kgs,-,-,-,71,78,75,Dewey,McDuck
8.0,32.0,78kgs,78,76,75,-,-,-,Scööpy,Doo
9.0,52.0,189lbs,-,-,-,68,75,72,Huey,McDuck
10.0,12.0,45kgs,-,-,-,92,95,87,Louie,McDuck


In [11]:
data = data.drop_duplicates(subset = ['First Name', 'Last Name'])
data

Unnamed: 0_level_0,Age,Weight,m0006,m0612,m1218,f0006,f0612,f1218,First Name,Last Name
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,56.0,70kgs,72,69,71,-,-,-,Mickéy,Mousé
2.0,34.0,154.89lbs,-,-,-,85,84,76,Donald,Duck
3.0,16.0,,-,-,-,65,69,72,Mini,Mouse
4.0,,78kgs,78,79,72,-,-,-,Scrooge,McDuck
5.0,54.0,198.658lbs,-,-,-,69,,75,Pink,Panther
6.0,52.0,189lbs,-,-,-,68,75,72,Huey,McDuck
7.0,19.0,56kgs,-,-,-,71,78,75,Dewey,McDuck
8.0,32.0,78kgs,78,76,75,-,-,-,Scööpy,Doo
10.0,12.0,45kgs,-,-,-,92,95,87,Louie,McDuck


### 6. Column data contains inconsistent unit values

The weight column is spelled as kgs, lbs. Let's keep the measuring unit to be 'kgs'

In [12]:
df = data.reset_index()

weight = df['Weight']
# lbs to kgs 
for i in range (0 ,len(weight)):    
    x = str(weight[i])
    
    if "lbs" in x[-3:]:
        no_lbs = x[:-3:]
        float_lbs = float(no_lbs)
        int_kgs = int(float_lbs / 2.206)
        kgs = str(int_kgs) + "kgs"
        weight[i] = kgs
df

Unnamed: 0,Id,Age,Weight,m0006,m0612,m1218,f0006,f0612,f1218,First Name,Last Name
0,1.0,56.0,70kgs,72,69,71,-,-,-,Mickéy,Mousé
1,2.0,34.0,70kgs,-,-,-,85,84,76,Donald,Duck
2,3.0,16.0,,-,-,-,65,69,72,Mini,Mouse
3,4.0,,78kgs,78,79,72,-,-,-,Scrooge,McDuck
4,5.0,54.0,90kgs,-,-,-,69,,75,Pink,Panther
5,6.0,52.0,85kgs,-,-,-,68,75,72,Huey,McDuck
6,7.0,19.0,56kgs,-,-,-,71,78,75,Dewey,McDuck
7,8.0,32.0,78kgs,78,76,75,-,-,-,Scööpy,Doo
8,10.0,12.0,45kgs,-,-,-,92,95,87,Louie,McDuck


### 7. Segragating Column values

Consider the column names of the dataset ['m0006', 'm0612', 'm1218', 'f0006', 'f0612', 'f1218']  
Here, the m stands for 'Male' and f stands for 'Female'  
Also the numbers are the hour ranges, i.e.,  
0006 - 00 : 06 (Hours)  
0612 - 06 : 12 (Hours)  
1218 - 12 : 18 (Hours)  

Let's separate the columns into Sex, Hour Range columns

In [13]:
#Melt the Sex + time range columns in single column
df = pd.melt(df,id_vars=['Id','Age','Weight',"First Name","Last Name"], value_name="PulseRate",var_name="sex_and_time").sort_values(['Id','Age','Weight',"First Name","Last Name"])
 
# Extract Sex, Hour lower bound and Hour upper bound group
tmp_df = df["sex_and_time"].str.extract("(\D)(\d+)(\d{2})",expand=True)
 
# Name columns
tmp_df.columns = ["Sex", "hours_lower", "hours_upper"]
 
# Create Time column based on "hours_lower" and "hours_upper" columns
tmp_df["Time"] = tmp_df["hours_lower"] + "-" + tmp_df["hours_upper"]
 
# Merge 
df = pd.concat([df, tmp_df], axis=1)
 
# Drop unnecessary columns and rows
df = df.drop(['sex_and_time','hours_lower','hours_upper'], axis=1)
df = df.dropna()
df

Unnamed: 0,Id,Age,Weight,First Name,Last Name,PulseRate,Sex,Time
0,1.0,56.0,70kgs,Mickéy,Mousé,72,m,00-06
9,1.0,56.0,70kgs,Mickéy,Mousé,69,m,06-12
18,1.0,56.0,70kgs,Mickéy,Mousé,71,m,12-18
27,1.0,56.0,70kgs,Mickéy,Mousé,-,f,00-06
36,1.0,56.0,70kgs,Mickéy,Mousé,-,f,06-12
45,1.0,56.0,70kgs,Mickéy,Mousé,-,f,12-18
1,2.0,34.0,70kgs,Donald,Duck,-,m,00-06
10,2.0,34.0,70kgs,Donald,Duck,-,m,06-12
19,2.0,34.0,70kgs,Donald,Duck,-,m,12-18
28,2.0,34.0,70kgs,Donald,Duck,85,f,00-06


### Wrapping Up

Now that the data is cleaned, let's store it in a separate csv file and utilize for analysis.

In [14]:
df.to_csv('data/cleaned_data.csv',index = False)