<a href="https://colab.research.google.com/github/victorezealuma/Data-Cleaning-of-Cardiovascular-Health-Data-in-Pandas/blob/main/Healthcare_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning of Cardiovascular Health Data in Pandas

This project focuses on cleaning and preparing of cardiovascular health dataset for analysis.
It involved addressing specific challenges related to medical data. Which include the following:

1. Introduction
2. Import required Python libraries
3. The source dataset
4. Data type conversion
5. Exploratory Data Analysis








## 1. Introduction
Data cleaning is a critical preprocessing step that involves identifying, correcting, and standardizing data to ensure data quality and integrity. This process is particularly crucial in the healthcare domain where errors can have significant implications. By meticulously cleaning cardiovascular health data, researchers can extract meaningful insights and draw accurate conclusions.

##2. Import the required Python libraries

To proceed with the preprocessing of the data, the required libraries are imported in order to use them for the cleaning of the dataset and their notations are as follows:

In [227]:
# import the Python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

##3. The source dataset
As you can see, this is the messy data that we will be cleaning and is imported as follows:

In [228]:
df_healthcare_data = pd.read_csv('/content/healthcare_messy_data.csv')

df_healthcare_data

Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Blood Pressure,Cholesterol,Email,Phone Number
0,david lee,25,Other,Heart Disease,METFORMIN,01/15/2020,140/90,200.0,name@hospital.org,555-555-5555
1,emily davis,,Male,Diabetes,NONE,"April 5, 2018",120/80,200.0,,
2,laura martinez,35,Other,Asthma,METFORMIN,2019.12.01,110/70,160.0,contact@domain.com,
3,michael wilson,,Male,Diabetes,ALBUTEROL,01/15/2020,110/70,,name@hospital.org,555-555-5555
4,david lee,,Female,Asthma,NONE,2020/02/20,110/70,180.0,,
...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70,Other,Asthma,ALBUTEROL,03-25-2019,110/70,,name@hospital.org,
996,mary clark,forty,Other,,LISINOPRIL,01/15/2020,,160.0,,123-456-7890
997,laura martinez,forty,Other,,ALBUTEROL,2020/02/20,110/70,,name@hospital.org,
998,jane smith,25,Male,,ALBUTEROL,"April 5, 2018",110/70,200.0,,


## 4. Exploratory Data Analysis
In order to understand the data, the data is diagnosed for any discrepancies by doing exploratory data analysis using the steps below:

**df.shape attribute**

We can check the dimensions of the data with **df.shape** attribute. And we'll get see that our data have 1000 rows and 10 columns.

In [229]:
df_healthcare_data.shape

(1000, 10)

**df.info() method**

With **df.info()** method, we get to see a concise summary of the dataset.
It prints information of all columns and we get to see columns containing missing and invalid values as well as their data type.

In [230]:
df_healthcare_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Patient Name    1000 non-null   object 
 1   Age             841 non-null    object 
 2   Gender          1000 non-null   object 
 3   Condition       794 non-null    object 
 4   Medication      1000 non-null   object 
 5   Visit Date      1000 non-null   object 
 6   Blood Pressure  834 non-null    object 
 7   Cholesterol     769 non-null    float64
 8   Email           616 non-null    object 
 9   Phone Number    821 non-null    object 
dtypes: float64(1), object(9)
memory usage: 78.2+ KB


## Handling invalid values

Similary, there is an invalid value coded as "NAN" in the age, blood pressure, cholesterol and email and phone number column. As, well as apply appropriate data types to the columns using the errors keyword as follows:

In [231]:
# This replaces numbers written in text to appropriate figure

df_healthcare_data['Age'] = df_healthcare_data['Age'].replace('forty', 40)

In [232]:
# Corrects invalid values to appropriate data type

df_healthcare_data["Visit Date"] = pd.to_datetime(df_healthcare_data["Visit Date"], format='mixed')

df_healthcare_data["Age"] = pd.to_numeric(df_healthcare_data["Age"], errors='coerce')

df_healthcare_data["Cholesterol"] = pd.to_numeric(df_healthcare_data["Cholesterol"], errors='coerce')

df_healthcare_data["Condition"].fillna("Others", inplace=True)

df_healthcare_data["Blood Pressure"] = df_healthcare_data["Blood Pressure"].astype(object)

df_healthcare_data['Email'] = df_healthcare_data['Email'].fillna('patientname@hospital.org').astype(str)

df_healthcare_data['Phone Number'] = df_healthcare_data['Phone Number'].fillna('123-456-789').astype(str)


In [233]:
df_healthcare_data.isnull().sum()


Unnamed: 0,0
Patient Name,0
Age,159
Gender,0
Condition,0
Medication,0
Visit Date,0
Blood Pressure,166
Cholesterol,231
Email,0
Phone Number,0


## Mean Imputation
This is a statictical concept used to replace missing values of data in our dataset. It entails calculating the mean value of a column and using it to replace missing values such as NAN. So, this concept was applied to three colunms(age, blood pressure and cholesterol) of our dataset.

In [234]:
# Calculating the mean imputation for the age and cholesterol columns

mean_age = df_healthcare_data["Age"].mean()

df_healthcare_data["Age"].fillna(mean_age, inplace=True)

df_healthcare_data['Age'] = df_healthcare_data['Age'].astype(int)


mean_cholesterol = df_healthcare_data['Cholesterol'].mean()

df_healthcare_data['Cholesterol'].fillna(mean_cholesterol, inplace=True)

In [235]:
# This code calculates the average blood pressure value to be used for filling in missing blood pressure data.


# Splits the original blood pressure column with NAN values into systolic and diastolic blood pressure
df_healthcare_data[['systolic', 'diastolic']] = df_healthcare_data['Blood Pressure'].str.split("/", expand = True)

# Converting to appropriate data type
df_healthcare_data['systolic'] = pd.to_numeric(df_healthcare_data['systolic'], errors='coerce')
df_healthcare_data['diastolic'] = pd.to_numeric(df_healthcare_data['diastolic'], errors='coerce')


# Calculating mean systolic and diastolic pressures
mean_systolic = df_healthcare_data['systolic'].mean()
mean_diastolic = df_healthcare_data['diastolic'].mean()


mean_systolic	 = df_healthcare_data['systolic'].mean()
mean_diastolic	 = df_healthcare_data['diastolic'].mean()

df_healthcare_data['systolic'].fillna(mean_systolic, inplace=True)
df_healthcare_data['diastolic'].fillna(mean_diastolic, inplace=True)
df_healthcare_data['systolic'] = df_healthcare_data['systolic'].astype(int)
df_healthcare_data['diastolic'] = df_healthcare_data['diastolic'].astype(int)


# Concatenates the values from both(systolic & diastolic) columns and representing it as our new blood_pressure clolumn with complete data
df_healthcare_data['Blood_Pressure'] = df_healthcare_data['systolic'].astype(str) + '/' + df_healthcare_data['diastolic'].astype(str)

df_healthcare_data

Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Blood Pressure,Cholesterol,Email,Phone Number,systolic,diastolic,Blood_Pressure
0,david lee,25,Other,Heart Disease,METFORMIN,2020-01-15,140/90,200.00000,name@hospital.org,555-555-5555,140,90,140/90
1,emily davis,45,Male,Diabetes,NONE,2018-04-05,120/80,200.00000,patientname@hospital.org,123-456-789,120,80,120/80
2,laura martinez,35,Other,Asthma,METFORMIN,2019-12-01,110/70,160.00000,contact@domain.com,123-456-789,110,70,110/70
3,michael wilson,45,Male,Diabetes,ALBUTEROL,2020-01-15,110/70,189.23277,name@hospital.org,555-555-5555,110,70,110/70
4,david lee,45,Female,Asthma,NONE,2020-02-20,110/70,180.00000,patientname@hospital.org,,110,70,110/70
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70,Other,Asthma,ALBUTEROL,2019-03-25,110/70,189.23277,name@hospital.org,123-456-789,110,70,110/70
996,mary clark,40,Other,Others,LISINOPRIL,2020-01-15,,160.00000,patientname@hospital.org,123-456-7890,125,81,125/81
997,laura martinez,40,Other,Others,ALBUTEROL,2020-02-20,110/70,189.23277,name@hospital.org,123-456-789,110,70,110/70
998,jane smith,25,Male,Others,ALBUTEROL,2018-04-05,110/70,200.00000,patientname@hospital.org,,110,70,110/70


In [236]:
df_healthcare_data.isnull().sum()


Unnamed: 0,0
Patient Name,0
Age,0
Gender,0
Condition,0
Medication,0
Visit Date,0
Blood Pressure,166
Cholesterol,0
Email,0
Phone Number,0


## df.describe() method
By utilizing the **df.describe()** method, we can gain valuable insights into the numerical data within our DataFrame. This summary provides essential statistics such as mean, median, standard deviation, and quartiles, which can aid in identifying potential outliers that may require further analysis.

In [237]:
df_healthcare_data.describe()

Unnamed: 0,Age,Visit Date,Cholesterol,systolic,diastolic
count,1000.0,1000,1000.0,1000.0,1000.0
mean,45.645,2019-06-29 06:23:02.399999744,189.23277,125.31,81.351
min,25.0,2018-04-05 00:00:00,160.0,110.0,70.0
25%,35.0,2019-03-25 00:00:00,180.0,120.0,80.0
50%,40.0,2019-12-01 00:00:00,189.23277,125.0,81.0
75%,60.0,2020-01-15 00:00:00,200.0,130.0,85.0
max,70.0,2020-02-20 00:00:00,220.0,140.0,90.0
std,15.092606,,19.535326,10.472037,6.900247


In [238]:
# Drops the old blood pressure column with missing values.

df_healthcare_data = df_healthcare_data.drop('Blood Pressure', axis=1)

df_healthcare_data


Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Cholesterol,Email,Phone Number,systolic,diastolic,Blood_Pressure
0,david lee,25,Other,Heart Disease,METFORMIN,2020-01-15,200.00000,name@hospital.org,555-555-5555,140,90,140/90
1,emily davis,45,Male,Diabetes,NONE,2018-04-05,200.00000,patientname@hospital.org,123-456-789,120,80,120/80
2,laura martinez,35,Other,Asthma,METFORMIN,2019-12-01,160.00000,contact@domain.com,123-456-789,110,70,110/70
3,michael wilson,45,Male,Diabetes,ALBUTEROL,2020-01-15,189.23277,name@hospital.org,555-555-5555,110,70,110/70
4,david lee,45,Female,Asthma,NONE,2020-02-20,180.00000,patientname@hospital.org,,110,70,110/70
...,...,...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70,Other,Asthma,ALBUTEROL,2019-03-25,189.23277,name@hospital.org,123-456-789,110,70,110/70
996,mary clark,40,Other,Others,LISINOPRIL,2020-01-15,160.00000,patientname@hospital.org,123-456-7890,125,81,125/81
997,laura martinez,40,Other,Others,ALBUTEROL,2020-02-20,189.23277,name@hospital.org,123-456-789,110,70,110/70
998,jane smith,25,Male,Others,ALBUTEROL,2018-04-05,200.00000,patientname@hospital.org,,110,70,110/70


In [239]:
assert pd.notnull(df_healthcare_data).all().all()

The assert statement returns nothing. So, we can conclude that there are no missing values in the dataset. We can confirm this by looking at the dataframe.

## Reordering the Columns by Name
So here we will use the **reindex()** method to reorder columns by their names.

In [240]:
# Reordering the Columns by Name

new_df_healthcare_data = df_healthcare_data.reindex(columns=['Patient Name','Age','Gender','Condition','Medication','Cholesterol','Blood_Pressure', 'systolic', 'diastolic','Email','Phone Number','Visit Date'])  # Reorder columns

new_df_healthcare_data

Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Cholesterol,Blood_Pressure,systolic,diastolic,Email,Phone Number,Visit Date
0,david lee,25,Other,Heart Disease,METFORMIN,200.00000,140/90,140,90,name@hospital.org,555-555-5555,2020-01-15
1,emily davis,45,Male,Diabetes,NONE,200.00000,120/80,120,80,patientname@hospital.org,123-456-789,2018-04-05
2,laura martinez,35,Other,Asthma,METFORMIN,160.00000,110/70,110,70,contact@domain.com,123-456-789,2019-12-01
3,michael wilson,45,Male,Diabetes,ALBUTEROL,189.23277,110/70,110,70,name@hospital.org,555-555-5555,2020-01-15
4,david lee,45,Female,Asthma,NONE,180.00000,110/70,110,70,patientname@hospital.org,,2020-02-20
...,...,...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70,Other,Asthma,ALBUTEROL,189.23277,110/70,110,70,name@hospital.org,123-456-789,2019-03-25
996,mary clark,40,Other,Others,LISINOPRIL,160.00000,125/81,125,81,patientname@hospital.org,123-456-7890,2020-01-15
997,laura martinez,40,Other,Others,ALBUTEROL,189.23277,110/70,110,70,name@hospital.org,123-456-789,2020-02-20
998,jane smith,25,Male,Others,ALBUTEROL,200.00000,110/70,110,70,patientname@hospital.org,,2018-04-05


Upon inspection, our dataset appears to be clean and well-structured. There are no missing values or outliers. Furthermore, the data is organized in a tidy format, making it suitable for analysis. This concludes the data cleaning process for the dataset.