<a href="https://colab.research.google.com/github/victorezealuma/Data-Cleaning-of-Cardiovascular-Health-Data-in-Pandas/blob/main/Healthcare_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data-Cleaning-of-Cardiovascular-Health-Data-in-Pandas

This project focuses on cleaning and preparing of cardiovascular health dataset for analysis.
It involved addressing specific challenges related to medical data. Which include the following:

Introduction
Handling missing values
Data type conversion
Error correction
Feature engineering
Data consistency
Outlier detection and treatment

Accurate and reliable data is essential for effective research and analysis in the field of cardiovascular health. However, real-world datasets often contain inconsistencies, errors, and missing values.

# 1. Introduction
Data cleaning is a critical preprocessing step that involves identifying, correcting, and standardizing data to ensure data quality and integrity. This process is particularly crucial in the healthcare domain where errors can have significant implications. By meticulously cleaning cardiovascular health data, researchers can extract meaningful insights and draw accurate conclusions.


#2. Import the required Python libraries

To proceed with the preprocessing of the data, the required libraries are imported in order to use them for the cleaning of the dataset and their notations are as follows:

In [None]:
# import the Python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#**3. The source dataset**
The dataset will be imported as follows:

In [None]:
df_healthcare_data = pd.read_csv('/content/healthcare_messy_data.csv')
df_healthcare_data.head()

Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Blood Pressure,Cholesterol,Email,Phone Number
0,david lee,25.0,Other,Heart Disease,METFORMIN,01/15/2020,140/90,200.0,name@hospital.org,555-555-5555
1,emily davis,,Male,Diabetes,NONE,"April 5, 2018",120/80,200.0,,
2,laura martinez,35.0,Other,Asthma,METFORMIN,2019.12.01,110/70,160.0,contact@domain.com,
3,michael wilson,,Male,Diabetes,ALBUTEROL,01/15/2020,110/70,,name@hospital.org,555-555-5555
4,david lee,,Female,Asthma,NONE,2020/02/20,110/70,180.0,,


In [None]:
df_healthcare_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Patient Name    1000 non-null   object 
 1   Age             841 non-null    object 
 2   Gender          1000 non-null   object 
 3   Condition       794 non-null    object 
 4   Medication      1000 non-null   object 
 5   Visit Date      1000 non-null   object 
 6   Blood Pressure  834 non-null    object 
 7   Cholesterol     769 non-null    float64
 8   Email           616 non-null    object 
 9   Phone Number    821 non-null    object 
dtypes: float64(1), object(9)
memory usage: 78.2+ KB


In [None]:
df_healthcare_data.dtypes

Unnamed: 0,0
Patient Name,object
Age,object
Gender,object
Condition,object
Medication,object
Visit Date,object
Blood Pressure,object
Cholesterol,float64
Email,object
Phone Number,object


In [None]:
df_healthcare_data['Age'] = df_healthcare_data['Age'].replace('forty', 40)

In [None]:
df_healthcare_data["Age"] = pd.to_numeric(df_healthcare_data["Age"], errors='coerce')

In [None]:
df_healthcare_data["Condition"].fillna("Unknown", inplace=True)

In [None]:
df_healthcare_data["Cholesterol"] = pd.to_numeric(df_healthcare_data["Cholesterol"], errors='coerce')

In [None]:
df_healthcare_data["Blood Pressure"] = df_healthcare_data["Blood Pressure"].astype(object)

In [None]:
df_healthcare_data["Email"] = df_healthcare_data["Email"].astype(object)

In [None]:
df_healthcare_data["Phone Number"] = df_healthcare_data["Phone Number"].astype(object)

In [None]:
df_healthcare_data["Visit Date"] = pd.to_datetime(df_healthcare_data["Visit Date"], format='mixed')

In [None]:
df_healthcare_data.dtypes

Unnamed: 0,0
Patient Name,object
Age,float64
Gender,object
Condition,object
Medication,object
Visit Date,datetime64[ns]
Blood Pressure,object
Cholesterol,float64
Email,object
Phone Number,object


In [None]:
df_healthcare_data.isnull().sum()


Unnamed: 0,0
Patient Name,0
Age,159
Gender,0
Condition,0
Medication,0
Visit Date,0
Blood Pressure,166
Cholesterol,231
Email,384
Phone Number,179


In [None]:
df_healthcare_data


Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Blood Pressure,Cholesterol,Email,Phone Number
0,david lee,25.0,Other,Heart Disease,METFORMIN,2020-01-15,140/90,200.0,name@hospital.org,555-555-5555
1,emily davis,,Male,Diabetes,NONE,2018-04-05,120/80,200.0,,
2,laura martinez,35.0,Other,Asthma,METFORMIN,2019-12-01,110/70,160.0,contact@domain.com,
3,michael wilson,,Male,Diabetes,ALBUTEROL,2020-01-15,110/70,,name@hospital.org,555-555-5555
4,david lee,,Female,Asthma,NONE,2020-02-20,110/70,180.0,,
...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70.0,Other,Asthma,ALBUTEROL,2019-03-25,110/70,,name@hospital.org,
996,mary clark,40.0,Other,Unknown,LISINOPRIL,2020-01-15,,160.0,,123-456-7890
997,laura martinez,40.0,Other,Unknown,ALBUTEROL,2020-02-20,110/70,,name@hospital.org,
998,jane smith,25.0,Male,Unknown,ALBUTEROL,2018-04-05,110/70,200.0,,


In [None]:
mean_age = df_healthcare_data["Age"].mean()

df_healthcare_data["Age"].fillna(mean_age, inplace=True)

df_healthcare_data['Age'] = df_healthcare_data['Age'].astype(int)




mean_cholesterol = df_healthcare_data['Cholesterol'].mean()

df_healthcare_data['Cholesterol'].fillna(mean_cholesterol, inplace=True)


In [None]:
df_healthcare_data

Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Blood Pressure,Cholesterol,Email,Phone Number
0,david lee,25,Other,Heart Disease,METFORMIN,2020-01-15,140/90,200.00000,name@hospital.org,555-555-5555
1,emily davis,45,Male,Diabetes,NONE,2018-04-05,120/80,200.00000,,
2,laura martinez,35,Other,Asthma,METFORMIN,2019-12-01,110/70,160.00000,contact@domain.com,
3,michael wilson,45,Male,Diabetes,ALBUTEROL,2020-01-15,110/70,189.23277,name@hospital.org,555-555-5555
4,david lee,45,Female,Asthma,NONE,2020-02-20,110/70,180.00000,,
...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70,Other,Asthma,ALBUTEROL,2019-03-25,110/70,189.23277,name@hospital.org,
996,mary clark,40,Other,Unknown,LISINOPRIL,2020-01-15,,160.00000,,123-456-7890
997,laura martinez,40,Other,Unknown,ALBUTEROL,2020-02-20,110/70,189.23277,name@hospital.org,
998,jane smith,25,Male,Unknown,ALBUTEROL,2018-04-05,110/70,200.00000,,


In [None]:

df_healthcare_data['Phone Number'] = df_healthcare_data['Phone Number'].replace('NAN', 123-456-789)

df_healthcare_data

Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Blood Pressure,Cholesterol,Email,Phone Number
0,david lee,25,Other,Heart Disease,METFORMIN,2020-01-15,140/90,200.00000,name@hospital.org,555-555-5555
1,emily davis,45,Male,Diabetes,NONE,2018-04-05,120/80,200.00000,,
2,laura martinez,35,Other,Asthma,METFORMIN,2019-12-01,110/70,160.00000,contact@domain.com,
3,michael wilson,45,Male,Diabetes,ALBUTEROL,2020-01-15,110/70,189.23277,name@hospital.org,555-555-5555
4,david lee,45,Female,Asthma,NONE,2020-02-20,110/70,180.00000,,
...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70,Other,Asthma,ALBUTEROL,2019-03-25,110/70,189.23277,name@hospital.org,
996,mary clark,40,Other,Unknown,LISINOPRIL,2020-01-15,,160.00000,,123-456-7890
997,laura martinez,40,Other,Unknown,ALBUTEROL,2020-02-20,110/70,189.23277,name@hospital.org,
998,jane smith,25,Male,Unknown,ALBUTEROL,2018-04-05,110/70,200.00000,,


In [None]:
# Split the column into systolic and diastolic blood pressure
df_healthcare_data[['systolic', 'diastolic']] = df_healthcare_data['Blood Pressure'].str.split("/", expand = True)

# Convert to numeric
df_healthcare_data['systolic'] = pd.to_numeric(df_healthcare_data['systolic'], errors='coerce')
df_healthcare_data['diastolic'] = pd.to_numeric(df_healthcare_data['diastolic'], errors='coerce')


# Calculate mean systolic and diastolic pressures
mean_systolic = df_healthcare_data['systolic'].mean()
mean_diastolic = df_healthcare_data['diastolic'].mean()


mean_systolic	 = df_healthcare_data['systolic'].mean()
mean_diastolic	 = df_healthcare_data['diastolic'].mean()

df_healthcare_data['systolic'].fillna(mean_systolic, inplace=True)
df_healthcare_data['diastolic'].fillna(mean_diastolic, inplace=True)
df_healthcare_data['systolic'] = df_healthcare_data['systolic'].astype(int)
df_healthcare_data['diastolic'] = df_healthcare_data['diastolic'].astype(int)


df_healthcare_data

Unnamed: 0,Patient Name,Age,Gender,Condition,Medication,Visit Date,Blood Pressure,Cholesterol,Email,Phone Number,systolic,diastolic
0,david lee,25,Other,Heart Disease,METFORMIN,2020-01-15,140/90,200.00000,name@hospital.org,555-555-5555,140,90
1,emily davis,45,Male,Diabetes,NONE,2018-04-05,120/80,200.00000,,,120,80
2,laura martinez,35,Other,Asthma,METFORMIN,2019-12-01,110/70,160.00000,contact@domain.com,,110,70
3,michael wilson,45,Male,Diabetes,ALBUTEROL,2020-01-15,110/70,189.23277,name@hospital.org,555-555-5555,110,70
4,david lee,45,Female,Asthma,NONE,2020-02-20,110/70,180.00000,,,110,70
...,...,...,...,...,...,...,...,...,...,...,...,...
995,mary clark,70,Other,Asthma,ALBUTEROL,2019-03-25,110/70,189.23277,name@hospital.org,,110,70
996,mary clark,40,Other,Unknown,LISINOPRIL,2020-01-15,,160.00000,,123-456-7890,125,81
997,laura martinez,40,Other,Unknown,ALBUTEROL,2020-02-20,110/70,189.23277,name@hospital.org,,110,70
998,jane smith,25,Male,Unknown,ALBUTEROL,2018-04-05,110/70,200.00000,,,110,70
