### Merge All Datasets — Patient + Biometric + Medical

Alright, this part is about getting everything into one place.

We’re starting with three separate files:
- Patient demographics (pretty clean and stable)
- Biometric indicators (some patients won’t have this—fair enough)
- Medical records (this one will be the messiest, I expect)

The goal is to create one flat table that has everything we need: who the patient is, what their body metrics are, and what kind of care they’ve received. But I’m cautious here—some joins might multiply rows, especially when patients have multiple diagnoses or visit types

In [1]:
# Step 0: Load required libraries
##  We'll be working mainly with pandas for loading, merging, and previewing data
import pandas as pd

In [2]:
# Step 1: Load each dataset
# A quick read,just pulling in the raw data first
biometric = pd.read_csv("Biometric Data.csv")
medical = pd.read_csv("Medical Records.csv")
patient = pd.read_csv("Patient Data.csv")

In [3]:
# Step 2: Keep only the columns that add value to our Business problem statement.
patient = patient[['patient_id', 'age', 'gender', 'ethnicity', 'zip_code']]
biometric = biometric[['patient_id', 'bmi', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'cholesterol_total']]
medical = medical[['patient_id', 'diagnosis', 'diagnosis_date', 'visit_type', 'cost']]

In [4]:
# Step 3: Merge biometric and medical data using OUTER JOIN
# This is a key move: we want patients who had biometric screening or medical visits, or both.
# This way, we capture not just the diagnosed patients, but also the "at-risk" ones who may be flying under the radar.
core = pd.merge(medical, biometric, on='patient_id', how='outer')

### Used a LEFT JOIN when adding demographics to the core dataset of patients who had biometric or medical interaction.

Why not INNER? Because missing demographic data doesn’t mean the patient is irrelevant — it may mean they’re new or partially recorded. These are often exactly the kinds of patients Loblaw should monitor.

I’ll handle missing age/gender/ethnicity in the cleaning step, rather than excluding them prematurely.

In [5]:
# Step 4: Merge in demographic info (LEFT JOIN)
# Now that we have a core patient list, we’ll bring in demographic details.
# Using a LEFT JOIN here — if some patients don’t have age/gender info, we’ll still keep them (and handle it in cleaning later).
merged_df = pd.merge(core, patient, on='patient_id', how='left')

In [6]:
#Step 5: Just peeking at the top of the dataset to confirm it looks okay.
merged_df.head()

Unnamed: 0,patient_id,diagnosis,diagnosis_date,visit_type,cost,bmi,blood_pressure_systolic,blood_pressure_diastolic,cholesterol_total,age,gender,ethnicity,zip_code
0,PAT00001,Common Cold,4/11/2024,Emergency,60.32,21.283985,131.0,78.0,191.0,33.0,Male,African American,92106
1,PAT00002,Depression,3/15/2024,Outpatient,587.01,21.267961,103.0,65.0,211.0,54.0,Female,African American,24249
2,PAT00002,Hypertension,8/1/2020,Outpatient,439.93,21.267961,103.0,65.0,211.0,54.0,Female,African American,24249
3,PAT00003,,,,,27.460336,106.0,74.0,190.0,65.0,Female,Caucasian,81306
4,PAT00004,Common Cold,10/3/2021,Telehealth,148.96,25.898735,112.0,73.0,233.0,27.0,Female,Asian,81707


In [7]:
 ##Save Merged Dataset 
##Just to be safe, I’m saving this merged dataset as a CSV. It’s not cleaned yet, but I don’t want to risk redoing all the joins if something crashes or I need to restart the notebook later.

# Save the merged dataset to a new CSV
merged_df.to_csv("Merged_Healthcare_Data.csv", index=False)

In [8]:
# Just confirming it saved successfully
print("Merged dataset saved as 'Merged_Healthcare_Data.csv'")

Merged dataset saved as 'Merged_Healthcare_Data.csv'
