<a href="https://colab.research.google.com/github/washmore1/PopulationHealthcareAnalytics/blob/main/HealthcareDatabaseGeneration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Generating Fake, Raw Data to Simulate Healthcare Data Sources**

This code builds a realistic, large-scale raw healthcare dataset that relates particularly to diabetic patients. This program produces 6 tables:

**1.   Patient Demographics Table**

  *   Patient ID
  *   Age
  *   Sex
  *   County
  *   Insurance Type

**2.   Clinical Diagnoses Table**

  *   Patient ID
  *   Diagnosis code
  *   Diagnosis date

**3.   Lab Results Table**

  *   Patient ID
  *   Lab type
  *   Lab result value
  *   Date

**4.   Medication Table**

  *   Patient ID
  *   Medication name
  *   Prescription start date
  *   Prescription end date

**5.   Visits Table**

  *   Patient ID
  *   Visit date
  *   Visit type
  *   Outcome

**6.   Care Management Table**

  *   Patient ID
  *   Care management type
  *   Start date
  *   Completion status

  In terms of volume, this generated database contains 10,000 unique patients (each with multiple entries in diagnosis, labs, medications, and visits) that are distributed across 50 unique, made-up counties.


In [17]:
# Library installation and importation
!pip install pandas faker numpy scikit-learn

import numpy as np
import pandas as pd
from faker import Faker
import random
from datetime import timedelta, datetime



In [18]:
# Set up Faker and Randomness for Reproducibility
fake = Faker()
np.random.seed(42)
random.seed(42)

In [19]:
# Constants and Setup
NUM_PATIENTS = 10000  # I want 10,000 patients in my generated database
NUM_COUNTIES = 50 # I want 50 counties in my generated database
START_DATE = datetime(2022, 1, 1)
END_DATE = datetime(2025, 1, 1)

# Generate 50 fake and unique county names
COUNTIES = set()
while len(COUNTIES) < NUM_COUNTIES:
  city = fake.city()
  county_name = f"{city}"
  COUNTIES.add(county_name)
COUNTIES = list(COUNTIES)

INSURANCE_TYPES = ['Medicare', 'Medicaid', 'Private', 'Uninsured']
SEXES = ['Male', 'Female']

# Generate Patient ID Primary Key
patient_ids = [f"P{str(i).zfill(6)}" for i in range(1, NUM_PATIENTS + 1)]

**Patient Demographics Table (df_demo)**

In [20]:
demo_data = {
    'Patient_ID': patient_ids,
    'Age': np.random.randint(18, 90, NUM_PATIENTS),  # Ages 18 to 89
    'Sex': np.random.choice(SEXES, NUM_PATIENTS),
    'County': np.random.choice(COUNTIES, NUM_PATIENTS),
    'Insurance_Type': np.random.choice(INSURANCE_TYPES, NUM_PATIENTS, p=[0.3, 0.2, 0.4, 0.1])
}
df_demo = pd.DataFrame(demo_data)

**Clinical Diagnoses Table (df_diag)**

In [21]:
# Focus on Type 2 Diabetes and common complications
diagnosis_codes = ['E11', 'E11.2', 'E11.9']
diag_records = []

for pid in patient_ids:
    # Each patient has a primary diabetes diagnosis
    diag_date = fake.date_between(start_date=START_DATE, end_date=END_DATE)
    diag_records.append((pid, 'E11', diag_date))

    # 30% of patients also have a complication
    if random.random() < 0.3:
        comp_date = diag_date + timedelta(days=random.randint(30, 365))
        diag_records.append((pid, random.choice(['E11.2', 'E11.9']), comp_date))

df_diag = pd.DataFrame(diag_records, columns=['Patient_ID', 'Diagnosis_Code', 'Diagnosis_Date'])

**Lab Results Table (df_labs)**

In [22]:
lab_types = ['HbA1c', 'LDL Cholesterol', 'Creatinine']
lab_records = []

for pid in patient_ids:
    num_labs = random.randint(2, 6)  # Each patient gets 2-6 lab tests
    for _ in range(num_labs):
        lab_date = fake.date_between(start_date=START_DATE, end_date=END_DATE)
        lab_type = random.choice(lab_types)

        # Generate realistic lab values
        if lab_type == 'HbA1c':
            lab_val = round(np.random.normal(7.0, 1.5), 2)
            lab_val = max(4.5, min(lab_val, 14.0))
        elif lab_type == 'LDL Cholesterol':
            lab_val = round(np.random.normal(110, 40), 1)
            lab_val = max(30, min(lab_val, 250))
        else:  # Creatinine
            lab_val = round(np.random.normal(1.1, 0.4), 2)
            lab_val = max(0.3, min(lab_val, 3.5))

        lab_records.append((pid, lab_type, lab_val, lab_date))

df_labs = pd.DataFrame(lab_records, columns=['Patient_ID', 'Lab_Type', 'Lab_Value', 'Lab_Date'])

**Medications Table (df_meds)**

In [23]:
medications = ['Metformin', 'Insulin', 'Glipizide', 'Sitagliptin']
med_records = []

for pid in patient_ids:
    num_meds = random.randint(1, 3)  # Each patient has 1–3 prescriptions
    start_base = fake.date_between(start_date=START_DATE, end_date=END_DATE - timedelta(days=365))

    for _ in range(num_meds):
        med = random.choice(medications)
        start_offset = timedelta(days=random.randint(0, 300))
        end_offset = start_offset + timedelta(days=random.randint(30, 365))

        start_date = start_base + start_offset
        end_date = start_base + end_offset

        med_records.append((pid, med, start_date, end_date))

df_meds = pd.DataFrame(med_records, columns=['Patient_ID', 'Medication', 'Start_Date', 'End_Date'])

**Visits Table (df_visits)**

In [24]:
visit_types = ['Primary Care', 'Endocrinology', 'Emergency', 'Nutrition Counseling']
visit_records = []

for pid in patient_ids:
    num_visits = random.randint(3, 10)  # Each patient has 3–10 visits
    for _ in range(num_visits):
        visit_date = fake.date_between(start_date=START_DATE, end_date=END_DATE)
        visit_type = random.choice(visit_types)
        outcome = random.choice(['Hospitalized', 'Follow-up Scheduled', 'No Issues'])

        visit_records.append((pid, visit_date, visit_type, outcome))

df_visits = pd.DataFrame(visit_records, columns=['Patient_ID', 'Visit_Date', 'Visit_Type', 'Outcome'])

**Care Management Table (df_care)**

In [25]:
care_types = ['Diabetes Education', 'Nutrition Plan', 'Medication Adherence', 'Case Management']
care_records = []

for pid in patient_ids:
    # Only 40% of patients receive care management
    if random.random() < 0.4:
        care_type = random.choice(care_types)
        start_date = fake.date_between(start_date=START_DATE, end_date=END_DATE)
        completion = random.choice(['Completed', 'In Progress', 'Not Started'])

        care_records.append((pid, care_type, start_date, completion))

df_care = pd.DataFrame(care_records, columns=['Patient_ID', 'Care_Type', 'Start_Date', 'Completion_Status'])

**View Shapes to Ensure Success**

In [26]:
print(f"Demo shape: {df_demo.shape}")
print(f"Diagnosis shape: {df_diag.shape}")
print(f"Labs shape: {df_labs.shape}")
print(f"Medications shape: {df_meds.shape}")
print(f"Visits shape: {df_visits.shape}")
print(f"Care Management shape: {df_care.shape}")

# Preview a table
df_demo.head()

Demo shape: (10000, 5)
Diagnosis shape: (12988, 3)
Labs shape: (40347, 4)
Medications shape: (19867, 4)
Visits shape: (64406, 4)
Care Management shape: (4025, 4)


Unnamed: 0,Patient_ID,Age,Sex,County,Insurance_Type
0,P000001,69,Male,West Jamesview,Medicare
1,P000002,32,Male,New Jason,Medicare
2,P000003,89,Female,South Janet,Private
3,P000004,78,Male,Sophiahaven,Medicare
4,P000005,38,Female,New Nicholastown,Medicare


**Save to CSV's to begin analysis and manipulation**

In [27]:
df_demo.to_csv('patient_demographics.csv', index=False)
df_diag.to_csv('clinical_diagnoses.csv', index=False)
df_labs.to_csv('lab_results.csv', index=False)
df_meds.to_csv('medications.csv', index=False)
df_visits.to_csv('visits.csv', index=False)
df_care.to_csv('care_management.csv', index=False)