# <p style="color:hotpink"> **Maternal Telehealth Synthetic Data Project**

<p style="color:pink">This notebook generates a realistic dataset of maternal telehealth visits, including patient demographics, insurance, visit types, and communication patterns.

<p style="color:pink">It simulates operational outcomes (Show, No-Show, Cancelled, Scheduled) plus patient messages with labeled intent and stress to reflect real-world telehealth operations.

---

## <p style="color:hotpink"> **Data Simulation**

<p style="color:pink">First, simulate 10,000 maternal telehealth appointments with realistic mixes of:

- Insurance types (Medicaid, Major Insurance, Self Pay)
- Referral sources
- Visit types (Prenatal, Postpartum, Follow-up, Regular care)
- Past visit patterns and no-shows

<p style="color:pink">This builds our structured dataset foundation.

In [1]:
# Re-run full setup after code execution environment reset
import pandas as pd
import numpy as np
import random
from faker import Faker

fake = Faker()
Faker.seed(42)
N = 10000  # Number of fake patients

# Distributions
insurance_mix = {
    "Employer Insured": 0.02,
    "Major Insurance": 0.85,
    "Medicaid": 0.11,
    "Self Pay": 0.02
}

referral_mix = {
    "Provider Referred": 0.50,
    "DME Referred": 0.48,
    "Employer Referred": 0.02,
}

visit_type_mix = {
    "New Patient Postpartum": 0.25,
    "New Patient Prenatal": 0.25,
    "Regular Care": 0.30,
    "Follow-up": 0.20,
}

def simulate_communication(booking_days_prior):
    confirmed = np.random.rand() < 0.75 if booking_days_prior >= 2 else False
    ignored = not confirmed
    return {
        "confirmed_2_days_prior": confirmed,
        "ignored_all_outreach": ignored
    }

def generate_zip():
    return fake.zipcode_in_state(state_abbr=random.choice([
        'CA', 'TX', 'NY', 'FL', 'PA', 'IL', 'OH', 'GA', 'NC', 'MI', 'AZ', 'IN',
        'MO', 'WI', 'CO', 'MN', 'AL', 'SC', 'VA', 'TN', 'MA', 'OK', 'KY', 'LA'
    ]))

# Generate data
data = []
for _ in range(N):
    insurance = random.choices(list(insurance_mix.keys()), weights=insurance_mix.values())[0]
    referral = random.choices(list(referral_mix.keys()), weights=referral_mix.values())[0]
    visit_type = random.choices(list(visit_type_mix.keys()), weights=visit_type_mix.values())[0]
    booking_days_prior = random.randint(0, 30)
    comms = simulate_communication(booking_days_prior)
    past_visits = np.random.randint(0, 6)
    past_no_shows = np.random.choice([0, 1], p=[0.9, 0.1])
    credit_card_on_file = np.random.choice([True, False], p=[0.30, 0.70])
    patient_age = np.random.randint(18, 45)
    baby_age_months = np.random.randint(0, 18) if "Postpartum" in visit_type or "Regular" in visit_type else None

    row = {
        "patient_id": fake.uuid4(),
        "patient_name": fake.name_female(),
        "patient_age": patient_age,
        "zip_code": generate_zip(),
        "insurance": insurance,
        "referral_source": referral,
        "visit_type": visit_type,
        "booking_days_prior": booking_days_prior,
        "confirmed_2_days_prior": comms["confirmed_2_days_prior"],
        "ignored_all_outreach": comms["ignored_all_outreach"],
        "past_visits": past_visits,
        "past_no_shows": past_no_shows,
        "credit_card_on_file": credit_card_on_file,
        "baby_age_months": baby_age_months
    }
    data.append(row)

df = pd.DataFrame(data)

# Assign status
def assign_status(row):
    if row['credit_card_on_file'] or row['confirmed_2_days_prior']:
        return "Show" if np.random.rand() > 0.01 else "Cancelled"
    if row['ignored_all_outreach'] and row['past_visits'] < 3:
        return "No-Show" if np.random.rand() < 0.25 else "Cancelled"
    if row['insurance'] == "Self Pay" and row['booking_days_prior'] > 5:
        return "Cancelled" if np.random.rand() < 0.3 else "Show"
    if row['past_no_shows'] >= 1:
        return "No-Show" if np.random.rand() < 0.15 else "Show"
    return np.random.choice(["Show", "No-Show", "Cancelled"], p=[0.88, 0.07, 0.05])

df["status"] = df.apply(assign_status, axis=1)

df.head(10)


Unnamed: 0,patient_id,patient_name,patient_age,zip_code,insurance,referral_source,visit_type,booking_days_prior,confirmed_2_days_prior,ignored_all_outreach,past_visits,past_no_shows,credit_card_on_file,baby_age_months,status
0,bdd640fb-0667-4ad1-9c80-317fa3b1799d,Courtney Doyle,32,95544,Major Insurance,Provider Referred,New Patient Postpartum,1,False,True,3,0,False,10.0,Show
1,16419f82-8b9d-4434-a465-e150bd9c66b3,Amanda Davis,21,55477,Major Insurance,Provider Referred,New Patient Prenatal,11,False,True,5,0,False,,Show
2,8fadc1a6-06cb-4fb3-9a1d-e644815ef6d1,Marie Gardner,30,32906,Major Insurance,Provider Referred,New Patient Prenatal,17,False,True,2,0,True,,Show
3,cf36d58b-4737-4190-96da-1dac72ff5d2a,Olivia Moore,23,85866,Major Insurance,Provider Referred,New Patient Prenatal,3,True,False,0,0,True,,Show
4,371ecd7b-27cd-4130-8722-9389571aa876,Gabrielle Davis,32,70199,Major Insurance,DME Referred,Regular Care,0,False,True,2,0,False,8.0,Cancelled
5,9a8dca03-580d-4b71-98f5-64135be6128e,Amanda Stevens,19,63512,Major Insurance,DME Referred,New Patient Postpartum,18,True,False,1,0,True,9.0,Show
6,142c3fe8-60e7-4113-ac1b-8ca1f91e1d4c,Sandra Montgomery,25,36768,Major Insurance,DME Referred,Regular Care,15,True,False,2,0,False,11.0,Show
7,b45ed1f0-3139-432c-93cd-59bf5c941cf0,Mary Mejia,33,35167,Major Insurance,Provider Referred,Regular Care,5,False,True,2,0,True,17.0,Show
8,19db3ad0-ddd1-4fb2-bb98-2ef8daf61a26,Jody Flowers,19,46334,Major Insurance,DME Referred,Regular Care,12,True,False,5,0,True,1.0,Show
9,ab9099a4-35a2-40ae-9af3-05535ec42e08,Taylor Wong,20,86248,Major Insurance,Provider Referred,Regular Care,30,True,False,2,0,False,3.0,Show


In [2]:
df['status'].value_counts()

status
Show         8779
Cancelled     897
No-Show       324
Name: count, dtype: int64

## <p style="color:hotpink"> **Generating Upcoming Appointments**

<p style="color:pink">Next, create 150 upcoming patient appointments that are still in the future, so they have no outcomes yet (status = "Scheduled").

<p style="color:pink">This reflects a realistic operational pipeline, where some appintments have already occurred with known outcomes (show, No-Show, Cancelled), while others are on the books without final attendance data.

<p style="color:pink">These upcoming appointments will later be included in the dataset to simulate an actual telehealth environment with both historical and future visits.

In [3]:
# Generate 100-200 upcoming appointments (no outcome yet)
N_upcoming = 150
upcoming_data = []

for _ in range(N_upcoming):
    insurance = random.choices(list(insurance_mix.keys()), weights=insurance_mix.values())[0]
    referral = random.choices(list(referral_mix.keys()), weights=referral_mix.values())[0]
    visit_type = random.choices(list(visit_type_mix.keys()), weights=visit_type_mix.values())[0]
    booking_days_prior = random.randint(0, 30)
    comms = simulate_communication(booking_days_prior)
    past_visits = np.random.randint(0, 6)
    past_no_shows = np.random.choice([0, 1], p=[0.9, 0.1])
    credit_card_on_file = np.random.choice([True, False], p=[0.30, 0.70])
    patient_age = np.random.randint(18, 45)
    baby_age_months = np.random.randint(0, 18) if "Postpartum" in visit_type or "Regular Care" in visit_type else None

    row = {
        "patient_id": fake.uuid4(),
        "patient_name": fake.name_female(),
        "patient_age": patient_age,
        "zip_code": generate_zip(),
        "insurance": insurance,
        "referral_source": referral,
        "visit_type": visit_type,
        "booking_days_prior": booking_days_prior,
        "confirmed_2_days_prior": comms["confirmed_2_days_prior"],
        "ignored_all_outreach": comms["ignored_all_outreach"],
        "past_visits": past_visits,
        "past_no_shows": past_no_shows,
        "credit_card_on_file": credit_card_on_file,
        "baby_age_months": baby_age_months,
        "status": "Scheduled" # Outcome not yet known
    }
    upcoming_data.append(row)

df_upcoming = pd.DataFrame(upcoming_data)

In [4]:
# Combine Upcoming Data to Original Data
df_full = pd.concat([df, df_upcoming], ignore_index=True)

In [5]:
# Check data is now combined via appt statuses
df_full['status'].value_counts()

status
Show         8779
Cancelled     897
No-Show       324
Scheduled     150
Name: count, dtype: int64

## <p style="color:hotpink"> **Adding Patient Communication Data**

<p style="color:pink">We sample 500 patients and attach realistic patient messages. Each message is tied to appointment outcome ('status') to reflect real-world patterns.

<p style="color:pink">This creates a combined structured + unstructured dataset.

In [14]:
# Generate Fake Patient Messages based on appointment status

def generate_message_by_status(status):
    cancel_msgs = [
    ("I can't make it.", "cancel", "medium"),
    ("Cancel.", "cancel", "medium"),
    ("Why was I charged?", "financial_question", "high"),
    ("Is this covered by insurance?", "financial_question", "high"),
    ]
    late_msgs = [
    ("I'm going to be a few minutes late.", "late_notice", "low"),
    ("I'm having trouble joining the link", "late_notice", "high"),
    ]
    reschedule_msgs = [
    ("Can I move my appointment?", "reschedule_request", "medium"),
    ("How do I reschedule?", "reschedule_request", "medium"),
    ("I'm sick, I need to reschedule today.", "reschedule_request", "high"),
    ("RS", "reschedule_request", "low"),
    ("Reschedule", "reschedule_request", "low"),
    ]
    support_msgs = [
    ("How do I complete my forms?", "support_question", "low"),
    ("Thank you for checking in!", "gratitude", "low"),
    ("What time is my appointment?", "confirmation_question", "low"),
    ("Will I be charged if I reschedule my appointment for today?", "support_question", "high"),
    ("Schedule", "schedule", "high"),
    ]
    anxiety_msgs = [
    ("I'm having pain, should I come in sooner?", "concern", "high"),
    ("I'm feeling very stressed about this visit.", "anxiety", "high"),
    ]

    # Choose based on current appointment outcome
    if status == "Cancelled":
        return random.choice(cancel_msgs + reschedule_msgs + anxiety_msgs)
    elif status == "Show":
        return random.choice(late_msgs + support_msgs + anxiety_msgs)
    elif status == "No-Show":
        return random.choice(cancel_msgs + reschedule_msgs + anxiety_msgs)
    else: # Scheduled / Future
        return random.choice(support_msgs + reschedule_msgs)

df_with_messages = df_full.sample(500).copy()

df_with_messages["patient_message"], df_with_messages["message_intent"], df_with_messages["stress_level"] = zip(
    *[generate_message_by_status(status) for status in df_with_messages["status"]]
)

In [15]:
df_with_messages.head(20)

Unnamed: 0,patient_id,patient_name,patient_age,zip_code,insurance,referral_source,visit_type,booking_days_prior,confirmed_2_days_prior,ignored_all_outreach,past_visits,past_no_shows,credit_card_on_file,baby_age_months,status,patient_message,message_intent,stress_level
5498,a109dbab-67ba-4a4d-bdf0-b75faca8e5a4,Nicole Thomas,24,36486,Major Insurance,DME Referred,New Patient Prenatal,20,True,False,2,0,False,,Show,Thank you for checking in!,gratitude,low
6677,0b5a9b5a-ad28-470e-83d3-508ff20eda5a,Debbie Mccann,35,30572,Major Insurance,Provider Referred,New Patient Prenatal,19,True,False,4,0,False,,Show,I'm feeling very stressed about this visit.,anxiety,high
2753,8eed8d34-f384-4346-945f-4281236603b4,Kelly Anderson,29,80058,Major Insurance,Provider Referred,New Patient Prenatal,17,False,True,0,1,False,,No-Show,Why was I charged?,financial_question,high
2414,a747520f-e315-4b34-a2f5-63b1e8cdea8a,Angela Carter,32,45100,Medicaid,DME Referred,Regular Care,30,False,True,3,0,False,7.0,Show,Schedule,schedule,high
8296,7ed7402f-8d9e-49b2-b864-28de4fbc3631,Karen Ellis,32,43281,Medicaid,Provider Referred,Regular Care,6,True,False,5,0,True,3.0,Show,Thank you for checking in!,gratitude,low
8121,17a86a12-14d2-405b-acf9-b230676f7928,Jocelyn Schultz,27,64709,Major Insurance,Employer Referred,Regular Care,18,False,True,1,0,False,14.0,No-Show,I can't make it.,cancel,medium
9700,82f9e879-3798-4443-a178-2c6cc5e18259,Danielle Flores,29,32581,Major Insurance,Provider Referred,Regular Care,27,True,False,2,1,True,12.0,Show,How do I complete my forms?,support_question,low
5833,86743542-6e7e-4484-a32b-df0d48b16c2a,Sara Washington,44,71209,Major Insurance,DME Referred,Regular Care,6,True,False,4,0,False,12.0,Show,I'm feeling very stressed about this visit.,anxiety,high
10104,7d6be102-88fb-4e9c-9813-7db6fbf2df25,Kayla Brown,36,31587,Major Insurance,Provider Referred,New Patient Prenatal,24,True,False,3,0,False,,Scheduled,RS,reschedule_request,low
2205,057eccfa-15f3-4747-90bb-5ed3c9765bdf,Shannon Brown,18,49747,Major Insurance,Provider Referred,Regular Care,16,True,False,5,0,False,0.0,Show,Schedule,schedule,high


## <p style="color:hotpink"> **Merging Data**

<p style="color:pink">We merge our communication sample back into the full dataset so only ~ 5% of patients have message data, simulating a realistic starting operational environment.

In [16]:
# Join back on patient_id so message data merges in
df_merged = df_full.merge(
    df_with_messages[["patient_id", "patient_message", "message_intent", "stress_level"]],
    on="patient_id",
    how="left" # keep all patients from df_full
)

In [17]:
# Check our new DF to ensure it is merged properly
df_merged.head(20)

Unnamed: 0,patient_id,patient_name,patient_age,zip_code,insurance,referral_source,visit_type,booking_days_prior,confirmed_2_days_prior,ignored_all_outreach,past_visits,past_no_shows,credit_card_on_file,baby_age_months,status,patient_message,message_intent,stress_level
0,bdd640fb-0667-4ad1-9c80-317fa3b1799d,Courtney Doyle,32,95544,Major Insurance,Provider Referred,New Patient Postpartum,1,False,True,3,0,False,10.0,Show,,,
1,16419f82-8b9d-4434-a465-e150bd9c66b3,Amanda Davis,21,55477,Major Insurance,Provider Referred,New Patient Prenatal,11,False,True,5,0,False,,Show,,,
2,8fadc1a6-06cb-4fb3-9a1d-e644815ef6d1,Marie Gardner,30,32906,Major Insurance,Provider Referred,New Patient Prenatal,17,False,True,2,0,True,,Show,,,
3,cf36d58b-4737-4190-96da-1dac72ff5d2a,Olivia Moore,23,85866,Major Insurance,Provider Referred,New Patient Prenatal,3,True,False,0,0,True,,Show,,,
4,371ecd7b-27cd-4130-8722-9389571aa876,Gabrielle Davis,32,70199,Major Insurance,DME Referred,Regular Care,0,False,True,2,0,False,8.0,Cancelled,,,
5,9a8dca03-580d-4b71-98f5-64135be6128e,Amanda Stevens,19,63512,Major Insurance,DME Referred,New Patient Postpartum,18,True,False,1,0,True,9.0,Show,,,
6,142c3fe8-60e7-4113-ac1b-8ca1f91e1d4c,Sandra Montgomery,25,36768,Major Insurance,DME Referred,Regular Care,15,True,False,2,0,False,11.0,Show,,,
7,b45ed1f0-3139-432c-93cd-59bf5c941cf0,Mary Mejia,33,35167,Major Insurance,Provider Referred,Regular Care,5,False,True,2,0,True,17.0,Show,,,
8,19db3ad0-ddd1-4fb2-bb98-2ef8daf61a26,Jody Flowers,19,46334,Major Insurance,DME Referred,Regular Care,12,True,False,5,0,True,1.0,Show,,,
9,ab9099a4-35a2-40ae-9af3-05535ec42e08,Taylor Wong,20,86248,Major Insurance,Provider Referred,Regular Care,30,True,False,2,0,False,3.0,Show,,,


## <p style="color:hotpink"> **Saving Dataset**

<p style="color:pink">Finally, save the final dataset as 'maternal_telehealth_full_dataset.csv' for downstream analysis and predictive modeling.

In [None]:
# Save dataset as a CSV
df_merged.to_csv("maternal_telehealth_full_dataset_final.csv", index=False)