#  Synthetic Wellness Dataset Generator

This notebook generates a **synthetic wellness dataset** that includes both **normal daily patterns** and **edge cases** for testing the robustness of wellness-related models.

##  Purpose:
- Simulate realistic health metrics: `Sleep`, `Mood`, and `Steps`
- Create extreme and contradictory scenarios (edge cases) to help test ML models, scoring systems, or rule-based wellness assessments
- Save the final dataset as a CSV for further use in model training or evaluation

The final dataset includes:
-  1800 normal entries (balanced lifestyle)
-  200 edge cases (low sleep, zero steps, contradictory combinations, etc.)


###  Step 1: Import Required Libraries

We'll start by importing the necessary libraries: `pandas` and `numpy`.


In [1]:
import pandas as pd
import numpy as np


###  Step 2: Set Random Seed

Setting the random seed for reproducibility of results across runs.


In [3]:
# Set seed for reproducibility
np.random.seed(42)


###  Step 3: Define Function to Generate Normal Wellness Data

This function simulates normal cases with:
- Sleep: Normally distributed around 7.5 hours
- Mood: Random integers between 4 and 8
- Steps: Normally distributed around 8000


In [2]:
def generate_normal_data(n):
    sleep = np.random.normal(loc=7.5, scale=1.0, size=n).clip(3, 10)
    mood = np.random.randint(4, 9, size=n)
    steps = np.random.normal(loc=8000, scale=2000, size=n).clip(1000, 15000)
    df = pd.DataFrame({'Sleep': sleep, 'Mood': mood, 'Steps': steps})
    return df.round(3)


###  Step 4: Define Function to Generate Edge Case Data

This function creates edge scenarios to test robustness of models:
- Extremely low or high sleep
- Zero sleep or steps
- Contradictory or unusual combinations
- Ideal and worst-case scenarios


In [3]:

def generate_edge_cases():
    """
    Create a list of known edge cases to test the robustness of the scoring/model
    """
    edge_data = []

    # Very low sleep (< 2.5 hrs)
    for _ in range(50):
        edge_data.append([np.random.uniform(0.5, 2.5), np.random.randint(1, 6), np.random.randint(1000, 8000)])

    # Oversleeping (> 10.5 hrs) - FIXED: Allow up to 15 hours
    for _ in range(50):
        edge_data.append([np.random.uniform(10.5, 15), np.random.randint(3, 8), np.random.randint(1000, 10000)])

    # Zero sleep
    for _ in range(25):
        edge_data.append([0, np.random.randint(1, 6), np.random.randint(100, 5000)])

    # High mood, poor sleep and steps - FIXED: Use mood 9-10
    for _ in range(25):
        edge_data.append([np.random.uniform(2, 4), np.random.randint(9, 11), np.random.randint(200, 2000)])

    # Sedentary lifestyle (0 steps)
    for _ in range(50):
        edge_data.append([np.random.uniform(6, 8), np.random.randint(2, 7), 0])

    # Very active (steps > 16000) - FIXED: Allow up to 25,000 steps
    for _ in range(50):
        edge_data.append([np.random.uniform(6, 9), np.random.randint(6, 11), np.random.randint(16000, 25000)])

    # Perfect scenario - FIXED: Use mood 10
    for _ in range(25):
        edge_data.append([8, 10, 10000])

    # All bad values
    for _ in range(25):
        edge_data.append([0, 1, 0])

    # High sleep + low steps + high mood - FIXED: Use mood 10
    for _ in range(25):
        edge_data.append([np.random.uniform(9, 12), 10, np.random.randint(100, 1000)])

    # High steps + low mood + low sleep
    for _ in range(25):
        edge_data.append([np.random.uniform(2, 4), np.random.randint(1, 3), np.random.randint(12000, 25000)])

    # Extreme mood swings (alternate between 1 and 10) - FIXED: This was added after DataFrame creation!
    for _ in range(20):
        mood = np.random.choice([1, 10])  # This ensures mood 10 is included
        edge_data.append([np.random.uniform(6, 8), mood, np.random.randint(4000, 10000)])

    # High mood despite poor physical metrics - FIXED: Ensure mood 10
    for _ in range(20):
        edge_data.append([np.random.uniform(2, 4), 10, np.random.randint(500, 2000)])

    # Low mood despite excellent physical metrics
    for _ in range(20):
        edge_data.append([np.random.uniform(7, 9), 1, np.random.randint(9000, 15000)])

    # Neutral mood with varied physical health
    for _ in range(20):
        edge_data.append([np.random.uniform(5, 9), 5, np.random.randint(2000, 15000)])

    # Max mood (10) with extreme physical values - FIXED: Ensure extreme values
    for _ in range(20):
        edge_data.append([np.random.uniform(12, 15), 10, np.random.randint(20000, 25000)])

    # ADD MORE EXTREME CASES:

    # Extreme sleep cases
    for _ in range(10):
        edge_data.append([np.random.uniform(13, 15), np.random.randint(1, 11), np.random.randint(1000, 10000)])

    # Extreme step cases
    for _ in range(10):
        edge_data.append([np.random.uniform(6, 9), np.random.randint(1, 11), np.random.randint(22000, 25000)])

    # Create DataFrame and round to 3 decimals
    edge_df = pd.DataFrame(edge_data, columns=["Sleep", "Mood", "Steps"])
    return edge_df.round(3)

###  Step 5: Generate Normal and Edge Case Datasets

We now generate:
- 1800 rows of normal data
- 470 rows of edge case data


In [6]:
normal_df = generate_normal_data(1800)
edge_df = generate_edge_cases()


###  Step 6: Combine and Shuffle Data

We combine the normal and edge case data into one dataset and shuffle it.


In [7]:
full_df = pd.concat([normal_df, edge_df], ignore_index=True)
full_df = full_df.sample(frac=1, random_state=42).reset_index(drop=True)


### Step 7: Save the Final Dataset

Save the combined dataset to a CSV file for use in model training or analysis.


In [10]:
full_df.to_csv("synthetic_wellness_dataset.csv", index=False)
print("Dataset generated and saved as 'synthetic_wellness_dataset.csv'")

Dataset generated and saved as 'synthetic_wellness_dataset.csv'
