# Synthetic data generation

This notebook creates a synthetic dataset for use in the discrete-event simulation templates.

The structure of the dataset is based on the data dictionary from the [Emergency Care Data Set (ECDS)], ensuring that the fields and formats loosely mirror real NHS data. However, the content itself is entirely artificial and is designed for a simplified/non-specific scenario which we will call nurse consultations.

All data values are entirely synthetic and reverse-engineered to be suitable for an M/M/s queueing model (i.e., Poisson arrivals and exponential service times).

**Why did we make this data?** It allows us to illustrate a typical workflow: starting from raw data, processing it, and extracting parameters for use in the simulation model.

In [1]:
import numpy as np
import pandas as pd

from sim_tools.distributions import Poisson, Exponential

Parameters.

In [2]:
# Generate data from one year
hours = 365*24

# Mean arrival rate (patients per hour)
arrival_rate = 15

# Mean wait time (in minutes)
wait_time_mean = 5

# Mean service time (in minutes)
service_time_mean = 10

# Random seeds
seeds = [101, 202, 303]

Generate arrival times.

In [3]:
# Sample patients per hour
arrival_dist = Poisson(rate=arrival_rate, random_seed=seeds[0])
hourly_counts = arrival_dist.sample(size=hours)

# Generate precise arrival timestamps (in seconds)
arrival_timestamps = []
current_time = 0
for hour, count in enumerate(hourly_counts):
    if count > 0:
        # Spread arrivals randomly within the hour
        offsets = np.sort(np.random.uniform(0, 3600, count))  # seconds
        arrival_timestamps.extend(hour * 3600 + offsets)

# Convert to datetime format (starting from 2025-01-01)
base_date = np.datetime64('2025-01-01')
arrival_dates = base_date + np.array(arrival_timestamps).astype('timedelta64[s]')

In [4]:
# Find total patients
total_patients = hourly_counts.sum()
print(f"Total patients: {total_patients}")

Total patients: 131916


Generate wait times and service durations.

In [5]:
# Generate wait times
wait_dist = Exponential(mean=wait_time_mean, random_seed=seeds[1])
wait_times = np.round(wait_dist.sample(size=total_patients), 1)

# Generate service durations
server_dist = Exponential(mean=service_time_mean, random_seed=seeds[2])
service_durations = np.round(server_dist.sample(size=total_patients), 1)

Calculate time first seen and departure times.

In [6]:
time_first_seen = [arr + pd.Timedelta(minutes=wt) 
                   for arr, wt in zip(arrival_dates, wait_times)]

departure_times = [first_seen + pd.Timedelta(minutes=svc)
                   for first_seen, svc in zip(time_first_seen, service_durations)]

Create dataframe, and save to csv.

In [7]:
# Ensure all are pandas Series of datetime64
arrival_series = pd.to_datetime(arrival_dates)
service_series = pd.to_datetime(time_first_seen)
departure_series = pd.to_datetime(departure_times)

# Create dataframe
df = pd.DataFrame({
    "ARRIVAL_DATE": arrival_series.strftime("%Y-%m-%d"),
    "ARRIVAL_TIME": arrival_series.strftime("%H%M"),
    "SERVICE_DATE": service_series.strftime("%Y-%m-%d"),
    "SERVICE_TIME": service_series.strftime("%H%M"),
    "DEPARTURE_DATE": departure_series.strftime("%Y-%m-%d"),
    "DEPARTURE_TIME": departure_series.strftime("%H%M"),
})
df.head()

Unnamed: 0,ARRIVAL_DATE,ARRIVAL_TIME,SERVICE_DATE,SERVICE_TIME,DEPARTURE_DATE,DEPARTURE_TIME
0,2025-01-01,1,2025-01-01,7,2025-01-01,12
1,2025-01-01,2,2025-01-01,4,2025-01-01,7
2,2025-01-01,3,2025-01-01,10,2025-01-01,30
3,2025-01-01,7,2025-01-01,14,2025-01-01,22
4,2025-01-01,10,2025-01-01,12,2025-01-01,31


In [8]:
df.to_csv("../NHS_synthetic.csv", index=False)