# Input modelling

This notebook shows a basic workflow for choosing probability distributions.

Here, we already know which distributions to use (as we sampled from them to create our synthetic data), but the steps illustrate how you might select distributions in practice with real data.

There are two possible workflows, depending on whether you want to fit distributions manually or automatically using available packages.

> **Manual workflow**.
> 
> 1. **Identify distributions to test** based on knowledge of processes being modelled, and inspection of the data (times series and histogram).
> 2. **Determine parameters** for those distributions.
> 3. **Test goodness-of-fit** for each distribution.
> 
> **Automated workflow**.
> 
> 1. **Identify relevant distributions** using same workflow as manual.
> 2. **Use tool** to fit a range of distributions.

It's still important to identify relevant distributions in the automated workflow - even if you'll be testing a whole suite - as these tools:

* Won't notify you have **temporal patterns** (e.g. spikes in service length every Friday).
* May suggest distributions which mathematically fit but **contextually are inappropriate** (e.g. normal distribution for service times, which can't be negative).
* Overfitting - tool may suggest complex distribution, even if **simpler is sufficient**.

## Set-up

In [1]:
# Import required packages
import pandas as pd
import plotly.express as px

In [2]:
# Import data
data = pd.read_csv("../inputs/NHS_synthetic.csv", dtype={
    "ARRIVAL_TIME": str,
    "SERVICE_TIME": str,
    "DEPARTURE_TIME": str
})

# Preview data
data.head()

Unnamed: 0,ARRIVAL_DATE,ARRIVAL_TIME,SERVICE_DATE,SERVICE_TIME,DEPARTURE_DATE,DEPARTURE_TIME
0,2025-01-01,1,2025-01-01,7,2025-01-01,12
1,2025-01-01,2,2025-01-01,4,2025-01-01,7
2,2025-01-01,3,2025-01-01,10,2025-01-01,30
3,2025-01-01,7,2025-01-01,14,2025-01-01,22
4,2025-01-01,10,2025-01-01,12,2025-01-01,31


Calculate inter-arrival times.

In [3]:
# Combine date/time and convert to datetime
data["arrival_datetime"] = pd.to_datetime(
    data["ARRIVAL_DATE"] + " " + data["ARRIVAL_TIME"].str.zfill(4),
    format="%Y-%m-%d %H%M"
)

# Sort by arrival time and calculate inter-arrival times
data = data.sort_values("arrival_datetime")
data["iat_mins"] = (
    data["arrival_datetime"].diff().dt.total_seconds() / 60
)

Calculate service times.

In [4]:
# Combine dates with times
data["service_datetime"] = pd.to_datetime(
    data["SERVICE_DATE"] + " " + data["SERVICE_TIME"].str.zfill(4)
)
data["departure_datetime"] = pd.to_datetime(
    data["DEPARTURE_DATE"] + " " + data["DEPARTURE_TIME"].str.zfill(4)
)

# Calculate time difference in minutes
time_delta = data["departure_datetime"] - data["service_datetime"]
data["service_mins"] = time_delta / pd.Timedelta(minutes=1)

## Both workflows: Identify relevant distributions

First, we consider our **knowledge about the process being modelled**. In this case, we have random arrivals and service times in a queueing model, which are often modelled using exponential distributions.

Then, we **inspect the data** in two different ways:

| Plot type | What does it show? | Why do we create this plot? |
| - | - | - |
| **Time series** | Trends, seasonality, and outliers (e.g., spikes or dips over time). | To check for **stationarity** (i.e. no trends or sudden changes). Stationary is an assumption of many distributions, and if trends or anomalies do exist, we may need to exclude certain periods or model them separately. The time series can also be useful for spotting outliers and data gaps. |
| **Histogram** | The shape of the data's distribution. | Helps **identify which distributions might fit** the data. |

We repeat this for arrivals and service time, so have created a function to avoid duplicate code between each.

In [5]:
def inspect_time_series(series, y_lab):
    """
    Plot time-series.

    Parameters
    ----------
    series : pd.Series
        Series containing the time series data (where index is the date).
    y_lab : str
        Y axis label.
    """
    # Label as "Date" and provided y_lab, and convert to dataframe
    df = series.rename_axis("Date").reset_index(name=y_lab)

    # Create plot
    fig = px.line(df, x="Date", y=y_lab)
    fig.update_layout(showlegend=False, width=700, height=400)
    fig.show()


def inspect_histogram(series, x_lab):
    """
    Plot histogram.

    Parameters
    ----------
    series : pd.Series
        Series containing the values to plot as a histogram.
    x_lab : str
        X axis label.
    """
    fig = px.histogram(series)
    fig.update_layout(
        xaxis_title=x_lab, showlegend=False, width=700, height=400
    )
    fig.update_traces(
        hovertemplate=x_lab + ": %{x}<br>Count: %{y}", name=""
    )
    fig.show()

### Arrivals

Daily arrivals - no trends/seasonality/outliers.

In [6]:
# Calculate mean arrivals per day and plot time series
inspect_time_series(series=data.groupby(by=["ARRIVAL_DATE"]).size(),
                    y_lab="Number of arrivals")

Distribution of inter-arrival times. Based on this, would try exponential, gamma and Weibull distributions.

In [7]:
# Plot histogram of inter-arrival times
inspect_histogram(series=data["iat_mins"],
                  x_lab="Inter-arrival time (min)")

### Service times

Daily mean service time - no trends/seasonality/outliers.

In [8]:
# Calculate mean service length per day, dropping last day (incomplete)
daily_service = data.groupby("SERVICE_DATE")["service_mins"].mean()
daily_service = daily_service.iloc[:-1]

# Plot time series
inspect_time_series(series=daily_service,
                    y_lab="Mean consultation length (min)")

Distribution of service times. Based on this, would try exponential, gamma and Weibull distributions.

In [9]:
# Plot histogram of service times
inspect_histogram(series=data["service_mins"],
                  x_lab="Consultation length (min)")

## Manual: Determine parameters

## Manual: Fit distributions

## Automated: Use tool to fit distributions