# **Analysis and Forecasting of ADHD Assessment Demand and Service Strain in England (2019–2024)**

## Project Narrative

This project analyses historical ADHD referral data from NHS England to evaluate statistically significant growth trends, assess indicators of service strain, and develop an interpretable predictive model to forecast future demand for ADHD assessments.

Recent NHS management information reports substantial increases in ADHD referrals and individuals awaiting assessment, highlighting growing pressures on mental health services (NHS Digital, 2025). National clinical guidance emphasises the importance of timely diagnosis and access to evidence-based treatment for individuals with ADHD, underscoring the need for effective service planning and capacity management (NICE, 2018).

By applying statistical analysis and supervised machine learning techniques to official NHS data, this project aims to generate evidence-based insights that may support healthcare commissioners and policy stakeholders in understanding demand trajectories and anticipating future service needs.


## Project Motivation

Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental condition characterised by persistent patterns of inattention, hyperactivity, and impulsivity, which can significantly affect educational, occupational, and social functioning (NICE, 2018). Prevalence estimates suggest that approximately 3–5% of the population may meet diagnostic criteria for ADHD, indicating substantial potential demand for assessment services (UK Parliament Commons Library, 2025).

Recent NHS publications report increasing numbers of individuals awaiting ADHD assessments and rising referral volumes year-on-year, raising concerns regarding service capacity and long waiting times (NHS Digital, 2025). The Independent ADHD Taskforce has further highlighted systemic pressures within diagnostic pathways and the need for improved planning and resource allocation (NHS England, 2024).

Given these developments, quantitative analysis of referral trends and waiting indicators is essential to determine whether observed increases are statistically significant and whether current trajectories suggest sustained or escalating service strain. This project seeks to contribute to that understanding through structured statistical analysis and predictive modelling.


## Project Objectives

**Primary Objective**

To analyse historical ADHD referral demand in England, evaluate service strain indicators, and develop a predictive model to forecast future assessment demand.

**Secondary Objectives**

- Identify whether ADHD referral volumes have increased significantly over time.
- Compare referral levels and growth trends across age groups.
- Analyse long-wait referral categories as indicators of system strain.
- Develop and evaluate a supervised machine learning model to forecast referral demand.
- Design an interactive dashboard to communicate insights clearly to healthcare stakeholders.


## Research Questions

**Research question 1:** Has ADHD referral demand increased significantly over time in England between April 2019 and February 2024?

**Research question 2:** Do mean open ADHD referral counts differ significantly across age groups in England?

**Research question 3:** Has the proportion of ADHD referrals waiting more than 52 weeks increased significantly over time?

**Research question 4:** Can future ADHD referral demand be predicted with acceptable forecasting accuracy using supervised machine learning?


## Hypothesis Testing Framework

Significance level for all statistical tests:  
α = 0.05

### Hypothesis 1 – Growth Trend in Referrals  
H₁: There is a statistically significant positive relationship between reporting month and open ADHD referral counts.  
H₀: There is no statistically significant relationship between reporting month and open ADHD referral counts.

### Hypothesis 2 – Age Group Differences  
H₁: There is a statistically significant difference in mean open ADHD referral counts across age groups.  
H₀: There is no statistically significant difference in mean open ADHD referral counts across age groups.

### Hypothesis 3 – Long Wait Service Strain  
H₁: There is a statistically significant positive relationship between reporting month and the proportion of ADHD referrals waiting more than 52 weeks.  
H₀: There is no statistically significant relationship between reporting month and the proportion of ADHD referrals waiting more than 52 weeks.


## Analytical Workflow

- Load the raw NHS ADHD referral dataset
- Clean and standardise variables
- Handle suppressed values appropriately
- Convert date variables to datetime format
- Construct structured time-series datasets
- Perform descriptive statistical analysis
- Conduct hypothesis testing (Regression & ANOVA)


## Inputs

- Raw dataset: data/raw/MHSDS_historic.csv
- Python libraries: pandas, numpy, matplotlib, seaborn, scipy, statsmodels
- NHS ADHD indicator definitions (data dictionary)


## Outputs

- Cleaned and standardised dataframe
- Structured monthly time-series datasets:
     - total_ts (total referrals)
     - age_ts (age-group monthly referrals)
     - strain_ts (52+ week proportion)
- Descriptive statistics and visualisations
- Statistical test results:
    - Linear regression outputs (H1 & H3)
    - ANOVA results (H2)


## Additional Comments

This notebook establishes the analytical foundation for the project by preparing structured datasets and conducting statistical testing prior to predictive modelling. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Shazia Mujahid\\Documents\\adhd-nhs-demand\\adhd-demand-forecast-england\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Shazia Mujahid\\Documents\\adhd-nhs-demand\\adhd-demand-forecast-england'

# Section 1. Data Loading & Initial Inspection

In this section, the raw NHS ADHD referral dataset is loaded and inspected to understand its structure, column names, data types, and overall completeness before performing any cleaning operations.

In [4]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/raw/MHSDS_historic.csv")

# Preview first 5 rows
df.head()

Unnamed: 0,REPORTING_PERIOD_START_DATE,REPORTING_PERIOD_END_DATE,INDICATOR_ID,AGE_GROUP,VALUE
0,01/02/2024,29/02/2024,ADHD003,0 to 4,1695
1,01/02/2024,29/02/2024,ADHD003,18 to 24,57760
2,01/02/2024,29/02/2024,ADHD003,25+,129075
3,01/02/2024,29/02/2024,ADHD003,5 to 17,105355
4,01/02/2024,29/02/2024,ADHD003,Unknown,15


In [5]:
# Inspect structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5609 entries, 0 to 5608
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   REPORTING_PERIOD_START_DATE  5609 non-null   object
 1   REPORTING_PERIOD_END_DATE    5609 non-null   object
 2   INDICATOR_ID                 5609 non-null   object
 3   AGE_GROUP                    5609 non-null   object
 4   VALUE                        5609 non-null   object
dtypes: object(5)
memory usage: 219.2+ KB


In [6]:
# Summary statistics
df.describe()

Unnamed: 0,REPORTING_PERIOD_START_DATE,REPORTING_PERIOD_END_DATE,INDICATOR_ID,AGE_GROUP,VALUE
count,5609,5609,5609,5609,5609
unique,59,59,21,5,2137
top,01/09/2021,30/09/2021,ADHD003,18 to 24,*
freq,102,102,295,1217,610


### Initial Observations

- The dataset contains 5,609 rows and 5 columns.
- No missing (null) values are present.
- All variables are currently stored as object (string) types.
- The VALUE column contains suppressed entries marked with "*", which must be handled before numeric analysis.
- Date columns require conversion to datetime format.

---

# Section 2. Data Cleaning & Type Conversion

In this section, data types are corrected, suppressed values are handled, and the dataset is prepared for time-series modelling.

In [7]:
# Convert date columns to datetime
df["REPORTING_PERIOD_START_DATE"] = pd.to_datetime(df["REPORTING_PERIOD_START_DATE"])
df["REPORTING_PERIOD_END_DATE"] = pd.to_datetime(df["REPORTING_PERIOD_END_DATE"])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5609 entries, 0 to 5608
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   REPORTING_PERIOD_START_DATE  5609 non-null   datetime64[ns]
 1   REPORTING_PERIOD_END_DATE    5609 non-null   datetime64[ns]
 2   INDICATOR_ID                 5609 non-null   object        
 3   AGE_GROUP                    5609 non-null   object        
 4   VALUE                        5609 non-null   object        
dtypes: datetime64[ns](2), object(3)
memory usage: 219.2+ KB


  df["REPORTING_PERIOD_END_DATE"] = pd.to_datetime(df["REPORTING_PERIOD_END_DATE"])


In [8]:
# Count suppressed values
(df["VALUE"] == "*").sum()

610

In [9]:
import numpy as np

# Replace suppressed values with NaN
df["VALUE"] = df["VALUE"].replace("*", np.nan)

# Convert VALUE to numeric
df["VALUE"] = pd.to_numeric(df["VALUE"])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5609 entries, 0 to 5608
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   REPORTING_PERIOD_START_DATE  5609 non-null   datetime64[ns]
 1   REPORTING_PERIOD_END_DATE    5609 non-null   datetime64[ns]
 2   INDICATOR_ID                 5609 non-null   object        
 3   AGE_GROUP                    5609 non-null   object        
 4   VALUE                        4999 non-null   float64       
dtypes: datetime64[ns](2), float64(1), object(2)
memory usage: 219.2+ KB


### Handling Suppressed Values

The VALUE column contained 610 suppressed entries marked with "*", representing small counts withheld for confidentiality reasons in NHS reporting.

These entries were replaced with NaN and excluded from numeric modelling to ensure analytical validity and avoid introducing bias through incorrect imputation.

In [10]:
# Identify all indicator codes present to ensure the dataset contains ADHD003 (referral measure of interest)
df["INDICATOR_ID"].unique()

array(['ADHD003', 'ADHD003a', 'ADHD003b', 'ADHD003c', 'ADHD003d',
       'ADHD004', 'ADHD004a', 'ADHD004b', 'ADHD004c', 'ADHD004d',
       'ADHD005', 'ADHD005a', 'ADHD005b', 'ADHD005c', 'ADHD005d',
       'ADHD006', 'ADHD006a', 'ADHD006b', 'ADHD006c', 'ADHD006d',
       'ADHD007'], dtype=object)

In [11]:
# Filter dataset to include only ADHD003 (total referral measure)
df = df[df["INDICATOR_ID"] == "ADHD003"].copy()

# Confirm filtering worked
df["INDICATOR_ID"].unique()

array(['ADHD003'], dtype=object)

In [12]:
# Rename columns for clarity and modelling
df = df.rename(columns={
    "REPORTING_PERIOD_START_DATE": "period_start",
    "REPORTING_PERIOD_END_DATE": "period_end",
    "AGE_GROUP": "age_group",
    "VALUE": "value"
})

In [None]:
# Confirm structure after filtering
df.head()

Unnamed: 0,period_start,period_end,INDICATOR_ID,age_group,value
0,2024-01-02,2024-02-29,ADHD003,0 to 4,1695.0
1,2024-01-02,2024-02-29,ADHD003,18 to 24,57760.0
2,2024-01-02,2024-02-29,ADHD003,25+,129075.0
3,2024-01-02,2024-02-29,ADHD003,5 to 17,105355.0
4,2024-01-02,2024-02-29,ADHD003,Unknown,15.0


---

# Section 3. Feature Engineering & Time Series Structuring

In this section, the cleaned ADHD003 dataset is transformed into structured time-series datasets suitable for statistical testing and forecasting.

In [14]:
# Create total monthly referrals (all age groups combined)
total_ts = (
    df.groupby("period_end", as_index=False)["value"]
      .sum()
      .sort_values("period_end")
)

total_ts.head()

Unnamed: 0,period_end,value
0,2019-04-30,30730.0
1,2019-05-31,33245.0
2,2019-06-30,36060.0
3,2019-07-31,39690.0
4,2019-08-31,39815.0


In [15]:
# Create age-group monthly dataset
age_ts = (
    df.groupby(["period_end", "age_group"], as_index=False)["value"]
      .sum()
      .sort_values(["period_end", "age_group"])
)

age_ts.head()

Unnamed: 0,period_end,age_group,value
0,2019-04-30,0 to 4,595.0
1,2019-04-30,18 to 24,5065.0
2,2019-04-30,25+,9435.0
3,2019-04-30,5 to 17,15615.0
4,2019-04-30,Unknown,20.0


In [18]:
# Filter dataset to include only referrals open for 52+ weeks
# ADHD003c = 52–104 weeks
# ADHD003d = 104+ weeks
# These combined represent long-wait referrals (>52 weeks)
long_wait_df = df[df["INDICATOR_ID"].isin(["ADHD003c", "ADHD003d"])].copy()

In [19]:
# Aggregate long-wait referrals to monthly national totals
# This creates a structured time-series dataset for RQ3 analysis
long_wait_ts = (
    long_wait_df.groupby("period_end", as_index=False)["value"]
    .sum()
    .sort_values("period_end")
)

In [20]:
# Merge total referrals with long-wait totals
# This allows calculation of the proportion of referrals waiting >52 weeks
strain_ts = total_ts.merge(
    long_wait_ts,
    on="period_end",
    how="left",
    suffixes=("_total", "_52plus")
)

In [21]:
# Replace any missing long-wait counts with 0 (no long waits recorded that month)
strain_ts["value_52plus"] = strain_ts["value_52plus"].fillna(0)

# Calculate proportion of referrals waiting >52 weeks
# This standardises long waits relative to total referral demand
strain_ts["prop_52plus"] = (
    strain_ts["value_52plus"] / strain_ts["value_total"]
)

---