# NHS A&E Data Validation & Summary Analysis

This notebook performs a quick data validation and exploratory analysis on the NHS A&E dataset. We will:
- Check that the dataset is balanced across years and months.
- Identify any outliers or unexpected trends.
- Summarize key statistics for numerical and categorical variables.


In [2]:
import pandas as pd

# Load the dataset
file_path = "nhs_ae_merged_with_synthetic_data.csv"
nhs_data = pd.read_csv(file_path)

# Display the first few rows
nhs_data.head()


Unnamed: 0,period,org_code,parent_org,org_name,a&e_attendances_type_1,a&e_attendances_type_2,a&e_attendances_other_a&e_department,a&e_attendances_booked_appointments_type_1,a&e_attendances_booked_appointments_type_2,a&e_attendances_booked_appointments_other_department,...,attendances_over_4hrs_booked_appointments_other_department,patients_who_have_waited_4-12_hs_from_dta_to_admission,patients_who_have_waited_12+_hrs_from_dta_to_admission,emergency_admissions_via_a&e_-_type_1,emergency_admissions_via_a&e_-_type_2,emergency_admissions_via_a&e_-_other_a&e_department,other_emergency_admissions,month,year,percentage_seen_within_4_hours
0,MSitAE-APRIL-2024,AAH,NHS ENGLAND SOUTH WEST,TETBURY HOSPITAL TRUST LTD,0.0,0.0,546.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,April,2024,
1,MSitAE-APRIL-2024,RAN,NHS ENGLAND LONDON,ROYAL NATIONAL ORTHOPAEDIC HOSPITAL NHS TRUST,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,41.0,April,2024,
2,MSitAE-APRIL-2024,8J094,NHS ENGLAND MIDLANDS,BADGER LTD,0.0,0.0,0.0,0.0,0.0,2078.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,April,2024,
3,MSitAE-APRIL-2024,AD913,NHS ENGLAND LONDON,BECKENHAM BEACON UCC,0.0,0.0,3694.0,0.0,0.0,104.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,April,2024,
4,MSitAE-APRIL-2024,AQN04,NHS ENGLAND SOUTH EAST,PHL LYMINGTON UTC,0.0,0.0,2897.0,0.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,April,2024,


## Checking Data Balance

To ensure the dataset is balanced, I will:
- Count the number of records for each year.
- Count the number of records for each month.


In [3]:
# Count records per year
print("Records per year:")
print(nhs_data["year"].value_counts())

# Count records per month
print("\nRecords per month:")
print(nhs_data["month"].value_counts())


Records per year:
year
2021    2906
2022    2466
2023    2450
2024    2383
2018     214
Name: count, dtype: int64

Records per month:
month
July         931
January      901
February     898
March        894
May          822
June         821
April        819
August       815
December     815
November     815
October      815
September    814
0            259
Name: count, dtype: int64


## Identifying Outliers

Next, I will look at some key numerical fields to identify any outliers or unexpected values. This will help ensure that our analysis is based on clean, reliable data.
