# **01 â€“ Data Cleaning**

## Objectives

- Load the raw NHS ADHD referral dataset
- Inspect structure and data types
- Identify missing values and duplicates
- Convert date fields to datetime
- Prepare the dataset for exploratory analysis and modelling

## Inputs

- Raw dataset: data/raw/MHSDS_historic.csv
- Python libraries: pandas, numpy

## Outputs

- Cleaned dataframe
- Corrected data types
- Sorted time-indexed dataset ready for analysis

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Shazia Mujahid\\Documents\\adhd-nhs-demand\\adhd-demand-forecast-england\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Shazia Mujahid\\Documents\\adhd-nhs-demand\\adhd-demand-forecast-england'

# Section 1. Data Loading & Initial Inspection

In this section, the raw NHS ADHD referral dataset is loaded and inspected to understand its structure, column names, data types, and overall completeness before performing any cleaning operations.

In [4]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/raw/MHSDS_historic.csv")

# Preview first 5 rows
df.head()

Unnamed: 0,REPORTING_PERIOD_START_DATE,REPORTING_PERIOD_END_DATE,INDICATOR_ID,AGE_GROUP,VALUE
0,01/02/2024,29/02/2024,ADHD003,0 to 4,1695
1,01/02/2024,29/02/2024,ADHD003,18 to 24,57760
2,01/02/2024,29/02/2024,ADHD003,25+,129075
3,01/02/2024,29/02/2024,ADHD003,5 to 17,105355
4,01/02/2024,29/02/2024,ADHD003,Unknown,15


In [5]:
# Inspect structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5609 entries, 0 to 5608
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   REPORTING_PERIOD_START_DATE  5609 non-null   object
 1   REPORTING_PERIOD_END_DATE    5609 non-null   object
 2   INDICATOR_ID                 5609 non-null   object
 3   AGE_GROUP                    5609 non-null   object
 4   VALUE                        5609 non-null   object
dtypes: object(5)
memory usage: 219.2+ KB


In [6]:
# Summary statistics
df.describe()

Unnamed: 0,REPORTING_PERIOD_START_DATE,REPORTING_PERIOD_END_DATE,INDICATOR_ID,AGE_GROUP,VALUE
count,5609,5609,5609,5609,5609
unique,59,59,21,5,2137
top,01/09/2021,30/09/2021,ADHD003,18 to 24,*
freq,102,102,295,1217,610


### Initial Observations

- The dataset contains 5,609 rows and 5 columns.
- No missing (null) values are present.
- All variables are currently stored as object (string) types.
- The VALUE column contains suppressed entries marked with "*", which must be handled before numeric analysis.
- Date columns require conversion to datetime format.

---

# Section 2. Data Cleaning & Type Conversion

In this section, data types are corrected, suppressed values are handled, and the dataset is prepared for time-series modelling.

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
