## Cleaning NHS Dataset - Attempt 2

Written by Autumn

This is my 2nd attempt at cleaning the NHS data ready for analysis.

I obtained this data set used in this program the following link: https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics/final-may-provisional-june-2018

There is a publication for each month and as my intial method did not provide the correct results, I looked for other files which could provide the data we need.

For this example I downloaded the below file:

<img src="data/images/nhs_file_may_2018.png">

I noticed that the first row of the data in this file had the figure we need the figures we needed so I created a function which would take in a list of files and then produce the cleaned versions ready to be joined together.

Below is the function i created:

First I imported the data and the pandas library and madea copy of the data so the raw data remains unchanged:

In [5]:
import pandas as pd

data = pd.read_csv("data/NHS_raw/NHS_May_18.csv")

data_copy = data.copy()

data_copy.dtypes

REPORTING_PERIOD                                                                                                  object
STATUS                                                                                                            object
BREAKDOWN                                                                                                         object
PRIMARY_LEVEL                                                                                                     object
PRIMARY_LEVEL_DESCRIPTION                                                                                         object
                                                                                                                   ...  
CCR71a - New Urgent Referrals to Crisis Care teams in RP, Aged 18 and over                                        object
CCR72 - New Emergency Referrals to Crisis Care teams in RP with first face to face contact                        object
CCR72a - New Emergency Referrals

All the columns have been been read in as 'object' data types and there are a lot of columns. We will be using the REPORTING_PERIOD and MHS32 - Referrals starting in RP columns to analyse the data so I will also remove all the columns that we won't be using.

In [6]:
data_copy = data_copy[['REPORTING_PERIOD', 'MHS32 - Referrals starting in RP']]

data_copy

Unnamed: 0,REPORTING_PERIOD,MHS32 - Referrals starting in RP
0,May-18,308921
1,May-18,795
2,May-18,670
3,May-18,955
4,May-18,1410
...,...,...
2702,May-18,-
2703,May-18,-
2704,May-18,-
2705,May-18,-


In [7]:
data_copy.rename(columns = {'REPORTING_PERIOD': 'month_start_date', 'MHS32 - Referrals starting in RP': 'new_referrals'}
                           , inplace=True)
data_copy

Unnamed: 0,month_start_date,new_referrals_count
0,May-18,308921
1,May-18,795
2,May-18,670
3,May-18,955
4,May-18,1410
...,...,...
2702,May-18,-
2703,May-18,-
2704,May-18,-
2705,May-18,-


I converted the remaining columns to string data types so they can be queried and manipulated.

In [8]:
data_copy[['month_start_date', 'new_referrals']] = data_copy[
    ['month_start_date', 'new_referrals']].astype("string")

data_copy.dtypes

month_start_date       string
new_referrals_count    string
dtype: object

Now I can convert the reporting_period_start column to a datetime datatype so we can filter by date range.

In [10]:
data_copy[['month_start_date', ]] = data_copy[['month_start_date', ]].apply(pd.to_datetime, format='%b-%y')
data_copy.dtypes

month_start_date       datetime64[ns]
new_referrals_count            string
dtype: object

Now we can take just the first row as this is the sum of all the new referrals in the reporting period

In [11]:
cleaned_df = data_copy.head(1)

cleaned_df

Unnamed: 0,month_start_date,new_referrals_count
0,2018-05-01,308921


And the file can be written to a csv.

In [None]:
cleaned_df.to_csv('data/NHS_cleaned/NHS_May_18_cleaned.csv', encoding='utf-8')

The plan was to join all these monthly files together to form the full data set. Unfortunately, this function only works for files between May 2018 and April 2019. After this, the file name stayed the same on each publication but the content of the file changed and it no longer provided the figures we needed. Then the following year, the file name was no longer published at all. Also, before May 2018, the day format was different so the function couldn't be used on these files without having to amend the function slightly. However, even if these files has the same date format as the files i could use, we still wouldn't have had enough data to cover the time period we need due to the missing information.