<a href="https://colab.research.google.com/github/regulate-tech/nhstech/blob/main/subject-paper/nhstech_project_paper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Obtain ERIC data on cost of storing paper medical records

This notebook creates a dataset that we can use to analyse spending on storing paper medical records in the NHS. 

What this notebook does:
- Collects all relevant ERIC releases on paper records
- Tidies the data and joins it into a single file.

Context: 
- The NHS has committed to replacing paper medical records with digital records several times but this has not happened.
- One way to understand the gap between commitment and delivery is to look at spending on storing and using paper records.
- All NHS Trusts have to complete an annual return for their estate costs which includes data on paper records costs - this is called [the ERIC return](https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection).

This notebook is based on an earlier notebook written by Richard, with some additions.

In [3]:
import os
import re
from urllib.parse import unquote, urlparse

import chardet
import numpy as np
import pandas as pd
import requests

First, fetch all the CSV files from the [published ERIC collection](https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection) and store them in a local directory. As there are relatively few of these, we just store the links manually.

In [4]:
DATA_URLS = [
    # NB: The 2023/24 data here is provisional - update the top 3 links here when the final data is published.
    'https://files.digital.nhs.uk/8B/75875E/ERIC%20-%202023_24%20-%20Trust%20data%20-%20Provisional.csv',
    'https://files.digital.nhs.uk/AF/AC55EB/ERIC%20-%202023_24%20-%20Site%20data%20-%20Provisional.csv',
    'https://files.digital.nhs.uk/5D/147420/ERIC%20-%202023_24%20-%20PFI%20data%20-%20Provisional.csv',
    'https://files.digital.nhs.uk/FB/BE3AC8/ERIC%20-%20202223%20-%20Trust%20data.csv',
    'https://files.digital.nhs.uk/41/5787C9/ERIC%20-%202022_23%20-%20Site%20data.csv',
    'https://files.digital.nhs.uk/42/D5A005/ERIC%20-%202022_23%20-%20PFI%20data.csv',
    'https://files.digital.nhs.uk/08/84C46C/ERIC%20-%20202122%20-%20Trust%20data.csv',
    'https://files.digital.nhs.uk/EE/7E330D/ERIC%20-%20202122%20-%20Site%20Data%20v3.csv',
    'https://files.digital.nhs.uk/D3/D0DFD3/ERIC%20-%20202122%20-%20PFI%20data%20-%20v2.csv',
    'https://files.digital.nhs.uk/81/4A77B0/ERIC%20-%20202021%20-%20Trust%20data.csv',
    'https://files.digital.nhs.uk/0F/46F719/ERIC%20-%20202021%20-%20Site%20data%20v2.csv',
    'https://files.digital.nhs.uk/5F/4B00BC/ERIC%20-%20202021%20-%20PFI%20data.csv',
    'https://files.digital.nhs.uk/84/07227E/ERIC%20-%20201920%20-%20TrustData.csv',
    'https://files.digital.nhs.uk/11/BC1043/ERIC%20-%20201920%20-%20SiteData%20-%20v2.csv',
    'https://files.digital.nhs.uk/51/8C7C23/ERIC%20-%20201920%20-%20PFIData.csv',
    'https://files.digital.nhs.uk/83/4AF81B/ERIC%20-%20201819%20-%20TrustData%20v4.csv',
    'https://files.digital.nhs.uk/63/ADBFFF/ERIC%20-%20201819%20-%20SiteData%20v4.csv',
    'https://files.digital.nhs.uk/F6/791B8F/ERIC%20-%20201819%20-%20PFIData%20v3.csv',
    'https://files.digital.nhs.uk/1B/7C75CF/ERIC-201718-TrustData.csv',
    'https://files.digital.nhs.uk/A8/188D99/ERIC-201718-SiteData.csv',
    'https://files.digital.nhs.uk/09/928620/ERIC-201718-PFIData.csv'
]

# Fetch raw data 

Fetch files, formatting filenames consistently.

In [5]:
os.makedirs("csv_files", exist_ok=True)
for url in DATA_URLS:
    try:
        response = requests.get(url)
        response.raise_for_status()
        # Format filenames if necessary.
        filename = unquote(os.path.basename(urlparse(url).path))
        filename = filename.replace(" ", "-")
        if "_" in filename:
            parts = filename.split("_")
            filename = parts[0] + parts[1]
        with open(os.path.join("csv_files", filename), 'wb') as f:
            f.write(response.content)
    except Exception as e:
        print(f"Failed to download {url}: {e}")

Make everything UTF-8, to avoid confusing pandas later.

In [6]:
csv_files = [f for f in os.listdir('csv_files') if f.endswith('.csv')]
for filename in csv_files:
  filepath = os.path.join('csv_files', filename)
  with open(filepath, 'rb') as f:
    rawdata = f.read()
  result = chardet.detect(rawdata)
  encoding = result["encoding"]
  if encoding != 'utf-8':
    try:
      with open(filepath, 'r', encoding=encoding) as f:
        data = f.read()
      with open(filepath, 'w', encoding='utf-8') as f:
        f.write(data)
    except Exception as e:
      print(f"Error converting {filename}: {e}")

# Extract MRC data points

The ERIC data has lots of data points in it, but in this exercise we just care about the use of paper records, so extract those columns.

In [7]:
def process_csv_files():
  """
  Opens CSV files that have 'Trust' in their filename,
  creates a dictionary with unique 'Trust Code' and 'Trust Name' pairings,
  extracts data from columns containing 'Medical Records', and stores it with the
  column name plus year code.
  """
  trust_data = {}

  # Get a list of all CSV files in the 'csv_files' directory that contain 'Trust'
  csv_files = [f for f in os.listdir('csv_files') if f.endswith('.csv') and "Trust" in f]
  for filename in csv_files:
    filepath = os.path.join('csv_files', filename)

    # Extract year from filename
    year_code = re.search(r'(\d{6})', filename)
    if year_code:
      year_code = year_code.group(1)
    else:
      year_code = "Unknown" 

    try:
      # This file has two superfluous header rows: skip them.
      if "202324---Trust-data---Provisional" in filepath:
        df = pd.read_csv(filepath, skiprows=2)
      else:
        df = pd.read_csv(filepath)
      df.head()

      if "Trust Code" in df.columns and "Trust Name" in df.columns:
        for index, row in df.iterrows():
          trust_code = row["Trust Code"]
          trust_name = row["Trust Name"]
          if (trust_code, trust_name) not in trust_data:
            trust_data[(trust_code, trust_name)] = {}
              
          for col in df.columns:
            if "Medical Records" in col:
              # Store the data with the column name plus year code
              trust_data[(trust_code, trust_name)][col + "_" + year_code] = row[col]

    except Exception as e:
      print(f"Error processing file {filename}: {e}")

  return trust_data

trust_data = process_csv_files()

# Convert to a DataFrame, reshape and simplify the column names.
trust_df = pd.DataFrame.from_dict(trust_data, orient='index')
trust_df = trust_df.reset_index()
trust_df = trust_df.rename(columns={'level_0': 'trust_code', 'level_1': 'trust_name'})
replacements = {
    r'Medical Records cost - Onsite \(£\)_(\d{6})': r'mrc_on_\1',
    r'Medical Records cost - Offsite \(£\)_(\d{6})': r'mrc_off_\1',
    r'Medical Records cost - Total \(£\)_(\d{6})': r'mrc_tot_\1',
    r'Medical Records volume - Onsite \(records\)_(\d{6})': r'mrv_on_\1',
    r'Medical Records volume - Offsite \(records\)_(\d{6})': r'mrv_off_\1',
    r'Medical Records volume - Total \(records\)_(\d{6})': r'mrv_tot_\1'
}
for old, new in replacements.items():
    trust_df.columns = trust_df.columns.str.replace(old, new, regex=True)

In [10]:
trust_df.dtypes

trust_code                                           object
trust_name                                           object
mrc_on_201819                                        object
mrc_off_201819                                       object
Type of Medical Records (Select)_201819              object
Medical Records service provision (Select)_201819    object
mrc_on_202324                                        object
mrc_off_202324                                       object
Type of Medical Records (Select)_202324              object
Medical Records service provision (Select)_202324    object
mrc_on_201920                                        object
mrc_off_201920                                       object
Type of Medical Records (Select)_201920              object
Medical Records service provision (Select)_201920    object
mrc_on_202021                                        object
mrc_off_202021                                       object
Type of Medical Records (Select)_202021 

In [9]:
trust_df.head()

Unnamed: 0,trust_code,trust_name,mrc_on_201819,mrc_off_201819,Type of Medical Records (Select)_201819,Medical Records service provision (Select)_201819,mrc_on_202324,mrc_off_202324,Type of Medical Records (Select)_202324,Medical Records service provision (Select)_202324,...,Type of Medical Records (Select)_202122,Medical Records service provision (Select)_202122,mrc_on_202223,mrc_off_202223,Type of Medical Records (Select)_202223,Medical Records service provision (Select)_202223,mrc_on_201718,mrc_off_201718,Type of Medical Records (Select)_201718,Medical Records service provision (Select)_201718
0,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,2871339,145631,3. Mixed,Internal,5557232.0,524104.0,3. Mixed,Internal,...,3. Mixed,Internal,5629069.0,296103.0,3. Mixed,Internal,3612846,203096,Mixed,Internal
1,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,169347,118345,3. Mixed,Hybrid,,,,,...,,,,,,,155193,107682,Mixed,Hybrid
2,R1C,SOLENT NHS TRUST,39584,83204,3. Mixed,Hybrid,31174.0,62480.0,3. Mixed,Hybrid,...,3. Mixed,Hybrid,15703.0,66994.0,3. Mixed,Hybrid,53002,71981,Mixed,Hybrid
3,R1D,SHROPSHIRE COMMUNITY HEALTH NHS TRUST,60990,65359,3. Mixed,Hybrid,22629.0,70135.0,3. Mixed,Hybrid,...,3. Mixed,Internal,44528.0,67100.0,3. Mixed,Internal,54090,55177,Mixed,Hybrid
4,R1F,ISLE OF WIGHT NHS TRUST,864334,20000,3. Mixed,Internal,123683.0,1398640.0,3. Mixed,Internal,...,3. Mixed,Internal,123683.0,227514.0,3. Mixed,Internal,832889,0,Mixed,Internal


## Create a summary table

Now, create a summary table with 'Trust Code' and 'Trust Name' and each column starting 'mrc', in ascending year order. 

Save this to a new CSV file: this will be our reference data going forward.

In [8]:
# Select the desired columns, and create a new dataframe, sorted by year.
selected_columns = ['trust_code', 'trust_name'] + [col for col in trust_df.columns if col.startswith('mrc')]
filtered_df = trust_df[selected_columns]
filtered_df = filtered_df.reindex(sorted(filtered_df.columns, key=lambda x: x.split('_')[-1] if '_' in x else x), axis=1)

# Replace null values with 0; remove commas from the number columns, and convert to numbers.
number_columns = [col for col in filtered_df.columns if col not in ['trust_code', 'trust_name']]
number_columns.sort()
filtered_df[number_columns] = filtered_df[number_columns].apply(lambda x: x.str.replace(',', ''))
filtered_df[number_columns] = filtered_df[number_columns].apply(pd.to_numeric, errors='coerce')
filtered_df.fillna(0, inplace=True)

filtered_df.to_csv('trust_mrc_sorted_formatted.csv', index=False)
filtered_df.head(2)

Unnamed: 0,mrc_on_201718,mrc_off_201718,mrc_on_201819,mrc_off_201819,mrc_on_201920,mrc_off_201920,mrc_on_202021,mrc_off_202021,mrc_on_202122,mrc_off_202122,mrc_on_202223,mrc_off_202223,mrc_on_202324,mrc_off_202324,trust_code,trust_name
0,3612846.0,203096.0,2871339.0,145631.0,3285737.0,277200.0,3285787.0,458075.0,4875116.0,438847.0,5629069.0,296103.0,5557232.0,524104.0,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST
1,155193.0,107682.0,169347.0,118345.0,184588.0,128996.0,201201.0,140606.0,0.0,0.0,0.0,0.0,0.0,0.0,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST


## TODO: Exclude inactive trusts

Some of the trusts in this file have since become inactive. We want to flag these, so we can exclude them from our analyses.