Objective:
 
We initially believed there is likely to be an association with deprivation levels and rates of injury. We will do this by comparing 1 month of NHS maternity dataset - trauma reported to the deprivation deciles at birth and observe if there is a positive correlation from there.

After doing exploratory data analysis we found there was a weak negative correlation between deprivation levels and injury rates, finding richer areas were more likely to incur injuries while giving birth.

Scope creep - From this the objective has changed, we will thoroughly analyse almost every element of the NHS maternity dataset to see if we can find what factors influence injury rates the most.


Accounting for misreporting:

From previous analysis we are aware there is misreporting of trauma data, with some trusts reporting that they have no trauma in their maternity wards.
There are other missing values and misreportings among the other elements of the dataset but it is not possible to filter this out unless analysing the data on a more granular scale. This scope creep will be avoided as otherwise the analysis will never finish and will likely result in the data being so incomplete nothing can be drawn from it.

Due to the offical nature of the reporting it is highly unlikely trusts will over-report trauma.

Update - we will only remove trusts that have trauma rates under 10% as there are only 3 clear cases of misreporting.

These trusts are:
IMPERIAL COLLEGE HEALTHCARE NHS TRUST,
ROYAL FREE LONDON NHS FOUNDATION TRUST,
THE SHREWSBURY AND TELFORD HOSPITAL NHS TRUST


Results:
Visualisations can be found at the bottom
There is a large correlation between % of black mothers and emergency C sections
There is a large correlation between deprivation and % of mothers that smoke
There is a slight correlation between deprivation and emergency c sections




In [151]:
import pandas as pd
import glob
import numpy as np
from numpy.polynomial.polynomial import Polynomial
import os
from datetime import datetime
import bokeh
import pandas_bokeh
from bokeh.plotting import figure, output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, HoverTool, LabelSet
from bokeh.transform import cumsum
from bokeh.palettes import Category20c
from math import pi
from IPython.display import FileLink
from scipy.stats import t
from bokeh.plotting import figure, show, output_notebook
from bokeh.transform import dodge
from bokeh.models import ColumnDataSource, HoverTool

In [66]:
#Initalises our maternity dataset
pd.set_option('plotting.backend', 'pandas_bokeh')
output_notebook()
file_path = 'C:/NHS maternity data/2023/msds-apr2023-exp-data-final.csv'

df = pd.read_csv(file_path)
df.head()

Unnamed: 0,ReportingPeriodStartDate,ReportingPeriodEndDate,Dimension,Org_Level,Org_Code,Org_Name,Measure,Count_Of,Final_value
0,01/04/2023,30/04/2023,AgeAtBookingMotherAvg,National,ALL,ALL SUBMITTERS,,Average over women,31
1,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,National,ALL,ALL SUBMITTERS,20 to 24,Women,6525
2,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,National,ALL,ALL SUBMITTERS,25 to 29,Women,13710
3,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,National,ALL,ALL SUBMITTERS,30 to 34,Women,17640
4,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,National,ALL,ALL SUBMITTERS,35 to 39,Women,9895


In [67]:
def pearson_correlation_debug(x, y):
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    covariance = sum((x.iloc[i] - mean_x) * (y.iloc[i] - mean_y) for i in range(n))
    std_x = (sum((x.iloc[i] - mean_x) ** 2 for i in range(n)) ** 0.5)
    std_y = (sum((y.iloc[i] - mean_y) ** 2 for i in range(n)) ** 0.5)
    if std_x == 0 or std_y == 0:
        return float('nan')
    r = covariance / (std_x * std_y)
    return r


In [116]:
def remove_outliers(df):
    
    df_clean = df.copy()

    for col in df.columns:
        if df[col].dtype.kind in 'bifc':  
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]
    
    return df_clean

In [68]:
def pearson_correlation_with_pvalue_debug(x, y):
    n = len(x)
    r = pearson_correlation_debug(x, y)
    if pd.isna(r):  # Check if r is NaN
        print("Pearson correlation coefficient is NaN.")
        return r, float('nan')
    
    # Handle the case where r is exactly 1 or -1
    if r == 1 or r == -1:
        print("Pearson correlation coefficient is exactly 1 or -1.")
        return r, 0.0  # P-value is 0 because the correlation is perfect
    
    # t-statistic
    t_stat = r * ((n - 2) / (1 - r ** 2)) ** 0.5
    p_value = 2 * t.sf(abs(t_stat), df=n - 2)
    if p_value < 1e-10:
        p_value = "< 1e-10"  
    
    return r, p_value

In [69]:

dftrusts = df[df['Org_Name'].str.contains('trust', case=False, na=False)]

In [70]:

unwanted_measures = [
    "Missing Value / Value outside reporting parameters",
    "Pseudo postcode recorded (includes no fixed abode or resident overseas)",
    "Resident Elsewhere in UK, Channel Islands or Isle of Man"
]
dftrusts.head(100)

Unnamed: 0,ReportingPeriodStartDate,ReportingPeriodEndDate,Dimension,Org_Level,Org_Code,Org_Name,Measure,Count_Of,Final_value
33452,01/04/2023,30/04/2023,AgeAtBookingMotherAvg,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,Average over women,31
33453,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,20 to 24,Women,180
33454,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,25 to 29,Women,400
33455,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,30 to 34,Women,525
33456,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,35 to 39,Women,285
...,...,...,...,...,...,...,...,...,...
33547,01/04/2023,30/04/2023,SmokingStatusGroupBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,Non-Smoker / Ex-Smoker,Women,1095
33548,01/04/2023,30/04/2023,SmokingStatusGroupBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,Smoker,Women,95
33549,01/04/2023,30/04/2023,TotalBabies,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,Babies,1210
33550,01/04/2023,30/04/2023,TotalBookings,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,Women,1525


In [71]:
dftrusts = dftrusts[~dftrusts['Measure'].isin(unwanted_measures)]
dftrusts.head(100)

Unnamed: 0,ReportingPeriodStartDate,ReportingPeriodEndDate,Dimension,Org_Level,Org_Code,Org_Name,Measure,Count_Of,Final_value
33452,01/04/2023,30/04/2023,AgeAtBookingMotherAvg,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,Average over women,31
33453,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,20 to 24,Women,180
33454,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,25 to 29,Women,400
33455,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,30 to 34,Women,525
33456,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,35 to 39,Women,285
...,...,...,...,...,...,...,...,...,...
33562,01/04/2023,30/04/2023,ApgarScore5TermGroup7,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,7 to 10,Babies,240
33564,01/04/2023,30/04/2023,BabyFirstFeedBreastMilkStatus,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,Maternal or Donor Breast Milk,Babies,135
33565,01/04/2023,30/04/2023,BabyFirstFeedBreastMilkStatus,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,Not Breast Milk,Babies,135
33566,01/04/2023,30/04/2023,BirthweightTermGroup,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,2000g to 2499g,Babies,10


In [72]:
dftrusts.loc[dftrusts['Dimension'] == 'DeprivationDecileAtBooking', 'Measure'] = (
    dftrusts.loc[dftrusts['Dimension'] == 'DeprivationDecileAtBooking', 'Measure']
    .apply(lambda x: x[:2] if len(x) > 2 else x) 
)
dftrusts.loc[(dftrusts['Dimension'] == 'DeprivationDecileAtBooking') & (dftrusts['Measure'] == '01'), 'Measure'] = '1'

dftrusts.head(100)


Unnamed: 0,ReportingPeriodStartDate,ReportingPeriodEndDate,Dimension,Org_Level,Org_Code,Org_Name,Measure,Count_Of,Final_value
33452,01/04/2023,30/04/2023,AgeAtBookingMotherAvg,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,Average over women,31
33453,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,20 to 24,Women,180
33454,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,25 to 29,Women,400
33455,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,30 to 34,Women,525
33456,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,35 to 39,Women,285
...,...,...,...,...,...,...,...,...,...
33562,01/04/2023,30/04/2023,ApgarScore5TermGroup7,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,7 to 10,Babies,240
33564,01/04/2023,30/04/2023,BabyFirstFeedBreastMilkStatus,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,Maternal or Donor Breast Milk,Babies,135
33565,01/04/2023,30/04/2023,BabyFirstFeedBreastMilkStatus,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,Not Breast Milk,Babies,135
33566,01/04/2023,30/04/2023,BirthweightTermGroup,Provider,R0B,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,2000g to 2499g,Babies,10


In [73]:
dftrusts.loc[dftrusts['Dimension'] == 'DeprivationDecileAtBooking', 'Measure'] = (
    dftrusts.loc[dftrusts['Dimension'] == 'DeprivationDecileAtBooking', 'Measure']
    .astype(int)  
)
dftrusts.head(400)

Unnamed: 0,ReportingPeriodStartDate,ReportingPeriodEndDate,Dimension,Org_Level,Org_Code,Org_Name,Measure,Count_Of,Final_value
33452,01/04/2023,30/04/2023,AgeAtBookingMotherAvg,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,Average over women,31
33453,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,20 to 24,Women,180
33454,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,25 to 29,Women,400
33455,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,30 to 34,Women,525
33456,01/04/2023,30/04/2023,AgeAtBookingMotherGroup,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,35 to 39,Women,285
...,...,...,...,...,...,...,...,...,...
33893,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R1H,BARTS HEALTH NHS TRUST,2,Women,370
33894,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R1H,BARTS HEALTH NHS TRUST,3,Women,370
33895,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R1H,BARTS HEALTH NHS TRUST,4,Women,185
33896,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R1H,BARTS HEALTH NHS TRUST,5,Women,75


In [74]:

dfdeprivation = dftrusts[dftrusts['Dimension'] == 'DeprivationDecileAtBooking']

dfdeprivation.head(100)


Unnamed: 0,ReportingPeriodStartDate,ReportingPeriodEndDate,Dimension,Org_Level,Org_Code,Org_Name,Measure,Count_Of,Final_value
33485,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,2,Women,260
33486,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,3,Women,180
33487,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,4,Women,150
33488,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,5,Women,110
33489,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,6,Women,90
...,...,...,...,...,...,...,...,...,...
34428,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAE,BRADFORD TEACHING HOSPITALS NHS FOUNDATION TRUST,1,Women,260
34429,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAE,BRADFORD TEACHING HOSPITALS NHS FOUNDATION TRUST,10,Women,5
34532,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAJ,MID AND SOUTH ESSEX NHS FOUNDATION TRUST,2,Women,80
34533,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAJ,MID AND SOUTH ESSEX NHS FOUNDATION TRUST,3,Women,120


In [75]:

dfdeprivation = dfdeprivation.sort_values(by=['Org_Code', 'Measure'])

dfdeprivation.head(100)


Unnamed: 0,ReportingPeriodStartDate,ReportingPeriodEndDate,Dimension,Org_Level,Org_Code,Org_Name,Measure,Count_Of,Final_value
33493,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,1,Women,445
33485,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,2,Women,260
33486,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,3,Women,180
33487,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,4,Women,150
33488,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,5,Women,110
...,...,...,...,...,...,...,...,...,...
34427,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAE,BRADFORD TEACHING HOSPITALS NHS FOUNDATION TRUST,9,Women,5
34429,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAE,BRADFORD TEACHING HOSPITALS NHS FOUNDATION TRUST,10,Women,5
34540,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAJ,MID AND SOUTH ESSEX NHS FOUNDATION TRUST,1,Women,65
34532,01/04/2023,30/04/2023,DeprivationDecileAtBooking,Provider,RAJ,MID AND SOUTH ESSEX NHS FOUNDATION TRUST,2,Women,80


To calculate our stats, we will take each measure * final value, then divide this by the total final value. This will give us our 'average deprivation' for each trust

In [76]:

dfdeprivation['Total Deprivation'] = dfdeprivation['Measure'] * dfdeprivation['Final_value']
total_deprivation_by_org = dfdeprivation.groupby('Org_Name')['Total Deprivation'].sum()

total_value_by_org = dfdeprivation.groupby('Org_Name')['Final_value'].sum()

total_deprivation_by_org


Org_Name
AIREDALE NHS FOUNDATION TRUST                                      950
AIREDALE NHS TRUST                                                 730
ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATION TRUST             1985
BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOSPITALS NHS TRUST    2790
BARNSLEY HOSPITAL NHS FOUNDATION TRUST                             875
                                                                  ... 
WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDATION TRUST          1000
WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST                          2585
WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST                 790
WYE VALLEY NHS TRUST                                               830
YORK AND SCARBOROUGH TEACHING HOSPITALS NHS FOUNDATION TRUST       225
Name: Total Deprivation, Length: 124, dtype: object

In [77]:
actual_deprivation = total_deprivation_by_org / total_value_by_org
actual_deprivation_df = actual_deprivation.reset_index()
actual_deprivation_df.columns = ['Org_Name', 'Deprivation']
actual_deprivation_df.head(100)

Unnamed: 0,Org_Name,Deprivation
0,AIREDALE NHS FOUNDATION TRUST,5.277778
1,AIREDALE NHS TRUST,5.214286
2,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,7.218182
3,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",4.292308
4,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,3.723404
...,...,...
95,THE PRINCESS ALEXANDRA HOSPITAL NHS TRUST,6.424242
96,"THE QUEEN ELIZABETH HOSPITAL, KING'S LYNN, NHS...",4.103448
97,THE ROTHERHAM NHS FOUNDATION TRUST,3.744186
98,THE ROYAL WOLVERHAMPTON NHS TRUST,3.613208


We now have our deprivation stats. Now we will Calculate our rates of trauma as a percentage. We will then analyse for an association between the trauma rates and our deprivation rates. Based on this we may analyse for associations between injury rates and other factors.

In [78]:
dftrauma = dftrusts[dftrusts['Dimension'] == 'GenitalTractTraumaticLesionGroup']
dftrauma = dftrauma[['Dimension', 'Org_Name', 'Measure', 'Final_value']]
dftrauma.head(100)

Unnamed: 0,Dimension,Org_Name,Measure,Final_value
33511,GenitalTractTraumaticLesionGroup,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,At least one traumatic lesion,255
33512,GenitalTractTraumaticLesionGroup,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,No traumatic lesion reported,390
33617,GenitalTractTraumaticLesionGroup,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,At least one traumatic lesion,170
33618,GenitalTractTraumaticLesionGroup,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,No traumatic lesion reported,30
33724,GenitalTractTraumaticLesionGroup,UNIVERSITY HOSPITALS DORSET NHS FOUNDATION TRUST,At least one traumatic lesion,110
...,...,...,...,...
38471,GenitalTractTraumaticLesionGroup,ST GEORGE'S UNIVERSITY HOSPITALS NHS FOUNDATIO...,At least one traumatic lesion,110
38472,GenitalTractTraumaticLesionGroup,ST GEORGE'S UNIVERSITY HOSPITALS NHS FOUNDATIO...,No traumatic lesion reported,90
38582,GenitalTractTraumaticLesionGroup,SOUTH WARWICKSHIRE UNIVERSITY NHS FOUNDATION T...,At least one traumatic lesion,115
38583,GenitalTractTraumaticLesionGroup,SOUTH WARWICKSHIRE UNIVERSITY NHS FOUNDATION T...,No traumatic lesion reported,30


In [79]:

dftrauma2 = dftrauma.groupby(['Org_Name', 'Measure'])['Final_value'].sum().unstack()

# Calculate total Final_value for each Org_Name
dftrauma2['Total'] = dftrauma2.sum(axis=1)
# Calculate the trauma percentage
dftrauma2['Trauma Percentage'] = (dftrauma2['At least one traumatic lesion'] / dftrauma2['Total']) * 100
columns = dftrauma2.columns.tolist()
dftrauma2.columns = columns

dftrauma2.head(200)


Unnamed: 0_level_0,At least one traumatic lesion,No traumatic lesion reported,Total,Trauma Percentage
Org_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AIREDALE NHS FOUNDATION TRUST,20.0,60.0,80.0,25.000000
AIREDALE NHS TRUST,20.0,55.0,75.0,26.666667
ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATION TRUST,80.0,35.0,115.0,69.565217
"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOSPITALS NHS TRUST",170.0,140.0,310.0,54.838710
BARNSLEY HOSPITAL NHS FOUNDATION TRUST,85.0,50.0,135.0,62.962963
...,...,...,...,...
WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDATION TRUST,75.0,60.0,135.0,55.555556
WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,155.0,70.0,225.0,68.888889
"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST",30.0,25.0,55.0,54.545455
WYE VALLEY NHS TRUST,45.0,15.0,60.0,75.000000


Here we find an example of misreporting. If you look in the dataframe above and sort by trauma percentage ascending you'll find Imperial College Healthcare, Royal Free London NHS foundation trust and Shrewsbury and Telford Hospital report trauma at ridiculously low rates. This data is certainly incorrect. For the others it may be possible that they have such low injury rates as they have such low numbers in their maternity suite.

In [80]:
print("Index:", dftrauma2.index.names)
print("Columns:", dftrauma2.columns)

Index: ['Org_Name']
Columns: Index(['At least one traumatic lesion', 'No traumatic lesion reported',
       'Total', 'Trauma Percentage'],
      dtype='object')


In [81]:

dftrauma2.reset_index(inplace=True)
print(dftrauma2.columns)

final_df = actual_deprivation_df.merge(dftrauma2[['Org_Name', 'Trauma Percentage']], on='Org_Name', how='left')
print(final_df.head())

Index(['Org_Name', 'At least one traumatic lesion',
       'No traumatic lesion reported', 'Total', 'Trauma Percentage'],
      dtype='object')
                                            Org_Name Deprivation  \
0                      AIREDALE NHS FOUNDATION TRUST    5.277778   
1                                 AIREDALE NHS TRUST    5.214286   
2  ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...    7.218182   
3  BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...    4.292308   
4             BARNSLEY HOSPITAL NHS FOUNDATION TRUST    3.723404   

   Trauma Percentage  
0          25.000000  
1          26.666667  
2          69.565217  
3          54.838710  
4          62.962963  


In [82]:
final_df.head(100)

Unnamed: 0,Org_Name,Deprivation,Trauma Percentage
0,AIREDALE NHS FOUNDATION TRUST,5.277778,25.000000
1,AIREDALE NHS TRUST,5.214286,26.666667
2,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,7.218182,69.565217
3,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",4.292308,54.838710
4,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,3.723404,62.962963
...,...,...,...
95,THE PRINCESS ALEXANDRA HOSPITAL NHS TRUST,6.424242,61.538462
96,"THE QUEEN ELIZABETH HOSPITAL, KING'S LYNN, NHS...",4.103448,75.000000
97,THE ROTHERHAM NHS FOUNDATION TRUST,3.744186,75.000000
98,THE ROYAL WOLVERHAMPTON NHS TRUST,3.613208,69.767442


In [83]:
final_df = final_df[final_df['Trauma Percentage'] >= 27]

The code above drops the data for:
IMPERIAL COLLEGE HEALTHCARE NHS TRUST           2.3%
ROYAL FREE LONDON NHS FOUNDATION TRUST          3.07%   
THE SHREWSBURY AND TELFORD HOSPITAL NHS TRUST   8.52%
SHEFFIELD TEACHING HOSPITALS NHS FOUNDATION TRUST 24% (Very round number?)
AIREDALE NHS FOUNDATION TRUST   25%
AIREDALE NHS TRUST  26.66%
This is due to the exceedingly low rates of reported trauma


In [84]:
#This ruins the correlation code for some reason
# Calculate deciles for trauma percentage
#final_df['Decile'] = pd.qcut(final_df['Trauma Percentage'], 10, labels=False) + 1  # +1 to make deciles start from 1



#final_df['Deprivation'] = pd.to_numeric(final_df['Deprivation'], errors='coerce')
#final_df['Trauma Percentage'] = pd.to_numeric(final_df['Trauma Percentage'], errors='coerce')
#print(final_df['Deprivation'].dtype)
#print(final_df['Trauma Percentage'].dtype)
#final_df.head(100)


In [85]:
final_df = final_df.dropna(subset=['Trauma Percentage'])
x = final_df['Deprivation']
y = final_df['Trauma Percentage']

r, p_value = pearson_correlation_with_pvalue_debug(x, y)
print("Pearson correlation coefficient:", r)
print("P-value:", p_value)

Pearson correlation coefficient: 0.14507588492112683
P-value: 0.12187142462818243


In [86]:
def calculate_percentage(part, total):
    if total == 0:
        return None
    percentage = (part / total) * 100
    return percentage

As a higher number in deprivation means an area is less deprived, this is interesting. Lets invert our deprivation numbers.

In [87]:
final_df['Inverted Deprivation'] = final_df['Deprivation'].max() - final_df['Deprivation']

# Calculate the Pearson correlation with the inverted values
x = final_df['Inverted Deprivation']
y = final_df['Trauma Percentage']

r, p_value = pearson_correlation_with_pvalue_debug(x, y)
print("Pearson correlation coefficient (with inverted deprivation):", r)
print("P-value:", p_value)


Pearson correlation coefficient (with inverted deprivation): -0.14507588492112675
P-value: 0.12187142462818258


Unfortunately as there is clearly an association now we must find out why. We hypothesise that birth weights and age of mother at birth could have a stronger correlation to trauma, and that in richer areas mothers give birth later (potentially to larger babies?).

In [88]:
df_age_at_booking = dftrusts[dftrusts['Dimension'] == 'AgeAtBookingMotherAvg']
df_age_at_booking = df_age_at_booking[['Org_Name', 'Dimension', 'Final_value']]
print(df_age_at_booking.head())


                                                Org_Name  \
33452         MANCHESTER UNIVERSITY NHS FOUNDATION TRUST   
33552  SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...   
33659   UNIVERSITY HOSPITALS DORSET NHS FOUNDATION TRUST   
33768                            ISLE OF WIGHT NHS TRUST   
33860                             BARTS HEALTH NHS TRUST   

                   Dimension  Final_value  
33452  AgeAtBookingMotherAvg           31  
33552  AgeAtBookingMotherAvg           29  
33659  AgeAtBookingMotherAvg           31  
33768  AgeAtBookingMotherAvg           29  
33860  AgeAtBookingMotherAvg           30  


In [89]:
final_df_with_age = final_df.merge(df_age_at_booking[['Org_Name', 'Final_value']], on='Org_Name', how='left')


final_df_with_age.rename(columns={'Final_value': 'Avg age'}, inplace=True)

final_df_with_age.head()

Unnamed: 0,Org_Name,Deprivation,Trauma Percentage,Inverted Deprivation,Avg age
0,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,7.218182,69.565217,0.830999,31
1,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",4.292308,54.83871,3.756873,30
2,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,3.723404,62.962963,4.325776,29
3,BARTS HEALTH NHS TRUST,3.22619,36.434109,4.82299,30
4,BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST,5.4,77.586207,2.64918,31


In [90]:
x = final_df_with_age['Avg age']
y = final_df_with_age['Trauma Percentage']
r, p_value = pearson_correlation_with_pvalue_debug(x, y)
print("Pearson correlation coefficient:", r)
print("P-value:", p_value)


Pearson correlation coefficient: -0.01382782345201866
P-value: 0.8828759849606236


There is a very small relation between age and trauma rates. It is not statistically significant enough to matter. However this seems to show that injury rates go down as age goes up? There could be a miscalculation in my pearson coefficient.

In [91]:
x = final_df_with_age['Avg age']
y = final_df_with_age['Deprivation']
r, p_value = pearson_correlation_with_pvalue_debug(x, y)
print("Pearson correlation coefficient:", r)
print("P-value:", p_value)


Pearson correlation coefficient: 0.5633596324930059
P-value: < 1e-10


The P value is ridiculous but there is likely to be a correlation between deprivation and age.

In [92]:
df_smoking_status = dftrusts[dftrusts['Dimension'] == 'SmokingStatusGroupBooking']
df_smoking_status = df_smoking_status[['Org_Name', 'Dimension', 'Measure', 'Final_value']]
df_smoking_status.head()

Unnamed: 0,Org_Name,Dimension,Measure,Final_value
33547,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,SmokingStatusGroupBooking,Non-Smoker / Ex-Smoker,1095
33548,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,SmokingStatusGroupBooking,Smoker,95
33654,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,SmokingStatusGroupBooking,Non-Smoker / Ex-Smoker,385
33655,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,SmokingStatusGroupBooking,Smoker,65
33763,UNIVERSITY HOSPITALS DORSET NHS FOUNDATION TRUST,SmokingStatusGroupBooking,Non-Smoker / Ex-Smoker,330


In [93]:
df_smokers = df_smoking_status[df_smoking_status['Measure'] == 'Smoker']
total_final_values = df_smoking_status.groupby('Org_Name')['Final_value'].sum().reset_index()
df_smokers = df_smokers.merge(total_final_values, on='Org_Name', suffixes=('', '_Total'))
df_smokers['Smoker_Percentage'] = (df_smokers['Final_value'] / df_smokers['Final_value_Total']) * 100
df_smokers = df_smokers[['Org_Name', 'Smoker_Percentage']]
df_smokers.head(200)

Unnamed: 0,Org_Name,Smoker_Percentage
0,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,7.983193
1,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,14.444444
2,UNIVERSITY HOSPITALS DORSET NHS FOUNDATION TRUST,9.589041
3,ISLE OF WIGHT NHS TRUST,18.750000
4,BARTS HEALTH NHS TRUST,3.930131
...,...,...
116,IMPERIAL COLLEGE HEALTHCARE NHS TRUST,4.081633
117,UNIVERSITY HOSPITALS SUSSEX NHS FOUNDATION TRUST,6.569343
118,AIREDALE NHS TRUST,11.764706
119,NOTTINGHAM UNIVERSITY HOSPITALS NHS TRUST - CI...,14.285714


Let's check for smoking and deprivation, not sure what's going on with chelsea and westminster if you look at the above. 88% smokers in the maternity unit? Nuts. I've just realised I should've made a percentage function, I don't want to rewrite my code so I will write the function below and use it from here on but I will not rewrite previous percentage calculations.

In [94]:
final_df_with_smokers = final_df_with_age.merge(df_smokers, on='Org_Name', how='left')
final_df_with_smokers.head(200)

Unnamed: 0,Org_Name,Deprivation,Trauma Percentage,Inverted Deprivation,Avg age,Smoker_Percentage
0,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,7.218182,69.565217,0.830999,31,9.433962
1,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",4.292308,54.838710,3.756873,30,5.384615
2,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,3.723404,62.962963,4.325776,29,14.893617
3,BARTS HEALTH NHS TRUST,3.22619,36.434109,4.82299,30,3.930131
4,BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST,5.4,77.586207,2.64918,31,7.368421
...,...,...,...,...,...,...
111,WHITTINGTON HEALTH NHS TRUST,4.114754,55.555556,3.934426,32,6.451613
112,WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDA...,4.166667,55.555556,3.882514,29,10.000000
113,WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,5.744444,68.888889,2.304736,30,13.483146
114,"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION T...",4.514286,54.545455,3.534895,29,8.571429


As we have nan values we will drop these and print which trusts have been dropped

In [99]:
df_csf = dftrusts[dftrusts['Dimension'] == 'ComplexSocialFactorsInd']
df_csf_grouped = df_csf.groupby(['Org_Name', 'Measure'])['Final_value'].sum().unstack()
df_csf_grouped['Total'] = df_csf_grouped['N'] + df_csf_grouped['Y'] + df_csf_grouped['Missing Value']
df_csf_grouped['Missing_Percentage'] = df_csf_grouped.apply(lambda row: calculate_percentage(row['Missing Value'], row['Total']), axis=1)
df_csf_grouped['csfpercent'] = df_csf_grouped.apply(lambda row: calculate_percentage(row['Y'], row['N']), axis=1)
df_csf_grouped.loc['Missing_Percentage'] = df_csf_grouped['Missing_Percentage']
df_csf_grouped.reset_index(inplace=True)
df_csf_grouped = df_csf_grouped[['Org_Name', 'csfpercent', 'Missing_Percentage']]
df_csf_grouped.head(110)

Measure,Org_Name,csfpercent,Missing_Percentage
0,AIREDALE NHS FOUNDATION TRUST,5.714286,
1,AIREDALE NHS TRUST,3.846154,
2,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,14.583333,
3,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",20.370370,
4,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,11.627907,
...,...,...,...
105,UNIVERSITY HOSPITALS BRISTOL AND WESTON NHS FO...,9.677419,
106,UNIVERSITY HOSPITALS COVENTRY AND WARWICKSHIRE...,32.558140,0.869565
107,UNIVERSITY HOSPITALS DORSET NHS FOUNDATION TRUST,10.447761,
108,UNIVERSITY HOSPITALS OF DERBY AND BURTON NHS F...,1.685393,


We're going to get a few more variables ready to process (Ethnicity, delivery method) and from there we can do a multivariate analysis to find some strong associations.

In [100]:
final_df_csfpercent = final_df_with_smokers.merge(df_csf_grouped, on='Org_Name', how='left')

final_df_csfpercent.head()


Unnamed: 0,Org_Name,Deprivation,Trauma Percentage,Inverted Deprivation,Avg age,Smoker_Percentage,csfpercent,Missing_Percentage
0,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,7.218182,69.565217,0.830999,31,9.433962,14.583333,
1,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",4.292308,54.83871,3.756873,30,5.384615,20.37037,
2,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,3.723404,62.962963,4.325776,29,14.893617,11.627907,
3,BARTS HEALTH NHS TRUST,3.22619,36.434109,4.82299,30,3.930131,49.704142,0.393701
4,BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST,5.4,77.586207,2.64918,31,7.368421,46.969697,


In [101]:
df_ethnic_category = dftrusts[dftrusts['Dimension'] == 'EthnicCategoryMotherGroup']

df_ethnic_category = df_ethnic_category[['Dimension', 'Org_Name', 'Measure', 'Final_value']]
for column in df_ethnic_category.columns:
    if df_ethnic_category[column].dtype == 'bool':
        df_ethnic_category[column] = df_ethnic_category[column].astype(int)

df_ethnic_category.head(100)


Unnamed: 0,Dimension,Org_Name,Measure,Final_value
33497,EthnicCategoryMotherGroup,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,Any other ethnic group,65
33498,EthnicCategoryMotherGroup,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,Asian or Asian British,290
33499,EthnicCategoryMotherGroup,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,Black or Black British,165
33501,EthnicCategoryMotherGroup,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,Mixed,60
33502,EthnicCategoryMotherGroup,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,Not Stated,45
...,...,...,...,...
35270,EthnicCategoryMotherGroup,WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDA...,Black or Black British,5
35271,EthnicCategoryMotherGroup,WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDA...,Mixed,5
35272,EthnicCategoryMotherGroup,WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDA...,Not Stated,5
35273,EthnicCategoryMotherGroup,WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDA...,White,225


Doing one-hot encoding so that we can add these variables to a multivariable analysis as they are non-numeric,  variables, we will do the same for method of birth. Once we've done this we'll chuck all our data into 

The below methods of transformation are very roundabout, this is because I initially was going to turn all the values into boolean, and then decided to opt for the below instead.

Again, as this is a technical project I feel commited to showing technical capability, so doing things in unnecessarily advanced ways in order to achieve simple objectives is, if anything, positive.

In [102]:
df_encoded = pd.get_dummies(df_ethnic_category, columns=['Measure'])
for column in df_encoded.columns:
    if df_encoded[column].dtype == 'bool':
        df_encoded[column] = df_encoded[column].astype(int)
df_encoded = df_encoded.drop(columns=['Measure_Not known', 'Measure_Not Stated'], errors='ignore')
for index, row in df_encoded.iterrows():
    for col in df_encoded.columns:
        if row[col] == 1:
            df_encoded.at[index, col] = row['Final_value']
df_encoded = df_encoded.groupby('Org_Name').max()
df_encoded.head(50)

Unnamed: 0_level_0,Dimension,Final_value,Measure_Any other ethnic group,Measure_Asian or Asian British,Measure_Black or Black British,Measure_Mixed,Measure_White
Org_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AIREDALE NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,135,0,45,5,5,135
AIREDALE NHS TRUST,EthnicCategoryMotherGroup,95,0,35,5,5,95
ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,185,10,50,5,5,185
"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOSPITALS NHS TRUST",EthnicCategoryMotherGroup,280,20,235,65,25,280
BARNSLEY HOSPITAL NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,210,5,5,5,5,210
BARTS HEALTH NHS TRUST,EthnicCategoryMotherGroup,645,65,645,140,25,320
BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,275,15,130,40,10,275
BIRMINGHAM WOMEN'S AND CHILDREN'S NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,280,40,255,75,30,280
BLACKPOOL TEACHING HOSPITALS NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,215,10,10,5,5,215
BOLTON NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,205,15,65,15,5,205


In [103]:
df_delivery_method = dftrusts[dftrusts['Dimension'] == 'DeliveryMethodBabyGroup']
df_delivery_method = df_delivery_method[['Dimension', 'Org_Name', 'Measure', 'Final_value']]
df_delivery_method_encoded = pd.get_dummies(df_delivery_method, columns=['Measure'])
for column in df_delivery_method_encoded.columns:
    if df_delivery_method_encoded[column].dtype == 'bool':
        df_delivery_method_encoded[column] = df_delivery_method_encoded[column].astype(int)
df_delivery_method_encoded = df_delivery_method_encoded.drop(columns=['Measure_Other'], errors='ignore')
for index, row in df_delivery_method_encoded.iterrows():
    final_value = row['Final_value']  # Capture the final_value for this row
    for col in df_delivery_method_encoded.columns:
        if row[col] == 1:
            df_delivery_method_encoded.at[index, col] = final_value
df_delivery_method_encoded = df_delivery_method_encoded.drop(columns='Final_value')
df_delivery_method_encoded = df_delivery_method_encoded.groupby('Org_Name').max()
df_delivery_method_encoded1 = df_delivery_method_encoded
# Check for duplicate 'Org_Name' in the index before resetting
if 'Org_Name' in df_delivery_method_encoded.index.names:
    duplicate_check = df_delivery_method_encoded.index.get_level_values('Org_Name').duplicated().any()
    print("Duplicates in Org_Name index:", duplicate_check)

df_delivery_method_encoded1
#we will translate these into percentages

Duplicates in Org_Name index: False


Unnamed: 0_level_0,Dimension,Measure_Elective caesarean section,Measure_Emergency caesarean section,Measure_Instrumental,Measure_Spontaneous
Org_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AIREDALE NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,30,25,15,80
AIREDALE NHS TRUST,DeliveryMethodBabyGroup,30,25,15,75
ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,65,65,20,90
"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOSPITALS NHS TRUST",DeliveryMethodBabyGroup,85,140,40,280
BARNSLEY HOSPITAL NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,5,35,10,130
...,...,...,...,...,...
WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,45,50,25,110
WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,DeliveryMethodBabyGroup,65,60,35,190
"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST",DeliveryMethodBabyGroup,35,35,5,80
WYE VALLEY NHS TRUST,DeliveryMethodBabyGroup,25,30,10,45


In [104]:
if 'Org_Name' in df_delivery_method_encoded.index.names:
    df_delivery_method_encoded.fillna(value=0, inplace=True)
df_delivery_method_encoded[2:5] = df_delivery_method_encoded[2:5].apply(pd.to_numeric, errors='coerce')
numeric_cols = df_delivery_method_encoded.select_dtypes(include=[np.number]).columns.tolist()
df_delivery_method_encoded['Row_Sum'] = df_delivery_method_encoded[numeric_cols].sum(axis=1)

for col in numeric_cols:
    df_delivery_method_encoded[col] = df_delivery_method_encoded[col] / df_delivery_method_encoded['Row_Sum'] * 100

df_delivery_method_encoded.drop('Row_Sum', axis=1, inplace=True)
df_delivery_method_encoded

Unnamed: 0_level_0,Dimension,Measure_Elective caesarean section,Measure_Emergency caesarean section,Measure_Instrumental,Measure_Spontaneous
Org_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AIREDALE NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,20.000000,16.666667,10.000000,53.333333
AIREDALE NHS TRUST,DeliveryMethodBabyGroup,20.689655,17.241379,10.344828,51.724138
ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATION TRUST,,27.083333,27.083333,8.333333,37.500000
"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOSPITALS NHS TRUST",,15.596330,25.688073,7.339450,51.376147
BARNSLEY HOSPITAL NHS FOUNDATION TRUST,,2.777778,19.444444,5.555556,72.222222
...,...,...,...,...,...
WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,19.565217,21.739130,10.869565,47.826087
WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,DeliveryMethodBabyGroup,18.571429,17.142857,10.000000,54.285714
"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST",DeliveryMethodBabyGroup,22.580645,22.580645,3.225806,51.612903
WYE VALLEY NHS TRUST,DeliveryMethodBabyGroup,22.727273,27.272727,9.090909,40.909091


In [105]:
if 'Org_Name' in df_delivery_method_encoded.index.names:
    df_delivery_method_encoded.fillna(value=0, inplace=True)

if 'Org_Name' in df_delivery_method_encoded1.index.names:
    df_delivery_method_encoded.fillna(value=0, inplace=True)

print("Columns in df_delivery_method_encoded:", df_delivery_method_encoded.columns)
print("Columns in df_delivery_method_encoded1:", df_delivery_method_encoded1.columns)

org_names_in_encoded = set(df_delivery_method_encoded['Org_Name'])
org_names_in_encoded1 = set(df_delivery_method_encoded1['Org_Name'])

missing_org_names = org_names_in_encoded1.difference(org_names_in_encoded)

print("Org_Names missing from df_delivery_method_encoded:")
print(missing_org_names)


Columns in df_delivery_method_encoded: Index(['Dimension', 'Measure_Elective caesarean section',
       'Measure_Emergency caesarean section', 'Measure_Instrumental',
       'Measure_Spontaneous'],
      dtype='object')
Columns in df_delivery_method_encoded1: Index(['Dimension', 'Measure_Elective caesarean section',
       'Measure_Emergency caesarean section', 'Measure_Instrumental',
       'Measure_Spontaneous'],
      dtype='object')


KeyError: 'Org_Name'

In [55]:
df_delivery_method_encoded

Unnamed: 0_level_0,Dimension,Measure_Elective caesarean section,Measure_Emergency caesarean section,Measure_Instrumental,Measure_Spontaneous
Org_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AIREDALE NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,20.000000,16.666667,10.000000,53.333333
AIREDALE NHS TRUST,DeliveryMethodBabyGroup,20.689655,17.241379,10.344828,51.724138
ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATION TRUST,0,27.083333,27.083333,8.333333,37.500000
"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOSPITALS NHS TRUST",0,15.596330,25.688073,7.339450,51.376147
BARNSLEY HOSPITAL NHS FOUNDATION TRUST,0,2.777778,19.444444,5.555556,72.222222
...,...,...,...,...,...
WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,19.565217,21.739130,10.869565,47.826087
WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,DeliveryMethodBabyGroup,18.571429,17.142857,10.000000,54.285714
"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST",DeliveryMethodBabyGroup,22.580645,22.580645,3.225806,51.612903
WYE VALLEY NHS TRUST,DeliveryMethodBabyGroup,22.727273,27.272727,9.090909,40.909091


In [56]:
df_delivery_method_encoded1

Unnamed: 0_level_0,Dimension,Measure_Elective caesarean section,Measure_Emergency caesarean section,Measure_Instrumental,Measure_Spontaneous
Org_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AIREDALE NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,20.000000,16.666667,10.000000,53.333333
AIREDALE NHS TRUST,DeliveryMethodBabyGroup,20.689655,17.241379,10.344828,51.724138
ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATION TRUST,0,27.083333,27.083333,8.333333,37.500000
"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOSPITALS NHS TRUST",0,15.596330,25.688073,7.339450,51.376147
BARNSLEY HOSPITAL NHS FOUNDATION TRUST,0,2.777778,19.444444,5.555556,72.222222
...,...,...,...,...,...
WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDATION TRUST,DeliveryMethodBabyGroup,19.565217,21.739130,10.869565,47.826087
WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,DeliveryMethodBabyGroup,18.571429,17.142857,10.000000,54.285714
"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST",DeliveryMethodBabyGroup,22.580645,22.580645,3.225806,51.612903
WYE VALLEY NHS TRUST,DeliveryMethodBabyGroup,22.727273,27.272727,9.090909,40.909091


In [106]:
for column in df_encoded.columns:
    if df_encoded[column].dtype == 'bool':
        df_encoded[column] = df_encoded[column].astype(int)
numeric_cols = df_encoded.select_dtypes(include=[np.number]).columns.tolist()
df_encoded['Row_Sum'] = df_encoded[numeric_cols].sum(axis=1)
for col in numeric_cols:
    df_encoded[col] = df_encoded[col] / df_encoded['Row_Sum'] * 100
df_encoded.drop('Row_Sum', axis=1, inplace=True)
df_encoded = df_encoded.groupby('Org_Name').max()
if 'Org_Name' in df_encoded.index.names:
    df_encoded.reset_index(inplace=True)
if 'Final_value' in df_encoded.columns:
    df_encoded = df_encoded.drop(columns='Final_value')
df_encoded.head(100)

Unnamed: 0,Org_Name,Dimension,Measure_Any other ethnic group,Measure_Asian or Asian British,Measure_Black or Black British,Measure_Mixed,Measure_White
0,AIREDALE NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,0.000000,13.846154,1.538462,1.538462,41.538462
1,AIREDALE NHS TRUST,EthnicCategoryMotherGroup,0.000000,14.893617,2.127660,2.127660,40.425532
2,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,EthnicCategoryMotherGroup,2.272727,11.363636,1.136364,1.136364,42.045455
3,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",EthnicCategoryMotherGroup,2.209945,25.966851,7.182320,2.762431,30.939227
4,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,1.136364,1.136364,1.136364,1.136364,47.727273
...,...,...,...,...,...,...,...
95,THE PRINCESS ALEXANDRA HOSPITAL NHS TRUST,EthnicCategoryMotherGroup,0.826446,2.479339,3.305785,0.826446,46.280992
96,"THE QUEEN ELIZABETH HOSPITAL, KING'S LYNN, NHS...",EthnicCategoryMotherGroup,1.724138,1.724138,1.724138,1.724138,46.551724
97,THE ROTHERHAM NHS FOUNDATION TRUST,EthnicCategoryMotherGroup,3.225806,6.451613,1.612903,1.612903,43.548387
98,THE ROYAL WOLVERHAMPTON NHS TRUST,EthnicCategoryMotherGroup,2.453988,11.656442,7.361963,2.453988,38.036810


In [107]:
df_delivery_method_encoded = df_delivery_method_encoded.drop(columns=['Dimension'], errors='ignore')
df_encoded = df_encoded.drop(columns=['Dimension'], errors='ignore')

In [108]:
merged_df = pd.merge(df_encoded, df_delivery_method_encoded, on='Org_Name', how='outer')
centraldf = pd.merge(merged_df, final_df_csfpercent, on='Org_Name', how='outer')
#I'm going to drop Airedale as they have so many nan values
centraldf = centraldf[~centraldf['Org_Name'].str.startswith('AIREDALE')]
#Fixed
centraldf = centraldf.drop(columns=['Missing_Percentage'], errors='ignore')
centraldf = centraldf.drop(columns=['Final_value'], errors='ignore')
centraldf

Unnamed: 0,Org_Name,Measure_Any other ethnic group,Measure_Asian or Asian British,Measure_Black or Black British,Measure_Mixed,Measure_White,Measure_Elective caesarean section,Measure_Emergency caesarean section,Measure_Instrumental,Measure_Spontaneous,Deprivation,Trauma Percentage,Inverted Deprivation,Avg age,Smoker_Percentage,csfpercent
2,ASHFORD AND ST PETER'S HOSPITALS NHS FOUNDATIO...,2.272727,11.363636,1.136364,1.136364,42.045455,27.083333,27.083333,8.333333,37.500000,7.218182,69.565217,0.830999,31.0,9.433962,14.583333
3,"BARKING, HAVERING AND REDBRIDGE UNIVERSITY HOS...",2.209945,25.966851,7.182320,2.762431,30.939227,15.596330,25.688073,7.339450,51.376147,4.292308,54.838710,3.756873,30.0,5.384615,20.370370
4,BARNSLEY HOSPITAL NHS FOUNDATION TRUST,1.136364,1.136364,1.136364,1.136364,47.727273,2.777778,19.444444,5.555556,72.222222,3.723404,62.962963,4.325776,29.0,14.893617,11.627907
5,BARTS HEALTH NHS TRUST,3.532609,35.054348,7.608696,1.358696,17.391304,11.734694,21.938776,11.224490,55.102041,3.22619,36.434109,4.82299,30.0,3.930131,49.704142
6,BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST,2.013423,17.449664,5.369128,1.342282,36.912752,18.181818,22.222222,8.080808,51.515152,5.4,77.586207,2.64918,31.0,7.368421,46.969697
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDA...,1.063830,1.063830,1.063830,1.063830,47.872340,19.565217,21.739130,10.869565,47.826087,4.166667,55.555556,3.882514,29.0,10.000000,6.382979
121,WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,0.645161,3.870968,0.645161,0.645161,47.096774,18.571429,17.142857,10.000000,54.285714,5.744444,68.888889,2.304736,30.0,13.483146,9.876543
122,"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION T...",1.492537,1.492537,2.985075,1.492537,46.268657,22.580645,22.580645,3.225806,51.612903,4.514286,54.545455,3.534895,29.0,8.571429,12.903226
123,WYE VALLEY NHS TRUST,0.000000,1.886792,1.886792,1.886792,47.169811,22.727273,27.272727,9.090909,40.909091,5.354839,75.000000,2.694342,30.0,10.344828,11.538462


In [150]:
print(model.summary())

output_notebook()

source = ColumnDataSource(data={
    'x': centraldf['Smoker_Percentage'],
    'y': centraldf['csfpercent'],
    'Org_Name': centraldf['Org_Name']
})

p = figure(title="Smoker Percentage vs. CSF Percentage",
           x_axis_label='Smoker Percentage',
           y_axis_label='CSF Percentage',
           tools="pan,wheel_zoom,box_zoom,reset")

p.circle('x', 'y', size=10, source=source, color="navy", alpha=0.5)

hover = HoverTool()
hover.tooltips = [("Org_Name", "@Org_Name"),
                  ("Smoker_Percentage", "@x"),
                  ("CSF_Percentage", "@y")]

p.add_tools(hover)

show(p)

                            OLS Regression Results                            
Dep. Variable:      Trauma Percentage   R-squared:                       0.108
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     2.330
Date:                Thu, 29 Aug 2024   Prob (F-statistic):             0.0368
Time:                        16:32:02   Log-Likelihood:                -460.21
No. Observations:                 123   AIC:                             934.4
Df Residuals:                     116   BIC:                             954.1
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
co

We will now find correlations between all of our data

In [110]:
correlation = centraldf.copy()
correlation = correlation.drop(columns=['Org_Name'])

columns_of_interest = [
    'Measure_Elective caesarean section',
    'Measure_Emergency caesarean section',
    'Inverted Deprivation',
    'Trauma Percentage',
    'Smoker_Percentage',
    'csfpercent'
]

results = []

for col1 in columns_of_interest:
    for col2 in correlation.columns:
        if col1 != col2 and col2 not in columns_of_interest:
            x = correlation[col1].dropna()
            y = correlation[col2].dropna()
            
            common_index = x.index.intersection(y.index)
            x = x.loc[common_index]
            y = y.loc[common_index]
            
            if len(x) > 0 and len(y) > 0 and len(x) == len(y):
                r, p_value = pearson_correlation_with_pvalue_debug(x, y)
                results.append({
                    'Measure': col1,
                    'ComparedMeasure': col2,
                    'Pearson Correlation': r,
                    'P-value': p_value
                })

correlationfin = pd.DataFrame(results)
pd.options.display.float_format = '{:.2f}'.format

correlationfin



Pearson correlation coefficient is exactly 1 or -1.


Unnamed: 0,Measure,ComparedMeasure,Pearson Correlation,P-value
0,Measure_Elective caesarean section,Measure_Any other ethnic group,0.11,0.23
1,Measure_Elective caesarean section,Measure_Asian or Asian British,-0.01,0.94
2,Measure_Elective caesarean section,Measure_Black or Black British,0.06,0.51
3,Measure_Elective caesarean section,Measure_Mixed,-0.03,0.71
4,Measure_Elective caesarean section,Measure_White,-0.01,0.92
5,Measure_Elective caesarean section,Measure_Instrumental,0.03,0.74
6,Measure_Elective caesarean section,Measure_Spontaneous,-0.59,< 1e-10
7,Measure_Elective caesarean section,Deprivation,0.05,0.57
8,Measure_Elective caesarean section,Avg age,0.11,0.22
9,Measure_Emergency caesarean section,Measure_Any other ethnic group,0.14,0.12


In [111]:
import statsmodels.api as sm
centraldf_filled = centraldf
centraldf_filled = centraldf_filled.drop(columns=['Org_Name'], errors='ignore')

centraldf_filled = centraldf_filled.fillna(centraldf_filled.mean())

Y = centraldf_filled['Trauma Percentage']

X = centraldf_filled[['Smoker_Percentage', 'csfpercent', 'Measure_Elective caesarean section', 
                      'Measure_Emergency caesarean section', 'Inverted Deprivation', 'Avg age']]

X = sm.add_constant(X)

model = sm.OLS(Y, X).fit()

print(model.summary())


                            OLS Regression Results                            
Dep. Variable:      Trauma Percentage   R-squared:                       0.108
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     2.330
Date:                Thu, 29 Aug 2024   Prob (F-statistic):             0.0368
Time:                        15:01:15   Log-Likelihood:                -460.21
No. Observations:                 123   AIC:                             934.4
Df Residuals:                     116   BIC:                             954.1
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
co

Moving on to visualisation here, I could spend forever refitting these calculations to optimise and realistically they could still result in nothing due to contaminated date (poor reporting sources, again not all trusts are poor at reporting but enough to make these not interpretable in a valuable way.

In [115]:
output_notebook()

hist, edges = np.histogram(centraldf['Smoker_Percentage'].dropna(), bins=20)

source = ColumnDataSource(data={
    'top': hist,
    'left': edges[:-1],
    'right': edges[1:]
})

p = figure(title="Histogram of Smoker Percentage",
           x_axis_label='Smoker Percentage',
           y_axis_label='Frequency',
           tools="pan,wheel_zoom,box_zoom,reset")

p.quad(top='top', bottom=0, left='left', right='right', source=source, 
       fill_color="navy", line_color="white", alpha=0.7)

show(p)


In [124]:
centraldf_no_outliers = remove_outliers(centraldf)
centraldf_no_outliers

Unnamed: 0,Org_Name,Measure_Any other ethnic group,Measure_Asian or Asian British,Measure_Black or Black British,Measure_Mixed,Measure_White,Measure_Elective caesarean section,Measure_Emergency caesarean section,Measure_Instrumental,Measure_Spontaneous,Deprivation,Trauma Percentage,Inverted Deprivation,Avg age,Smoker_Percentage,csfpercent
8,BLACKPOOL TEACHING HOSPITALS NHS FOUNDATION TRUST,2.17,2.17,1.09,1.09,46.74,17.02,25.53,10.64,46.81,3.72,73.91,4.33,29.00,19.05,16.28
9,BOLTON NHS FOUNDATION TRUST,2.94,12.75,2.94,0.98,40.20,13.16,21.05,11.84,53.95,3.92,43.75,4.13,30.00,10.00,13.64
11,BUCKINGHAMSHIRE HEALTHCARE NHS TRUST,0.99,11.88,1.98,1.98,41.58,14.00,20.00,6.00,60.00,7.58,71.88,0.46,32.00,5.06,9.86
12,CALDERDALE AND HUDDERSFIELD NHS FOUNDATION TRUST,1.52,13.64,3.79,2.27,39.39,14.29,23.21,8.93,53.57,4.23,60.61,3.81,30.00,11.39,23.81
15,CHESTERFIELD ROYAL HOSPITAL NHS FOUNDATION TRUST,0.93,1.87,1.87,1.87,46.73,13.95,25.58,13.95,46.51,4.90,72.00,3.15,30.00,9.26,16.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118,WEST SUFFOLK NHS FOUNDATION TRUST,1.25,2.50,1.25,2.50,46.25,15.15,15.15,9.09,60.61,5.93,60.87,2.12,30.00,9.09,13.16
120,WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDA...,1.06,1.06,1.06,1.06,47.87,19.57,21.74,10.87,47.83,4.17,55.56,3.88,29.00,10.00,6.38
121,WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,0.65,3.87,0.65,0.65,47.10,18.57,17.14,10.00,54.29,5.74,68.89,2.30,30.00,13.48,9.88
122,"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION T...",1.49,1.49,2.99,1.49,46.27,22.58,22.58,3.23,51.61,4.51,54.55,3.53,29.00,8.57,12.90


In [126]:
output_notebook()

source = ColumnDataSource(data={
    'x': centraldf_no_outliers['Smoker_Percentage'],
    'y': centraldf_no_outliers['Inverted Deprivation'],
    'Org_Name': centraldf_no_outliers['Org_Name']
})

p = figure(title="Smoker Percentage vs. Deprivation",
           x_axis_label='Smoker Percentage',
           y_axis_label='Deprivation',
           tools="pan,wheel_zoom,box_zoom,reset")

p.circle('x', 'y', size=10, source=source, color="navy", alpha=0.5)

hover = HoverTool()
hover.tooltips = [("Org_Name", "@Org_Name"),
                  ("Smoker_Percentage", "@x"),
                  ("Deprivation", "@y")]

p.add_tools(hover)

show(p)


The average ratio of Deprivation to Smoker Percentage is: 0.29


In [135]:
output_notebook()

centraldf_no_outliers['Smoker_Percentage'] = pd.to_numeric(centraldf_no_outliers['Smoker_Percentage'], errors='coerce')
centraldf_no_outliers['Inverted Deprivation'] = pd.to_numeric(centraldf_no_outliers['Inverted Deprivation'], errors='coerce')

centraldf_no_outliers = centraldf_no_outliers.dropna(subset=['Smoker_Percentage', 'Inverted Deprivation'])

source = ColumnDataSource(data={
    'x': centraldf_no_outliers['Smoker_Percentage'],
    'y': centraldf_no_outliers['Inverted Deprivation'],
    'Org_Name': centraldf_no_outliers['Org_Name']
})

p = figure(title="Smoker Percentage vs. Deprivation with Trend Line",
           x_axis_label='Smoker Percentage',
           y_axis_label='Deprivation',
           tools="pan,wheel_zoom,box_zoom,reset")

p.circle('x', 'y', size=10, source=source, color="navy", alpha=0.5)

hover = HoverTool()
hover.tooltips = [("Org_Name", "@Org_Name"),
                  ("Smoker_Percentage", "@x"),
                  ("Deprivation (Higher = more deprived)", "@y")]

p.add_tools(hover)

#(linear regression)
x = centraldf_no_outliers['Smoker_Percentage']
y = centraldf_no_outliers['Inverted Deprivation']

#(y = mx + c)
p_coeff = np.polyfit(x, y, 1)
y_fit = np.polyval(p_coeff, x)

p.line(x, y_fit, color="red", line_width=2, legend_label="Average smoker to deprivation ratio")

show(p)

In [137]:
output_notebook()

centraldf_no_outliers['Measure_Black or Black British'] = pd.to_numeric(centraldf_no_outliers['Measure_Black or Black British'], errors='coerce')
centraldf_no_outliers['Measure_Emergency caesarean section'] = pd.to_numeric(centraldf_no_outliers['Measure_Emergency caesarean section'], errors='coerce')

centraldf_no_outliers = centraldf_no_outliers.dropna(subset=['Measure_Black or Black British', 'Measure_Emergency caesarean section'])

source = ColumnDataSource(data={
    'x': centraldf_no_outliers['Measure_Black or Black British'],
    'y': centraldf_no_outliers['Measure_Emergency caesarean section'],
    'Org_Name': centraldf_no_outliers['Org_Name']
})

p = figure(title="Black or Black British vs. Emergency Caesarean Section",
           x_axis_label='Black or Black British',
           y_axis_label='Emergency Caesarean Section',
           tools="pan,wheel_zoom,box_zoom,reset")

p.circle('x', 'y', size=10, source=source, color="navy", alpha=0.5)

hover = HoverTool()
hover.tooltips = [("Org_Name", "@Org_Name"),
                  ("Black or Black British", "@x"),
                  ("Emergency Caesarean Section", "@y")]

p.add_tools(hover)

x = centraldf_no_outliers['Measure_Black or Black British']
y = centraldf_no_outliers['Measure_Emergency caesarean section']

# (y = mx + c)
p_coeff = np.polyfit(x, y, 1)
y_fit = np.polyval(p_coeff, x)

p.line(x, y_fit, color="red", line_width=2, legend_label="Trend Line")

show(p)

In [144]:
top_10_emergency_c_section = centraldf_no_outliers.sort_values(by='Measure_Emergency caesarean section', ascending=False).head(10)

top_10_emergency_c_section = top_10_emergency_c_section[['Org_Name', 'Measure_Emergency caesarean section']]
top_10_emergency_c_section.rename(columns={'Measure_Emergency caesarean section': '% of Births - Emergency C section'}, inplace=True)
top_10_emergency_c_section.rename(columns={'Org_name': 'NHS Trust'}, inplace=True)

top_10_emergency_c_section

Unnamed: 0,Org_Name,% of Births - Emergency C section
62,NORTH WEST ANGLIA NHS FOUNDATION TRUST,33.33
52,MEDWAY NHS FOUNDATION TRUST,32.86
20,DONCASTER AND BASSETLAW TEACHING HOSPITALS NHS...,31.15
33,GREAT WESTERN HOSPITALS NHS FOUNDATION TRUST,29.63
63,NORTHAMPTON GENERAL HOSPITAL NHS TRUST,28.12
66,NOTTINGHAM UNIVERSITY HOSPITALS NHS TRUST,27.78
112,UNIVERSITY HOSPITALS OF NORTH MIDLANDS NHS TRUST,27.63
116,WARRINGTON AND HALTON TEACHING HOSPITALS NHS F...,27.5
123,WYE VALLEY NHS TRUST,27.27
89,STOCKPORT NHS FOUNDATION TRUST,26.67


In [147]:
bottom_10_emergency_c_section = centraldf_no_outliers.sort_values(by='Measure_Emergency caesarean section', ascending=True).head(10)

bottom_10_emergency_c_section = bottom_10_emergency_c_section[['Org_Name', 'Measure_Emergency caesarean section']]
bottom_10_emergency_c_section

Unnamed: 0,Org_Name,Measure_Emergency caesarean section
83,SOUTH TYNESIDE AND SUNDERLAND NHS FOUNDATION T...,12.73
36,HARROGATE AND DISTRICT NHS FOUNDATION TRUST,14.29
80,SHERWOOD FOREST HOSPITALS NHS FOUNDATION TRUST,14.81
54,MID CHESHIRE HOSPITALS NHS FOUNDATION TRUST,14.89
35,HAMPSHIRE HOSPITALS NHS FOUNDATION TRUST,15.07
118,WEST SUFFOLK NHS FOUNDATION TRUST,15.15
84,SOUTH WARWICKSHIRE UNIVERSITY NHS FOUNDATION T...,15.69
61,NORTH TEES AND HARTLEPOOL NHS FOUNDATION TRUST,16.67
65,NORTHUMBRIA HEALTHCARE NHS FOUNDATION TRUST,16.98
121,WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,17.14


In [149]:
output_notebook()

centraldf_no_outliers['Inverted Deprivation'] = pd.to_numeric(centraldf_no_outliers['Inverted Deprivation'], errors='coerce')
centraldf_no_outliers['% of Births - Emergency C section'] = pd.to_numeric(centraldf_no_outliers['Measure_Emergency caesarean section'], errors='coerce')

centraldf_no_outliers = centraldf_no_outliers.dropna(subset=['Inverted Deprivation', '% of Births - Emergency C section'])

source = ColumnDataSource(data={
    'x': centraldf_no_outliers['Inverted Deprivation'],
    'y': centraldf_no_outliers['% of Births - Emergency C section'],
    'Org_Name': centraldf_no_outliers['Org_Name']
})

p = figure(title="Inverted Deprivation vs. % of Births - Emergency C section",
           x_axis_label='Inverted Deprivation',
           y_axis_label='% of Births - Emergency C section',
           tools="pan,wheel_zoom,box_zoom,reset")

# Add a scatter renderer with circle markers
p.circle('x', 'y', size=10, source=source, color="navy", alpha=0.5)

# Add tooltips
hover = HoverTool()
hover.tooltips = [("Org_Name", "@Org_Name"),
                  ("Inverted Deprivation", "@x"),
                  ("% of Births - Emergency C section", "@y")]

p.add_tools(hover)

x = centraldf_no_outliers['Inverted Deprivation']
y = centraldf_no_outliers['% of Births - Emergency C section']

p_coeff = np.polyfit(x, y, 1)
y_fit = np.polyval(p_coeff, x)

p.line(x, y_fit, color="red", line_width=2, legend_label="Trend Line")
show(p)
