# Extract hospital performance for pathway model

## Aims

* Extract and save hospital performance for pathway simulation model
* Create breakdowns by weekend/weekday/day/night

## Import libraries

In [1]:
import numpy as np
import pandas as pd

## Load data

* Load data
* Restrict data to fields necessary for pathway extraction
* Remove in-hospital admissions

In [2]:
# Load data
data_loaded = pd.read_csv(
    './data/SAMueL ssnap extract v2.csv', low_memory=False)

# Number of years data covers
# January 2016 to December 2021
data_years = 6.0

In [3]:
# Restrict fields
used_fields = [
    'TeamName',
    # 'S1Gender',
    'OnsettoArrivalMinutes',
    'S1OnsetTimeType',
    'ArrivaltoBrainImagingMinutes',
    'S2StrokeType',
    'S2Thrombolysis',                   # Yes/No thrombolysis
    'ArrivaltoThrombolysisMinutes',     # thrombolysis time
    'ArrivaltoArterialPunctureMinutes', # thrombectomy time
    'S2NihssArrival'                    # stroke severity
]

data_loaded = data_loaded[used_fields]

## Extract hospital performance

In [4]:
def extract_hospital_performance(stroke_team: str, stroke_type: str, group_df: pd.DataFrame):
    """ 
    Measure metrics of the hospital's performance.
    
    The metrics are measured on various subgroups of patients that
    are recorded using masks. Each patient has a value of True when
    the mask conditions are met and False when they are not.
    The time distribution metrics are calculated from the log-normal
    distribution of times.
    
    The measured values are:
    + admissions per year
    + proportion of all patients given thrombolysis
    + proportion of all patients given thrombectomy
    + proportion of patients given thrombectomy 
      who were also given thrombolysis
    For thrombolysis and for thrombectomy:
    + proportion of all patients with known onset time
    + proportion of mask1 with onset to arrival on time 
    + proportion of mask2 with arrival to scan on time 
    + proportion of mask3 with onset to scan on time 
    + proportion of mask4 with enough time to treat 
    + proportion of mask5 with treated 
    For subgroups of patients meeting certain conditions
    and for thrombolysis and for thrombectomy:
    + mean (mu) of lognormed onset to arrival times 
    + standard deviation (sigma) of lognormed onset to arrival times
    + mean (mu) of lognormed scan to arrival times
    + standard deviation (sigma) of lognormed scan to arrival times
    + mean (mu) of lognormed scan to treatment times
    + standard deviation (sigma) of lognormed scan to treatment times
    
    ----- Method -----
    The masks are created in the following way. With each step, whittle
    down the full group of patients. In the example, the sizes of 
    blocks are arbitrary.     
    Key:
    ░ - patients still in the subgroup
    ▒ - patients rejected from the subgroup at this step
    █ - patients rejected from the subgroup in previous steps

    ▏Start: Full group                                                ▕
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    ▏-------------------------All patients----------------------------▕
    ▏                                                                 ▕
    ▏Mask 1: Is onset time known?                                     ▕
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
    ▏--------------------Yes----------------------▏---------No--------▕
    ▏                                             ▏                   ▕
    ▏Mask 2: Is onset to arrival within the time limit?               ▕
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░▒▒▒▒▒▒▒▒▒▒▒█████████████████████
    ▏---------------Yes----------------▏----No----▏------Rejected-----▕
    ▏                                  ▏          ▏                   ▕
    ▏Mask 3: Is arrival to scan wtihin the time limit?                ▕
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░▒▒▒▒▒▒▒████████████████████████████████
    ▏------------Yes------------▏--No--▏-----------Rejected-----------▕
    ▏                           ▏      ▏                              ▕
    ▏Mask 4: Is onset to scan within the time limit?                  ▕
    ░░░░░░░░░░░░░░░░░░░░░░░▒▒▒▒▒███████████████████████████████████████
    ▏----------Yes---------▏-No-▏---------------Rejected--------------▕
    ▏                      ▏    ▏                                     ▕
    ▏Mask 5: Is there enough time left for thrombolysis/thrombectomy? ▕
    ░░░░░░░░░░░░░░░░▒▒▒▒▒▒▒████████████████████████████████████████████
    ▏------Yes------▏--No--▏------------------Rejected----------------▕
    ▏               ▏      ▏                                          ▕
    ▏Mask 6: Did the patient receive thrombolysis/thrombectomy?       ▕
    ░░░░░░░░░░░▒▒▒▒▒███████████████████████████████████████████████████
    ▏----Yes---▏-No-▏---------------------Rejected--------------------▕


    Patient proportions measured:    
    +--------------------------------+--------------------------------+
    | Proportion                     | Measure                        |
    +--------------------------------+--------------------------------+
    | Thrombolysis rate or           | Total number treated           |
    | thrombectomy rate              | divided by all patients.       |
    | Onset known                    | "Yes" to Mask 1 divided by     |
    |                                | all patients.                  |
    | Onset to arrival within limit  | "Yes" to Mask 2 divided by     |
    |                                | "Yes" to Mask 1.               |
    | Arrival to scan within limit   | "Yes" to Mask 3 divided by     |
    |                                | "Yes" to Mask 2.               |
    | Onset to scan within limit     | "Yes" to Mask 4 divided by     |
    |                                | "Yes" to Mask 3.               |
    | Enough time left for treatment | "Yes" to Mask 5 divided by     |
    |                                | "Yes" to Mask 4.               |
    | "Chosen" for thrombolysis      | "Yes" to Mask 6 divided by     |
    | or for thrombectomy            | "Yes" to Mask 5.               |
    +--------------------------------+--------------------------------+
    The "proportion chosen for thrombolysis" is a different measure
    from the "thrombolysis rate", which is the proportion of all of the
    patients at the start who were given thrombolysis. It is possible
    some patients received thrombolysis in real life but that by this
    process they were rejected before Mask 6.

    The log-normal mean and standard deviation (mu and sigma) are taken
    for the groups of patients who answer "Yes" to everything up to and
    including particular steps.
    +---------------------------------+-------------------------------+
    | Subgroup who answer "yes" at... | Log-normal distribution       |
    +---------------------------------+-------------------------------+
    |                          Mask 2 | Onset to arrival time         |
    |                          Mask 3 | Arrival to scan time          |
    |                          Mask 6 | Scan to needle or             |
    |                                 | scan to puncture time         |
    +---------------------------------+-------------------------------+
    
    Thrombolysis and thrombectomy can be given different time limits
    for the creation of these masks.
    
    Inputs:
    -------
    stroke_team - str. Name of the hospital for labelling.
    stroke_type - str. Names of the stroke types in this data (i.e.
                  non-Large Vessel Occlusion (nLVO), Large Vessel 
                  Occlusion (LVO), other).
    group_df    - pandas DataFrame. Contains all of the hospital data.
    
    Returns:
    --------
    performance_dict - dictionary. Contains various metrics of hospital
                       performance. 
    """
    def calculate_more_times_for_dataframe(group_df):
        """
        Combine the existing time data into more time measures.
        
        Creates:
        + scan to needle time
        + scan to puncture time
        + time left for needle
        + time left for puncture
        
        Inputs:
        -------
        group_df - pandas DataFrame. Stores the original time data
                   that will be combined into new measures.
        
        Returns:
        --------
        group_df - pandas DataFrame. Same as the input dataframe but 
                   with the new time arrays added.                     
        """
        # Scan to treatment
        scan_to_needle_mins = (
            group_df['ArrivaltoThrombolysisMinutes'] -
            group_df['ArrivaltoBrainImagingMinutes']
        )
        scan_to_puncture_mins = (
            group_df['ArrivaltoArterialPunctureMinutes'] -
            group_df['ArrivaltoBrainImagingMinutes']
        )
        # Replace any zero scan to treatment times with 1 (for log?)
        # and store in the dataframe.
        scan_to_needle_mins[scan_to_needle_mins == 0] = 1
        group_df['scan_to_needle_mins'] = scan_to_needle_mins
        scan_to_puncture_mins[scan_to_puncture_mins == 0] = 1
        group_df['scan_to_puncture_mins'] = scan_to_puncture_mins
            
        
        # Time left after scan for thrombolysis...
        group_df['time_left_for_ivt_after_scan_mins'] = np.maximum((
            allowed_onset_to_needle_time_mins -
            (group_df['OnsettoArrivalMinutes'] + 
              group_df['ArrivaltoBrainImagingMinutes'])
            ), -0.0)
        # ... and thrombectomy:
        group_df['time_left_for_mt_after_scan_mins'] = np.maximum((
            allowed_onset_to_puncture_time_mins -
            (group_df['OnsettoArrivalMinutes'] + 
              group_df['ArrivaltoBrainImagingMinutes'])
            ), -0.0)
        # If the time is negative, set it to -0.0.
        # The minus distinguishes the clipped values from the ones
        # that genuinely have exactly 0 minutes left.
        return group_df
    
    
    def create_masks(
            time_left_after_scan_mins: np.array,
            time_to_treatment_mins: np.array,        
            limit_mins: float=np.inf,
            minutes_left: float=15.0
            ):
        """
        Make masks of whether patients meet various conditions.
        
        The masks match the diagram in the 
        extract_hospital_performance() docstring.
        
        Inputs:
        -------
        time_left_after_scan_mins - array. One value per patient for
                                    time left after the scan for 
                                    treatment.
        time_to_treatment_mins    - array. One value per patient for
                                    time from arrival to treatment.
        limit_mins                - float. The time limit that the 
                                    times in each step are compared 
                                    with in creating the mask.
        minutes_left              - float. How much time there must be
                                    left after scan for treatment to be
                                    considered.
        
        Returns:
        --------
        masks_dict - dictionary. Contains the masks.
        """
        # Create masks.
        # 1: Onset time known
        mask1 = group_df[
            'S1OnsetTimeType'].apply(lambda x: x in ['P', 'BE'])
        # 2: Mask 1 and onset to arrival on time
        mask2 = ((mask1 == True) & 
                 (group_df['OnsettoArrivalMinutes'] <= limit_mins))
        # 3: Mask 2 and arrival to scan on time
        mask3 = (
            (mask2 == True) & 
            (group_df['ArrivaltoBrainImagingMinutes'] <= limit_mins)
            )
        # 4: Mask 3 and onset to scan on time
        mask4 = (
            (mask3 == True) &
            ((group_df['OnsettoArrivalMinutes'] + 
              group_df['ArrivaltoBrainImagingMinutes']) <= limit_mins)
            )
        # 5: Mask 4 and enough time to treat
        mask5 = ((mask4 == True) &
                 (time_left_after_scan_mins >= minutes_left))
        # 6: Mask 5 and treated
        mask6 = ((mask5 == True) & 
                 (time_to_treatment_mins >= 0))
        
        masks_dict = dict(
            mask1_all_onset_known=mask1,
            mask2_mask1_and_onset_to_arrival_on_time=mask2,
            mask3_mask2_and_arrival_to_scan_on_time=mask3, 
            mask4_mask3_and_onset_to_scan_on_time=mask4, 
            mask5_mask4_and_enough_time_to_treat=mask5, 
            mask6_mask5_and_treated=mask6
            )
        return masks_dict

    
    def calculate_proportions(masks_dict: dict):
        """
        Find proportions of patients who answer True to each mask.
        
        The proportion is out of those who answered True to the 
        previous mask, not out of the whole cohort.
        
        Inputs:
        -------
        masks_dict - a dictionary containing the masks described in
                     the docstring of extract_hospital_performance().
        
        Returns:
        --------
        proportions_dict - a dictionary containing the proportions
                           of patients answering True to each mask.
                           The keys are named similarly to the input
                           mask dictionary.
        """
        # Store results in here:
        proportions_dict = dict()
        # Get the mask names assuming that the masks are stored in
        # the dictionary in order.
        mask_names = list(masks_dict.keys())
        for j, mask_name_now in enumerate(mask_names):
            mask_now = masks_dict[mask_name_now]
            if j > 0:
                # If there's a previous mask, find it from the dict:
                mask_before = masks_dict[mask_names[j-1]]
                mask_name_before = f'mask{j}'
            else:
                # All patients answered True in the previous step.
                mask_before = np.full(len(mask_now), 1)
                mask_name_before = 'all'
            
            # Proportion is Yes to Mask now / Yes to Mask before.
            proportion = (np.sum(mask_now) / np.sum(mask_before)
                          if np.sum(mask_before) > 0 else np.NaN)
            # Create a name string for this proportion.
            # Replace the initial "maskX_" with "proportionX_":
            proportion_name = f'proportion{j+1}_of_{mask_name_before}_with'
            # Remove previous mask name and "_and" from the string:
            p = '_'.join(mask_name_now[6:].split(mask_name_before)[-1:])
            p = '_and'.join(p.split('_and')[-1:])
            proportion_name += p
        
            # Store result with a similar name to the original mask.
            proportions_dict[proportion_name] = proportion
        return proportions_dict

    
    def calculate_lognorm_parameters(group_df, input_dicts):
        """ 
        Calculate parameters of lognorm time distributions.
        
        Inputs:
        -------
        group_df    - pandas DataFrame. Stores the original time data
                      that will be lognorm-ed and analysed.
        input_dicts - list of dicts. Stores the instructions for which
                      times to pull out of the dataframe and what to 
                      name the resulting data. Each dict must contain:
                      + label: str. name for the resulting data.
                      + mask: array. Mask of True/False for the time
                        array so that only certain times are used in
                        the calculations.
                      + column: str. The name of the column of times
                        in the dataframe.
        
        Returns:
        --------
        results_dict - dict. Contains a mu (mean) and sigma (standard
                       deviation) for the lognorm distributions of the
                       times selected by each of the input dicts.
        """
        # Place all of the results in here:
        results_dict = dict()
        for d in input_dicts:
            # Pick out the times from the chosen column and use the
            # chosen mask to only select a subset from the column.
            times = np.log(group_df[d['mask']][d['column']])
            # Calculate the lognorm mu and sigma and store them
            # in the results dictionary.
            results_dict['lognorm_mu_' + d['label']] = times.mean()
            results_dict['lognorm_sigma_' + d['label']] = times.std()
        return results_dict
    

    # Set up allowed time and over-run for thrombolysis...
    allowed_onset_to_needle_time_mins = 270  # 4h 30m
    allowed_overrun_for_slow_scan_to_needle_mins = 15
    # ... and for thrombectomy
    allowed_onset_to_puncture_time_mins = 8*60  # --------------------------------- need to check for a reaonsable number here
    allowed_overrun_for_slow_scan_to_puncture_mins = 15
    minutes_left = 15.0
    
    # Limit for comparing conditions (e.g. is onset to arrival within
    # 4hrs?). Separate limits for IVT and MT:
    limit_ivt_mins = 4*60
    limit_mt_mins = 8*60  # ################################################# look up sensible value


    # Record admission numbers
    admissions = group_df.shape[0]

    # Calculate more times from the existing data:
    group_df = calculate_more_times_for_dataframe(group_df)

    # Find the proportion of the whole cohort that receives
    # treatment.
    proportion_all_ivt = (group_df['S2Thrombolysis'] == 'Y').mean()
    proportion_all_mt = (
        group_df['ArrivaltoArterialPunctureMinutes'] >= 0).mean()
    proportion_mt_also_receiving_ivt = (
        np.sum((group_df['S2Thrombolysis'] == 'Y') & 
               (group_df['ArrivaltoArterialPunctureMinutes'] >= 0)) /
        np.sum(group_df['ArrivaltoArterialPunctureMinutes'] >= 0)
        if np.sum(group_df['ArrivaltoArterialPunctureMinutes'] >= 0) > 0 
        else np.NaN  # Prevent division by zero if no patients have MT.
        )

    # ----- Thrombolysis -----
    # Masks of patients who answer True to each step:
    masks_dict_ivt = create_masks(
        group_df['time_left_for_ivt_after_scan_mins'],
        group_df['ArrivaltoThrombolysisMinutes'],
        limit_ivt_mins,
        minutes_left
        )
    # Proportion of patients in each mask:
    proportions_dict_ivt = calculate_proportions(masks_dict_ivt)
    # Record the mu and sigma for certain times and subgroups.
    # Set up with these dictionaries:
    dicts_ivt = [
        dict(label = 'onset_arrival_mins',
             mask = masks_dict_ivt[
                'mask2_mask1_and_onset_to_arrival_on_time'],
             column = 'OnsettoArrivalMinutes'
             ),
        dict(label = 'arrival_scan_arrival_mins',
             mask = masks_dict_ivt[
                'mask3_mask2_and_arrival_to_scan_on_time'],
             column = 'ArrivaltoBrainImagingMinutes'
             ),
        dict(label = 'scan_needle_mins',
             mask = masks_dict_ivt['mask6_mask5_and_treated'],
             column = 'scan_to_needle_mins'
             )
        ]
    lognorm_dict_ivt = calculate_lognorm_parameters(group_df, dicts_ivt)


    # ----- Thrombectomy -----
    # Masks of patients who answer True to each step:
    masks_dict_mt = create_masks(
        group_df['time_left_for_mt_after_scan_mins'],
        group_df['ArrivaltoArterialPunctureMinutes'],
        limit_mt_mins,
        minutes_left
        )
    # Proportion of patients in each mask:
    proportions_dict_mt = calculate_proportions(masks_dict_mt)
    # Record the mu and sigma for certain times and subgroups.
    # Set up with these dictionaries:
    dicts_mt = [
        dict(label = 'onset_arrival_mins',
             mask = masks_dict_mt[
                'mask2_mask1_and_onset_to_arrival_on_time'],
             column = 'OnsettoArrivalMinutes'
             ),
        dict(label = 'arrival_scan_arrival_mins',
             mask = masks_dict_mt[
                'mask3_mask2_and_arrival_to_scan_on_time'],
             column = 'ArrivaltoBrainImagingMinutes'
             ),
        dict(label = 'scan_puncture_mins',
             mask = masks_dict_mt['mask6_mask5_and_treated'],
             column = 'scan_to_puncture_mins'
             )
        ]
    lognorm_dict_mt = calculate_lognorm_parameters(group_df, dicts_mt)

    # ----- Combine results -----
    performance_dict = dict()
    performance_dict['stroke_team'] = stroke_team
    performance_dict['stroke_type'] = stroke_type
    performance_dict['admissions'] = admissions
    performance_dict['proportion_of_all_with_ivt'] = proportion_all_ivt
    performance_dict['proportion_of_all_with_mt'] = proportion_all_mt
    performance_dict['proportion_of_mt_with_ivt'] = \
        proportion_mt_also_receiving_ivt

    # Take these dictionaries from earlier...
    dicts_to_combine = [proportions_dict_ivt, lognorm_dict_ivt,
                        proportions_dict_mt, lognorm_dict_mt]
    # ... and merge them into the new results dictionary.
    for i, d in enumerate(dicts_to_combine):
        # Add extra label to prevent repeat keys in the combined dict.
        extra_label = '_ivt' if i < 2 else '_mt'
        keys = list(d.keys())
        for key in keys:
            performance_dict[key + extra_label] = d[key]

    return performance_dict

In [5]:
# Copy data
data = data_loaded.copy()
# Split data by stroke team
groups = data.groupby('TeamName') # creates a new object of groups of data

# Store each stroke team's results in this list:
list_of_series = []
for stroke_team, group_df in groups: # each group has an index + dataframe of data
    stroke_type_mask_dict = {
        'lvo': (group_df['S2StrokeType']=='I') & (group_df['S2NihssArrival']>=11),
        'nlvo': (group_df['S2StrokeType']=='I') & (group_df['S2NihssArrival']<10),
        'other': (group_df['S2StrokeType']=='PIH')  # excludes no type given
    }
    # Split by stroke type:
    for stroke_type in list(stroke_type_mask_dict.keys()):
        group_df_here = group_df[stroke_type_mask_dict[stroke_type]].copy()
        stroke_team_here = stroke_team #+ ': ' + stroke_type
        stroke_type_here = stroke_type
        # Main results function:
        group_dict = extract_hospital_performance(stroke_team_here, stroke_type_here, group_df_here)
        # Update admissions, average over the full number of years:
        group_dict['admissions'] = group_dict['admissions'] / data_years
        # Convert output dict into a pandas Series:
        group_series = pd.Series(data=group_dict.values(),
                                 index=group_dict.keys())
        list_of_series.append(group_series)
    
# Combine all results into one dataframe:
df_all = pd.concat(list_of_series, axis=1)

In [6]:
# # Limit to hosp with > 100 admissions/year 
# # and >10 thrombolysis in total 
# # and either >10 thrombectomy in total or no thrombectomy.
# admissions = df_all.loc['admissions']
# thrombolysed = admissions * df_all.loc['proportion_of_all_with_ivt']
# thrombectomy = admissions * df_all.loc['proportion_of_all_with_mt']
# mask = list(np.where((
#         (admissions >= (100 / data_years)) &
#         (thrombolysed >= (10.0 / data_years)) &
#         ((thrombectomy >= (10.0 / data_years)) | (thrombectomy == 0.0))
#        ).values == True)[0])

# df = df_all[mask].T

df = df_all.T

# Save
df.to_csv('data/hospital_performance_thrombectomy.csv', index=False)

# Show data for five hopsitals
df.head().T

Unnamed: 0,0,1,2,3,4
stroke_team,Addenbrooke's Hospital,Addenbrooke's Hospital,Addenbrooke's Hospital,Basildon University Hospital,Basildon University Hospital
stroke_type,lvo,nlvo,other,lvo,nlvo
admissions,109.0,424.5,54.833333,73.0,361.5
proportion_of_all_with_ivt,0.308869,0.12446,0.0,0.328767,0.10604
proportion_of_all_with_mt,0.090214,0.012564,0.0,0.054795,0.005071
proportion_of_mt_with_ivt,0.542373,0.4375,,0.625,0.545455
proportion1_of_all_with_onset_known_ivt,0.688073,0.561445,0.613982,0.648402,0.651452
proportion2_of_mask1_with_onset_to_arrival_on_time_ivt,0.835556,0.611189,0.787129,0.785211,0.511677
proportion3_of_mask2_with_arrival_to_scan_on_time_ivt,0.986702,0.932494,0.974843,0.986547,0.95574
proportion4_of_mask3_with_onset_to_scan_on_time_ivt,0.940701,0.795092,0.877419,0.977273,0.855282


In [7]:
df.columns

Index(['stroke_team', 'stroke_type', 'admissions',
       'proportion_of_all_with_ivt', 'proportion_of_all_with_mt',
       'proportion_of_mt_with_ivt', 'proportion1_of_all_with_onset_known_ivt',
       'proportion2_of_mask1_with_onset_to_arrival_on_time_ivt',
       'proportion3_of_mask2_with_arrival_to_scan_on_time_ivt',
       'proportion4_of_mask3_with_onset_to_scan_on_time_ivt',
       'proportion5_of_mask4_with_enough_time_to_treat_ivt',
       'proportion6_of_mask5_with_treated_ivt',
       'lognorm_mu_onset_arrival_mins_ivt',
       'lognorm_sigma_onset_arrival_mins_ivt',
       'lognorm_mu_arrival_scan_arrival_mins_ivt',
       'lognorm_sigma_arrival_scan_arrival_mins_ivt',
       'lognorm_mu_scan_needle_mins_ivt', 'lognorm_sigma_scan_needle_mins_ivt',
       'proportion1_of_all_with_onset_known_mt',
       'proportion2_of_mask1_with_onset_to_arrival_on_time_mt',
       'proportion3_of_mask2_with_arrival_to_scan_on_time_mt',
       'proportion4_of_mask3_with_onset_to_scan_on_t