# 2b: Calculate the data needed for scenario trials

## Plain English summary
This notebook calculates the base pathway statistics for each stroke team. The patients that attended each stroke team are first split into groups for each stroke type. Then a series of tests are performed. The proportion of patients passing each test is recorded, and for certain subgroups of patients passing particular tests, the distribution of times taken at that point in the hospital pathway are measured.

The tests are:
1. Is onset time known?
2. Is onset to arrival within the time limit?
3. Is arrival to scan wtihin the time limit?
4. Is onset to scan within the time limit?
5. Is there enough time left for thrombolysis?
6. Did the patient receive thrombolysis?

And the time distributions measured are:
1. Onset to arrival time for those passing Test 2.
2. Arrival to scan time for those passing Test 3.
3. Scan to treatment time for those passing Test 6.

The values measured are saved for future use.

## Notebook setup:

In [1]:
import pandas as pd
import numpy as np
import pickle
from dataclasses import dataclass
import yaml

import matplotlib.pyplot as plt

import stroke_utilities.scenario as scenario

## Set up paths and filenames

In [2]:
@dataclass(frozen=True)
class Paths:
    '''Singleton object for storing paths to data and database.'''

    data_read_path: str = './stroke_utilities/data/'
    output_folder = './stroke_utilities/output/'

paths = Paths()

## Import patient data

All patient data:

In [3]:
filename = paths.data_read_path + 'clean_samuel_ssnap_extract_v2.csv'
data_loaded = pd.read_csv(filename)

In [4]:
data_loaded.columns

Index(['id', 'stroke_team', 'age', 'male', 'infarction',
       'onset_to_arrival_time', 'onset_known', 'precise_onset_known',
       'onset_during_sleep', 'arrive_by_ambulance',
       'call_to_ambulance_arrival_time', 'ambulance_on_scene_time',
       'ambulance_travel_to_hospital_time', 'ambulance_wait_time_at_hospital',
       'month', 'year', 'weekday', 'arrival_time_3_hour_period',
       'arrival_to_scan_time', 'thrombolysis', 'scan_to_thrombolysis_time',
       'thrombectomy', 'arrival_to_thrombectomy_time',
       'congestive_heart_failure', 'hypertension', 'atrial_fibrillation',
       'diabetes', 'prior_stroke_tia', 'afib_antiplatelet',
       'afib_anticoagulant', 'afib_vit_k_anticoagulant',
       'afib_doac_anticoagulant', 'afib_heparin_anticoagulant',
       'new_afib_diagnosis', 'prior_disability', 'stroke_severity',
       'nihss_complete', 'nihss_arrival_loc', 'nihss_arrival_loc_questions',
       'nihss_arrival_loc_commands', 'nihss_arrival_best_gaze',
       'nihss_

In [5]:
data_loaded = data_loaded[np.isnan(data_loaded['stroke_team_id']) == False]

data_loaded['stroke_team_id'] = data_loaded['stroke_team_id'].astype(int)

## Process the data


In [6]:
with open('./stroke_utilities/fixed_params.yml') as f:
    fixed_params = yaml.safe_load(f)

In [7]:
fixed_params

{'allowed_onset_to_needle_time_mins': 270,
 'allowed_overrun_for_slow_scan_to_needle_mins': 15,
 'allowed_onset_to_puncture_time_mins': 480,
 'allowed_overrun_for_slow_scan_to_puncture_mins': 15,
 'minutes_left': 15.0,
 'limit_ivt_mins': 240,
 'limit_mt_mins': 360}

In [8]:
# Set up allowed time and over-run for thrombolysis...
allowed_onset_to_needle_time_mins = fixed_params['allowed_onset_to_needle_time_mins']
allowed_overrun_for_slow_scan_to_needle_mins = fixed_params['allowed_overrun_for_slow_scan_to_needle_mins']
# ... and for thrombectomy
allowed_onset_to_puncture_time_mins = fixed_params['allowed_onset_to_puncture_time_mins']
allowed_overrun_for_slow_scan_to_puncture_mins = fixed_params['allowed_overrun_for_slow_scan_to_puncture_mins']
minutes_left = fixed_params['minutes_left']
# Limit for comparing conditions (e.g. is onset to arrival within
# 4hrs?). Separate limits for IVT and MT:
limit_ivt_mins = fixed_params['limit_ivt_mins']
limit_mt_mins = fixed_params['limit_mt_mins']

Combine existing time data to create some new measures:

In [9]:
data_loaded['arrival_to_thrombolysis_time'] = np.sum((
    data_loaded['arrival_to_scan_time'].values,
    data_loaded['scan_to_thrombolysis_time'].values
    ), axis=0)

data_loaded['scan_to_thrombectomy_time'] = np.sum((
    data_loaded['arrival_to_thrombectomy_time'].values,
    -data_loaded['arrival_to_scan_time'].values
    ), axis=0)

# Time left after scan for thrombolysis...
data_loaded['time_left_for_ivt_after_scan_mins'] = np.maximum((
    allowed_onset_to_needle_time_mins -
    (data_loaded['onset_to_arrival_time'] + 
      data_loaded['arrival_to_scan_time'])
    ), -0.0)
# ... and thrombectomy:
data_loaded['time_left_for_mt_after_scan_mins'] = np.maximum((
    allowed_onset_to_puncture_time_mins -
    (data_loaded['onset_to_arrival_time'] + 
      data_loaded['arrival_to_scan_time'])
    ), -0.0)

Measure how many years this data spans. This is used later to calculate the number of admissions per year.

In [10]:
data_years = (data_loaded['year'].max() - data_loaded['year'].min()) + 1

data_years

6

## Calculate proportions

The following cell loops over all stroke teams. It finds which patients have each stroke type. Then the subset with each stroke type is passed through the masks and the performance parameters are stored in a pandas Series.

In [11]:
# Copy data
data = data_loaded.copy()
# Split data by stroke team
groups = data.groupby('stroke_team_id') # creates a new object of groups of data

# Store each stroke team's results in this list:
list_of_series = []
for stroke_team, group_df in groups: # each group has an index + dataframe of data
    stroke_type_mask_dict = {
        'lvo': ((group_df['infarction']==1) & 
                (group_df['stroke_severity']>=11)),
        'nlvo': ((group_df['infarction']==1) & 
                 (group_df['stroke_severity']<11)),
        'other': (group_df['infarction']!=1),  # excludes no type given
        'mixed': [True] * len(group_df)
    }
    
    # Split by stroke type:
    for stroke_type in list(stroke_type_mask_dict.keys()):
        group_df_here = group_df[stroke_type_mask_dict[stroke_type]].copy()
        stroke_team_here = stroke_team #+ ': ' + stroke_type
        stroke_type_here = stroke_type
        
        # Main results function:
        group_df_here, group_dict, masks_dict_ivt, masks_dict_mt = (
            scenario.extract_hospital_performance(
                stroke_team_here,
                stroke_type_here,
                group_df_here,
                limit_ivt_mins,
                limit_mt_mins,
                minutes_left
            ))
        # Update admissions, average over the full number of years:
        group_dict['admissions'] = group_dict['admissions'] / data_years
        # Convert output dict into a pandas Series:
        group_series = pd.Series(data=group_dict.values(),
                                 index=group_dict.keys())
        list_of_series.append(group_series)

# Combine all results into one dataframe:
df_all = pd.concat(list_of_series, axis=1)

## Results

In [12]:
df = df_all.T

# Round the values to fewer decimal places:
for column in df.columns:
    if column not in ['stroke_team', 'stroke_type', 'stroke_team_id']:
        df[column] = df[column].astype(float).round(6)

# Save
df.to_csv(f'{paths.output_folder}/hospital_performance.csv', index=False)#, float_format='%.7f')

In [13]:
# Show data for first hospital
df_all.T.head(4).T

Unnamed: 0,0,1,2,3
stroke_team_id,1,1,1,1
stroke_type,lvo,nlvo,other,mixed
admissions,154.833333,676.666667,101.166667,932.666667
proportion_of_all_with_ivt,0.341227,0.09064,0.0,0.122409
proportion_of_all_with_mt,0.176534,0.014778,0.0,0.040029
proportion_of_mt_with_ivt,0.77439,0.466667,,0.691964
proportion1_of_all_with_onset_known_ivt,0.481163,0.337931,0.37397,0.365618
proportion2_of_mask1_with_onset_to_arrival_on_time_ivt,0.897092,0.653061,0.762115,0.718475
proportion3_of_mask2_with_arrival_to_scan_on_time_ivt,1.0,0.944196,0.942197,0.959184
proportion4_of_mask3_with_onset_to_scan_on_time_ivt,0.972569,0.885343,0.91411,0.913475
