# Descriptive analysis of SSNAP Extract Version 2

## Plain English Summary

tbc

## Aims

* Restrict to records from 2017 to 2019 (inclusive) and stroke teams with an average of at least 100 stroke admissions and 3 thrombolysis patients per year.

## Observations

tbc

## Set up and import data

In [1]:
# Linting
%load_ext pycodestyle_magic
%pycodestyle_on

In [2]:
# Import packages and functions
import numpy as np
import os
import pandas as pd
from dataclasses import dataclass

In [44]:
# Set the maximum number of columns and rows to 100
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [3]:
# Set paths and filenames
@dataclass(frozen=True)
class Paths:
    '''Singleton object for storing paths to data and files.'''

    data_path = './../output/'
    data_filename = 'reformatted_data.csv'
    notebook = '01'


paths = Paths()

In [4]:
# Load data
raw_data = pd.read_csv(os.path.join(paths.data_path, paths.data_filename))

## Restrict data

Restrict to:
* Records from 2017, 2018 and 2019.
* Stroke teams with at least an average of 100 stroke admissions and 3 thrombolysis patients per year - hence, removing where less than 300 admissions or 9 patients.

In [30]:
# Restricting to records from 2017 to 2019
raw_data_restrict = raw_data[raw_data['year'].isin([2017, 2018, 2019])]

# Printing change in number of records due to restricting years
print(f'''Number of records per year:
{raw_data.year.value_counts().sort_index().to_string()}

Total records (all years): {len(raw_data.index)}
Total records (2017-19): {len(raw_data_restrict.index)}
''')

Number of records per year:
2016    56510
2017    58983
2018    58549
2019    60413
2020    59301
2021    66625

Total records (all years): 360381
Total records (2017-19): 177945



In [33]:
# Restrict to stroke teams with >= 300 admissions and 9 patients

# Empty objects to store stroke teams and count those discarded
keep = []
discard = 0

# Group dataframe by stroke team
groups = raw_data_restrict.groupby('stroke_team')

# Loop through name (each stroke team) and group_df (relevant rows from data)
for name, group_df in groups:
    # Skip if admissions less than 300 or thrombolysis patients less than 9
    raw_admissions = len(group_df.index)
    raw_thrombolysis_received = group_df['thrombolysis'] == 1
    if (raw_admissions < 300) or (raw_thrombolysis_received.sum() < 9):
        discard += 1
        continue
    else:
        keep.append(group_df)

# Concatenate output
data = pd.concat(keep)

# Number of stroke teams kept v.s. removed
print('Number of stroke teams remaining in dataset: {0}'.format(len(keep)))
print('Number of stroke teams removed from dataset: {0}'.format(discard))

Number of stroke teams remaining in dataset: 114
Number of stroke teams removed from dataset: 4


## Other possible restrictions

Would there be interest in filtering the dataset further for descriptive analysis - e.g. :
* Only to patients who arrived by ambulance

In [9]:
# Arrival by ambulance
data['arrive_by_ambulance'].value_counts(normalize=True, dropna=False)

1    0.787233
0    0.212767
Name: arrive_by_ambulance, dtype: float64

## Contents of dataset

List all columns, their data types, and proportion complete

In [168]:
# List all columns and show their data types and proportion of completed data
data_type_complete = pd.DataFrame(
    {'Data type': data.dtypes,
     'Proportion complete': data.count() / data.shape[0]})

# Show all columns
data_type_complete

Unnamed: 0,Data type,Proportion complete
id,int64,1.0
stroke_team,object,1.0
age,float64,1.0
male,int64,1.0
infarction,float64,0.995958
onset_to_arrival_time,int64,1.0
onset_known,int64,1.0
precise_onset_known,int64,1.0
onset_during_sleep,int64,1.0
arrive_by_ambulance,int64,1.0


## Exploring NaN

In [61]:
# Just list columns with incomplete data
data_type_complete[data_type_complete['Proportion complete'] < 1]

Unnamed: 0,Data type,Proportion complete
infarction,float64,0.995958
call_to_ambulance_arrival_time,float64,0.159651
ambulance_on_scene_time,float64,0.159555
ambulance_travel_to_hospital_time,float64,0.130912
ambulance_wait_time_at_hospital,float64,0.130963
arrival_to_scan_time,float64,0.995958
scan_to_thrombolysis_time,float64,0.117069
arrival_to_thrombectomy_time,float64,0.010083
discharge_disability,float64,0.992935
disability_6_month,float64,0.298653


#### Infarction and arrival to scan time

Infarction and arrival to scan time both have 718 NaN. **Should they be dropped within reformat_data?** They were created as follows...

```
cleaned_data['infarction'] = raw_data['S2StrokeType'].map(infarction)
cleaned_data['arrival_to_scan_time'] = raw_data['ArrivaltoBrainImagingMinutes']
```

Data dictionary says S2StrokeType (i.e. infarction) is NA when scan not performed - hence could have infarction, we just don't know. And hence, those some patients are missing "arrival_to_scan_time" - as they did not receive a scan!

In [65]:
(data[['infarction', 'arrival_to_scan_time']]
 .isnull().value_counts().reset_index(name='count'))

Unnamed: 0,infarction,arrival_to_scan_time,count
0,False,False,176913
1,True,True,718


#### Discharge disability

There are 1255 NaN, which were present in raw data, and which have no definition in data dictionary so are presumably just missing. **Should they be dropped within reformat_data?** This was created as follows:

`cleaned_data['discharge_disability'] = raw_data['S7RankinDischarge']`

In [71]:
data['discharge_disability'].isnull().value_counts()

False    176376
True       1255
Name: discharge_disability, dtype: int64

#### Disability at 6 months

Majority of patients are missing this variable. It can be entered if patients attended a 6-month follow-up assessment. **Should we describe this in descriptive analysis, or is it likely to be biased due to missing data and better to not included in reformat_data?** It was created as follows:

`cleaned_data['disability_6_month'] = raw_data['S8Rankin6Month']`

In [72]:
data['disability_6_month'].isnull().value_counts()

True     124581
False     53050
Name: disability_6_month, dtype: int64

#### Scan to thrombolysis time

This is only missing data when thrombolysis was not performed (i.e. when thrombolysis = 0, then missing scan_to_thrombolysis_time = True), so **this is fine and as expected.**

In [97]:
(data['scan_to_thrombolysis_time']
 .isnull().groupby(data['thrombolysis'])
 .value_counts().reset_index(name='count'))

Unnamed: 0,thrombolysis,scan_to_thrombolysis_time,count
0,0,True,156836
1,1,False,20795


#### Arrival to thrombectomy time

This is only missing data when thrombectomy was not performed (i.e. when thrombectomy = 0, then missing arrival_to_thrombectomy_time = True), so **this is fine and as expected.**

In [99]:
(data['arrival_to_thrombectomy_time']
 .isnull().groupby(data['thrombectomy'])
 .value_counts().reset_index(name='count'))

Unnamed: 0,thrombectomy,arrival_to_thrombectomy_time,count
0,0,True,175840
1,1,False,1791


#### Ambulance timings

*Note: numbers here are different to those from unit testing as this calculation is performed after restriction of the dataset.*

In [167]:
(data[['call_to_ambulance_arrival_time', 'ambulance_on_scene_time',
       'ambulance_travel_to_hospital_time', 'ambulance_wait_time_at_hospital']]
 .isnull()
 .groupby(data['arrive_by_ambulance'])
 .value_counts()
 .reset_index(name='count'))

Unnamed: 0,arrive_by_ambulance,call_to_ambulance_arrival_time,ambulance_on_scene_time,ambulance_travel_to_hospital_time,ambulance_wait_time_at_hospital,count
0,0,True,True,True,True,37793
1,0,False,False,False,False,1
2,1,True,True,True,True,111466
3,1,False,False,False,False,23251
4,1,False,False,True,True,5085
5,1,False,True,True,True,21
6,1,True,True,True,False,8
7,1,True,False,True,True,3
8,1,True,False,False,False,2
9,1,False,True,True,False,1


This individual does not arrive by ambulance (according to 'S1ArriveByAmbulance') - **should we:**
1. Drop the patient
2. Modify arrive_by_ambulance to 1, as it gives times for the ambulance (which all seem alright)

In [142]:
# Not displaying row of data when submit to GitHub, but this would show it
# data[(data['call_to_ambulance_arrival_time'].notnull())
#      & (data['arrive_by_ambulance'] == 0)]

There are a couple of issues here:
* Times less than 0 - should we change to NA?
* Times that are NA
* Times that are very large

In [178]:
time_cols = ['onset_to_arrival_time',
             'call_to_ambulance_arrival_time',
             'ambulance_on_scene_time',
             'ambulance_travel_to_hospital_time',
             'ambulance_wait_time_at_hospital',
             'arrival_time_3_hour_period',
             'arrival_to_scan_time',
             'scan_to_thrombolysis_time',
             'arrival_to_thrombectomy_time']

time_counts = pd.DataFrame({'Value': [
    'Time < 0', 'Time == 0', 'Time > 0', 'Time NA',
    'Min Time', 'Max Time']})
for col in time_cols:
    time_counts[col] = [sum(data[col] < 0),
                        sum(data[col] == 0),
                        sum(data[col] > 0),
                        sum(data[col].isnull()),
                        data[col].min(),
                        data[col].max()]
time_counts.set_index('Value').T

Value,Time < 0,Time == 0,Time > 0,Time NA,Min Time,Max Time
onset_to_arrival_time,1.0,22.0,177608.0,0.0,-717.0,62064145.0
call_to_ambulance_arrival_time,61.0,7.0,28291.0,149272.0,-1051185.0,5871.0
ambulance_on_scene_time,26.0,6.0,28310.0,149289.0,-63103546.0,1051214.0
ambulance_travel_to_hospital_time,51.0,46.0,23157.0,154377.0,-63106438.0,62930084.0
ambulance_wait_time_at_hospital,1338.0,2798.0,19127.0,154368.0,-89280.0,63106456.0
arrival_time_3_hour_period,0.0,7634.0,169997.0,0.0,0.0,24.0
arrival_to_scan_time,0.0,0.0,176913.0,718.0,1.0,526050.0
scan_to_thrombolysis_time,0.0,161.0,20634.0,156836.0,0.0,656.0
arrival_to_thrombectomy_time,0.0,0.0,1791.0,175840.0,2.0,1780.0


In [242]:
# Values get big, with many leading up to this point, this showing the top end
# This converts it from minutes to years
time_years = data['onset_to_arrival_time']/525600
time_years[time_years > 2].sort_values()

33232       2.002871
49026       3.003210
116980      3.003683
110037      3.004075
38583       3.007730
46621       3.016048
84970       7.010459
79135       8.005757
137459     10.008858
112877     15.011221
71144      15.012003
189213     16.011475
87581      16.016324
118245     17.011421
253581     17.017783
121658     18.011570
142827    100.125778
61653     117.137833
123960    118.082468
Name: onset_to_arrival_time, dtype: float64

## Exploratory (to tidy)

In [179]:
# Admissions per stroke team

In [7]:
# Admissions
print('Total admissions: {0}'.format(len(data.index)))
print('Average yearly admissions: {0}'.format(round(len(data.index)/3)))
admissions = data.groupby('stroke_team').size()
admissions.describe()

Total admissions: 177631
Average yearly admissions: 59210


count     114.000000
mean     1558.166667
std       593.363294
min       452.000000
25%      1137.500000
50%      1456.500000
75%      1919.750000
max      3568.000000
dtype: float64

In [32]:
# Stroke types
data['infarction'].map({1: 'Infarction',
                        0: 'Primary Intracerebral Haemorrage'}) \
                  .value_counts(normalize=True, dropna=False)

Infarction                          0.874662
Primary Intracerebral Haemorrage    0.121296
NaN                                 0.004042
Name: infarction, dtype: float64

In [10]:
# Thrombolysis use rates for in-hospital and out-of-hospital onset
# Can't do as S1OnsetInHospital not in cleaned dataset
# Also therefore can't restrict to out-of-hospital only

In [11]:
# Analyse by team - group by team, record:
# Team, admission numbers, thrombolysis rate, rank before stroke, NIHSS on arrival, proportion with known onset time (remove rest), proportion with onset <4 (remove rest)
# rankin again, proportion 80+, onset to arrival, scan within 4 hours, arrival to scan, thrombolysis given, scan to needle, arrival to needle, onset to needle, proportion thrombolysis after 180 or 270

2:80: E501 line too long (170 > 79 characters)
3:80: E501 line too long (200 > 79 characters)


In [12]:
# Based on analysis by team, summarise for whole population (average of each hospital)

1:80: E501 line too long (86 > 79 characters)


In [13]:
# Those average summary results for under 80 v.s. over 80

In [14]:
# Figure with thrombolyysis use (all and < 4 hours onset)
# Figure proportion with known onset
# Mean arrival to scan time for patients
# Mean scan to needle time
# Mean arrival to needle time

In [15]:
# Stroke severity distribution

In [16]:
# Onset to arrival, proportion known onset, severity

In [17]:
# Restrict to patients who received thrombolysis
thrombolysed = data[data['thrombolysis'] == 1].copy()

# Proportion where onset is known
# throm_arrival['onset known'] = thrombolysed['onset known'].value_counts(normalize = True)[1]

# Arrival within 4 or 6 hours
thrombolysed['arrive_within_4'] = np.where(thrombolysed['onset_to_arrival_time'] <= 240, 1, 0)
thrombolysed['arrive_within_6'] = np.where(thrombolysed['onset_to_arrival_time'] <= 360, 1, 0)

# NIHSS 6+ or 11+
thrombolysed['nihss_6_plus'] = np.where(thrombolysed['stroke_severity'] >= 6, 1, 0)
thrombolysed['nihss_11_plus'] = np.where(thrombolysed['stroke_severity'] >= 11, 1, 0)

# Find results overall, by arrival time group, and by NIHSS group
# thrombolysed.groupby('arrive_within_4').mean()


5:80: E501 line too long (94 > 79 characters)
8:80: E501 line too long (94 > 79 characters)
9:80: E501 line too long (94 > 79 characters)
12:80: E501 line too long (83 > 79 characters)
13:80: E501 line too long (85 > 79 characters)
17:1: W391 blank line at end of file
