# Clean raw data

## Summary

This notebook cleans the SAMueL-2 SAMueL SSNAP extract v2 (not included on GitHub repository). This data only contains patients with **stroke onset out of hospital** (all in-hospital strokes removed prior to receiving data). For an explanation of the decisions made when data cleaning (for example, why made changes to certain variables, or how decided to drop unusual results), please see "02_reasoning_behind_clean_raw_data.ipynb".

## Aims

(1) Extract and clean relevant variables from the raw SSNAP data - for example:
* Converting from Y/N/NB to 0/1.
* Converting categories to numbers (e.g. age 40-45 to 42.5).
* Creating indicators (e.g. whether stroke onset time was known, by grouping "precise" with "best estimate", and comparing that to "not known")
* Calculating times (e.g. comparing time arrived at location to time call connected, to determine call to ambulance arrival time).

(2) Deal with unusual results (that apply for any project using this dataset) - for example:
* Replacing values as missing when they are considered to be invalid/implausible/errors.
* Removing patients from the dataset.

Individual projects (i.e. different models or analyses) may impose additional restrictions on the dataset (e.g. dropping patients missing vital data, restricting date range).

## Set up

In [1]:
# Import packages
from dataclasses import dataclass
import numpy as np
import os
import pandas as pd

# Linting
%load_ext pycodestyle_magic
%pycodestyle_on

# Set the maximum number of columns to 100
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [2]:
# Set paths and filenames
@dataclass(frozen=True)
class Paths:
    '''Singleton object for storing paths to data and database.'''

    data_path: str = './data'
    data_filename: str = 'SAMueL ssnap extract v2.csv'
    data_save_path: str = './output'
    data_save_filename: str = 'clean_samuel_ssnap_extract_v2.csv'
    database_filename: str = 'samuel.db'


paths = Paths()

In [3]:
# Load data
raw_data: pd.DataFrame = pd.read_csv(
    os.path.join(paths.data_path, paths.data_filename), low_memory=False)

# Set up DataFrame for cleaned data
cleaned_data: pd.DataFrame = pd.DataFrame()

## Create clean version of data

### ID

In [4]:
cleaned_data['id'] = raw_data['PatientId']

### Stroke team

In [5]:
cleaned_data['stroke_team'] = raw_data['TeamName']

### Age and gender

In [6]:
# Dictionary defining numeric age as middle of age band
ages: dict = {'AgeUnder40': 37.5,
              'Age40to44': 42.5, 'Age45to49': 47.5,
              'Age50to54': 52.5, 'Age55to59': 57.5,
              'Age60to64': 62.5, 'Age65to69': 67.5,
              'Age70to74': 72.5, 'Age75to79': 77.5,
              'Age80to84': 82.5, 'Age85to89': 87.5,
              'AgeOver90': 92.5}

# Extract age band columns, and find the highest age band that
# # the patient is part of
col_extract: pd.DataFrame = raw_data[ages.keys()]
age_band: pd.Series = col_extract.idxmax(axis=1)

# Use that ageband to find appropriate numeric age from ages dictionary
cleaned_data['age'] = age_band.map(ages)

In [7]:
gender: dict = {'M': 1, 'F': 0}
cleaned_data['male'] = raw_data['S1Gender'].map(gender)

### Stroke type

Abbreviations:
* Infarction (I)
* Primary intracerebral haemorrage (PIH)
* Unknown if not imaged (NaN)

In [8]:
infarction: dict = {'I': 1, 'PIH': 0}
cleaned_data['infarction'] = raw_data['S2StrokeType'].map(infarction)

### Onset time

Abbreviations:
* Precise (P)
* Best estimate (BE)
* Not known (NK)
* During sleep (DS)

In [9]:
# Onset to arrival time in minutes
cleaned_data['onset_to_arrival_time'] = raw_data['OnsettoArrivalMinutes']

# Whether onset time is known
onset_known: dict = {'NK': 0, 'P': 1, 'BE': 1}
cleaned_data['onset_known'] = raw_data['S1OnsetTimeType'].map(onset_known)

# Whether onset time is precise - if not, then best estimate or not known
precise_onset_known: dict = {'P': 1, 'BE': 0, 'NK': 0}
cleaned_data['precise_onset_known'] = (
    raw_data['S1OnsetTimeType'].map(precise_onset_known))

# Stroke during sleep
sleep: dict = {'DS': 1, 'P': 0, 'BE': 0}
cleaned_data['onset_during_sleep'] = raw_data['S1OnsetDateType'].map(sleep)

### Ambulance timings

In [10]:
# Arrive by ambulance
by_ambulance: dict = {'Y': 1, 'N': 0}
cleaned_data['arrive_by_ambulance'] = (
    raw_data['S1ArriveByAmbulance'].map(by_ambulance))

# Time from call connected to ambulance arrival at patient location
cleaned_data['call_to_ambulance_arrival_time'] = (
    raw_data['ArrivalPatientLocationtoArrivalMinutes'] -
    raw_data['CallConnectedtoArrivalMinutes'])

# Time that ambulance on scene at patient location
cleaned_data['ambulance_on_scene_time'] = (
       raw_data['DeparturePatientLocationtoArrivalMinutes'] -
       raw_data['ArrivalPatientLocationtoArrivalMinutes'])

# Ambulance travel time to from patient location to hospital
cleaned_data['ambulance_travel_to_hospital_time'] = (
       raw_data['WheelsStoptoArrivalMinutes'] -
       raw_data['DeparturePatientLocationtoArrivalMinutes'])

# Ambulance wait time at hospital
cleaned_data['ambulance_wait_time_at_hospital'] = (
    0 - raw_data['WheelsStoptoArrivalMinutes'])

### Day, month, year and time of arrival.

In [11]:
# Month, year and day
cleaned_data['month'] = pd.to_datetime(
    raw_data['FirstArrivalMonthYear'], format='%b-%y').dt.month
cleaned_data['year'] = pd.to_datetime(
    raw_data['FirstArrivalMonthYear'], format='%b-%y').dt.year
cleaned_data['weekday'] = raw_data['FirstArrivalWeekday']

In [12]:
# Get arrival period (3 hour period during day)
arrival_time_dict: dict = {
    '0000to3000': 0,
    '0300to0600': 3,
    '0600to0900': 6,
    '0900to1200': 9,
    '1200to1500': 12,
    '1500to1800': 15,
    '1800to2100': 18,
    '2100to2400': 21
}
cleaned_data['arrival_time_3_hour_period'] = (
    raw_data['FirstArrivalTime'].map(arrival_time_dict))

### Scan, thrombolysis and thrombectomy

In [13]:
# Get arrival to scan time
cleaned_data['arrival_to_scan_time'] = raw_data['ArrivaltoBrainImagingMinutes']

# Get use of thrombolysis
# NB is the answer automatically selected if type of stroke is PIH
thrombolysis: dict = {'Y': 1, 'N': 0, 'NB': 0}
cleaned_data['thrombolysis'] = raw_data['S2Thrombolysis'].map(thrombolysis)

# Get time from scan to thrombolysis
cleaned_data['scan_to_thrombolysis_time'] = (
    raw_data['ArrivaltoThrombolysisMinutes'] -
    raw_data['ArrivaltoBrainImagingMinutes'])

# Get use of thrombectomy (0 if x is NaN, 1 if x is a number)
cleaned_data['thrombectomy'] = (
    raw_data['ArrivaltoArterialPunctureMinutes'].apply(
        lambda x: 0 if np.isnan(x) else 1))

# Get time from arrival to thrombectomy
cleaned_data['arrival_to_thrombectomy_time'] = (
    raw_data['ArrivaltoArterialPunctureMinutes'])

### Comorbidities

These are co-morbidities that were present prior to this admission, and medication that patient was on prior to this admission. The one exception is S2NewAfDiagnosis, which is whether a new diagnosis of atrial fibrillation was made on admission.

In [14]:
comorbidities: dict = {
    'S2CoMCongestiveHeartFailure': 'congestive_heart_failure',
    'S2CoMHypertension': 'hypertension',
    'S2CoMAtrialFibrillation': 'atrial_fibrillation',
    'S2CoMDiabetes': 'diabetes',
    'S2CoMStrokeTIA': 'prior_stroke_tia',
    'S2CoMAFAntiplatelet': 'afib_antiplatelet',
    'S2CoMAFAnticoagulent': 'afib_anticoagulant'}

# Add comorbidites columns with new names and change Y/N/NB to 1/0
cleaned_data[list(comorbidities.values())] = raw_data[comorbidities.keys()]
comorbid_marker = {'Y': 1, 'N': 0, 'NB': 0, np.nan: np.nan}
for col in comorbidities.values():
    cleaned_data[col] = cleaned_data[col].map(comorbid_marker)

# You cannot be marked as receiving antiplatelets unless you have an atrial
# fibrillation diagnosis, so change those from missing to 0
cleaned_data.loc[(cleaned_data['atrial_fibrillation'] == 0) &
                 (cleaned_data['afib_antiplatelet'].isnull()),
                 'afib_antiplatelet'] = 0

In [15]:
anticoag_type: dict = {
    'S2CoMAFAnticoagulentVitK': 'afib_vit_k_anticoagulant',
    'S2CoMAFAnticoagulentDOAC': 'afib_doac_anticoagulant',
    'S2CoMAFAnticoagulentHeparin': 'afib_heparin_anticoagulant'
}

# Add to clean data and map (seperately from other comorbidities as
# dictionary states that leaving empty (NaN) means False for these 3)
cleaned_data[list(anticoag_type.values())] = raw_data[anticoag_type.keys()]
anticoag_type_map = {1: 1, 0: 0, np.nan: 0}
for col in anticoag_type.values():
    cleaned_data[col] = cleaned_data[col].map(anticoag_type_map)

In [16]:
new_afib = {'Y': 1, 'N': 0, np.nan: np.nan}
cleaned_data['new_afib_diagnosis'] = raw_data['S2NewAFDiagnosis'].map(new_afib)

### Prior disability

In [17]:
cleaned_data['prior_disability'] = raw_data['S2RankinBeforeStroke']

### NIHSS data

In [18]:
def camel_to_snake(str):
    # Converts CamelCase to snake_case
    # Input: str
    snake = ''.join(['_' + i.lower() if i.isupper()
                     else i for i in str]).lstrip('_')
    return snake

In [19]:
# Stroke severity is NIHSS score on arrival
cleaned_data['stroke_severity'] = raw_data['S2NihssArrival']

# List of NIHSS arrival measures
nihss: list = ['S2NihssArrivalLoc', 'S2NihssArrivalLocQuestions',
               'S2NihssArrivalLocCommands', 'S2NihssArrivalBestGaze',
               'S2NihssArrivalVisual', 'S2NihssArrivalFacialPalsy',
               'S2NihssArrivalMotorArmLeft', 'S2NihssArrivalMotorArmRight',
               'S2NihssArrivalMotorLegLeft', 'S2NihssArrivalMotorLegRight',
               'S2NihssArrivalLimbAtaxia', 'S2NihssArrivalSensory',
               'S2NihssArrivalBestLanguage', 'S2NihssArrivalDysarthria',
               'S2NihssArrivalExtinctionInattention']

# Finds the minimum value across these columns, and uses that to create
# marker of whether any of them contain a missing value (indicated by -1)
cleaned_data['nihss_complete'] = raw_data[nihss].min(axis=1).apply(
    lambda x: 0 if x == -1 else 1)

# Add columns (exactly as are in raw data)
cleaned_data[nihss] = raw_data[nihss]

# Rename - convert to snake case and remove 's2_'
rename_dict: dict = {}
for col in nihss:
    rename_dict[col] = camel_to_snake(col).split('s2_')[1]
cleaned_data.rename(rename_dict, axis=1, inplace=True)

### Death and outcome data

In [20]:
# Discharge destination
discharge: dict = {
    'CH': 'care_home',
    'D': 'died',
    'H': 'home',
    'SE': 'somewhere_else',
    'TC': 'community_team_or_esd',
    'TCN': 'community_team_or_esd',
    'TN': 'non_ssnap_hospital_team',
    'T': 'ssnap_hospital_team',
    np.NaN: 'missing'}
cleaned_data['discharge_destination'] = (
    raw_data['S7DischargeType'].map(discharge))

# Death - if NaN then 0, if 0+ days (so if died) then 1
cleaned_data['death'] = (raw_data['ArrivalToDeathDays'] >= 0) * 1

# Outcome
cleaned_data['discharge_disability'] = raw_data['S7RankinDischarge']
cleaned_data['disability_6_month'] = raw_data['S8Rankin6Month']
# S8Rankin6MonthNK not included as implicit that not known
# if NaN is S8Rankin6Month

### Reasons for no thrombolysis

"No" reasons

Abbreviations:
* TNA - Thrombolysis not available at hospital at all
* OTSH - Outside thrombolysis service hours
* USQE - Unable to scan quickly enough
* N - None

In [21]:
def make_no_thrombolysis(category):
    '''
    Function used to create thrombolysis no variables.
    Created using map rather than ifelse (which would require less code)
    as map is explicitly stating what each variable is mapped to
    (e.g. being aware that NaN is also mapping to 0)
    '''
    # Create dictionary mapping every category to 0
    thrombolysis_no = {'TNA': 0, 'OTSH': 0, 'USQE': 0, 'N': 0, np.nan: 0}

    # Change specific category to map to 1
    thrombolysis_no[category] = 1

    # Return thrombolysis no column mapped to this dictionary
    return (raw_data['S2ThrombolysisNoReason'].map(thrombolysis_no))


# Create columns with reason for no thrombolysis - if 0 for all then was NaN
cleaned_data['thrombolysis_no_not_available'] = make_no_thrombolysis('TNA')
cleaned_data['thrombolysis_no_out_of_hours'] = make_no_thrombolysis('OTSH')
cleaned_data['thrombolysis_no_scan_not_quick_enough'] = (
    make_no_thrombolysis('USQE'))
cleaned_data['thrombolysis_no_no_reason'] = make_no_thrombolysis('N')

"No but" reasons

*Note: NoButTimeWindow refers to time since stroke onset, and not time window that thrombolysis services were available (that is covered in the "no" reasons)*

In [22]:
no_thrombolysis: list = [
       'S2ThrombolysisNoButHaemorrhagic', 'S2ThrombolysisNoButTimeWindow',
       'S2ThrombolysisNoButComorbidity', 'S2ThrombolysisNoButMedication',
       'S2ThrombolysisNoButRefusal', 'S2ThrombolysisNoButAge',
       'S2ThrombolysisNoButImproving', 'S2ThrombolysisNoButTooMildSevere',
       'S2ThrombolysisNoButTimeUnknownWakeUp',
       'S2ThrombolysisNoButOtherMedical']

# Add columns
cleaned_data[no_thrombolysis] = raw_data[no_thrombolysis]

# Rename - convert to snake case and remove 's2_'
rename_dict: dict = {}
for col in no_thrombolysis:
    rename_dict[col] = camel_to_snake(col).split('s2_')[1]
cleaned_data.rename(rename_dict, axis=1, inplace=True)

### Deal with unusual results

**Unusual ambulance times**

In [23]:
amb_col = ['call_to_ambulance_arrival_time',
           'ambulance_on_scene_time',
           'ambulance_travel_to_hospital_time',
           'ambulance_wait_time_at_hospital']

amb_except_wait = ['call_to_ambulance_arrival_time',
                   'ambulance_on_scene_time',
                   'ambulance_travel_to_hospital_time']

# If any times are missing, set all to NaN
cleaned_data.loc[cleaned_data[amb_col].isnull().any(axis=1), amb_col] = np.nan

# If any times are negative, set all to NaN
cleaned_data.loc[(cleaned_data[amb_col] < 0).any(axis=1), amb_col] = np.nan

# If any ambulance column except waittime is 0, set all to NaN
cleaned_data.loc[
    (cleaned_data[amb_except_wait] == 0).any(axis=1), amb_col] = np.nan

# Find individuals with ambulance times yet arrive_by_ambulance = 0
# For those individuals, set ambulance times + arrive by ambulance to NaN
times_but_no_amb = (
    cleaned_data['arrive_by_ambulance'] == 0) & (
        (cleaned_data['call_to_ambulance_arrival_time'].notnull()) |
        (cleaned_data['ambulance_on_scene_time'].notnull()) |
        (cleaned_data['ambulance_travel_to_hospital_time'].notnull()) |
        (cleaned_data['ambulance_wait_time_at_hospital'].notnull()))
cleaned_data.loc[times_but_no_amb,
                 amb_col + ['arrive_by_ambulance']] = np.nan

# If call to ambulance arrive is greater than 24h (1440m), set all times to NaN
cleaned_data.loc[cleaned_data['call_to_ambulance_arrival_time'] > 1440,
                 amb_col] = np.nan

# If ambulance on scene time is greater than 12h (720m), set all times to NaN
cleaned_data.loc[cleaned_data['ambulance_on_scene_time'] > 720,
                 amb_col] = np.nan

# If travel to hospital time is greater than 6h (360m), set all times to NaN
cleaned_data.loc[cleaned_data['ambulance_travel_to_hospital_time'] > 360,
                 amb_col] = np.nan

# If wait time at hospital is greater than 12h (720m), set all times to NaN
cleaned_data.loc[cleaned_data['ambulance_wait_time_at_hospital'] > 720,
                 amb_col] = np.nan

**Unusual onset to arrival times**

In [24]:
# If onset to arrival = 0 or is negative, mark as missing
cleaned_data.loc[(cleaned_data['onset_to_arrival_time'] <= 0),
                 'onset_to_arrival_time'] = np.nan

# If onset time is not known, set onset to arrival time as missing
cleaned_data.loc[(cleaned_data['onset_known'] == 0),
                 'onset_to_arrival_time'] = np.nan

**Inconsistent anticoagulant data**

In [25]:
# If discrepancy use of anticoagulants is negative, but an indivudal
# anticoagulant is marked as being used, mark both as missing
cleaned_data.loc[
    ((cleaned_data['afib_anticoagulant'] == 0) &
     ((cleaned_data['afib_vit_k_anticoagulant'] == 1) |
      (cleaned_data['afib_doac_anticoagulant'] == 1) |
      (cleaned_data['afib_heparin_anticoagulant'] == 1))),
    ['afib_anticoagulant', 'afib_vit_k_anticoagulant',
     'afib_doac_anticoagulant', 'afib_heparin_anticoagulant']] = np.nan

**Inconsistent death markers**

In [26]:
# Identify patients who have any marker indicating death
any_death = ((cleaned_data['death'] == 1) |
             (cleaned_data['discharge_disability'] == 6) |
             (cleaned_data['discharge_destination'] == 'died'))

# Identify patients who have all markers indicating death
# (or, if not, missing a result)
all_or_missing_death = (
    ((cleaned_data['death'] == 1) |
     cleaned_data['death'].isnull()) &
    ((cleaned_data['discharge_disability'] == 6) |
     (cleaned_data['discharge_disability'].isnull())) &
    ((cleaned_data['discharge_destination'] == 'died') |
     (cleaned_data['discharge_destination'].isnull())))

# Extract columns where have at least one marker indicating death,
# but then others do not (don't extract if its just because others
# are missing) - and set all their death columns as NaN
cleaned_data.loc[
 any_death & ~all_or_missing_death,
 ['death', 'discharge_disability', 'discharge_destination']] = np.nan

### Remove patients

This performed at the end of the notebook, as the data cleaning steps assume that the rows in the clean data match up to the rows in the raw data.

In [27]:
# Remove patients missing stroke type
new_cleaned_data = cleaned_data[cleaned_data['infarction'].notna()]

## Save cleaned data

In [28]:
filename = os.path.join(paths.data_save_path, paths.data_save_filename)
new_cleaned_data.to_csv(filename, index=False)