### Impute data from SSNAP by replacing missing (NaN) values and encoding categorical variables 

Also remove any patients that are not scanned, and remove columns that are highly correlated with another (refer to "1a_understand_dataset.ipynb" for the work that informed this)

In [1]:
import os
import json
import pandas as pd
import numpy as np

In [2]:
# location of the raw data and additional files. Will need to be changed by user.

data_loc = './'

### Import Data 

In [3]:
data = pd.read_csv(data_loc + '2019-11-04-HQIP303-Exeter_MA.csv',
                   low_memory=False)

In [4]:
n_patients_total = data.shape[0] 

In [5]:
len(data.columns)

62

In [6]:
data.head()

Unnamed: 0,StrokeTeam,PatientUID,Pathway,S1AgeOnArrival,MoreEqual80y,S1Gender,S1Ethnicity,S1OnsetInHospital,S1OnsetToArrival_min,S1OnsetDateType,...,Comorbidity,Medication,Refusal,Age,Improving,TooMildSevere,TimeUnknownWakeUp,OtherMedical,S2ThrombolysisTime_min,S2TIAInLastMonth
0,GWOXR9160G,PSFLJ5008B,1,"[85,90)",Yes,Female,White,No,235.0,Stroke during sleep,...,No,No,No,No,No,No,Yes,No,,
1,WTBXP2683L,HJCXV6545Z,1,"[85,90)",Yes,Male,White,No,70.0,Precise,...,Yes,No,No,No,Yes,No,No,No,,
2,WTBXP2683L,DGCGW7328T,1,"[85,90)",Yes,Female,White,Yes,,Precise,...,Yes,Yes,No,No,No,No,No,No,,
3,GRCGI3873D,YSWGZ4558X,1,"[65,70)",No,Female,White,No,,Best estimate,...,No,No,No,No,No,No,Yes,No,,
4,ZRRCV7012C,VBUYH5070E,1,"[75,80)",No,Male,White,No,,Stroke during sleep,...,No,No,No,No,No,No,Yes,No,,


### Import variable2type.json

This dictionary maps variables used in the prediction to their data type. Variables that do not appear in the dictionary are not used for the prediction and, apart from 'S2 Thrombolysis', will be dropped.

In [7]:
with open(data_loc + 'variable2type.json', 'r') as fp:
    variable2type = json.load(fp)

In [8]:
for feature in data.columns:
    
    if feature == 'S2Thrombolysis':
        
        continue
    
    if feature not in variable2type:

        data = data.drop(feature, axis=1)
    

In [9]:
data.columns

Index(['StrokeTeam', 'Pathway', 'S1AgeOnArrival', 'MoreEqual80y', 'S1Gender',
       'S1Ethnicity', 'S1OnsetInHospital', 'S1OnsetToArrival_min',
       'S1OnsetDateType', 'S1OnsetTimeType', 'S1ArriveByAmbulance',
       'S1AdmissionHour', 'S1AdmissionDay', 'S1AdmissionQuarter',
       'S1AdmissionYear', 'CongestiveHeartFailure', 'Hypertension',
       'AtrialFibrillation', 'Diabetes', 'StrokeTIA', 'AFAntiplatelet',
       'AFAnticoagulent', 'AFAnticoagulentVitK', 'AFAnticoagulentDOAC',
       'AFAnticoagulentHeparin', 'S2NewAFDiagnosis', 'S2RankinBeforeStroke',
       'Loc', 'LocQuestions', 'LocCommands', 'BestGaze', 'Visual',
       'FacialPalsy', 'MotorArmLeft', 'MotorArmRight', 'MotorLegLeft',
       'MotorLegRight', 'LimbAtaxia', 'Sensory', 'BestLanguage', 'Dysarthria',
       'ExtinctionInattention', 'S2NihssArrival', 'S2BrainImagingTime_min',
       'S2StrokeType', 'S2Thrombolysis', 'S2TIAInLastMonth'],
      dtype='object')

In [10]:
len(data.columns)

47

### Import variable2method.json 

In [11]:
with open(data_loc + 'variable2method.json', 'r') as fp:
    variable2method = json.load(fp)

In [12]:
variable2method

{'S1OnsetToArrival_min': '9999',
 'S1ArriveByAmbulance': 'missing',
 'AFAntiplatelet': 'missing',
 'AFAnticoagulent': 'missing',
 'AFAnticoagulentVitK': 'missing',
 'AFAnticoagulentDOAC': 'missing',
 'AFAnticoagulentHeparin': 'missing',
 'S2NewAFDiagnosis': 'missing',
 'LocQuestions': 'zero',
 'LocCommands': 'zero',
 'BestGaze': 'zero',
 'Visual': 'zero',
 'FacialPalsy': 'zero',
 'MotorArmLeft': 'zero',
 'MotorArmRight': 'zero',
 'MotorLegLeft': 'zero',
 'MotorLegRight': 'zero',
 'LimbAtaxia': 'zero',
 'Sensory': 'zero',
 'BestLanguage': 'zero',
 'Dysarthria': 'zero',
 'ExtinctionInattention': 'zero',
 'S2NihssArrival': 'zero',
 'S2BrainImagingTime_min': '9999',
 'S2StrokeType': 'missing',
 'S2TIAInLastMonth': 'missing'}

In [13]:
variable2method['S2NihssArrival'] = 'sum'

Values in this dictionary correspond to the following methods:

- 9999: replace missing values with 9999
- zero: replace missing values with zero
- missing: replace missing values with a text label 'missing' 

## Imputation 

#### Step 1: Replace all NaN values according to the process in variable2method 

In [14]:
data.columns[27:43]

Index(['Loc', 'LocQuestions', 'LocCommands', 'BestGaze', 'Visual',
       'FacialPalsy', 'MotorArmLeft', 'MotorArmRight', 'MotorLegLeft',
       'MotorLegRight', 'LimbAtaxia', 'Sensory', 'BestLanguage', 'Dysarthria',
       'ExtinctionInattention', 'S2NihssArrival'],
      dtype='object')

In [15]:
imputed = data.copy()

for variable, method in variable2method.items():
    
    series = imputed[variable].copy()
    missing = series.isna()
    
    if method=='missing':
        
        series[missing] = 'missing'
        
    elif method=='zero':
        
        series[missing] = 0
        
    elif method=='9999':
        
        series[missing] = 9999
        
    elif method=='sum':
        
        series[missing] = imputed[data.columns[27:43]].sum(axis=1)
        
    else:
        raise Exception('{0} not a valid method'.format(method))
        
    imputed[variable] = series

#### Step 2: Use one-hot-encoding to encode all categorical and binary (text) variables 

#### If preparing data for neural network or train test by hospital, uncomment lines 9 and 10

In [16]:
encoded = imputed.copy()

variable2family={}

for variable, type_ in variable2type.items():
    
    if type_ in ['Categorical', 'Binary']:
        
        if variable == 'StrokeTeam':
            continue

        to_code = encoded[variable]
        
        if type_ == 'Binary': 
            
            coded = pd.get_dummies(to_code, prefix=variable)
            
        else:
            coded = pd.get_dummies(to_code, prefix=variable)
        
        encoded = pd.concat([encoded, coded], axis=1)
        encoded.drop([variable], axis=1, inplace=True)
        
        variable2family[variable] = coded.columns.values.tolist()

In [17]:
with open(data_loc + 'variable2family.json', 'w') as f: 
    json.dump(variable2family, f) 

#### Step 3: Encode 'S2Thrombolysis' to target 

In [18]:
target=[]
for outcome in encoded.S2Thrombolysis.values:
    
    if outcome in ['No', 'No but']:
        
        target.append(0)
        
    elif outcome == 'Yes': 
        
        target.append(1)
        
encoded['S2Thrombolysis'] = target

#### Step 4: Change 'S1AgeOnArrival' to midpoint

In [19]:
ages = []
for group in encoded.S1AgeOnArrival.values:
    minage, maxage = group.split(',')
    
    minage = int(''.join(list(minage)[1:]))
    maxage = int(''.join(list(maxage)[:-1]))
    
    ages.append(np.median([minage,maxage]))
    
encoded['S1AgeOnArrival'] = ages

### Restrict data 

Restriction 1. Patients that attend a hospital with at least 300 admissions and 10 thrombolysis

In [20]:
# Set up list for dataframe groups
keep = []

groups = encoded.groupby('StrokeTeam') # creates a new object of groups of data

for index, group_df in groups: # each group has an index and a dataframe of data
    
    # Skip if total admiision less than 300 or total thrombolysis < 10
    if (group_df.shape[0] < 300) or (group_df['S2Thrombolysis'].sum() < 10):
        continue
    
    else: 
        keep.append(group_df)

# Concatenate output
filtered_data = pd.DataFrame()
filtered_data = pd.concat(keep)

n_patients_after_admission_restrictions = filtered_data.shape[0] 

Restriction 2. Patients who have their stroke onset out of hospital with onset to arrival of 4 hours or less

In [21]:
filtered_data = filtered_data[(filtered_data['S1OnsetInHospital_Yes']==0) & 
                              (filtered_data['S1OnsetToArrival_min']<= 240)]

n_patients_after_arrival_restrictions = filtered_data.shape[0] 

Restriction 3. Patients that have a scan

In [22]:
mask = filtered_data['S2BrainImagingTime_min']!=9999
filtered_data = filtered_data[mask]

n_patients_after_need_a_scan_restrictions = filtered_data.shape[0] 

### Summary of number of patients after filter data at each step

In [23]:
print(f"There are {n_patients_total} patients in total")
print(f"There are {n_patients_after_admission_restrictions} patients after "
      f"hospital admission restrictions")
print(f"There are {n_patients_after_arrival_restrictions} patients after "
      f"arrival time <4hours and onset out of hospital restriction")
print(f"There are {n_patients_after_need_a_scan_restrictions} patients after "
      f"need to have a scan restrictions")

There are 246676 patients in total
There are 239505 patients after hospital admission restrictions
There are 88928 patients after arrival time <4hours and onset out of hospital restriction
There are 88792 patients after need to have a scan restrictions


## Drop columns
### Motivation to learn which columns to drop

In SAMueL2 we will focus more on the explanability of the models. We will do this by looking at the importance of the feature in the model, and also the features SHAP value (the contribution of a feature to the target value). Therefore it is now more useful for us to remove any feature with a near perfect correlation with another feature. Say we have two binary variables, one recording if the patient is female and another recording is a patient is male. These both provide the model with the same information and so one model may give "female" a score of 10, amd "male" a score of 0, and another may give both a score of 5. When look at the features in terms of ranked importance, having the data represented in this way complicates the interpretation.

In [24]:
columns_to_drop = ['MoreEqual80y_No','S1Gender_Male','S1OnsetInHospital_No',
                   'CongestiveHeartFailure_No','Hypertension_No',
                   'AtrialFibrillation_No', 'Diabetes_No', 'StrokeTIA_No',
                   'AFAntiplatelet_missing',
                   'S1ArriveByAmbulance_missing',
                   'S2StrokeType_missing',
                   'S2StrokeType_Primary Intracerebral Haemorrhage',
                   'S1OnsetTimeType_Best estimate', 'S1OnsetTimeType_Not known']

filtered_data.drop(columns=columns_to_drop, inplace=True)

### Save imputed data to csv 

In [25]:
filtered_data.to_csv(
    data_loc + '220401_national_data_imputed_filtered_no_unit_encoding.csv', index=False)

### One hot encode units and save to csv

In [26]:
units = filtered_data['StrokeTeam']
filtered_data.drop(['StrokeTeam'],inplace=True, axis=1)
one_hot_coded = pd.get_dummies(units, prefix='StrokeTeam')
filtered_data = pd.concat([filtered_data, one_hot_coded], axis=1)

In [27]:
filtered_data.to_csv(
    data_loc + '220401_national_data_imputed_filtered_with_unit_encoding.csv', 
    index=False)