# Reformat data

This notebook reformats the cleaned up SSNAP data for use with machine learning.

Uses as input the output file from 01_clean_raw_data.ipynb ('clean_samuel_ssnap_extract_v2.csv') which is on github repository: https://github.com/samuel-book/samuel_2_data_prep/blob/main/01_clean_raw_data.ipynb 

(This notebook renamed from "02_reformat_data_ml_230612.ipynb" on branch kerry_01)

Option to keep thrombectomy patients in (set "include_thrombectomy_patients")

## Import packages

In [1]:
# Import packages
import numpy as np
import os
import pandas as pd
import random
from os.path import exists
import json

from dataclasses import dataclass

# Set the maximum number of columns to 100
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

User determines if keep thrombectomy patients in the dataset

In [2]:
remove_thrombectomy_patients = False

## Set up paths and filenames

Use os.path.join() to create filenames. So define folders without trailing forward slash, and include all characters in file names.

In [3]:
@dataclass(frozen=True)
class Paths:
    '''Singleton object for storing paths to data and database.'''

    data_read_path: str = './output'
    data_read_filename: str = 'clean_samuel_ssnap_extract_v2.csv'
    data_save_path: str = './output'
    data_save_filename: str = 'reformatted_data_ml.csv'
    teamcode_save_filename: str = 'team_code.csv'
    notebook: str = '02_'

paths = Paths()

## Define thresholds

In [4]:
min_hospital_thrombolysis_threshold = 10
min_hospital_admission_threshold = 250

## Define default values

In [5]:
set_duration_not_get_thrombolysis = -100

## Load data

In [6]:
filename = os.path.join(paths.data_read_path, paths.data_read_filename)
all_data = pd.read_csv(filename)
all_data.shape

(358993, 71)

## Filter patients 
### Filter patients on patient characteristic

Filter based on category, or threshold.

In [7]:
# Limit to years 2016+
mask = (all_data['year'] >= 2016)
data = all_data[mask]

# Limit to infarction stroke
mask = (data['infarction'] == 1)
data = data[mask]

# Limit to arrivals by ambulace
mask = (data['arrive_by_ambulance'] == 1)
data = data[mask]

if remove_thrombectomy_patients:
    # Remove patients who have received thrombectomy
    mask = (data['thrombectomy'] == 0)
    data = data[mask]

# Remove patients with no recorded prior disability
mask = data['prior_disability'] >= 0
data = data[mask]

# Remove records with no recorded discharge_disability
mask = data['discharge_disability'] >= 0
data = data[mask]

# Remove records with negative onset_to_arrival_time
mask = data['onset_to_arrival_time'] <= 0
mask =  mask == False
data = data[mask]

# Remove patients with 'onset_known' = 0
# SSNAP data assumes patients with S1OnsetTimeType="NK" had their stroke onset 
# at midnight (so their OnsettoArrivalMinutes are calculated from midnight). 
# Remove these patients. This information is in feature 'onset_known' as 0 (a 
# value of 1 represents precise and best estimate) (see 01_clean_raw_data.ipynb)
mask = data['onset_known'] == 1
data = data[mask]

Remove leading and trailing whitespace in team names

In [8]:
all_data["stroke_team"] = (all_data["stroke_team"].apply(lambda x: x.strip()))

### Filter patients on attended hospital characteristic

Include patients that attend a hospital with more than 250 admissions, and give atleast 10 thrombolysis in the 6 years included in the dataset (2016 - 2021 incusive)

Define function to remove patients based on hospital values

In [9]:
def filter_stroke_team(data, min_threshold, stroke_team_values):
    """
    Returns the dataframe with only the patients that attend a stroke team that 
    pass a minimum threshold.
    Currently used to limit patients to those that attend a hospital that has
    at least 250 admissions, and gives thrombolysis at least 10 times.

    data [dataframe]: The full dataset
    min_threshold [float]: threshold above which stroke team needs to be to stay 
                    in data
    stroke_team_values [series]: contains value per stroke team, to be compared 
                    against the minimum threshold (index: stroke_team)
    """

    mask = stroke_team_values >= min_threshold
    stroke_team_keep = list(stroke_team_values[mask].index)
    data = data[data['stroke_team'].isin(stroke_team_keep)]

    return(data)

1. Include patients that attend a hospital with more than 250 admissions in the 6 years included in the dataset (2016 - 2021 incusive)

In [10]:
stroke_team_admissions = data.groupby(['stroke_team'])['stroke_team'].count()
data = filter_stroke_team(data, min_hospital_admission_threshold, 
                          stroke_team_admissions)
data.shape

(172001, 71)

2. Include patients that attend a hospital that gives atleast 10 thrombolysis in the 6 years included in the dataset (2016 - 2021 incusive)

In [11]:
stroke_team_thrombolysis = data.groupby(['stroke_team'])['thrombolysis'].sum()
data = filter_stroke_team(data, min_hospital_thrombolysis_threshold, 
                          stroke_team_thrombolysis)
data.shape

(172001, 71)

## Edit feature values

1. Set scan to thrombolysis time to -100 for those patients that do not recieve thrombolysis. Doing so, we will be able to remove thromboylsis as a feature as the information will be captured in the duration feature (can not keep both in the model, as that will introduce feature dependency)

In [12]:
# give -100 for patients not receive thrombolysis
mask = data['thrombolysis'] == 0
data.loc[mask, 'scan_to_thrombolysis_time'] = set_duration_not_get_thrombolysis

## Add new features

1. New feature "onset_to_thrombolysis_time"

Create new feature "onset_to_thrombolysis_time", the sum of the three separate duration features (onset to arrival, arrival to scan, scan to thrombolysis).

Set as -100 for the patients that do not receive thrombolysis

In [13]:
def calculate_onset_to_thrombolysis(row):
    # Set default value of onset to thrombolysis of -100 (no thrombolysis given)
    onset_to_thrombolysis = -100
    # Set value if thrombolysis given
    if  row['scan_to_thrombolysis_time'] != -100:
        onset_to_thrombolysis = (row['onset_to_arrival_time'] + 
        row['arrival_to_scan_time'] + row['scan_to_thrombolysis_time'])
    return onset_to_thrombolysis

In [14]:
# Calculate onset to thgrombolysis (but set to -100 if no thrombolysis given)
data['onset_to_thrombolysis_time'] = (
                        data.apply(calculate_onset_to_thrombolysis, axis=1))

2. New feature: 'team_code'

Create a new feature containing an anonymised stroke team code (also save the team name with code as a separate csv file). Replace the stroke team name with this code for use in onwards notebooks.

Randomise the stroke teams, and create an anonymised code.

Once csv file is created, do not recreate as this code list has been shared with others so that they can identify themselves in the web app.

Web app: https://stroke-predictions.streamlit.app/

In [15]:
filename = os.path.join(paths.data_save_path, 
                        (paths.notebook + paths.teamcode_save_filename))

# Check if exists
file_exists = exists(filename)

# Only create team codes if file not exist
if not file_exists:
    # Get list of teams
    teams = list(set(data['stroke_team']))

    # Shuffle into random order
    random.seed(42)
    random.shuffle(teams)
    
    # Create dictionary
    teams_code_dict = dict()
    for i, j in enumerate(teams):
        teams_code_dict[j] = i + 1

    # Save teams ID to csv file
    col_names = ['stroke_team', 'team_code']
    teams_code_df = pd.DataFrame(
        teams_code_dict.items(), columns=col_names)
    teams_code_df.to_csv(filename,index=False)

else:
    # Use existing file to overwrite 'stroke_team' column
    teams_code_df = pd.read_csv(filename)

    # remove leading and trailing whitespace in team names
    teams_code_df["stroke_team"] = (
                    teams_code_df["stroke_team"].apply(lambda x: x.strip()))

    # Create dictionary
    teams_code_dict = dict()
    for row in teams_code_df.iterrows():
        teams_code_dict[row[1]["stroke_team"]] = row[1]['team_code']

# Overwrite stroke_team names with codes
data['stroke_team'] = data['stroke_team'].map(teams_code_dict)


## Removing features
Set up a list of features to remove and remove at same time.

1. Remove 'onset\_known', as all patients have same value (only kept those with a value of 1 as wanted to remove patients with an unknown onset time - used a default onset time of midnight for the duration calculations).

In [16]:
remove_features = ['onset_known']

2. Remove anticolagulant types. 

In [17]:
remove_features.append('afib_vit_k_anticoagulant')
remove_features.append('afib_doac_anticoagulant')
remove_features.append('afib_heparin_anticoagulant')

3. Remove thrombolysis, and keep the pathway durations.

A value in the scan\_to\_thrombolysis\_time will indicate the patient had thrombolysis. Keeping both in will mean dependencies in the features (SHAP assumes all features are independent).

In [18]:
remove_features.append('thrombolysis')

4. Remove features that contain information later in the pathway, or contain information in the target feature (discharge_disability)

In [19]:
remove_features.append('discharge_destination')
remove_features.append('death')
remove_features.append('disability_6_month')

5. Remove features about ambulance times (not fully filled in)

In [20]:
remove_features.append('call_to_ambulance_arrival_time')
remove_features.append('ambulance_on_scene_time')
remove_features.append('ambulance_travel_to_hospital_time')
remove_features.append('ambulance_wait_time_at_hospital')


6. If remove_thrombectomy, then remove features about thrombectomy

In [21]:
if remove_thrombectomy_patients:
    remove_features.append('thrombectomy')
    remove_features.append('arrival_to_thrombectomy_time')

7. Remove any features (not yet identified to be removed) that have the same value for the whole dataset (those with 0 standard deviation)

In [22]:
for col in data.columns:
    if (data[col].dtype != 'O'):
        if (data[col].std()) == 0:
            if col not in remove_features:
                remove_features.append(col)
                print(f"Removing feature {col} as standard deviation = 0. All "
                      f"patients have value {data[col].iloc[0]}")

Removing feature infarction as standard deviation = 0. All patients have value 1.0
Removing feature arrive_by_ambulance as standard deviation = 0. All patients have value 1.0


Remove the features from the dataset

In [23]:
data = data.drop(remove_features, axis=1)

## Save reformatted data 

Ready for machine learning (to predict the disability at discharge)

In [24]:
# Have different filename depending on if thrombectomy patients are included
save_filename = paths.data_save_filename

if not remove_thrombectomy_patients:
    save_filename = save_filename.replace(".csv", "_include_mt.csv")

filename = os.path.join(paths.data_save_path, 
                        (paths.notebook + save_filename))

data.to_csv(filename, index=False)