# 1: Group patients by similar features

## Plain English summary

This notebook defines many different "types" of patients. Each group of patients has similar properties, and for each group we look at all nine of the features that will be used in the machine learning model. This notebook defines 576 different groups, places the 10,000 patients in the test data into those groups, and saves a record of the group assignments for later. 

The groups are not divided up too finely otherwise there would not be very many people in any group.
Also, two of the times in the patient data have been combined into one new value. We use onset-to-scan time instead of separate onset-to-arrival and arrival-to-scan times. If they were kept separate, there would be too few patients in any group.

The patients are grouped by:

| Feature | Options |
| --- | --- |
| Stroke severity | Mild, moderate, severe |
| Prior disability | Mild (mRS 0 or 1), moderate (mRS 2 or 3), severe (mRS 4 or 5) |
| Age | Below 80, 80 or higher |
| Infarction | Yes, no |
| Onset to _scan_ time | Below four hours, at least four hours |
| Precise onset known | Yes, no |
| Onset during sleep | Yes, no |
| Afib anticoagulants | Yes, no |

All of the different combinations of these features make $3 \times 3 \times 2 \times 2 \times 2 \times 2 \times 2 \times 2  = 576$ groups.

This notebook also uses the trained machine learning model to predict the probability of thrombolysis for each patient in the test data. This means we can save a data file with each patient, their group number, and their probability of thrombolysis.

## Load imports

In [73]:
import pandas as pd
import numpy as np
import yaml
import pickle
import copy

from dataclasses import dataclass
from sklearn.model_selection import train_test_split

import stroke_utilities.process_data as process_data

import matplotlib.pyplot as plt

# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")

## Set up paths and filenames

In [74]:
@dataclass(frozen=True)
class Paths:
    '''Singleton object for storing paths to data and database.'''

    data_read_path: str = './stroke_utilities/data/'
    data_read_filename: str = 'reformatted_data_thrombolysis_decision.csv'
    data_test_filename: str = 'cohort_10000_test.csv'
    data_train_filename: str = 'cohort_10000_train.csv'
    data_save_path: str = './stroke_utilities/data'
    model_folder = './stroke_utilities/models'
    notebook: str = ''

paths = Paths()

# Load data

Import the trained machine learning model:

In [75]:
with open(f'{paths.model_folder}/model.p', 'rb') as fp:
    model = pickle.load(fp)

Import the patient data. The following cell imports just the 10000 patients in the test data and splits it into X and y for the model.

In [76]:
filename = paths.data_read_path + paths.data_test_filename
test = pd.read_csv(filename)

X_test, y_test = process_data.split_X_and_y(test, 'thrombolysis')

data = process_data.one_hot_encode_column(
    X_test, 'stroke_team_id', prefix='team')
data = data.drop('year', axis=1)

save_str = ''

Alternative: import the _training_ patient data. The following cell imports just the ~100,000 patients in the training data and splits it into X and y for the model.

In [77]:
training = 1
if training == 1:
    filename = paths.data_read_path + paths.data_train_filename
    train = pd.read_csv(filename)
    
    X_train, y_train = process_data.split_X_and_y(train, 'thrombolysis')
    
    data = process_data.one_hot_encode_column(
        X_train, 'stroke_team_id', prefix='team')
    data = data.drop('year', axis=1)
    
    save_str = 'train_'

Alternative: import the patient data for all ~110,000 patients, not just the 10,000 test patients. 

In [78]:
all_patients = 0
if all_patients == 1:
    filename = paths.data_read_path + paths.data_read_filename
    data = pd.read_csv(filename)
    
    
    # Ensure all values are float and shuffle
    
    data = data.sample(frac=1.0, random_state=42)
    
    ## Limit to scan with enough time for thrombolysis
    
    with open('./stroke_utilities/fixed_params.yml') as f:
        fixed_params = yaml.safe_load(f)
    
    # allowed_onset_to_needle_time_mins = fixed_params['allowed_onset_to_needle_time_mins']
    # minutes_left = fixed_params['minutes_left']
    allowed_onset_to_scan_time = fixed_params['allowed_onset_to_scan_time']
    
    def restrict_to_onset_to_scan_on_time(big_data):    
        # Time left after scan for thrombolysis
        big_data['onset_to_scan_time'] = (
            big_data['onset_to_arrival_time'] + 
            big_data['arrival_to_scan_time']
            )
    
        mask_to_include = big_data['onset_to_scan_time'] <= allowed_onset_to_scan_time
    
        # Restrict the data to these patients:
        big_data = big_data[mask_to_include]
        return big_data
    
    data = restrict_to_onset_to_scan_on_time(data)
    
    # mask = data['onset_to_arrival_time'] <= 240
    # data = data[mask]
    
    ## Limit to 10 features and thrombolysis label
    
    features_to_use = [
        'stroke_team_id',
        'stroke_severity',
        'prior_disability',
        'age',
        'infarction',
        'onset_to_arrival_time',
        'precise_onset_known',
        'onset_during_sleep',
        'arrival_to_scan_time',
        'afib_anticoagulant',
        'year',    
        'thrombolysis'
    ]
    
    data = data[features_to_use]

    save_str = 'cleaned_'

## Define groups

Set up the groups by defining how many options there are for each feature. They are called "masks" because the feature options will be used to mask out unwanted patients from the full dataset.

In [79]:
# How many masks are in each category?
masks = {
    'onset_scan':2,
    'severity':3,
    'mrs':3,
    'age':2,
    'infarction':2,
    'precise':2,
    'sleep':2,
    'anticoag':2
}
masks_names = list(masks.keys())
masks_lens = list(masks.values())

Also store a way to convert the number labels back to more meaningful labels:

In [80]:
mask_str_dict = {
    'onset_scan':{0:'<=4hr', 1:'>4hr'},
    'severity':{0:'Mild', 1:'Moderate', 2:'Severe'},
    'mrs':{0:'0 to 1', 1:'2 to 3', 2:'4 to 5'},
    'age':{0:'Below 80', 1:'At least 80'},
    'infarction':{0:'No', 1:'Yes'},
    'precise':{0:'No', 1:'Yes'},
    'sleep':{0:'No', 1:'Yes'},
    'anticoag':{0:'No', 1:'Yes'},
    }

Name each feature option as a number (option 0, 1, 2...). Find every unique combination of these feature options and store the lists of numbers.

In [81]:
# This could be written more compactly but it works for now:
inds_lists = []

for a in range(masks_lens[0]):
    for b in range(masks_lens[1]):
        for c in range(masks_lens[2]):
            for d in range(masks_lens[3]):
                for e in range(masks_lens[4]):
                    for f in range(masks_lens[5]):
                        for g in range(masks_lens[6]):
                            for h in range(masks_lens[7]):
                                inds_lists.append([a, b, c, d, e, f, g, h])

The first few combinations look like this:

In [82]:
inds_lists[:8]

[[0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 1],
 [0, 0, 0, 0, 0, 1, 1, 0],
 [0, 0, 0, 0, 0, 1, 1, 1]]

Place these lists of lists into a dataframe so that we can label which number belongs to which feature. Also create a new column, `mask_number`, to label each unique combination of the features.

In [83]:
df_mask_numbers = pd.DataFrame(inds_lists, columns=[m + '_mask_number' for m in masks_names])

df_mask_numbers['mask_number'] = np.arange(len(df_mask_numbers))

df_mask_numbers

Unnamed: 0,onset_scan_mask_number,severity_mask_number,mrs_mask_number,age_mask_number,infarction_mask_number,precise_mask_number,sleep_mask_number,anticoag_mask_number,mask_number
0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,1
2,0,0,0,0,0,0,1,0,2
3,0,0,0,0,0,0,1,1,3
4,0,0,0,0,0,1,0,0,4
...,...,...,...,...,...,...,...,...,...
571,1,2,2,1,1,0,1,1,571
572,1,2,2,1,1,1,0,0,572
573,1,2,2,1,1,1,0,1,573
574,1,2,2,1,1,1,1,0,574


Save the mask combinations and their labels to file:

In [84]:
df_mask_numbers.to_csv(f'./uncertainty/{save_str}mask_numbers.csv', index=False)

Translate the numbers back to strings for a better look:

In [85]:
df_mask_labels = df_mask_numbers.copy()

# Remove the _mask_number part of the columns:
df_mask_labels = df_mask_labels.rename(
    columns=dict(zip([m + '_mask_number' for m in masks_names], masks_names))
)

# Replace numbers with labels:
df_mask_labels = df_mask_labels.replace(mask_str_dict)

df_mask_labels

Unnamed: 0,onset_scan,severity,mrs,age,infarction,precise,sleep,anticoag,mask_number
0,<=4hr,Mild,0 to 1,Below 80,No,No,No,No,0
1,<=4hr,Mild,0 to 1,Below 80,No,No,No,Yes,1
2,<=4hr,Mild,0 to 1,Below 80,No,No,Yes,No,2
3,<=4hr,Mild,0 to 1,Below 80,No,No,Yes,Yes,3
4,<=4hr,Mild,0 to 1,Below 80,No,Yes,No,No,4
...,...,...,...,...,...,...,...,...,...
571,>4hr,Severe,4 to 5,At least 80,Yes,No,Yes,Yes,571
572,>4hr,Severe,4 to 5,At least 80,Yes,Yes,No,No,572
573,>4hr,Severe,4 to 5,At least 80,Yes,Yes,No,Yes,573
574,>4hr,Severe,4 to 5,At least 80,Yes,Yes,Yes,No,574


## Place patients into groups

Before we set up a labelling system for all of the combinations of masks. Now we will actually find the mask belonging to each label.

The patient data is used and checked against each condition (e.g. is the stroke severity mild? Is it moderate? Is it severe?) and an answer of True or False is stored for each patient in every case. For the test data, each mask contains 10,000 True and False values.

The following cell creates a dictionary. The dictionary contains a list of masks for each feature. Each list is a different length depending on how many options there are for the feature.

In [86]:
masks_severity = [
    (data['stroke_severity'] < 8),
    ((data['stroke_severity'] >= 8) & (data['stroke_severity'] <= 32)),
    (data['stroke_severity'] > 32)
    ]
masks_mrs = [
    ((data['prior_disability'] == 0) | (data['prior_disability'] == 1)),
    ((data['prior_disability'] == 2) | (data['prior_disability'] == 3)),
    ((data['prior_disability'] == 4) | (data['prior_disability'] == 5)),
    ]
masks_age = [
    (data['age'] < 80),
    (data['age'] >= 80)
    ]
masks_infarction = [
    (data['infarction'] == 0),
    (data['infarction'] != 0)
    ]
masks_onset_scan = [
    (data['onset_to_arrival_time'] + data['arrival_to_scan_time'] <= 4*60),
    (data['onset_to_arrival_time'] + data['arrival_to_scan_time'] > 4*60)
    ]
masks_precise = [
    (data['precise_onset_known'] == 0),
    (data['precise_onset_known'] != 0)
    ]
masks_sleep = [
    (data['onset_during_sleep'] == 0),
    (data['onset_during_sleep'] != 0)
    ]
masks_anticoag = [
    (data['afib_anticoagulant'] == 0),
    (data['afib_anticoagulant'] != 0)
    ]

# Store the masks in a dictionary:
masks = {
    'onset_scan':masks_onset_scan,
    'severity':masks_severity,
    'mrs':masks_mrs,
    'age':masks_age,
    'infarction':masks_infarction,
    'precise':masks_precise,
    'sleep':masks_sleep,
    'anticoag':masks_anticoag
}
masks_names = list(masks.keys())
masks_lists = list(masks.values())

For each of the 576 combinations of feature values, pick out the relevant masks from the above dictionary. Then multiply all eight of the chosen masks together. This means that only patients who answer True to each of the eight masks will end up as True in the combined mask.

The list `group_masks` will then let us pick out only the patients who belong to a certain group.

In [87]:
group_masks = []

for inds in inds_lists:
    # Patient assigned 1 if all masks are 1, else 0 (using np.product to multiply masks).
    group_masks.append(np.product([masks_lists[m][i] for m, i in enumerate(inds)], axis=0))

In [88]:
n_groups = len(group_masks)

n_groups

576

Then convert this into a list of 10000 numbers. This is the complete ordered list of group number by patient.

In [89]:
group_numbers_all_patients = np.full(len(data), 0.0)

for i, g in enumerate(group_masks):
    print(i, end='\r')
    # Pick out which patients are in this group:
    patients_inds = np.arange(len(data))[g == True]

    group_numbers_all_patients[patients_inds] = i

575

## Check group sizes

Create a list of how many patients are in each group.

In [90]:
n_patients_per_group = []

for i in range(n_groups):
    n_patients = len(np.where(group_numbers_all_patients == i)[0])
    n_patients_per_group.append(n_patients)

n_patients_per_group = np.array(n_patients_per_group)

If this is the training data, save a copy of how many patients are in each group.

In [92]:
if training == 1:
    df_numbers = pd.DataFrame(
        np.stack(
            (np.arange(len(n_patients_per_group)), 
             n_patients_per_group), axis=-1),
        columns=['mask_number', 'number_of_patients']
    )
    
    df_numbers.to_csv(f'./uncertainty/{save_str}group_sizes.csv', index=False)

In [None]:
How many groups contain zero patients? How many groups contain very few patients?

In [None]:
count_empty = len(np.where(n_patients_per_group == 0)[0])
count_low = len(np.where(
    (n_patients_per_group > 0) &
    (n_patients_per_group < 10) 
)[0])

print(f'Number of empty groups: {count_empty}')
print(f'Number of small groups: {count_low}')

Plot distribution of group sizes for all non-empty groups:

In [None]:
plt.hist(n_patients_per_group[np.where(n_patients_per_group > 0)], bins=50)
plt.show()

## Largest groups

Sort the group labels by group size:

In [None]:
sorted_group_sizes = sorted(n_patients_per_group)
sorted_groups = np.argsort(n_patients_per_group)

Show the properties of the largest groups:

In [None]:
df_largest_groups = pd.DataFrame(
    np.array(inds_lists)[sorted_groups[::-1][:10]],
    columns=[m for m in masks_names]
)

df_largest_groups['size'] = sorted_group_sizes[::-1][:10]

df_largest_groups = df_largest_groups.replace(mask_str_dict)

df_largest_groups

## Empty groups

Show the properties of the size-zero groups:

In [None]:
inds = np.where(n_patients_per_group == 0)[0]

df_zero_groups = pd.DataFrame(
    np.array(inds_lists)[inds],
    columns=[m for m in masks_names]
)

df_zero_groups['size'] = n_patients_per_group[inds]

df_zero_groups = df_zero_groups.replace(mask_str_dict)

df_zero_groups

Show all zero-size groups that do not have onset-to-scan above 4 hours or both precise onset time and onset during sleep. There is no easy logical reason why any remaining groups should have zero size.

In [None]:
inds = np.where(n_patients_per_group == 0)[0]

df_zero_groups = pd.DataFrame(
    np.array(inds_lists)[inds],
    columns=[m for m in masks_names]
)

df_zero_groups['size'] = n_patients_per_group[inds]

z_mask = (
    (df_zero_groups['onset_scan'] == 1) |
    ((df_zero_groups['precise'] == 1) &
     (df_zero_groups['sleep'] == 1))
) == False

df_zero_groups = df_zero_groups[z_mask]


df_zero_groups = df_zero_groups.replace(mask_str_dict)

df_zero_groups

How many groups have precise=yes and sleep=yes?

In [None]:
df_groups = pd.DataFrame(
    np.array(inds_lists)[sorted_groups],
    columns=[m for m in masks_names]
)

df_groups['size'] = sorted_group_sizes

ps_mask = (
    (df_groups['precise'] == 1) &
    (df_groups['sleep'] == 1)
)

df_groups = df_groups[ps_mask]

df_groups = df_groups.replace(mask_str_dict)

len(df_groups)

## Small groups

Show the properties of the small but non-zero groups:

In [None]:
inds = np.where(
    (n_patients_per_group > 0) &
    (n_patients_per_group < 10) 
)[0]

df_small_groups = pd.DataFrame(
    np.array(inds_lists)[inds],
    columns=[m for m in masks_names]
)

df_small_groups['size'] = n_patients_per_group[inds]

df_small_groups = df_small_groups.replace(mask_str_dict)

df_small_groups

How many have infarction=No?

In [None]:
len(df_small_groups[df_small_groups['infarction'] == 'No'])

Only show small groups with infarction:

In [None]:
df_small_groups[df_small_groups['infarction'] == 'Yes']

## Predict probabilities of thrombolysis

The following cell picks out each group of patients in turn. For each group, the machine learning model is used to predict their probabilities of thrombolysis. 

In [None]:
if test == 0:
    print(stop, here, please)

In [None]:
results_arr = np.full((3, len(data)), 0.0)

for i in range(n_groups):
    print(i, end='\r')

    # Which patients are in this group?
    patients_inds = np.where(group_numbers_all_patients == i)[0]

    # Pick out the data for the model for these patients:
    patients_here = data.loc[patients_inds]
    y_here = y_test[patients_inds].values

    # What are their predicted probabilities?
    if len(patients_here) > 0:
        # Only store them to the nearest %, i.e. round to 2 d.p.
        probs = np.round(model.predict_proba(patients_here)[:,1], 2)
    else:
        probs = np.array([])

    # Save the results to the big array:
    results_arr[0, patients_inds] = i
    results_arr[1, patients_inds] = probs
    results_arr[2, patients_inds] = y_here

Convert the results array to a dataframe so that we can label the columns:

In [None]:
df_groups_probs = pd.DataFrame(
    results_arr.T,
    columns=['mask_number', 'predicted_probs', 'thrombolysis']
)

df_groups_probs['mask_number'] = df_groups_probs['mask_number'].astype(int)
df_groups_probs['thrombolysis'] = df_groups_probs['thrombolysis'].astype(int)

df_groups_probs

Save this dataframe to file:

In [None]:
df_groups_probs.to_csv(f'./uncertainty/{save_str}masks_probabilities.csv', index=False)