# Investigating the output of neural net embedding subnets

## Aims

* To investigate the output of the hospital and clinical subnets of the embedding neural network.

* To examine the link between hospital subnet output and use of thrombolysis in hospitals - both the actual thrombolysis use, and the predicted thrombolysis use of a 10k set of patients passed through all hospital models.

* To examine the link between the patient clinical feature subnet output and the use of thrombolysis, and the link between patient features and the clinical feature subnet output

## Neural Network structure

The model contains three subnets that take portions of the data. The output of these subnets is an n-dimensional vector. In this case the output is a 1D vector, that is each subnet is reduced to a single value output. The subnets created are for:

1. *Patient clinical data*: Age, gender, ethnicity, disability before stroke, stroke scale data. Pass through one hidden layer (with 2x neurons as input features) and then to single neuron with sigmoid activation.

2. *Pathway process data*: Times of arrival and scan, time of day, day of week. Pass through one hidden layer (with 2x neurons as input features) and then to single neuron with sigmoid activation.

3. *Hospital ID* (one-hot encoded): Connect input directly to single neuron with sigmoid activation.

The outputs of the three subnet outputs are then passed to a single neuron with sigmoid activation for final output.

![](./images/embedding_1d_with_subnet_output.png)

## Fitting of model

The model has been pre-trained (see the notebook *Modular TensorFlow model with 1D embedding - Train and save model for 10k patient subset*)

## Load libraries

In [1]:
# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# sklearn for pre-processing
from sklearn.preprocessing import MinMaxScaler

# TensorFlow api model
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import backend as K
from tensorflow.keras.losses import binary_crossentropy

2022-06-12 17:24:44.772019: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


## Define function to scale data

Scale input data 0-1 (MinMax scaling).

In [2]:
def scale_data(X_train, X_test):
    """Scale data 0-1 based on min and max in training set"""
    
    # Initialise a new scaling object for normalising input data
    sc = MinMaxScaler()

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_sc = sc.transform(X_train)
    test_sc = sc.transform(X_test)
    
    return train_sc, test_sc

## Get model outputs for test data

Get prediction probabilities for the test 10k training set. Training data is used only to scale test set X values.

This prediction run is used to check model, and get accuracy.

In [None]:
# Get data subgroups
subgroups = pd.read_csv('../data/subnet.csv', index_col='Item')
# Get list of clinical items
clinical_subgroup = subgroups.loc[subgroups['Subnet']=='clinical']
clinical_subgroup = list(clinical_subgroup.index)
# Get list of pathway items
pathway_subgroup = subgroups.loc[subgroups['Subnet']=='pathway']
pathway_subgroup = list(pathway_subgroup.index)
# Get list of hospital items
hospital_subgroup = subgroups.loc[subgroups['Subnet']=='hospital']
hospital_subgroup = list(hospital_subgroup.index)
  
# Load data
train = pd.read_csv(f'../data/10k_training_test/cohort_10000_train.csv')
test = pd.read_csv(f'../data/10k_training_test/cohort_10000_test.csv')

# Limit subgroups to fields present
# Due to improved data selection in this repository
clinical_subgroup = [
    subgroup for subgroup in clinical_subgroup if subgroup in list(train)]
pathway_subgroup = [
    subgroup for subgroup in pathway_subgroup if subgroup in list(train)]

# OneHot encode stroke team
coded = pd.get_dummies(train['StrokeTeam'])
train = pd.concat([train, coded], axis=1)
train.drop('StrokeTeam', inplace=True, axis=1)
coded = pd.get_dummies(test['StrokeTeam'])
test = pd.concat([test, coded], axis=1)
test.drop('StrokeTeam', inplace=True, axis=1)

# Split into X, y
X_train_df = train.drop('S2Thrombolysis',axis=1) 
y_train_df = train['S2Thrombolysis']
X_test_df = test.drop('S2Thrombolysis',axis=1) 
y_test_df = test['S2Thrombolysis'] 

# Split train and test data by subgroups
X_train_patients = X_train_df[clinical_subgroup]
X_test_patients = X_test_df[clinical_subgroup]
X_train_pathway = X_train_df[pathway_subgroup]
X_test_pathway = X_test_df[pathway_subgroup]
X_train_hospitals = X_train_df[hospital_subgroup]
X_test_hospitals = X_test_df[hospital_subgroup]

# Convert to NumPy
X_train = X_train_df.values
X_test = X_test_df.values
y_train = y_train_df.values
y_test = y_test_df.values

# Scale data
X_train_patients_sc, X_test_patients_sc = \
    scale_data(X_train_patients, X_test_patients)

X_train_pathway_sc, X_test_pathway_sc = \
    scale_data(X_train_pathway, X_test_pathway)

X_train_hospitals_sc, X_test_hospitals_sc = \
    scale_data(X_train_hospitals, X_test_hospitals)

# Load model
path = './saved_models/1d_for_10k/'
filename = f'{path}10k_model.h5'
model = keras.models.load_model(filename)

# Test model
probability = model.predict(
    [X_test_patients_sc, X_test_pathway_sc, X_test_hospitals_sc])
own_unit_prob = probability
y_pred_test = probability >= 0.5
y_pred_test = y_pred_test.flatten()
accuracy_test = np.mean(y_pred_test == y_test)

In [4]:
print(f'Accuracy test {accuracy_test:0.3f}')

Accuracy test 0.852


## Get predictions for thrombolysis use of 10k set of patients at each hospital

Here we ask the counter-factual question - "what treatment would a patient be expected to receive at each of the 132 hospitals?".

Hospital is one-hot encoded as input to the hospital subnet. To make a prediction of treatment at different hospitals we change the one-hot encoding of the hospital when making prediction.

For each hospital we pass through the 10k test set, and record the proportion of the patients receiving thrombolysis at that hospital.

In [5]:
# Get number of hospitals
num_hospitals = len(X_test_hospitals_sc[0])
# Create test array for changing hospital ID
X_hospitals_alter = X_test_hospitals_sc.copy()
# Get classification for all patients at all hospials
patient_results = []

# Loop through setting hospital
hospital_results = []
for hosp in range(num_hospitals):
    # Set all hospitals to zero
    X_hospitals_alter[:,:] = 0
    # Set test hospital to 1
    X_hospitals_alter[:,hosp] = 1
    # Get probability of thrombolysis
    probability = model.predict(
        [X_test_patients_sc, X_test_pathway_sc, X_hospitals_alter])
    # Classify
    classified = probability >= 0.5
    patient_results.append(classified)

In [6]:
# Convert patient results to NumPy array
patient_results_np = 1 * np.array(patient_results).reshape(10000, -1)
# Put patient results in DataFrame
cohort_results_all_hospitals = pd.DataFrame(
    patient_results_np, columns=hospital_subgroup, index=test.index)

In [7]:
cohort_results_all_hospitals

Unnamed: 0,AGNOF1041H,AKCGO9726K,AOBTM3098N,APXEE8191H,ATDID5461S,BBXPQ0212O,BICAW1125K,BQZGT7491V,BXXZS5063A,CNBGF2713O,...,XKAWN3771U,XPABC1435F,XQAGA4299B,XWUBX0795L,YEXCH8391J,YPKYH1768F,YQMZV4284N,ZBVSO0975W,ZHCLE1578P,ZRRCV7012C
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,1,0,1,0,0
1,0,0,0,1,0,1,1,1,1,0,...,1,1,0,1,0,0,1,0,0,1
2,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,1,0,1,0
3,0,0,1,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,1
4,0,1,0,0,0,1,1,1,0,1,...,0,0,1,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,1,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,1
9997,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9998,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,1,1,0,0,0


## Get subnet outpus

In [8]:
cohort_subnet_outputs = pd.DataFrame()

### Get hospital subnet outout

In [9]:
layer_name = 'hospital_encode'
hospital_encode_model = keras.Model(
    inputs=model.input,outputs=model.get_layer(layer_name).output)
hospital_encode_output = hospital_encode_model([
    X_test_patients_sc, X_test_pathway_sc, X_test_hospitals_sc])
hospital_encode_output = np.array(hospital_encode_output).flatten()
cohort_subnet_outputs['hospital_subnet'] = hospital_encode_output

### Get clinical subnet output

In [10]:
layer_name = 'patient_encode'
patient_encode_model = keras.Model(
    inputs=model.input,outputs=model.get_layer(layer_name).output)
patient_encode_output = patient_encode_model([
    X_test_patients_sc, X_test_pathway_sc, X_test_hospitals_sc])
patient_encode_output = np.array(patient_encode_output).flatten()
cohort_subnet_outputs['clinical_subnet'] = patient_encode_output

### Get pathway subnet output

In [11]:
layer_name = 'pathway_encode'
patient_encode_model = keras.Model(
    inputs=model.input,outputs=model.get_layer(layer_name).output)
pathway_encode_output = patient_encode_model([
    X_test_patients_sc, X_test_pathway_sc, X_test_hospitals_sc])
pathway_encode_output = np.array(pathway_encode_output).flatten()
cohort_subnet_outputs['pathway_subnet'] = pathway_encode_output

### Add actual thrombolysis

In [12]:
cohort_subnet_outputs['Thrombolysis'] = test['S2Thrombolysis'] 

In [13]:
cohort_subnet_outputs

Unnamed: 0,hospital_subnet,clinical_subnet,pathway_subnet,Thrombolysis
0,0.676831,0.009032,0.299687,0
1,0.469118,0.115673,0.977729,0
2,0.703789,0.852798,0.983000,1
3,0.410166,0.807462,0.027178,1
4,0.468869,0.379883,0.154912,0
...,...,...,...,...
9995,0.882832,0.217620,0.284438,0
9996,0.999913,0.337073,0.370088,0
9997,0.331239,0.802803,0.262890,1
9998,0.468869,0.581135,0.071124,0


## Scale all subnet output 0-1 with 1 = more likely to thrombolyse

In [14]:
cohort_subnet_outputs.groupby('Thrombolysis').mean()

Unnamed: 0_level_0,hospital_subnet,clinical_subnet,pathway_subnet
Thrombolysis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.588916,0.428909,0.389943
1,0.53297,0.801131,0.133646


In [15]:
cohort_subnet_outputs.describe()

Unnamed: 0,hospital_subnet,clinical_subnet,pathway_subnet,Thrombolysis
count,10000.0,10000.0,10000.0,10000.0
mean,0.572378,0.538938,0.314181,0.2956
std,0.209576,0.306723,0.302895,0.456335
min,0.005932,0.002952,0.008583,0.0
25%,0.434379,0.271398,0.067754,0.0
50%,0.583605,0.605153,0.186311,0.0
75%,0.714548,0.810452,0.473568,1.0
max,0.999999,0.989428,1.0,1.0


In [16]:
mean_value_by_thrombolyse = cohort_subnet_outputs.groupby('Thrombolysis').mean()

for col in mean_value_by_thrombolyse:
    
    # Multipy values by -1 if thrombolysed patients have lower value
    reverse_values = \
        mean_value_by_thrombolyse[col][0] > mean_value_by_thrombolyse[col][1]
    if reverse_values:
        cohort_subnet_outputs[col] *= -1
    
    # Scale data
    min_value = cohort_subnet_outputs[col].min()
    max_value = cohort_subnet_outputs[col].max()
    diff = max_value - min_value
    scaled_values = (cohort_subnet_outputs[col] - min_value) / diff
    cohort_subnet_outputs[col] = scaled_values

In [17]:
cohort_subnet_outputs.groupby('Thrombolysis').mean()

Unnamed: 0_level_0,hospital_subnet,clinical_subnet,pathway_subnet
Thrombolysis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.413536,0.431797,0.615339
1,0.469817,0.809122,0.873854


In [18]:
cohort_subnet_outputs.describe()

Unnamed: 0,hospital_subnet,clinical_subnet,pathway_subnet,Thrombolysis
count,10000.0,10000.0,10000.0,10000.0
mean,0.430171,0.543334,0.691757,0.2956
std,0.21083,0.310928,0.305517,0.456335
min,0.0,0.0,0.0,0.0
25%,0.287155,0.272127,0.53099,0.0
50%,0.41888,0.610457,0.820734,0.0
75%,0.568996,0.818571,0.940317,1.0
max,1.0,1.0,1.0,1.0


## Save Data Frames

In [19]:
# Add original proability
cohort_subnet_outputs['predicted_prob'] = own_unit_prob

cohort_results_all_hospitals.to_csv(
    './predictions/cohort_results_all_hospitals.csv', index=False)

cohort_subnet_outputs.to_csv(
    './predictions/cohort_results_subnet_output.csv', index=False)

X_test_patients.to_csv(
    './predictions/cohort_clinical_input.csv', index=False)

X_test_pathway.to_csv(
    './predictions/cohort_pathway_input.csv', index=False)

X_test_hospitals.to_csv(
    './predictions/cohort_hospital_input.csv', index=False)