This is a kernel under construction.

The goal is to use train data preprocessed in this kernel

https://www.kaggle.com/spacelx/2020-pe-preprocessing-train-data

and this kernel

https://www.kaggle.com/spacelx/2020-pe-preprocessing-train-table

to train basic convolutional neural networks in Pytorch to label pulmonary embolism in CT scans.

The train data has been preprocessed to extract lung features and has been resampled such that each study is associated with a 3D scan with 20 slices of 128x128 pixels each. The full preprocessed dataset can be found here

https://www.kaggle.com/spacelx/2020pe-preprocessed-train-data

In the current state this kernel gets the data ready and sets up a CNN intended to only classify 3D scans into (negative / indeterminate / positive) PE exams. A second CNN of the same structure is then trained to classify PE-positive studies into (acute / chronic / acute & chronic) PE exams.

Examining the confusion matrices from both stages, it's clear that none of that actually works at the moment. This is surely due to the preprocessing simplifying the train dataset quite a bit, and certainly also caused by the simplicity of the model used.

If anyone has any ideas for improvement, feel free to leave a comment! Would be nice if we could turn this into a somewhat-working-but-certainly-not-perfect kernel with a simple approach for fresh starters!

# Setup

Let's start by importing some useful packages.

In [None]:
import numpy as np
import pandas as pd

import time

# path management
from pathlib import Path

# progress bars
from tqdm import tqdm

# plotting
import matplotlib.pyplot as plt
import seaborn as sn

# pytorch
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

import torch.nn as nn
import torch.nn.functional as F

from torch.optim import SGD, Adam

# zipfile management
import zipfile

Saving paths of base data into variables

In [None]:
comp_data_path = Path('../input/rsna-str-pulmonary-embolism-detection')
prep_data_path = Path('../input/2020pe-preprocessed-train-data')

Some settings - the preprocessed dataset is 20 slices of size 128x128 for each study, showing only the lung sections.

The preprocessing was done in 

https://www.kaggle.com/spacelx/2020-pe-preprocessing-train-data

which is based on this kernel

https://www.kaggle.com/allunia/pulmonary-dicom-preprocessing,

and the complete preprocessed dataset can be found at

https://www.kaggle.com/spacelx/2020pe-preprocessed-train-data.

In [None]:
# set sizing
NSCANS = 20
NPX = 128

# set device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Here we load an example study, just to show what our preprocessed data looks like.

In [None]:
allfiles = list((prep_data_path / f'proc_{NSCANS}_{NPX}_train').glob('*_data.npy'))
sample_file = allfiles[0]
sample_scans = np.load(str(sample_file), allow_pickle=True)

fig, ax = plt.subplots(5, 4, figsize=(20,20))
ax = ax.flatten()
for m in range(NSCANS):
    ax[m].imshow(sample_scans[m], cmap='Blues_r')

Now reading in the preprocessed train data table, which includes one row per study (not per image as in the original version). Preprocessing was done in this kernel

https://www.kaggle.com/spacelx/2020-pe-preprocessing-train-table.

There are 20 columns (one for each resampled slice) noting whether there was PE present in the base data or not.

In [None]:
train = pd.read_csv(prep_data_path / 'train_proc.csv', index_col=0)
train.head()

# Dataset model

Just quickly setting up a Pytorch dataset...

In [None]:
class PE2020Dataset(Dataset):
    def __init__(self, scans, labels, datapath):
        self.scans = scans
        self.labels = labels
        self.datapath = Path(datapath)

    def __len__(self):
        return len(self.scans)

    def __getitem__(self, i):
        file = self.datapath / (self.scans[i] + '_data.npy')
        x = torch.tensor(np.load(str(file)), dtype=torch.float).to(device)
        y = torch.tensor(self.labels[i]).to(device)
        return x, y

# Model

We define a basic CNN with two convolutional and two fully connected linear layers. We will use this several times later on.
The constructor parameter <code>NOUT</code> is used to set the number of output channels.

In [None]:
class PE2020Net(nn.Module):
    def __init__(self, NOUT):
        super(PE2020Net, self).__init__()
        self.conv1 = nn.Conv2d(NSCANS, 32, 5)
        self.conv2 = nn.Conv2d(32, 64, 5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(64*29*29, 500)
        #self.fc2 = nn.Linear(5000, 5000)
        self.fc3 = nn.Linear(500, 3)

    def forward(self, x):
        x = x.view(-1, NSCANS, NPX, NPX)
        x = self.conv1(x)
        x = F.max_pool2d(x, 2)
        x = F.relu(x)
        x = self.conv2(x)
        x = self.conv2_drop(x)
        x = F.max_pool2d(x, 2)
        x = F.relu(x)
        x = x.view(-1, 64*29*29)
        x = self.fc1(x)
        x = F.dropout(x, training=self.training)
        x = F.relu(x)
        #x = self.fc2(x)
        #x = F.relu(x)
        x = self.fc3(x)
        x = F.log_softmax(x, dim=-1)
        return x

# First stage - classification into (PE / no PE / indeterminate)

Set up a dataset with all scans and labels <code>positive_exam_for_pe</code>,  <code>negative_exam_for_pe</code> and  <code>indeterminate</code>, and create DataLoaders.

In [None]:
# separate scans from labels
scans = train['dcmpath']
all_labels = train.drop(labels='dcmpath', axis=1).astype(int)
labels = all_labels[['negative_exam_for_pe', 'indeterminate']].copy()
labels['positive_exam_for_pe'] = 1 - labels[['negative_exam_for_pe', 'indeterminate']].sum(axis=1)

# keep label names for later
label_names = labels.columns.tolist()
# get label index for all studies
labels = np.where(labels.values)[1]

# split into train and validation and create datasets
tmp = train.sample(len(train)).index.values
idx_train = tmp[:int(0.8*len(tmp))]
idx_valid = tmp[int(0.8*len(tmp)):]
trainset = PE2020Dataset(scans.loc[idx_train].values, labels[idx_train], prep_data_path / f'proc_{NSCANS}_{NPX}_train')
validset = PE2020Dataset(scans.loc[idx_valid].values, labels[idx_valid], prep_data_path / f'proc_{NSCANS}_{NPX}_train')

# instantiate DataLoaders
BATCHSIZE = 20
trainloader = DataLoader(trainset, batch_size=BATCHSIZE, shuffle=True)
validloader = DataLoader(validset, batch_size=BATCHSIZE, shuffle=True)

Instantiate model and train, validate after each epoch.

In [None]:
model = PE2020Net(3).to(device)
optimiser = Adam(model.parameters(), lr=0.005)
epochs = 20
accuracy = {}
accuracy['train'] = []
accuracy['valid'] = []

for epoch in range(epochs):
    
    # training
    correct = 0
    losses = torch.tensor([])
    for batch_data in tqdm(trainloader):
        X, y = batch_data
        # zero gradients
        model.zero_grad()
        # forward pass
        output = model(X)
        
        # count accurate predictions
        predicted = torch.max(output.detach(),1)[1]
        correct += (predicted == y).sum()
        
        # calculate loss and keep it for later
        loss = F.cross_entropy(output, y)
        losses = torch.cat((losses, torch.tensor([loss.detach()])), 0)
        
        # backward pass
        loss.backward()
        # update parameters
        optimiser.step()
    
    # calculate mean loss and accuracy and print
    meanacc = float(correct) / (len(trainloader) * BATCHSIZE)
    meanloss = float(losses.mean())
    print('Epoch:', epoch, 'Loss:', meanloss, 'Accuracy:', meanacc)
    accuracy['train'].append(meanacc)
    # putting this in to keep the console clean
    time.sleep(0.5)
    
    
    
    # validation
    correct = 0
    for batch_data in tqdm(validloader):
        # forward pass
        X, y = batch_data
        with torch.no_grad():
            output = model(X)
        # get number of accurate predictions
        predicted = torch.max(output,1)[1]
        correct += (predicted == y).sum()
    # calculate mean accuracy and print
    meanacc = float(correct) / (len(validloader) * BATCHSIZE)
    print('Validation epoch:', epoch, 'Accuracy:', meanacc)
    accuracy['valid'].append(meanacc)
    # putting this in to keep the console clean
    time.sleep(0.5)

Plotting accuracy per epoch here - we don't see much improvement. Our model gets a bit over 60% right, which doesn't seem totally bad but also not really awesome. But that is to be expected; we boiled the dataset down by quite a bit and our model isn't really advanced either.

In [None]:
plt.plot(accuracy['train'],label="Training Accuracy")
plt.plot(accuracy['valid'],label="Validation Accuracy")
plt.xlabel('No. of Epochs')
plt.ylabel('Accuracy')
plt.legend(frameon=False)
plt.show()

Let's quickly save the stage 1 model

In [None]:
torch.save(model, 'stg1_model.pt')
np.save('stg1_label_names', label_names)

Now we'll check our predictions with a confusion matrix

In [None]:
validset = iter(validloader)
all_y = np.array([])
all_y_pred = np.array([])
for x, y in validset:
    with torch.no_grad():
        y_pred = model(x)
    all_y = np.append(all_y, y.tolist())
    all_y_pred = np.append(all_y_pred, torch.max(y_pred,1)[1].tolist())
    
df = pd.DataFrame(np.array([all_y, all_y_pred]).T, columns=['y','y_pred'])
confusion_matrix = pd.crosstab(df['y'], df['y_pred'], rownames=['y'], colnames=['y_pred'])

sn.heatmap(confusion_matrix, annot=True)
plt.show()

A few things to unpack here. First of all, our model never predicts <code>indeterminate</code> - there are barely any examples of that kind in the set. Secondly, our model pretty much always predicts <code>negative_exam_for_pe</code>, and the 60-something percent accuracy we get simply reflects the 60% of PE-negative studies in the dataset.
So all in all we can safely say our model doesn't predict shit for now...

# Second stage - for PE samples, classify into (acute / chronic / acute & chronic)
Let's try something else - now we use the same model architecture to classify PE-positive studies by the PE type, so we only use PE-positive studies for training obviously. Apart from that, the steps are the same as in the previous stage.

In [None]:
# choose only PE-positive samples
train_pe_positive = train[(train['negative_exam_for_pe'] == 0) &
                          (train['indeterminate'] == 0)]

# separate scans from labels
scans = train_pe_positive['dcmpath']
all_labels = train_pe_positive.drop(labels='dcmpath', axis=1).astype(int)
labels = all_labels[['acute_pe', 'chronic_pe', 'acute_and_chronic_pe']].copy()

# keep label names for later
label_names = labels.columns.tolist()
# get label index for all studies
labels['label'] = np.where(labels.values)[1]

# split into train and validation and create datasets
tmp = train_pe_positive.sample(len(train_pe_positive)).index.values
idx_train = tmp[:int(0.8*len(tmp))]
idx_valid = tmp[int(0.8*len(tmp)):]
trainset = PE2020Dataset(scans.loc[idx_train].values, labels.loc[idx_train, 'label'].values,
                         prep_data_path / f'proc_{NSCANS}_{NPX}_train')
validset = PE2020Dataset(scans.loc[idx_valid].values, labels.loc[idx_train, 'label'].values,
                         prep_data_path / f'proc_{NSCANS}_{NPX}_train')

# instantiate DataLoaders
BATCHSIZE = 20
trainloader = DataLoader(trainset, batch_size=BATCHSIZE, shuffle=True)
validloader = DataLoader(validset, batch_size=BATCHSIZE, shuffle=True)

We set up the same model as before and train it, this time on PE-positive studies only...

In [None]:
model = PE2020Net(3).to(device)
optimiser = Adam(model.parameters(), lr=0.005)
epochs = 20
accuracy = {}
accuracy['train'] = []
accuracy['valid'] = []

for epoch in range(epochs):
    
    # training
    correct = 0
    losses = torch.tensor([])
    for batch_data in tqdm(trainloader):
        X, y = batch_data
        # zero gradients
        model.zero_grad()
        # forward pass
        output = model(X)
        
        # count accurate predictions
        predicted = torch.max(output.detach(),1)[1]
        correct += (predicted == y).sum()
        
        # calculate loss and keep it for later
        loss = F.cross_entropy(output, y)
        losses = torch.cat((losses, torch.tensor([loss.detach()])), 0)
        
        # backward pass
        loss.backward()
        # update parameters
        optimiser.step()
    
    # calculate mean loss and accuracy and print
    meanacc = float(correct) / (len(trainloader) * BATCHSIZE)
    meanloss = float(losses.mean())
    print('Epoch:', epoch, 'Loss:', meanloss, 'Accuracy:', meanacc)
    accuracy['train'].append(meanacc)
    # putting this in to keep the console clean
    time.sleep(0.5)
    
    
    
    # validation
    correct = 0
    for batch_data in tqdm(validloader):
        # forward pass
        X, y = batch_data
        with torch.no_grad():
            output = model(X)
        # get number of accurate predictions
        predicted = torch.max(output,1)[1]
        correct += (predicted == y).sum()
    # calculate mean accuracy and print
    meanacc = float(correct) / (len(validloader) * BATCHSIZE)
    print('Validation epoch:', epoch, 'Accuracy:', meanacc)
    accuracy['valid'].append(meanacc)
    # putting this in to keep the console clean
    time.sleep(0.5)

We plot the accuracy... looks like we're overfitting a bit, training accuracy increases while validation accuracy doesn't...

In [None]:
plt.plot(accuracy['train'],label="Training Accuracy")
plt.plot(accuracy['valid'],label="Validation Accuracy")
plt.xlabel('No. of Epochs')
plt.ylabel('Accuracy')
plt.legend(frameon=False)
plt.show()

... and save the stage 2 model.

In [None]:
torch.save(model, 'stg2_model.pt')
np.save('stg2_label_names', label_names)

Let's plot a confusion matrix again

In [None]:
validset = iter(validloader)
all_y = np.array([])
all_y_pred = np.array([])
for x, y in validset:
    with torch.no_grad():
        y_pred = model(x)
    all_y = np.append(all_y, y.tolist())
    all_y_pred = np.append(all_y_pred, torch.max(y_pred,1)[1].tolist())
    
df = pd.DataFrame(np.array([all_y, all_y_pred]).T, columns=['y','y_pred'])
confusion_matrix = pd.crosstab(df['y'], df['y_pred'], rownames=['y'], colnames=['y_pred'])

sn.heatmap(confusion_matrix, annot=True)
plt.show()

In [None]:
label_names

Looking at this, it is clear that nearly all PE-positive studies in the validation set (and so likely also in the training set) are acute, while nearly none are chronic or acute & chronic - we have a highly imbalanced dataset. No wonder our model has a hard time identifying chronic and acute & chronic studies. But then remembering stage 1 above, the model didn't fare much better even when the dataset was balanced...