## Continuous IBDM

This notebook shows how to do image based data mining against a continuous outcome. The outcome could be whatever you like, provided it is a continuous variable; some examples include weight loss, muscle area loss and feeding tube duration

In [None]:
## Import libraries and set up

import os
import time
import os.path
import numpy as np
try:
    from tqdm import tqdm_notebook as tqdm
    haveTQDM = True
except:
    haveTQDM = False
import SimpleITK as sitk
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Make the notebook use full width of display
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))



We will also download the HNSCC data and unzip it

# NOTE: you might not need to do this is you end up on the same runtime as in the registration example, check the file browser on the left

In [None]:
!wget https://www.dropbox.com/s/msb4uiq8rw79rqe/HNSCC_data.zip?dl=0
!unzip HNSCC_data.zip
!rm HNSCC_data.zip

# Getting started

I'm going to assume you've already run the binary IBDM notebook, and are familiar with what's going on there. As a result, we will skip straight to loading the same patients as we had in the binary mining, by looking at and filtering the clinical data. I any of this doesn't make sense, refer back to the previous notebook

In [None]:
clinicalDataPath = "/content/HNSCC_data/clinicalData.csv"

clinicalData = pd.read_csv(clinicalDataPath)

ccrtOnlyPatients = clinicalData[(clinicalData["Oncologic Treatment Summary"].str.contains('^CCRT', regex=True)) & (clinicalData["Oncologic Treatment Summary"].str.contains('\+', regex=True) == False)]
len(ccrtOnlyPatients["Oncologic Treatment Summary"])

selectedPatients = ccrtOnlyPatients[ccrtOnlyPatients["Number of Fractions"].astype(int) < 40]
len(selectedPatients["Number of Fractions"])

def calculateVoxelwiseBED(dose, nFrac, alpha_beta=10.0):
    factor = 1.0 / (nFrac*alpha_beta)

    BED = dose*(1.0 + dose*factor)

    return BED


dosesPath = "/content/HNSCC_data/warpedDoses/"
availableDoses = ["HNSCC-01-{0}".format(a.split('.')[0]) for a in os.listdir(dosesPath)]

availablePatientsMask = selectedPatients['ID'].isin(availableDoses)
probeDose = sitk.GetArrayFromImage(sitk.ReadImage(os.path.join(dosesPath, "{0:04d}.nii".format(int(2)))))

selectedPatients = selectedPatients.loc[availablePatientsMask]

doseArray = np.zeros((len(selectedPatients), *probeDose.shape))
statusArray = np.zeros((len(selectedPatients),))

n = 0
for idx, pt in selectedPatients.iterrows():
    dose_arr = sitk.GetArrayFromImage(sitk.ReadImage(os.path.join(dosesPath, f"{pt.ID.split('-')[-1]}.nii" ) ) )
    doseArray[n,...] = calculateVoxelwiseBED(dose_arr, pt["Number of Fractions"], alpha_beta=10.0)
    n += 1

Now we have our data, and we have corrected all the doses tot eh same BED, we are ready to do continuous outcome image based data mining.

For this we need to select a suitable outcome variable - I suggest weight loss as a good one to start with. 

The next cell defines a function that calculates the pearson correlation coefficient in each voxel of the dose distribution. To do this, we slightly modify the online calculation of variance used in the binary data mining to do online calculation of covariance. The formula for pearson's correlation coefficient is then:

$ \rho = \frac{cov(X,Y)}{\sigma_{x} \sigma_{y}} $


(note: strictly, this is for a population, we have a sample, but the estimates for variance and covariance we return are for a sample, so it will work)


In [None]:
def calculateRho(doseData, continuousOutcome, mask=None):
    """
    Calculate a per-voxel correlation coefficient between two images. Uses Welford's method to calculate mean, variance and covariance. 
    
    Inputs:
        - doseData: the dose data, should be structured such that the number of patients in it is along the last axis
        - statuses: the outcome labels. 1 indicates an event, 0 indicates no event
    Returns:
        - rhoValues: an array of the same size as one of the images which contains the per-voxel rho values
    """
    doseMean = np.zeros_like(doseData[...,0])
    doseStd = np.zeros_like(doseData[...,0])
    covariance = np.zeros_like(doseData[...,0])
    C = np.zeros_like(doseData[...,0])
    rho = np.zeros_like(doseData[...,0])
    
    
    outcomeMean = 0.0
    outcomeVar = 0.0
    doseMean[np.where(mask)] += doseData[...,0][np.where(mask)]
    outcomeMean += continuousOutcome[0]
    subjectCount = 1.0
    
    for n,y in zip(range(1, doseData.shape[-1]), continuousOutcome[1:]):
        x = doseData[...,n]
        subjectCount += 1.0
        dx = x - doseMean
        
        om = doseMean.copy()
        yom = outcomeMean.copy()

        doseMean[np.where(mask)] += dx[np.where(mask)]/subjectCount
        outcomeMean += (y - outcomeMean)/subjectCount

        doseStd[np.where(mask)] += ((x[np.where(mask)] - om[np.where(mask)])*(x[np.where(mask)] - doseMean[np.where(mask)]))
        outcomeVar += (y - yom)*(y - outcomeMean)

        C[np.where(mask)] += (dx[np.where(mask)] * (y - outcomeMean))
        
    doseStd[np.where(mask)] /= (subjectCount)
    outcomeVar /= (subjectCount)
    covariance[np.where(mask)] = C[np.where(mask)] / (subjectCount - 1) ## Bessel's correction for a sample

    rho[np.where(mask)] = covariance[np.where(mask)] / (np.sqrt(doseStd[np.where(mask)]) * np.sqrt(outcomeVar))
    return rho

Now we can apply the mining to some data.  The first thing we need to do is select a continuous outcome variable; we will try weight loss first - we need to create this variable from our database by taking the difference between the starting and ending weight of the patient.

In [None]:
selectedPatients = selectedPatients.assign(WeightLoss = lambda d : d["BW Start tx (kg)"] - d["BW stop treat (kg)"])
weightLoss = selectedPatients.WeightLoss.values 

Now we can do the actual correlation analysis!

In [None]:
mask = sitk.GetArrayFromImage(sitk.ReadImage("/content/HNSCC_data/0002_mask.nii")).astype(np.int16)

import time
start = time.time()
rhoMap = calculateRho(doseArray, weightLoss, mask=mask)
print(time.time() - start)

referenceAnatomy = sitk.GetArrayFromImage(sitk.ReadImage("/content/HNSCC_data/downsampledCTs/0002.nii"))

fig = plt.figure(figsize=(10,10))
anatomy = plt.imshow(referenceAnatomy[::-1,64,...], cmap='Greys_r')
rhoMapOverlay = plt.imshow(rhoMap[::-1,64,...], alpha=0.5)
_ = plt.axis('off')

Now we've got our correlation map with the true correspondence of weight loss to dose, we can compute the permutation distribution again to get the significance of the correlation.

This works just like before, we just rearrange the weight loss values and re-calculate the correlation coefficient and look at how the most extreme voxels behave

In [None]:
def doPermutation(doseData, outcome, mask=None):
    """
    Permute the statuses and return the maximum t value for this permutation
    Inputs:
        - doseData: the dose data, should be structured such that the number of patients in it is along the last axis
        - outcome: the outcome values. Shuold be a continuous number. These will be permuted in this function to 
                    assess the null hypothesis of no dose interaction
        - mask: A mask outside which we will ignore the returned correlation
    Returns:
        - (tMin, tMax): the extreme values of the whole t-value map for this permutation
    """
    poutcome = np.random.permutation(outcome)
    permT = calculateRho(doseData, poutcome, mask)
    return (np.min(permT), np.max(permT))


def permutationTest(doseData, outcome, nperm=1000, mask=None):
    """
    Perform a permutation test to get the global p-value and t-thresholds
    Inputs:
        - doseData: the dose data, should be structured such that the number of patients in it is along the last axis
        - outcome: the outcome labels. Should be a continuous number.
        - nperm: The number of permutations to calculate. Defaults to 1000 which is the minimum for reasonable accuracy
        - mask: A mask outside which we will ignore the returned correlation
    Returns:
        - globalPNeg: the global significance of the test for negative t-values
        - globalPPos: the global significance of the test for positive t-values
        - tThreshNeg: the list of minT from all the permutations, use it to set a significance threshold.
        - tThreshPos: the list of maxT from all the permutations, use it to set a significance threshold.
    """
    tthresh = []
    gtCount = 1
    ltCount = 1
    trueT = calculateRho(doseData, outcome, mask=mask)
    trueMaxT = np.max(trueT)
    trueMinT = np.min(trueT)
    if haveTQDM:
        for perm in tqdm(range(nperm)):
            tthresh.append(doPermutation(doseData, outcome, mask))
            if tthresh[-1][1] > trueMaxT:
                gtCount += 1.0
            if tthresh[-1][0] < trueMinT:
                ltCount += 1.0
    else:
        for perm in range(nperm):
            tthresh.append(doPermutation(doseData, outcome, mask))
            if tthresh[-1][1] > trueMaxT:
                gtCount += 1.0
            if tthresh[-1][0] < trueMinT:
                ltCount += 1.0
    
    globalpPos = gtCount / float(nperm)
    globalpNeg = ltCount / float(nperm)
    tthresh = np.array(tthresh)
    return (globalpNeg, globalpPos, sorted(tthresh[:,0]), sorted(tthresh[:,1]))

If we run this function with our data, we will get back the global significance and threshold values of $\rho$. We can then use a contour plot at the 95th percentile to indicate regions of significance.

Let's try doing it now - this is once again equivalent to the binary data mining, but with the continuous data mining code we wrote above. The function call is identical to the binary version, but because the content of the functions is different, it is now doing the calculation with continuous IBDM.

*Warning: this cell will take a really long time to run! On my machine, it was about 13 minutes*

In [None]:
pNeg, pPos, threshNeg, threshPos = permutationTest(doseArray, weightLoss, nperm=100, mask=mask)

In [None]:
print(pNeg, pPos)
print(np.percentile(threshNeg, 10))
print(threshNeg)
print(np.min(rhoMap))

print(threshPos)
print(np.max(rhoMap))

The usual threshold for saying a result is statstically sgnificant is p=<0.05. Unfortunately, in my example analysis we don't seem to have a globally significant result. Everything below here won't really work properly because there is not significant result in this case, however let's do it anyway so you can see what to do when you mine something else later and get a significant result!

---

We also have our map of rho values, and the associated permutation test distribution, so we can plot the regions of significance overlaid on the rho-map and CT anatomy. To do this, we use matplotlib's imshow and contourf functions as below

In [None]:
fig = plt.figure(figsize=(10,10))

ax = fig.add_subplot(111)

# First show the CT
ctImg = ax.imshow(referenceAnatomy[:,:,64], cmap='Greys_r')

# Now add the t-map with some transparency
rhomapImg = ax.imshow(rhoMap[:,:,64], alpha=0.3)

plt.axis('off');

neg_p005 = np.percentile(threshNeg, 0.05)
neg_p010 = np.percentile(threshNeg, 0.10)
neg_p015 = np.percentile(threshNeg, 0.15) ## Contour plot needs two levels, so we use p=0.05 & 0.10
pos_p005 = np.percentile(threshPos, 0.95)
pos_p010 = np.percentile(threshPos, 0.9)
pos_p015 = np.percentile(threshPos, 0.75)

## Now do the contourplot at the 95% level for p=0.05
pos_contourplot = plt.contour(rhoMap[:,:,64], levels=[pos_p015, pos_p010, pos_p005], colors='r')
# neg_contourplot = plt.contour(rhoMap[:,:,64], levels=[neg_p005, neg_p010, neg_p015], colors='g')

Now we have a complete pipeline to do image based data mining!

Now use this pipeline to try mining against some of the other outcomes in the clinical data, for example:

- Change in BMI between pre/post treatment
- Change in skeletal muscle pre/post treatment
- Change in other areas, e.g. fats
- Skeletal muscle change against CT image density

Have fun!