In [52]:
import numpy as np
import pandas as pd
from collections import namedtuple, defaultdict
from glob import glob
from pathlib import Path, PurePath
from typing import NamedTuple, Tuple

In [2]:
data_dir = Path("/datasets/luna16")

### Image files

The LUNA16 dataset is split into 9 different subsets. Initially, I only downloaded
a subset of them. To make sure downstream code doesn't break because the annotation
files reference images from unavailable sets, let's make alist of the files
that are currently available in the `data_dir` location:

In [78]:
mhd_files = list(data_dir.glob('subset*/*.mhd'))
presentOnDisk_set = [x.stem for x in mhd_files]

### Annotation files

The LUNA16 dataset contains two CSV files with annotations:

1. `candidates.csv`
2. `annotations.csv`

Here, we explore these two files to understand what they contain.

In [3]:
candidates = pd.read_csv(data_dir / "candidates.csv")
annotations = pd.read_csv(data_dir / "annotations.csv")

In [4]:
candidates.shape

(551065, 5)

The `candidates.csv` file contains information about ~ half a million potential nodules,
including 

- `seriesuid`: the unique identifier of the scan
- `coordX`, `coordY`, `coordZ`: the coordinates of the nodule _center_ in __patient coordinates__, measured in millimeters relative to an arbitrary origin.
- `class`: the nodule status, 0 if it is not a nodule, 1 for a nodule that can either be malignent of benign.



In [5]:
candidates.head()

Unnamed: 0,seriesuid,coordX,coordY,coordZ,class
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-56.08,-67.85,-311.92,0
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,53.21,-244.41,-245.17,0
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.66,-121.8,-286.62,0
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-33.66,-72.75,-308.41,0
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-32.25,-85.36,-362.51,0


Each of the 888 Ct scan is identified by its `seriesuid`, and contains between 32 and > 1400 different candidate nodules:

In [6]:
(candidates
 .groupby('seriesuid')['seriesuid']
 .count()
 .describe()
)

count     888.000000
mean      620.568694
std       229.923643
min        32.000000
25%       453.000000
50%       582.500000
75%       769.250000
max      1468.000000
Name: seriesuid, dtype: float64

The `annotations.csv` file contains information for those candidates that were flagged as nodules.
It contains 

- `seriesuid`: the unique identifier of the scan
- `coordX`, `coordY`, `coordZ`: the coordinates of the nodule _center_ in __patient coordinates__.
- `diameter_mm`: the diameter of the nodule, in millimeters

Interestingly, the coordinates were recorded with higher precision, e.g. more decimal places, than
in the `candidates.csv` file. (Perhaps because these nodules were considered more interesting?
Who knows..)

In [8]:
annotations.head()

Unnamed: 0,seriesuid,coordX,coordY,coordZ,diameter_mm
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-128.699421,-175.319272,-298.387506,5.651471
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.783651,-211.925149,-227.12125,4.224708
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793...,69.639017,-140.944586,876.374496,5.786348
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,-24.013824,192.102405,-391.081276,8.143262
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,2.441547,172.464881,-405.493732,18.54515


Only 1186 of the roughly half a million candidate nodules are annotated:

In [7]:
annotations.shape

(1186, 5)

As for the `candidates.csv file`, there can be multiple nodules annotated in the same
scan - albeit the range is much smaller: 1 - 12 nodules were reported in 601 scans.

In [9]:
(annotations
 .groupby('seriesuid')['seriesuid']
 .count()
 .describe()
)

count    601.000000
mean       1.973378
std        1.483000
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max       12.000000
Name: seriesuid, dtype: float64

To complete the diagnostic pipeline, we need to perform multiple tasks, including building a classifier
that returns the status of a previously identified candidate nodule as either _benign_ or _malignant_.

This decision will be made on a nodule-by-nodule basis, and only requires local information (e.g. voxels
in the vicinity of the nodule center) as input.

First, we will combine information from the two CSV files to generate a list of candidate nodules, each
annotated with the following elements.

The smallest nodule has a diameter of 3.5 mm, the largest measures 27 mm across.

In [10]:
(annotations
 .groupby('seriesuid')
 .agg({'diameter_mm':  np.mean})
 .describe()
)

Unnamed: 0,diameter_mm
count,601.0
mean,8.781195
std,4.699269
min,3.530974
25%,5.56275
50%,6.966073
75%,10.496969
max,27.442423


Based on this information, the authors decide to 

1. Calculate the radius of each nodule.
2. Decide that nodule with centers within a Euclidean distance of < 1/2 of the radius are the same.
   
`min_diameter = diameter_mm / 2 / 2`

Because only a small subset of the candidate nodules are annotated with the `diameter_mm`
information, we also have to decide what to do with this missing information for the rest of them.
Here, we simply record a size of 0 mm for each candidate. (After all, if a candidate is _not_ a
nodule, it doesn't really _have_ a size.)

First, let's check that all annotated nodules are also included in the `candidates.csv` file:

In [11]:
all_candidate_series = candidates.seriesuid.unique()
sum([x not in all_candidate_series for x in annotations.seriesuid]) == 0

True

To combine information from the two files, we

1. Read information about all _annotated_ nodules into a dictionary keyed by `seriesuid`.
2. Record the information for each _candidate_ nodule and
  - Set the initial `diameter_mm` to zero
  - Look up all annotated nodules from the same `seriesuid`
  - Loop over all annotated nodules for that `seriesuid`
  - Calculate their Euclidean distance and match those within radius / 4 mm
  - Use the `diameter_mm` information for a match
 3. Return a NamedTuple for each _candidate nodule_ containing
  - `series_uid`: the unique identifier of the Ct scan
  - `center_xyz`: the position of the nodule (in mm)  
  - `isNodule_bool`: a boolean value indication whether candidate is a nodule or not
  - `diameter_mm`: the diameter of the nodule (or zero if no match was found)

In [12]:
class AnnotationInfoTuple(NamedTuple):
  center_xyz: Tuple[float, float, float]
  diameter_mm: float = 0

annotated_nodules = defaultdict(list)
for x in annotations.itertuples():
    annotated_nodules[x.seriesuid].append(
        AnnotationInfoTuple(
            (float(x.coordX), float(x.coordY), float(x.coordZ)),
            x.diameter_mm )
    )

In [13]:
list(annotated_nodules.items())[:2]

[('1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860',
  [AnnotationInfoTuple(center_xyz=(-128.6994211, -175.3192718, -298.3875064), diameter_mm=5.651470635),
   AnnotationInfoTuple(center_xyz=(103.7836509, -211.9251487, -227.12125), diameter_mm=4.224708481)]),
 ('1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793540579077826395208',
  [AnnotationInfoTuple(center_xyz=(69.63901724, -140.9445859, 876.3744957), diameter_mm=5.786347814)])]

In [29]:
class CandidateInfoTuple(NamedTuple):
  isNodule_bool: bool
  candidateDiameter_mm: float
  series_uid: str
  candidateCenter_xyz: Tuple[float, float, float]

requireOnDisk_bool = False  # set to True to only consider available images
candidateInfo_list = []

for x in candidates.itertuples():
    uid = x.seriesuid
    if uid not in presentOnDisk_set and requireOnDisk_bool:
        continue
    center_xyz = (float(x.coordX), float(x.coordY), float(x.coordZ))
    isNodule_bool = bool(x._5) # 'class' is an invalid fieldname is is replaced by _5
    diameter = 0
    for annotation in annotated_nodules[uid]:
        threshold = annotation.diameter_mm / 4
        distance = np.sum(abs(np.subtract(center_xyz, annotation.center_xyz)))
        if distance < threshold:
            diameter = annotation.diameter_mm
            break
    info = CandidateInfoTuple(isNodule_bool, diameter, uid, center_xyz)
    candidateInfo_list.append(info)
    

In [31]:
positiveInfo_list = [x for x in candidateInfo_list if x.isNodule_bool]
diameter_list = [x.candidateDiameter_mm for x in positiveInfo_list]

In [32]:
print(len(positiveInfo_list))
print(positiveInfo_list[0])

1351
CandidateInfoTuple(isNodule_bool=True, candidateDiameter_mm=4.224708481, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860', candidateCenter_xyz=(104.16480444, -211.685591018, -227.011363746))
