# LUng Nodule Analysis 2016 (LuNA 16) Dataset

## Dataset

### [Download](https://luna16.grand-challenge.org/Download/)

This dataset is based on the publicly available [LIDC/IDRI database](https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI).

Data can be downloaded [here](https://luna16.grand-challenge.org/Download/). Data consist of two major parts located in [part1](https://zenodo.org/records/3723295) and [part2](https://zenodo.org/records/4121926). This link provide the data and a lot of different metadata for this dataset. The main data consists of ten subsets. Because full dataset has approximately 220gb, I will be using just subsets from 1 to 3 in my local environment. Then, I want to use Azure ML to train my dataset on full data using it's GPU VM. 

### [Description](https://luna16.grand-challenge.org/Data/)

The data is structured as follows:
- **subset0.zip to subset9.zip**: 10 zip files which contain all CT images
- **annotations.csv**: csv file that contains the annotations used as reference standard for the 'nodule detection' track
- **sampleSubmission.csv**: an example of a submission file in the correct format
- **candidates.csv**: the original set of candidates used for the LUNA16 workshop at ISBI2016. This file is kept for completeness, but should not be used, use candidates_V2.csv instead (see more info below).
- **candidates_V2.csv**: csv file that contains an extended set of candidate locations for the ‘false positive reduction’ track. 
- **evaluation script**: the evaluation script that is used in the LUNA16 framework
- **lung segmentation**: a directory that contains the lung segmentation for CT images computed using automatic algorithms
- **additional_annotations.csv**: csv file that contain additional nodule annotations from our observer study. The file will be available soon



## Human Evaluation vs. Deep Learning 

The majority of a CT scan is fundamentally uninteresting with regard to answering the question, “Does this patient have a malignant tumor?” This makes intuitive sense, since the vast majority of the patient’s body will consist of healthy cells. In the cases where there is a malignant tumor, up to 99.9999% of the voxels in the CT still won’t be cancer. That ratio is equivalent to a two-pixel blob of incorrectly tinted color somewhere on a high-definition television, or a single misspelled word out of a shelf of novels.

<figure>
    <center>
        <img src="attachments/ct-scan-visualization.png"  style="width:750px;" >
    </center>
</figure>


## End-to-end vs. Specific model design

You might have seen elsewhere that _end-to-end approaches_ for detection and classification of objects are very successful in general vision tasks. TorchVision includes end-to-end models like Fast R-CNN/Mask R-CNN, but these are typically trained on hundreds of thousands of images, and those datasets aren’t constrained by the number of samples from rare classes. The project architecture we will use has the benefit of working well with a more modest amount of data. 

So while it’s certainly theoretically possible to just throw an arbitrarily large amount of data at a neural network until it learns the specifics of the proverbial lost needle, as well as how to ignore the hay, it’s going to be practically prohibitive to collect enough data and wait for a long enough time to train the network properly. That won’t be the best approach since the results are poor, and most readers won’t have access to the compute resources to pull it off at all.

### Suitable end-to-end architecture

If we would really want to use end-to-end design, we can always find architecture that would match our need (like Retina U-Net and FishNet). But if we would want to use them, we would first need to understand the their complex design. These complicated designs are capable of producing high-quality results, but they’re not the best because understanding the design decisions behind them requires having mastered fundamental concepts first.

# Domain Knowledge

## Computed Tomography (CT) Scan

We will be using data from CT scans extensively as the main data format for our project. CT scans are essentially 3D X-rays, represented as a 3D array of single-channel data. As we might recall from chapter 4, this is like a stacked set of grayscale PNG images. 


<figure>
    <center>
        <img src="attachments/ct-scan-voxel-example.png"  style="width:550px;" >
        <p><small>A CT scan of a human torso showing, from the top, skin, organs, spine, and patient support bed.</small></p>
    </center>
</figure>

> **Voxel - Volumetric pixel** is the 3D equivalent to the familiar two-dimensional pixel. It encloses a volume of space.

Each voxel of a CT scan has a numeric value that roughly corresponds to the average mass density of the matter contained inside. Most visualizations of that data show high-density material like bones and metal implants as white, low-density air and lung tissue as black, and fat and tissue as various shades of gray.


## Nodule

A nodule is any of the myriad lumps and bumps that might appear inside someone’s lungs. Some are problematic from a health-of-the-patient perspective; some are not. The precise definition limits the size of a nodule to 3 cm or less, with a larger lump being a lung mass; but we’re going to use nodule interchangeably for all such anatomical structures. A nodule can turn out to be benign or a malignant tumor (also referred to as cancer). From a radiological perspective, a nodule is really similar to other lumps that have a wide variety of causes: infection, inflammation, blood-supply issues, malformed blood vessels, and diseases other than tumors.

<figure>
    <center>
        <img src="attachments/malignant-nodule.png"  style="width:750px;" >
    </center>
</figure>


# An end-to-end project design

We’re going to use five main steps to go from examining a whole-chest CT scan to giving the patient a lung cancer diagnosis. This figure only depicts the final path through the system once we’ve built and trained all of the requisite models. The actual work required to train the relevant models will be detailed as we get closer to implementing each step.

We will first work on step 1 (data loading), and then jump to step 4 before we come back and implement steps 2 and 3, since step 4 (classification) requires an approach similar to what we used in bird & plane model, using multiple convolutional and pooling layers to aggregate spatial information before feeding it into a linear classifier. Once we’ve got a handle on our classification model, we can start working on step 2 (segmentation). Since segmentation is the more complicated topic, we want to tackle it without having to learn both segmentation and the fundamentals of CT scans and malignant tumors at the same time. Instead, we’ll explore the cancer-detection space while working on a more familiar classification problem.

<figure>
    <center>
        <img src="attachments/end-to-end-design.png"  style="width:750px;" >
    </center>
</figure>


### 1. Data Loading

Load our raw CT scan data into a form that we can use with PyTorch. Putting raw data into a form usable by PyTorch will be the first step in any project you face. The process is somewhat less complicated with 2D image data and simpler still with non-image data.

### 2. Segmentation

Identify the voxels of potential tumors in the lungs using PyTorch to implement a technique known as _segmentation_. This is roughly akin to producing a heatmap of areas that should be fed into our classifier in step 3. This will allow us to focus on potential tumors inside the lungs and ignore huge swaths of uninteresting anatomy (a person can’t have lung cancer in the stomach, for example).

### 3. Grouping

Group interesting voxels into lumps: that is, candidate nodules (see figure 9.5 for more information on nodules). Here, we will find the rough center of each hotspot on our heatmap.

Each nodule can be located by the index, row, and column of its center point. We do this to present a simple, constrained problem to the final classifier. Grouping voxels will not involve PyTorch directly, which is why we’ve pulled this out into a separate step. Often, when working with multistep solutions, there will be non-deep-learning glue steps between the larger, deep-learning-powered portions of the project.

### 4. Classification

Classify candidate nodules as actual nodules or non-nodules using 3D convolution.

The features that determine the nature of a tumor from a candidate structure are local to the tumor in question, so this approach should provide a good balance between limiting input data size and excluding relevant information. Making scope-limiting decisions like this can keep each individual task constrained, which can help limit the amount of things to examine when troubleshooting.

### 5. Module analysis and diagnosis

Diagnose the patient using the combined per-nodule classifications.

Similar to the nodule classifier in the previous step, we will attempt to determine whether the nodule is benign or malignant based on imaging data alone. We will take a simple maximum of the per-tumor malignancy predictions, as only one tumor needs to be malignant for a patient to have cancer. Other projects might want to use different ways of aggregating the per-instance predictions into a file score. Here, we are asking, “Is there anything suspicious?” so maximum is a good fit for aggregation. If we were looking for quantitative information like “the ratio of type A tissue to type B tissue,” we might take an appropriate mean instead.


## Autograd

Our approach for solving the problem won’t use **end-to-end gradient backpropagation** to directly optimize for our end goal. Instead, we’ll **optimize discrete chunks of the problem individually**, since our segmentation model and classification model won’t be trained in tandem with each other. That might limit the top-end effectiveness of our solution, but we feel that this will make for a much better learning experience.

