# LUng Nodule Analysis 2016 (LuNA 16) Dataset

## Dataset

### [Download](https://luna16.grand-challenge.org/Download/)

This dataset is based on the publicly available [LIDC/IDRI database](https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI).

Data can be downloaded [here](https://luna16.grand-challenge.org/Download/). Data consist of two major parts located in [part1](https://zenodo.org/records/3723295) and [part2](https://zenodo.org/records/4121926). This link provide the data and a lot of different metadata for this dataset. The main data consists of ten subsets. Because full dataset has approximately 220gb, I will be using just subsets from 1 to 3 in my local environment. Then, I want to use Azure ML to train my dataset on full data using it's GPU VM. 

### [Description](https://luna16.grand-challenge.org/Data/)

The data is structured as follows:
- **subset0.zip to subset9.zip**: 10 zip files which contain all CT images
- **annotations.csv**: csv file that contains the annotations used as reference standard for the 'nodule detection' track
- **sampleSubmission.csv**: an example of a submission file in the correct format
- **candidates.csv**: the original set of candidates used for the LUNA16 workshop at ISBI2016. This file is kept for completeness, but should not be used, use candidates_V2.csv instead (see more info below).
- **candidates_V2.csv**: csv file that contains an extended set of candidate locations for the ‘false positive reduction’ track. 
- **evaluation script**: the evaluation script that is used in the LUNA16 framework
- **lung segmentation**: a directory that contains the lung segmentation for CT images computed using automatic algorithms
- **additional_annotations.csv**: csv file that contain additional nodule annotations from our observer study. The file will be available soon



## Human Evaluation vs. Deep Learning 

The majority of a CT scan does not contribute to determining whether a patient has a malignant tumor. This is because most of the patient's body consists of healthy cells. Even in cases where a malignant tumor is present, up to 99.9999% of the voxels in the CT scan will not indicate cancer. This ratio is comparable to a two-pixel error on a high-definition television screen or a single misspelled word in a shelf full of novels.

<figure>
    <center>
        <img src="attachments/ct-scan-visualization.png"  style="width:750px;" >
    </center>
</figure>


## End-to-End versus Specific Model Design 

End-to-end models (e.g. Fast R-CNN, Mask R-CNN from TorchVision) perform well in general vision tasks but require vast datasets—impractical for rare classes. Our approach effectively handles modest data. Collecting immense data for training is resource-intensive and often yields poor results.


# Domain Knowledge

## Computed Tomography (CT) Scan

We will be using data from CT scans extensively as the main data format for our project. CT scans are essentially 3D X-rays, represented as a 3D array of single-channel data. Each element in the array is called a voxel, which is the 3D equivalent of a pixel.

<figure>
    <center>
        <img src="attachments/ct-scan-voxel-example.png"  style="width:550px;" >
        <p><small>A CT scan of a human torso showing, from the top, skin, organs, spine, and patient support bed.</small></p>
    </center>
</figure>

### Voxel

> **Voxel - Volumetric pixel** is the 3D equivalent to the familiar two-dimensional pixel. It encloses a volume of space.

Each voxel of a CT scan has a numeric value that roughly corresponds to the average mass density of the matter contained inside. Most visualizations of that data show high-density material like bones and metal implants as white, low-density air and lung tissue as black, and fat and tissue as various shades of gray.


## Nodule

A nodule is any of the myriad lumps and bumps that might appear inside someone’s lungs. Some are problematic from a health-of-the-patient perspective; some are not. The precise definition limits the size of a nodule to 3 cm or less, with a larger lump being a lung mass; but we’re going to use nodule interchangeably for all such anatomical structures. A nodule can turn out to be benign or a malignant tumor (also referred to as cancer). From a radiological perspective, a nodule is really similar to other lumps that have a wide variety of causes: infection, inflammation, blood-supply issues, malformed blood vessels, and diseases other than tumors.

<figure>
    <center>
        <img src="attachments/malignant-nodule.png"  style="width:750px;" >
    </center>
</figure>


# An end-to-end project design

Our lung cancer diagnosis pipeline uses five key steps:

1. Data Loading: Convert raw CT scans into PyTorch-compatible format.
2. Segmentation:Detect tumor-associated voxels in lung regions using segmentation models.
3. Grouping: Cluster voxels into candidate nodules and identify their 3D centers (non-ML step).
4. Classification: Analyze 3D regions around candidate nodules using convolutional networks to predict malignancy.
5. Diagnosis: Aggregate nodule predictions, using maximum malignancy score for final diagnosis.
