# End-to-end Lung Cancer Detection

Steps to putting all models/datasets we worked on together:

1. **Generate nodule candidates**. This is step 3 in the overall project. Three tasks go into this step:
    - **Segmentation** — The segmentation model from chapter 13 will predict if a given pixel is of interest: if we suspect it is part of a nodule. This will be done per 2D slice, and every 2D result will be stacked to form a 3D array of voxels containing nodule candidate predictions.
    - **Grouping** — We will group the voxels into nodule candidates by applying a threshold to the predictions, and then grouping connected regions of flagged voxels.
    - **Constructing sample tuples** — Each identified nodule candidate will be used to construct a sample tuple for classification. In particular, we need to produce the coordinates (index, row, column) of that nodule’s center.
2. **Classify nodules and malignancy**. We’ll take the nodule candidates we just produced and pass them to the candidate classification, and then perform malignancy detection on the candidates flagged as nodules:
    - **Nodule classification** — Each nodule candidate from segmentation and grouping will be classified as either nodule or non-nodule. Doing so will allow us to screen out the many normal anatomical structures flagged by our segmentation process.
    - **ROC/AUC metrics** — Before we can start our last classification step, we’ll define some new metrics for examining the performance of classification models, as well as establish a baseline metric against which to compare our malignancy classifiers.
    - **Fine-tuning the malignancy model** — Once our new metrics are in place, we will define a model specifically for classifying benign and malignant nodules, train it, and see how it performs. We will do the training by fine-tuning: a process that cuts out some of the weights of an existing model and replaces them with fresh values that we then adapt to our new task.
3. **End-to-end detection**. Finally, we will put all of this together to get to the finish line, combining the components into an end-to-end solution that can look at a CT and answer the question “Are there malignant nodules present in the lungs?”
    - **IRC** — We will segment our CT to get nodule candidate samples to classify.
    - **Determine the nodules** — We will perform nodule classification on the candidate to determine whether it should be fed into the malignancy classifier.
    - **Determine malignancy** — We will perform malignancy classification on the nodules that pass through the nodule classifier to determine whether the patient has cancer.

## Warning: The leak

For each of the segmentation and classification models, we took care of splitting the data into a training set and an independent validation set by taking every tenth example for validation and the remainder for training.

However, the split for the classification model was done on the list of nodules, and the split for the segmentation model was done on the list of CT scans. This means we likely have nodules from the segmentation validation set in the training set of the classification model and vice versa.

<span style="color:red">**This is called a leak, and it would invalidate our validation.**</span>

To rectify this potential data leak, we need to rework the classification dataset to also work at the CT scan level, just as we did for the segmentation dataset.

> **Takeaway**: Keep an eye on the end-to-end process when defining the validation set. Probably the easiest way to do this (and the way it is done for most important datasets) is to make the validation split as explicit as possible—for example, by having two separate directories for training and validation—and then stick to this split for your entire project.