This is taken from deeplearning with pytorch book chapter 9 onwards.

Our goal is to be able to produce a training sample given our inputs of raw CT
scan data and a list of annotations for those CTs. This might sound simple, but
quite a bit needs to happen before we can load, process, and extract the data

### Raw CT data files

Our CT data comes in two files: a .mhd file containing metadata header information,
and a .raw file containing the raw bytes that make up the 3D array. Each file’s name starts
with a unique identifier called the series UID (the name comes from the Digital Imaging
and Communications in Medicine [DICOM] nomenclature) for the CT scan in question. 

 Our Ct class will consume those two files and produce the 3D array, as well as the
transformation matrix to convert from the patient coordinate system (which we will
discuss in more detail in section 10.6) to the index, row, column coordinates needed
by the array (these coordinates are shown as (I,R,C) in the figures and are denoted
with _irc variable suffixes in the code).

We will also load the annotation data provided by LUNA, which will give us a list of
nodule coordinates, each with a malignancy flag, along with the series UID of the relevant CT scan. By combining the nodule coordinate with coordinate system transformation information, we get the index, row, and column of the voxel at the center of
our nodule.

 Using the (I,R,C) coordinates, we can crop a small 3D slice of our CT data to use as
the input to our model. Along with this 3D sample array, we must construct the rest of
our training sample tuple, which will have the sample array, nodule status flag, series
UID, and the index of this sample in the CT list of nodule candidates. This sample
tuple is exactly what PyTorch expects from our Dataset subclass and represents the
last section of our bridge from our original raw data to the standard structure of
PyTorch tensors.

### Parsing LUNA's annotations

The candidates.csv file contains information about all lumps that potentially look like
nodules, whether those lumps are malignant, benign tumors, or something else altogether. We’ll use this as the basis for building a complete list of candidates that can
then be split into our training and validation datasets. The following Bash shell session shows what the file contains

In [None]:
# counts the number of lines in the files
!wc -l ../input/luna-lung-cancer-dataset/annotations.csv

In [None]:
# printing first few lines 
!head ../input/luna-lung-cancer-dataset/candidates.csv

In [None]:
# counts the number of malignant files

!head ../input/luna-lung-cancer-dataset/annotations.csv

### Training and validation sets

For any standard supervised learning task (classification is the prototypical example),
we’ll split our data into training and validation sets. We want to make sure both sets
are representative of the range of real-world input data we’re expecting to see and handle normally. If either set is meaningfully different from our real-world use cases, it’s
pretty likely that our model will behave differently than we expect—all of the training
and statistics we collect won’t be predictive once we transfer over to production use!
We’re not trying to make this an exact science, but you should keep an eye out in
future projects for hints that you are training and testing on data that doesn’t make
sense for your operating environment.

Unfortunately, the location information provided in annotations.csv doesn’t
always precisely line up with the coordinates in candidates.csv:

In [None]:
 !grep 100225287222365663678666836860 ../input/luna-lung-cancer-dataset/annotations.csv

In [None]:
!grep '100225287222365663678666836860.*,1$' ../input/luna-lung-cancer-dataset/candidates_V2/candidates_V2.csv