# Training LuNA 16 Dataset

**The basic structure of what we’re going to implement is as follows:**

- Initialize our model and data loading.
- Loop over a semi-arbitrarily chosen number of epochs.
  - Loop over each batch of training data returned by dataset.
    - The data-loader worker process loads the relevant batch of data in the background.
    - Pass the batch into the model to get predictions.
    - Calculate our loss based on the difference between our predicted results and our ground-truth data.
    - Record metrics about our model’s performance into a temporary data structure.
    - Update the model weights via backpropagation of the error.
  - Loop over each batch of validation data (in a manner very similar to the training loop).
    - Load the relevant batch of validation data (again, in the background worker process).
    - Classify the batch, and compute the loss.
    - Record information about how well the model performed on the validation data.
  - Log progress and performance information for this epoch.


In [None]:
from torch.utils import data as data_utils

from luna16 import augmentations, datasets, dto

# Loading LuNA Data

Training on single samples at a time is typically an inefficient use of computing resources, because most processing platforms are capable of more parallel calculations than are required by a model to process a single training or validation sample. The solution is to group sample tuples together into a batch tuple, allowing multiple samples to be processed at the same time.

Conveniently, we don’t have to implement any of this _batching_: the PyTorch `DataLoader` class will handle all of the collation work for us. We’ve already built the bridge from the CT scans to PyTorch tensors with our `CutoutsDataset` class, so all that remains is to plug our dataset into a data loader.


In [None]:
ratio = dto.NoduleRatio(positive=1, negative=5)
transformations: list[augmentations.Transformation] = [
    augmentations.Flip(),
    augmentations.Offset(offset=0.1),
    augmentations.Scale(scale=0.2),
    augmentations.Rotate(),
]
filters: list[augmentations.Filter] = [
    augmentations.Noise(noise=25.0),
]
dataset = datasets.CutoutsDataset(
    ratio=ratio,
    validation_stride=20,
    transformations=transformations,
    filters=filters,
)

train_dataloader = data_utils.DataLoader(
    dataset=dataset,
    batch_size=64,
    num_workers=0,
    pin_memory=False,
)

## Data Module

The `DataModule` class is a custom data handling module built on top of PyTorch's `DataLoader`. It simplifies the process of creating and managing data loaders for training and validation datasets. The `DataModule` class handles the following tasks:

- Initializes with training and validation datasets, along with a specified batch size.
- Determines the device (CPU or GPU) to be used for data loading and model training.
- Creates data loaders for both training and validation datasets, ensuring efficient data loading with support for multiple workers and CUDA pin memory if available.
- Provides methods to retrieve the appropriate data loader (training or validation) based on the context.

This abstraction helps in organizing the data loading logic and makes the training and validation loops cleaner and more manageable.

In [None]:
data_module = datasets.DataModule(
    batch_size=64,
    train=dataset,
    validation=dataset,  # this would normally be a validation dataset
)

# LuNA Model Architecture

Classification models often have a structure that consists of a tail, a backbone (or body), and a head. 

Using a structure like this can be a good first building block for a convolutional network. There are more complicated designs out there, but for many projects they’re overkill in terms of both implementation complexity and computational demands. It’s a good idea to start simple and add complexity only when there’s a demonstrable need for it.


## Network Composition

### Tail

The _tail_ is the first few layers that process the input to the network. These early layers often have a different structure or organization than the rest of the network, as they must adapt the input to the form expected by the backbone. Here we use a simple batch normalization layer, though often the tail contains convolutional layers as well. Such convolutional layers are often used to aggressively downsample the size of the image; since our image size is already small, we don’t need to do that here.

#### Our implementation

We are going to normalize our input using nn.BatchNorm3d, which, as we saw in chapter 8, will shift and scale our input so that it has a mean of 0 and a standard deviation of 1. Thus, the somewhat odd Hounsfield unit (HU) scale that our input is in won’t really be visible to the rest of the network.


### Backbone (or body)

The _backbone_ of the network typically contains the bulk of the layers, which are usually arranged in series of blocks. Each block has the same (or at least a similar) set of layers, though often the size of the expected input and the number of filters changes from block to block. We will use a block that consists of two 3 × 3 convolutions, each followed by an activation, with a max-pooling operation at the end of the block. We can see this in the expanded view of figure above labeled `Block[block1]`. Here’s what the implementation of the block looks like in code.

#### Our implementation

Our backbone is four repeated blocks, with the block implementation pulled out into the separate `nn.Module` subclass. Since each block ends with a `2 × 2 × 2` max-pool operation, after 4 layers we will have decreased the resolution of the image 16 times in each dimension. Our data is returned in chunks that are `32 × 48 × 48` (`cutout_shape` in code), which will become `2 × 3 × 3` by the end of the backbone.


### Head

The _head_ of the network takes the output from the backbone and converts it into the desired output form. For convolutional networks, this often involves flattening the intermediate output and passing it to a fully connected layer. For some networks, it makes sense to also include a second fully connected layer, although that is usually more appropriate for classification problems in which the imaged objects have more structure (think about cars versus trucks having wheels, lights, grill, doors, and so on) and for projects with a large number of classes. Since we are only doing binary classification, and we don’t seem to need the additional complexity, we have only a single flattening layer.

#### Our implementation

Continuing on with our model definition, we come to a complication. We can’t just feed the output of `self.block4` into a fully connected layer, since that output is a per-sample `2 × 3 × 3` image with 64 channels, and fully connected layers expect a 1D vector as input (well, technically they expect a batch of 1D vectors, which is a 2D array, but the mismatch remains either way).

Note that before we pass data into a fully connected layer, we must flatten it using the view function. Since that operation is stateless (it has no parameters that govern its behavior), we can simply perform the operation in the forward function.

Our head is just a fully connected layer followed by a call to `nn.Softmax`. Softmax is a useful function for single-label classification tasks and has a few nice properties: it bounds the output between 0 and 1, it’s relatively insensitive to the absolute range of the inputs (only the relative values of the inputs matter), and it allows our model to express the degree of certainty it has in an answer.

For the return value of the forward method, we return both the raw _logits_ and the softmax-produced _probabilities_. Logits are the numerical values produced by the network prior to being normalized into probabilities by the softmax layer. Logits are really just the raw input to the softmax layer. They can have any real-valued input, and the softmax will squash them to the range 0–1.

We’ll use the logits when we calculate the `nn.CrossEntropyLoss` during training, and we’ll use the probabilities for when we want to actually classify the samples. This kind of slight difference between what’s used for training and what’s used in production is fairly common, especially when the difference between the two outputs is a simple, stateless function like softmax.


## Model Block

This block will be repeated multiple times to form our model’s backbone.

### Convolutional Layers

We’re using `3 × 3 × 3` convolutions in our block. A single `3 × 3 × 3` convolution has a receptive field of `3 × 3 × 3`, which is almost tautological. Twenty-seven voxels are fed in, and one comes out. 

It gets interesting when we use two `3 × 3 × 3` convolutions stacked back to back. Stack- ing convolutional layers allows the final output voxel (or pixel) to be influenced by an input further away than the size of the convolutional kernel suggests. If that output voxel is fed into another `3 × 3 × 3` kernel as one of the edge voxels, then some of the inputs to the first layer will be outside of the `3 × 3 × 3` area of input to the second. The final output of those two stacked layers has an effective receptive field of `5 × 5 × 5`. That means that when taken together, the stacked layers act as similar to a single convolutional layer with a larger size.

Put another way, each `3 × 3 × 3` convolutional layer adds an additional one-voxel- per-edge border to the receptive field. We can see this if we trace the arrows in fig- ure 11.6 backward; our `2 × 2` output has a receptive field of `4 × 4`, which in turn has a receptive field of `6 × 6`. Two stacked `3 × 3 × 3` layers uses fewer parameters than a full `5 × 5 × 5` convolution would (and so is also faster to compute).

### MaxPool

The output of our two stacked convolutions is fed into a `2 × 2 × 2` max pool, which means we’re taking a `6 × 6 × 6` effective field, throwing away seven-eighths of the data, and going with the one `5 × 5 × 5` field that produced the largest value. Now, those “discarded” input voxels still have a chance to contribute, since the max pool that’s one output voxel over has an overlapping input field, so it’s possible they’ll influence the final output that way.

### ReLU

In the `nn.ReLU` loss functions, the outputs greater than $0.0$ will be left unchanged, and outputs less than $0.0$ will be clamped to zero.
