# Improving training with metrics and augmentation

After initial implementation, our models had solid loss after 10-20 epochs. The problem was, our model had only few positive samples, and even if our model did not correctly classified any of them, the score of our model was still great. That bring two errors we must fix:
1. Raise the number of positive samples in our datasets
2. Change how we score our model to better include classification of positive samples


# False Positive vs. True Negative

A _false positive_ is an event that is classified as of interest or as a member of the desired class (positive as in “Yes, that’s the type of thing I’m interested in knowing about”) but that in truth is not really of interest. For the nodule-detection problem, it’s when an actually uninteresting candidate is flagged as a nodule and, hence, in need of a radiologist’s attention.

Contrast _false positives_ with _true positives_: items of interest that are classified correctly.

A _false negative_ is an event that is classified as not of interest or not a member of the desired class (negative as in “No, that’s not the type of thing I’m interested in knowing about”) but that in truth is actually of interest. For the nodule-detection problem, it’s when a nodule (that is, a potential cancer) goes undetected.

Contrast _false negatives_ with _true negatives_: uninteresting items that are correctly identified as such. 



## Graphing the positives and negatives

The actual input data we’re going to use has high dimensionality—we need to con- sider a ton of CT voxel values, along with more abstract things like candidate size, overall location in the lungs, and so on. The job of our model is to map each of these events and respective properties into this rectangle in such a way that we can separate those positive and negative events cleanly using a single vertical line (our classification threshold). This is done by the `nn.Linear` layers at the end of our model. The position of the vertical line corresponds exactly to the `classification_threshold`. There, we chose the hardcoded value `0.5` as our threshold.

> **Note for our data** The data presented is not two-dimensional; it goes from very-high-dimensional after the second-to-last layer, to one-dimensional (here, our X-axis) at the output—just a single scalar per sample (which is then bisected by the classification threshold). Here, we use the second dimension (the Y-axis) to represent per-sample features that our model cannot see or use: things like age or gender of the patient, location of the nodule candidate in the lung, or even local aspects of the candidate that the model hasn’t utilized. It also gives us a convenient way to represent confusion between non-nodule and nodule samples.

The quadrant areas in figure below and the count of samples contained in each will be the values we use to discuss model performance, since we can use the ratios between these values to construct increasingly complex metrics that we can use to objectively measure how well we are doing. As they say, “the proof is in the proportions.”

<figure>
    <center>
        <img src="attachments/positives-negatives.png"  style="width:750px;" >
    </center>
</figure>

# Discrimination

> We define discrimination as “the ability to separate two classes from each other.” 

Some other definitions of discrimination are more problematic. While out of scope for the discussion of our work here, there is a larger issue with models trained from real-world data. If that real-world dataset is collected from sources that have a real-world discriminatory bias (for example, racial bias in arrest and conviction rates, or anything collected from social media), and that bias is not corrected for during dataset preparation or training, then the resulting model will continue to exhibit the same biases present in the training data. Just as in humans, racism is learned.

This means almost any model trained from internet-at-large data sources will be compromised in some fashion, unless extreme care is taken to scrub those biases from the model.

# Metrics for grading model 

A binary label and a binary classification threshold combine to partition the dataset into four quadrants: true positives, true negatives, false negatives, and false positives. These four quantities provide the basis for our improved performance metrics.

<figure>
    <center>
        <img src="attachments/confusion-matrix.png"  style="width:750px;" >
    </center>
</figure>

Both _precision_ and _recall_ are valuable metrics to be able to track during training, since they provide important insight into how the model is behaving. If either of them drops to zero, it’s likely that our model has started to behave in a degenerate manner.

While neither _precision_ nor _recall_ can be the single metric used to grade our model, they are both useful numbers to have on hand during training. 


## Recall

> “Make sure you never miss any interesting events!”

A _recall_ (or _sensitivity_) is the ratio of the true positives to the union of true positives and false negatives. 

To improve recall, minimize false negatives.


## Precision

> “Never bark unless you’re sure.”

A _precision_ is the ratio of the true positives to the union of true positives and false positives.

To improve precision, minimize false positives.


## F1 Score

The generally accepted way of combining precision and recall is by using the [F1 score](https://en.wikipedia.org/wiki/F1_score). As with other metrics, the F1 score ranges between 0 (a classifier with no real-world predictive power) and 1 (a classifier that has perfect predictions).

Instead, we want to see an image like figure 12.15. Here, our label threshold is nearly vertical. That’s what we want, because it means the label threshold and our classification threshold can line up reasonably well. Similarly, most of the samples are concentrated at either end of the diagram. Both of these things require that our data be easily separable and that our model have the capacity to perform that separation. Our model currently has enough capacity, so that’s not the issue. Instead, let’s take a look at our data.

Recall that our data is wildly imbalanced. There’s a 400:1 ratio of positive samples to negative ones. That’s crushingly imbalanced! Figure 12.16 shows what that looks like. No wonder our “actually nodule” samples are getting lost in the crowd!


<figure>
    <center>
        <img src="attachments/imbalanced-dataset.png"  style="width:750px;" >
    </center>
</figure>


# Balancing the Dataset

Balancing the training set to have an equal number of positive and negative samples during training can result in the model performing better (defined as having a positive, increasing F1 score).


## Using `sampler` from `DataLoader` class

One of the optional arguments to DataLoader is `sampler=...` . This allows the data loader to override the iteration order native to the dataset passed in and instead shape, limit, or reemphasize the underlying data as desired. This can be incredibly useful when working with a dataset that isn’t under your control. Taking a public dataset and reshaping it to meet your needs is far less work than reimplementing that data- set from scratch.

The downside is that many of the mutations we could accomplish with samplers require that we break encapsulation of the underlying dataset. The Dataset API only specifies that subclasses provide __len__ and __getitem__, but there is nothing direct we can use to ask hpw to sample the data.

## Modifying Dataset class directly

We are going to directly change our LunaDataset to present a balanced, one-to-one ratio of positive and negative samples for training. We will keep separate lists of negative training samples and positive training samples, and alternate returning samples from each of those two lists.

TBD: Explain implementation


# Data Augmentation

We _augment_ a dataset by applying synthetic alterations to individual samples, resulting in a new dataset with an effective size that is larger than the original. The typical goal is for the alterations to result in a synthetic sample that remains **representative of the same general class** as the source sample, but that **cannot be trivially memorized** alongside the original. 

When done properly, this augmentation can increase the training set size beyond what the model is capable of memorizing, resulting in the model being forced to increasingly rely on generalization.