**The report is only 10 pages. The other 40 pages are just the code appendix**

# Problem Definition

This final projects explore two ways of using transfer learning for the task of classifying Xray images into indicating pneumonia or not. One way is retrain the top layer of VGG16 on the new data set. The other way is to train a classifier on the activations earlier in VGG16.

# Solution Specification

I used CNN for this classification task as the problem deals with images data. Neural networks, especially convolutional neural network, is fit for the task compared to other methods because the inherent nature of the model of preserving the spatial structure of images and thus the model is capable of taking into account the neighboring relation among the pixels, extracting and learning helpful features.

I will use the transfer learning technique for the following two reasons:
 
1. In order to get good results, this task most likely requires a deep architecture, with many layers, which require extensive resources. Therefore,  I choose VGG16, a CNN pre-trained on ImageNet that achieved a top-5 performance on ImageNet (https://neurohive.io/en/popular-networks/vgg16/), so that I only have to train some top layers. 
2. The data is taken from Kaggle (https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia). This is not too big of a data set, and fully training a deep model on it, therefore, runs a risk of overfitting. VGG16 is a model trained on a data set of 14 million images and is expected to have good image feature extraction. 

## Data
Data is taken from Kaggle. The data set includes three subsets: training (5216 images), validation (16 images), testing (624 images). 

## VGG16 + CNN Overview 
This section gives detailed description of VGG16, explaining common components of a CNN along the way. The overview is based on this [cs231n note](http://cs231n.github.io/convolutional-networks/).
![If-else diagram](images/vgggraph.jpg)

Above is the graph from [neurohive.io](https://neurohive.io/en/popular-networks/vgg16/). It has the following types of layers:

### Convolution layer
This layer uses a small, spatial filter (a 3D volume) that extends the full depth of the input, which is also a 3D volume. As the filter convolves across the input (from left to right, from top to bottom), it dot the region it is at with itself to produce a single number. After finishing convolving and computing the dot products, each filter have produced a 2D feature map. Stacking multiple of these 2D maps by using multiple filters gives us an output volume. 

We can control the size of the outputs via the following parameters:
1. Stride: how much to convovle a filter after calculating a dot product. 
2. Padding: whether or not, and if so, how much, to padd the borders with 0. 
3. Number of filters: the depth of outputs is the number of filters.

### Pooling layer
This is to reduce the size of the outputs of some conv layer in the network to control the number of parameters. It usually uses the max operation. VGG uses this max pooling type. The max operation is applied to, for example, $2\times 2$ regions and take the largest value, therefore discarding some activations. Stride can also be used here. 

### Fully connected layer
Each unit in this layer is connected to all activations in the previous layer, just like in a regular neural network. 

### ReLU
This is not a layer, but rather an activation function applied to the outputs of a layer. It sets all negative activations to 0.

### Softmax
This is another activation function. It squashes a number to between 0 and 1. Suppose two outputs are $a$ and $b$, then these two are transformed into 
$$\frac{e^{a}}{e^{a}+e^{b}}$$
$$\frac{e^{b}}{e^{a}+e^{b}}$$
respectively. This is also what softmax does in VGG16.



## Road map
This section outlines what I did in this assignment. 

1. I retrained only the last layer of VGG16 on the training set (I'll call this Model 1). Some features about the training that's worth highlighting:
    * In this Model 1 (and in the other two models as well), I only trained for 5 epochs, which already took a lot of time (~2.5 hours per model), due to resource contraints. The notebook crashed in the middle of training several times before I could successfully complete a 5-epoch training session.
    * Because I see the validation set only has 16 images, I took $20\%$ images from the trainng set and move them to the validation set. My reasoning is the validation needs to be large enough to obtain meaningful validation results.
    * I did not rebalance the data, although the training data is a bit imbalance (the ratio positive/negative is around $3$). Instead, I use the weighted cross entropy loss. The regular cross entropy loss function for a sample $i$ is: $$-y_i\text{log}\hat{y}_i-(1-y_i)\text{log}(1-\hat{y_i})$$ And suppose the weights are $a$ and $b$ for positive and negative classes respectively, then the weighted version of cross entropy loss is: $$-ay_i\text{log}\hat{y}_i-b(1-y_i)\text{log}(1-\hat{y_i})$$ This means we penalize the misclassification diffirently on different class. In our case, since the positive cases are more than the negative ones, the weight for positive should be less than that for negative (1 and 0.5 to be specific)


2. In the second attempt, I retrained the VGG16 starting from an earlier layer (the last 8 layers are trainable) (I'll call this Model 2). This is inspired by the fact that in CNN, later layers learn features that are more specific to the training data, while early layers learn more generic ones. As the Xray images are also very different from the data used to train VGG16 (the ImageNet), I decided to give this method of training on activations at an early layer a go. 


3. I compare the two validation results from the two models above and pick the first one (training only the top fully-connected layer) for testing. The test result is very poor, compared to validation result. This got me suspect that the way I moved some of the images from the default training set to the validation set was flawed. Maybe the way the training data set is created makes some training images are similar to each other. For example, this could happen when the train-val-test were not randomly split, or when they augment the training data. Therefore, moving some training images to the validation set would cause information leak into validation, thus can explain why the validation result was so good and similar to the training result. 

4. As a result, I trained the top layer of VVG16 again, this time using the original, default train-val-test sets downloaded from Kaggle without any image relocation (I'll call this Model 3), comparing the validation and test results to see if the discrepancy was alleviated.

The step 4 above rendered this whole assignment an exploratory one rather than the one that builds a model of any real use because the test set was used to tune the model and therefore the test results are no longer unbiased estimates of the chosen model's performance on unseen data (**#overfitting**). 

# Testing and Analysis

## Model Metric
I only try area under the ROC (AUC) for this task. I assume that the most important metric is recall, which is the ability to detect all the pneumonia cases, as we don't want missed detections whose cost is really high for the patients. However, one can easily achieve $100\%$ recall for predicting everything as positive, and this is not ideal, as we want to avoid the cost incurred by false positivevs as well (follow-up examination for positively diagnosed cases is costly). That said, we also want a balance with fpr. However, these metrics in turn depend on the choice of threshold (at which we classify the patients' cases). I believe it's more convenient and systematic to have a metric that is threshold-independent and holistic, and AUC is such a metric.

If the AUC is 0, then the model predicts $100\%$ wrong. The closer AUC to 1, the better the model is.

As a note, in the code the accuracy was also reported, but that's only because I forgot to delete the corresponding code. The accuracy is not analyzed in this assignment.

## Results

Just to recap: Model 1 refers to the VGG16 with only the top fc layer retrained, Model 2 is with the top 11 layers made trainable, and Model 3 is the same as Model 1, except that the data Model 3 uses is not intefered with like in Model 1.

### Model 1
Training AUC: 0.996
Validation AUC: 0.994

![ROC val](images/rocval.jpg)

Based on the validation results, the threshold set so that fpr <= .2 and tpr >= .99: 0.37298495

![Threshold](images/vsthreshold.jpg)

Test AUC: 0.937

![ROC test](images/roctest.jpg)

With the threshold inferred from validation above ($0.37$):
    * FPR: 0.6837606837606838
    * TPR: 0.9948717948717949
    
![Threshold](images/pic5.png)


### Model 2
Training AUC: 0.5
Validation AUC: 0.5

![ROC val](images/pic6.png)

### Model 3
Training AUC: 0.998
Validation AUC: 1

![fsdlk](images/pic7.png)

Based on the validation results, the threshold set so that fpr <= .2 and tpr >= .99: 0.98834604

![flsd](images/pic8.png)

Test AUC: 0.94

![ROC test](images/pic9.png)

With the threshold inferred from validation above ($0.988$):
    * FPR: 0.11965811965811966
    * TPR: 0.8692307692307693
    
![ROC test](images/pic10.png)




## Result Analysis
Between Model 1 and Model 2, Model 1 is far better because the validation AUC is better for Model 1 (0.994 vs 0.5). Therefore, I chose to test the data with Model 1. 

We see that the test AUC for Model 1 is only slightly worse than that for validation (0.937 vs 0.994), which is a good sign. However, with the threshold inferred from validation, while the TPR is near perfect ($0.995$), the FPR suffers at 0.684, nowhere near the intended 0.2 FPR from validation set. This may hint at the possibility that the validation data somehow resembles the training data more than the testing data does. This is very likely because the way I process the data at the beginning: I randomly moved $20\%$ of the training data to the val data set. If the way the training data is constructed is not random (for example, they do some data augmentation using the training data), then this possibility would be confirmed. This motivated me to take an exploratory path and build a third model, where the data is kept intact (no relocation). 

In Model 3, the training and validation AUC is slightly better than the counterpart statistics in Model 1. However, it has the same problem as with Model 1: the threshold and the corresponding FPR and TPR do not generalize well to test cases: while the FPR is better now (0.119), the TPR is unacceptable at the rate of 0.869, compared to the intended 0.99. However, the caveat is the validation set for Model 3 contains only 16 images, 8 of which are negative and the other 8 are positive, which probably makes the threshold determined from this validation set not very meaningful. In fact, if we print out the thresholds, the corresponding FPR and TPR we will get:
```
Input: print(tpr)
Output: [0.125, 1.   , 1.   ]
```

```
Input: print(fpr)
Output: [0. 0. 1.]
```

```
Input: print(thresholds)
Output: [0.9999995  0.98834604 0.00752002]
```
My argument is that the test results might have fared better if there had been simply more validation data. 

Regarding the extremely poor performance of Model 2, I admit that there is room for improvements:

* The activation at the final layer, the optimizer, and the loss function could be experimented with. 

* Maybe the network simply needs to be trained for more epochs.

Additionally, currently the weights used the cross entropy loss function is `[.5, 1]`, but they could also be validated in all three models.

Thus, the results show some evidence that training from earlier layers is not as good as training only the top, fully-connected layer, but this is not a very good evidence as there are many paramters involve in Model that could be tuned which could potentially lead to better results. 




## What I've Learned

1. Using the weighted loss function is one way to deal with unbalanced data. Previously I only knew about downsampling when it comes to unbalanced data.

2. Simply putting some images from the training set to validation set may not be a good way to increase the size of the validation set. More broadly, the meta data is important in building machine learning: how is the data is collected? If the data is presented to us as already split into train-val-test, how was this split implemented?; etc. Knowing this information could inform data processing.