materials/04-computer-vision-CNNs/03-transfer-learning.Rmd

---
title: "Computer vision & CNNs: Transfer Learning"
output:
  html_notebook:
    toc: yes
    toc_float: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
ggplot2::theme_set(ggplot2::theme_bw())
```

In this module, we are going to use a ___pretrained___ CNN model to perform
image classification on our dogs vs. cats images. 

Learning objectives:

- Why using pretrained models can be efficient and effective.
- How to perform feature extraction with pretrained models.
- How you can fine tune a pretrained model and run end-to-end.

# Requirements

```{r}
library(keras)
library(glue)
library(ggplot2)
```

# Data

We are working with the same dogs and cats images as before.

```{r image-file-paths}
# define the directories:
if (stringr::str_detect(here::here(), "conf-2020-user")) {
  image_dir <- "/home/conf-2020-user/data/dogs-vs-cats"
} else {
  image_dir <- here::here("materials", "data", "dogs-vs-cats")
}
train_dir <- file.path(image_dir, "train")
valid_dir <- file.path(image_dir, "validation")
test_dir <- file.path(image_dir, "test")
```

# Transfer Learning

There are _two main_ ways we can apply a pretrained model to perform a CNN

1. __Feature extraction__: Use the convolutional base to do feature engineering
   on our images and then feed into a new densely connected classifier.
   - Most efficient
   - Does not require GPUs
   - Does not "personalize" feature extraction to the problem at hand
   - Likely leaves room for improvement
   
2. __Fine tune a pretrained model and run end-to-end__: Build a full sequential
   model with the convolutional base and a new densely connected classifier and
   train the entire model with some or all of the convolutional base layers
   frozen.
   - Computationally demanding
   - Often requires GPUs
   - Tweaks pretrained convolution layers to extract problem-specific features
   - Maximize performance
   
![](images/swapping_fc_classifier.png)

#  Transfer Learning: Feature Extraction

## Get our pretrained model

There are several pretrained models available via keras; they all start with 
`application_`. Here we'll use the VGG16 model; it is intuitive to understand 
the model structure, does a good job with this task, and is good for teaching
purposes.

Best practice for picking pretrained models

- Start with smaller models (i.e. VGG16, VGG19, Resnet34)
- Check out the data the model was built on and see how it aligns to your data.
  Most common is [ImageNet](http://image-net.org/about-overview)

Applying pretrained models that are already supplied by keras is simple:

- `weights`: Represents the weights to use. Most pretrained models are built on 
  imagenet and using these weights tends to do well.
- `include_top`: Whether to include the fully-connected dense classifier. Typically, 
  we want the classifier to be specific to our problem.
- `input_shape`: The shape of our nputs (150x150 pixel images w/3 color channels).

__Tip__: Check out the [tfhub](https://github.com/rstudio/tfhub) package which
makes it easy to interact with [TensorFlow Hub](https://www.tensorflow.org/hub),
a libarry for publication, discovery, and consumption of resusable models.

```{r pretrained-model}
conv_base <- application_vgg16(
  weights = "imagenet",
  include_top = FALSE,
  input_shape = c(150, 150, 3)
)
```

```{r vgg16-model-structure}
summary(conv_base)
```

## Extracting features using the pretrained convolutional base

This seems a little daunting to understand but this is just implementing more of 
a manual approach to what you already have been doing. Here we create a function 
that will:

1. Create an empty tensor to hold our transformed features and labels.
2. Create our data batch importing generator. Here we will use batch size of 20 
   simply because we have 2000 training images and 1000 validation images and 
   a batch size of 20 divides nicely to both of these values.
3. Loop through our training data:
   a. import a batch of images
   b. apply the pretrained base to our images to output the predicted features
   c. add these new features and the labels to our tensor
   d. continue to do this until our number of iterations x batch size equals or 
      exceeds our total number of samples.

```{r image-generator-feature-extraction}

datagen <- image_data_generator(rescale = 1/255)
batch_size <- 20

extract_features <- function(directory, sample_count) {
  
  features <- array(0, dim = c(sample_count, 4, 4, 512))   # step 1
  labels <- array(0, dim = c(sample_count))                # step 1
  
  generator <- flow_images_from_directory(                 # step 2
    directory = directory,                                 
    generator = datagen,                                   
    target_size = c(150, 150),
    batch_size = batch_size,
    class_mode = "binary"
  )
  
  i <- 0
  while (TRUE) {                                           # step 3
    message("Processing batch ", i + 1, " of ", ceiling(sample_count / batch_size))
    batch <- generator_next(generator)                     # step 3a
    inputs_batch <- batch[[1]]
    labels_batch <- batch[[2]]
    features_batch <- conv_base %>% predict(inputs_batch)  # step 3b
    index_range <- ((i * batch_size) + 1):((i + 1) * batch_size)
    features[index_range,,,] <- features_batch             # step 3c
    labels[index_range] <- labels_batch                    # step 3c
    i <- i + 1
    if (i * batch_size >= sample_count) break              # step 3d
  }
  
  list(
    features = features,
    labels = labels
  ) 
}
```

 Let's apply this function to our training, validation, and test data
 
 **Without a GPU this will take approximately 5 minutes to execute**
 
```{r execute-feature-extraction}
train <- extract_features(train_dir, 2000)
validation <- extract_features(valid_dir, 1000)
test <- extract_features(test_dir, 1000)
```

## Reshape features

The extracted features will be a 4D tensor (samples, 4, 4, 512).  We can see this 
in the last layer of our `conv_base` model above (`block5_pool (MaxPooling2D)`). 

Consequently, we need to reshape (flatten) these into a 2D tensor to feed into 
a densely connected classifier. This results in a 2D tensor of size 
(samples, 4 * 4 * 512 = 8192).

```{r reshape-features}
reshape_features <- function(features) {
  array_reshape(features, dim = c(nrow(features), 4 * 4 * 512))
}

train$features <- reshape_features(train$features)
validation$features <- reshape_features(validation$features)
test$features <- reshape_features(test$features)

dim(train$features)
```

## Define densely connected classifier

We've extracted and flattened our features from the convolution layers so now
we only need to build the densely connected classifier portion of our model.

```{r model-classifier}
model <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu", input_shape = 4 * 4 * 512) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = "sigmoid")

summary(model)
```

Now we can compile and train our model. This will train quickly, taking
approximately 1 min when trained on your local CPU. Our validation loss also
improves over the previous CNN we built.

```{r train-model}
model %>% compile(
  optimizer = optimizer_rmsprop(lr = 0.0001),
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

history1 <- model %>% fit(
  train$features, train$labels,
  epochs = 30,
  batch_size = 32,
  validation_data = list(validation$features, validation$labels),
  callbacks = list(
    callback_early_stopping(patience = 10),
    callback_reduce_lr_on_plateau(patience = 2)
  )
)
```

So we acheived a significant decrease in our loss score, increased our accuracy 
to 90%, and did so in a fraction of the time!

```{r second-model-results}
best_epoch <- which.min(history1$metrics$val_loss)
best_loss <- history1$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history1$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")
```

```{r plot-model1-history}
plot(history1) + 
  scale_x_continuous(limits = c(0, length(history1$metrics$val_loss)))
```


#  Transfer Learning: End-to-End

⚠️⚠️ ONLY RUN ON GPU!! ⚠️⚠️

The above approach performed pretty well. However, we can see that we are still 
overfitting, which may be reducing model performance. An alternative approach is 
to run a pretrained model from end-to-end. This approach is much slower and 
computationally intense; however, it offers greater flexibility in using and 
adjusting the pretrained model because it lets you:

1. use data augmentation to decrease overfitting (and usually increase model 
   performance).
2. fine tune parts of the pretrained model.

The following approach simply plugs the pretrained convolution base into a 
sequential model but ___freezes___ the convolution base weights.

## Combining a densely-connected neural network with the convolutional base

In this case we can literally plug in our `conv_base` within our model 
architecture. 

```{r CNN-base-and-classifier}

model <- keras_model_sequential() %>%
  conv_base %>%
  layer_flatten() %>%
  layer_dense(units = 256, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model
```

## Freezing layers

Before you compile and train the model, it's important to freeze the convolutional 
base weights. This prevents the weights from being updated during training. If 
you don't do this then the representations found in the pretrained model will be 
modified and, potentially, completely destroyed.

```{r freeze-parameters}
cat(length(model$trainable_weights), "trainable weight tensors before freezing.\n")

freeze_weights(conv_base)

cat(length(model$trainable_weights), "trainable weight tensors before freezing.\n")
```

## Training the model end-to-end with a frozen convolutional base

The following trains the model end-to-end using all CNN logic that you have seen 
before:

1. data augmentation
2. image generator
3. compile our model
4. train our model

```{r train-end-to-end}

train_datagen = image_data_generator(
  rescale = 1/255,
  rotation_range = 40,
  width_shift_range = 0.2,
  height_shift_range = 0.2,
  shear_range = 0.2,
  zoom_range = 0.2,
  horizontal_flip = TRUE,
  fill_mode = "nearest"
)

test_datagen <- image_data_generator(rescale = 1/255)

train_generator <- flow_images_from_directory(
  train_dir,
  train_datagen,
  target_size = c(150, 150),
  batch_size = 20,
  class_mode = "binary"
)

validation_generator <- flow_images_from_directory(
  valid_dir,
  test_datagen,
  target_size = c(150, 150),
  batch_size = 20,
  class_mode = "binary"
)

model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(lr = 1e-5),
  metrics = c("accuracy")
)

history2 <- model %>% fit_generator(
  train_generator,
  steps_per_epoch = 100,
  epochs = 30,
  validation_data = validation_generator,
  validation_steps = 50
)

```

```{r plot-model2-history}
plot(history2)
```


#  Transfer Learning: Fine Tune

Another widely used technique for using pretrained models, is to unfreeze a few 
of the convolutional base and allow those weights to be updated. Recall that 
the early layers in a CNN identify detailed edges and shapes. Later layers put 
these edges and shapes together to make higher order parts of the images we are 
trying to classify (i.e. cat ears, dog tails). 

The more our images deviate from the images used to create the pretrained model, 
then the more likely you will want to retrain the last few layers, which will 
make the edge and shape features more relevant to your problem.

To fine-tune a pretrained model you:

1. Add your custom network on top of an already-trained base network (executed 
   in the `CNN-base-and-classifier` code chunk).
2. Freeze the base network (executed in the `freeze-parameters` code chunk).
3. Train the part you added (executed in the `train-end-to-end` code chunk).
4. Unfreeze some layers in the base network.
5. Jointly train both these layers and the part you added.

We already did steps 1-3. The following executes steps 4 and 5.

## Unfreeze some layers in the base network

```{r unfreeze-some-layers}
unfreeze_weights(conv_base, from = "block3_conv1")
```


## Jointly train both these layers and the part you added

```{r fine-tune-model}
model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(lr = 1e-5),
  metrics = c("accuracy")
)

history2 <- model %>% fit_generator(
  train_generator,
  steps_per_epoch = 100,
  epochs = 100,
  validation_data = validation_generator,
  validation_steps = 50
)
```

```{r plot-model3-history}
plot(history2)
```

# Takeways

* Pre-trained models can be efficient and effective for problems that align to
common computer vision (and other) tasks.

* Many pre-existing models exist on [TensorFlow Hub](https://www.tensorflow.org/hub)
  and are worth researching.

* Three main approaches exists for transfer learning; the decision depends on
  your desire to maximize efficiency (compute time) vs effectiveness (accuracy):
   - feature extraction
   - end-to-end training with existing CNN weights
   - end-toend training while fine-tuning latter CNN layer weights

[🏠](https://github.com/rstudio-conf-2020/dl-keras-tf)